Title: Design and Evaluation of Architectures for Commercial Applications
1Design and Evaluation of Architectures for
Commercial Applications
Part I benchmarks
2Why architects should learn about commercial
applications?
- Because they are very different from typical
benchmarks - Because they are demanding on many interesting
architectural features - Because they are driving the sales of mid-range
and high-end systems
3Shortcomings of popular benchmarks
- SPEC
- uniprocessor-oriented
- small cache footprints
- exacerbates impact of CPU core issues
- SPLASH
- small cache footprints
- extremely optimized sharing
- STREAMS
- no real sharing/communication
- mainly bandwidth-oriented
4SPLASH vs. Online Transaction Processing (OLTP)
- A typical SPLASH app. has
- gt 3x the issue rate,
- 26x less cycles spent in memory barriers,
- 1/4 of the TLB miss ratios,
- lt 1/2 the fraction of cache-to-cache transfers,
- 22x smaller instruction cache miss ratio,
- 1/2 L2 miss ratio
- ...of an OLTP app.
5But the real reason we care? !
- Server market
- Total gt 50 billion
- Numeric/scientific computing lt 2 billion
- Remaining 48 billion?
- OLTP
- DSS
- Internet/Web
- Trend is for numerical/scientific to remain a
niche
6Relevance of server vs. PC market
- High profit margins
- Performance is a differentiating factor
- If you sell the server you will probably sell
- the client
- the storage
- the networking infrastructure
- the middleware
- the service
- ...
7Need for speed in the commercial market
- Applications pushing the envelope
- Enterprise resource planning (ERP)
- Electronic commerce
- Data mining/warehousing
- ADSL servers
- Specialized solutions
- Intel splitting Pentium line into 3-tiers
- Oracles raw iron initiative
- Network Appliances machines
8Seminar disclaimer
- Hardware centric approach
- target is build better machines, not better
software - focus on fundamental behavior, not on software
features - Stick to general purpose paradigm
- Emphasis on CPUmemory system issues
- Lots of things missing
- object-relational and object-oriented databases
- public domain/academic database engines
- many others
9Overview
- Day I Introduction and workloads
- Background on commercial applications
- Software structure of a commercial RDBMS
- Standard benchmarks
- TPC-B
- TPC-C
- TPC-D
- TPC-W
- Cost and pricing trends
- Scaling down TPC benchmarks
10Overview(2)
- Day 2 Evaluation methods/tools
- Introduction
- Software instrumentation (ATOM)
- Hardware measurement profiling
- IPROBE
- DCPI
- ProfileMe
- Tracing trace-driven simulation
- User-level simulators
- Complete machine simulators (SimOS)
11Overview (3)
- Day III Architecture studies
- Memory system characterization
- Out-of-order processors
- Simultaneous multithreading
- Final remarks
12Background on commercial applications
- Database applications
- Online Transaction Processing (OLTP)
- massive number of short queries
- read/update indexed tables
- canonical example banking system
- Decision Support Systems (DSS)
- smaller number of complex queries
- mostly read-only over large (non-indexed) tables
- canonical example business analysis
13Background (2)
- Web/Internet applications
- Web server
- many requests for small/medium files
- Proxy
- many short-lived connection requests
- content caching and coherence
- Web search index
- DSS with a Web front-end
- E-commerce site
- OLTP with a Web front-end
14Background (3)
- Common characteristics
- Large amounts of data manipulation
- Interactive response times required
- Highly multithreaded by design
- suitable for large multiprocessors
- Significant I/O requirements
- Extensive/complex interactions with the operating
system - Require robustness and resiliency to failures
15Database performance bottlenecks
- I/O-bound until recently (Thakkar, ISCA90)
- Many improvements since then
- multithreading of DB engine
- I/O prefetching
- VLM (very large memory) database caching
- more efficient OS interactions
- RAIDs
- non-volatile DRAM (NVDRAM)
- Todays bottlenecks
- Memory system
- Processor architecture
16Structure of a database workload
Application server (optional)
clients
Database server
Formulates and issues DB query
Executes query
Simple logic checks
17Who is who in the database market?
- DB engine
- Oracle is dominant
- other players Microsoft, Sybase, Informix
- Database applications
- SAP is dominant
- other players Oracle Apps, PeopleSoft, Baan
- Hardware
- players Sun, IBM, HP and Compaq
18Who is who in the database market? (2)
- Historically, mainly mainframe proprietary OS
- Today
- Unix 40
- NT 8
- Proprietary 52
- In two years
- Unix 46
- NT 19
- Proprietary 35
19Overview of a RDBMS Oracle8
- Similar in structure to most commercial engines
- Runs on
- uniprocessors
- SMP multiprocessors
- NUMA multiprocessors
- For clusters or message passing multiprocessors
- Oracle Parallel Server (OPS)
20The Oracle RDBMS
- Physical structure
- Control files
- basic info on the database, its structure and
status - Data files
- tables actual database data
- indexes sorted list of pointers to data
- rollback segments keep data for recovery upon a
failed transaction - Log files
- compressed storage of DB updates
21Index files
- Critical in speeding up access to data by
avoiding expensive scans - The more selective the index, the faster the
access - Drawbacks
- Very selective indexes may occupy lots of storage
- Updates to indexed data are more expensive
22Files or raw disk devices
- Most DB engines can directly access disks as raw
devices - Idea is to bypass the file system
- Manageability/flexibility somewhat compromised
- Performance boost not large (10-15)
- Most customer installations use file systems
23Transactions rollback segments
- Single transaction can access/update many items
- Atomicity is required
- transaction either happens or not
- old value of balance(X) is kept in a rollback
segment - rollback old values restored, all locks released
Example bank transfer Transaction A (accounts
X,Y value M) read account balance(X)
subtract M from balance(X) add M to
balance(Y) commit
failure
24Transactions log files
- A transaction is only committed after its side
effects are in stable storage - Writing all modified DB blocks would be too
expensive - random disk writes are costly
- a whole DB block has to be written back
- no coalescing of updates
- Alternative write only a log of modifications
- sequential I/O writes (enables NVDRAM
optimizations) - batching of multiple commits
- Background process periodically writes dirty data
blocks out
25Transactions log files (2)
- When a block is written to disk the log file
entries are deleted - If the system crashes
- in-memory dirty blocks are lost
- Recovery procedure
- goes through the log files and applies all
updates to the database
26Transactions concurrency control
- Many transactions in-flight at any given time
- Locking of data items is required
- Lock granularity
- Efficient row-level locking is needed for high
transaction throughput
27Row-level locking
- Each new transaction is assigned an unique ID
- A transaction table keeps track of all active
transactions - Lock write ID in directory entry for row
- Unlock remove ID from transaction table
Transaction table
Data block
- Simultaneous release of all locks
- Simultaneous release of all locks
28Transaction read consistency
- A transaction that reads a full table should see
a consistent snapshot - For performance, reads shouldnt lock a table
- Problem intervening writes
- Solution leverage rollback mechanism
- intervening write saves old value in rollback
segment
29Oracle software structure
- Server processes
- actual execution of transactions
- DB writer
- flush dirty blocks to disk
- Log writer
- writes redo logs to disk at commit time
- Process and system monitors
- misc. activity monitoring and recovery
- Processes communicate through SGA and IPC
30Oracle software structure(2)
System Global Area (SGA)
- SGA
- shared memory segment mapped by all processes
- Block buffer area
- cache of database blocks
- larger portion of physical memory
- Metadata area
- where most communication takes place
- synchronization structures
- shared procedures
- directory information
Block buffer area
Increasing virtual address
Redo buffers
Data dictionary
Metadata area
Shared pool
Fixed region
31Oracle software structure(3)
- Hiding I/O latency
- many server processes/processor
- large block buffer area
- Process dynamics
- server reads/updates database
- (allocates entries in the redo buffer pool)
- at commit time server signals Log writer and
sleeps - Log writer wakes up, coalesces multiple commits
and issues log file write - after log is written, Log writer signals
suspended servers
32Oracle NUMA issues
- Single SGA region complicates NUMA localization
- Single log writer process becomes a bottleneck
- Oracle8 is incorporating NUMA-friendly
optimizations - Current large NUMA systems use OPS even on a
single address space
33Oracle Parallel Server (OPS)
- Runs on clusters of SMPs/NUMAs
- Layered on top of RDBMS engine
- Shared data through disk
- Performance very dependent on how well data can
be partitioned - Not supported by most application vendors
34Running Oracle other issues
- Most memory allocated to block buffer area
- Need to eliminate OS double buffering
- Best performance attained by limiting process
migration - In large SMPs, dedicating one processor to I/O
may be advantageous
35TPC Database Benchmarks
- Transaction Processing Performance Council (TPC)
- Established about 10 years ago
- Mission define representative benchmark
standards for vendors (hardware/software) to
compare their products - Focus on both performance and price/performance
- Strict rules about how the benchmark is ran
- Only widely used benchmarks
36TPC pricing rules
- Must include
- All hardware
- server, I/O, networking, switches, clients
- All software
- OS, any middleware, database engine
- 5-year maintenance contract
- Can include usual discounts
- Audited components must be products
37TPC history of benchmarks
- TPC-A
- First OLTP benchmark
- Based on Jim Grays Debit-Credit benchmark
- TPC-B
- Simpler version of TPC-A
- Meant as a stress test of the server only
- TPC-C
- Current TPC OLTP benchmark
- Much more complex than TPC-A/B
- TPC-D
- Current TPC DSS benchmark
- TPC-W
- New Web-based e-commerce benchmark
38The TPC-B benchmark
- Models a bank with many branches
- 1 transaction type account update
- Metrics
- tpsB (transactions/second)
- /tpsB
- Scale requirement
- 1 tpsB needs 100,000 accounts
Branch
Begin transaction Update account balance
Write entry in history table Update teller
balance Update branch balance Commit
100,000
10
Teller
Account
History
39TPC-B other requirements
- System must be ACID
- (A)tomicity
- transactions either commit or leave the system as
if were never issued - (C)onsistency
- transactions take system from a consistent state
to another - (I)solation
- concurrent transactions execute as if in some
serial order - (D)urability
- results of committed transactions are resilient
to faults
40The TPC-C benchmark
- Current TPC OLTP benchmark
- Moderately complex OLTP
- Models a wholesale supplier managing orders
- Workload consists of five transaction types
- Users and database scale linearly with throughput
- Specification was approved July 23, 1992
41TPC-C schema
42TPC-C transactions
- New-order enter a new order from a customer
- Payment update customer balance to reflect a
payment - Delivery deliver orders (done as a batch
transaction) - Order-status retrieve status of customers most
recent order - Stock-level monitor warehouse inventory
43TPC-C transaction flow
2
Measure menu Response Time
Input screen
Keying time
3
Measure txn Response Time
Output screen
Think time
Go back to 1
44TPC-C other requirements
- Transparency
- tables can be split horizontally and vertically
provided it is hidden from the application - Skew
- 1 of new-order txn are to a random remote
warehouse - 15 of payment txn are to a random remote
warehouse - Metrics
- performance new-order transactions/minute (tpmC)
- cost/performance /tpmC
45TPC-C scale
- Maximum of 12 tpmC per warehouse
- Consequently
- A quad-Xeon system today (20,000 tpmC) needs
- over 1668 warehouses
- over 1 TB of disk storage!!
- Thats a VERY expensive benchmark to run!
46TPC-C side effects of the skew rules
- Very small fraction of transactions go to remote
warehouses - Transparency rules allow data partitioning
- Consequence
- Clusters of powerful machines show exceptional
numbers - Compaq has current TPC-C record of over 100 KtpmC
with an 8-node memory channel cluster - Skew rules are expected to change in the future
47The TPC-D benchmark
- Current DSS benchmark from TPC
- Moderately complex decision support workload
- Models a worldwide reseller of parts
- Queries ask real world business questions
- 17 ad hoc DSS queries (Q1 to Q17)
- 2 update queries
48TPC-D schema
Customer SF150K
Nation 25
Region 5
Order SF1500K
Supplier SF10K
Part SF200K
LineItem SF6000K
PartSupp SF800K
49TPC-D scale
- Unlike TPC-C, scale not tied to performance
- Size determined by a Scale Factor (SF)
- SF 1,10,30,100,300,1000,3000,10000
- SF1 means a 1GB database size
- Majority of current results are in the 100GB and
300GB range - Indices and temporary tables can significantly
increase the total disk capacity. (3-5x is
typical)
50TPC-D example query
- Forecasting Revenue Query (Q6)
- This query quantifies the amount of revenue
increase that would have resulted from
eliminating company-wide discounts in a given
percentage range in a given year. Asking this
type of what if query can be used to look for
ways to increase revenues - Considers all line-items shipped in a year
- Query definition
- SELECT SUM(L_EXTENDEDPRICEL_DISCOUNT) AS
REVENUE FROM LINEITEM WHERE L_SHIPDATE gt DATE
DATE AND L_SHIPDATE lt DATE DATE
INTERVAL 1 YEAR AND L_DISCOUNTBETWEEN
DISCOUNT - 0.01 AND DISCOUNT 0.01 AND
L_QUANTITY lt QUANTITY
51TPC-D execution rules
- Power Test
- Queries submitted in a single stream (i.e., no
concurrency) - Each Query Set is a permutation of the 17
read-only queries - Throughput Test
- Multiple concurrent query streams
- Single update stream
Cache Flush
Query Set 0 (optional)
Query Set 0
UF1
UF2
Query Set 1
...
Query Set 2
Query Set N
UF1 UF2 UF1 UF2 UF1 UF2
Updates
52TPC-D metrics
- Power Metric (QppD)
- Geometric Mean
- Throughput (QthD)
- Arithmetic Mean
- Both Metrics represent Queries per Gigabyte
Hour
53TPC-D metrics(2)
- Composite Query-Per-Hour Rating (QphD)
- The Power and Throughput metrics are combined to
get the composite queries per hour. - Reported metrics are
- Power QppD_at_Size
- Throughput QthD_at_Size
- Price/Performance /QphD_at_Size
54TPC-D other issues
- Queries are complex and long-running
- Crucial that DB engine parallelizes queries for
acceptable performance - Quality of query parallelizer is the most
important factor - Large improvements are still observed from
generation to generation of software
55The TPC-W benchmark
- Just introduced
- Represent a business that markets and sells over
the internet - Includes security/authentication
- Uses dynamically generated pages (e.g. cgi-bins)
- Metric Web Interactions Per Second (WIPS)
- Transactions
- Browse, shopping-cart, buy, user-registration,
and search
56A look at current audited TPC-C systems
- Leader in price/performance
- Compaq ProLiant 7000-6/450, MS SQL 7.0, NT
- 4x 450MHz Xeons, 2MB cache, 4GB DRAM, 1.4 TB disk
- 22,479 tpmC, 18.84/tpmC
- Leader in non-cluster performance
- Sun Enterprise 6500, Sybase 11.9, Solaris7
- 24x 336MHz UltraSPARC IIs, 4MB cache, 24 GB DRAM,
4TB disk - 53,050 tpmC, 76.00/tpmC
57Audited TPC-C systems price breakdown
- Server sub-component prices
58Using TPC benchmarks for architecture studies
- Brute force approach use full audit-sized system
- Who can afford it?
- How can you run it on top of a simulator?
- How can you explore a wide design space?
- Solution scaling down the size
59Careful Scaling of Workloads
- Identify architectural issue under study
- Apply appropriate scaling to simplify monitoring
and enable simulation studies - Most scaling experiments on real machines
- simulation-only is not a viable option!
- Validation through sanity checks and comparison
with audit-sized runs
60Scaling OLTP
- Forget about TPC compliance
- Determine lower bound on DB size
- monitor contention for smaller tables/indexes
- DB size will change with number of processors
- I/O bandwidth requirements vary with fraction of
DB resident in memory - completely in-memory run no special I/O
requirements - favor more small disks vs. few large ones
- place all redo logs on a separate disk
- reduce OS double-buffering
- Limit number of transactions executed
61Scaling OLTP(2)
- Achieve representative cache behavior
- relevant data structures gtgt size of hardware
caches (metadata area size is key) - maintain same number of processes/CPU as larger
run - Simplify setup by running clients on the server
machine - need to make lighter-weight versions of the
clients - Ensure efficient execution
- excessive migration, idle time, OS or application
spinning distorts metrics
62Scaling DSS
- Determine lower bound DB size
- sufficient work in parallel section
- Ensure representative cache behavior
- DB gtgt hardware caches
- maintain same number of processes/CPU as large
run - Reduce execution time through sampling
- Major difficulty is ensuring representative query
plans - DSS results more volatile due to improvements in
query optimizers
63Tuning, tuning, tuning
- Ensure scaled workload is running efficiently
- Requires a large number of monitoring runs on
actual hardware platform - Resembles black art on Oracle
- Self-tuning features in Microsoft SQL 7.0 are
promising - ability for user overrides is desirable, but
missing
64 65TPC-C scaled vs. full size
- Breakdown profile of CPU cycles
- Platform 8-proc. AlphaServer 8400
66Using simpler OLTP benchmarks
- Although obsolete TPC-B can be used in
architectural studies
67Benchmarks wrap-up
- Commercial applications are complex, but need to
be considered during design evaluation - TPC benchmarks cover a wide range of commercial
application areas - Scaled down TPC benchmarks can be used for
architecture studies - Architect needs deep understanding of the workload