Simulating a $2M Commercial Server on a $2K PC - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Simulating a $2M Commercial Server on a $2K PC

Description:

Setup & Tune Workloads (on real hardware) Tune workload, OS ... Full-system SPARC V9. SPLASH-2 Kernels. 1.2 GHz Pentium running Linux. match. close. different ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 52
Provided by: multiface
Category:

less

Transcript and Presenter's Notes

Title: Simulating a $2M Commercial Server on a $2K PC


1
Simulatinga 2M Commercial Serveron a 2K PC
  • Alaa Alameldeen, Milo Martin, Carl Mauer,Kevin
    Moore, Min Xu, Daniel Sorin,Mark D. Hill,
    David A. Wood
  • Multifacet Project (www.cs.wisc.edu/multifacet)
  • Computer Sciences Department
  • University of WisconsinMadison
  • February 2003

2
Summary
  • Context
  • Commercial server design is important
  • Multifacet project seeks improved designs
  • Must evaluate alternatives
  • Commercial Servers
  • Processors, memory, disks ? 2M
  • Run large multithreaded transaction-oriented
    workloads
  • Use commercial applications on commercial OS
  • To Simulate on 2K PC
  • Scale tune workloads
  • Manage simulation complexity
  • Cope with workload variability

Keep L2 miss rates, etc.
Separate timing function
Use randomness statistics
3
Outline
  • Context
  • Commercial Servers
  • Multifacet Project
  • Workload Simulation Methods
  • Separate Timing Functional Simulation
  • Cope with Workload Variability
  • Summary

4
Why Commercial Servers?
  • Many (Academic) Architects
  • Desktop computing
  • Wireless appliances
  • We focus on servers
  • (Important Market)
  • Performance Challenges
  • Robustness Challenges
  • Methodological Challenges

5
3-Tier Internet Service
LAN / SAN
LAN / SAN
Servers runningapplicationsfor business rules
Servers running databases for hard state
PCs w/ soft state
6
Multifacet Commercial Server Design
  • Wisconsin Multifacet Project
  • Directed by Mark D. Hill David A. Wood
  • Sponsors NSF, WI, Compaq, IBM, Intel, Sun
  • Current Contributors Alaa Alameldeen, Brad
    Beckman,Nikhil Gupta, Pacia Harper, Jarrod
    Lewis, Milo Martin, Carl Mauer,Kevin Moore,
    Daniel Sorin, Min Xu
  • Past Contributors Anastassia Ailamaki, Ender
    Bilir,Ross Dickson, Ying Hu, Manoj Plakal,
    Anne Condon
  • Analysis
  • Want 4-64 processors
  • Many cache-to-cache misses
  • Neither snooping nor directories ideal
  • Multifacet Designs
  • Snooping w/ multicast ISCA99 or unordered
    network ASPLOS00
  • Bandwidth-adaptive HPCA02 token coherence
    ISCA03

7
Outline
  • Context
  • Workload Simulation Methods
  • Select, scale, tune workloads
  • Transition workload to simulator
  • Specify test the proposed design
  • Evaluate design with simple/detailed processor
    models
  • Separate Timing Functional Simulation
  • Cope with Workload Variability
  • Summary

8
Multifacet Simulation Overview
Commercial Server(Sun E6000)
Scaled Workloads
Full Workloads
Workload Development
Full System FunctionalSimulator (Simics)
Memory Protocol Generator (SLICC)
Pseudo-RandomProtocol Checker
Memory TimingSimulator (Ruby)
Processor TimingSimulator (Opal)
  • Virtutech Simics (www.virtutech.com)
  • Rest is Multifacet software

9
Select Important Workloads
Full Workloads
  • Online Transaction Processing DB2 w/ TPC-C-like
  • Java Server Workload SPECjbb
  • Static web content serving Apache
  • Dynamic web content serving Slashcode
  • Java-based Middleware (soon)

10
Setup Tune Workloads (on real hardware)
Commercial Server(Sun E6000)
Full Workloads
  • Tune workload, OS parameters
  • Measure transaction rate, speed-up, miss rates,
    I/O
  • Compare to published results

11
Scale Re-tune Workloads
Commercial Server(Sun E6000)
Scaled Workloads
  • Scale-down for PC memory limits
  • Retaining similar behavior (e.g., L2 cache miss
    rate)
  • Re-tune to achieve higher transaction
    rates(OLTP raw disk, multiple disks, more
    users, etc.)

12
Transition Workloads to Simulation
Scaled Workloads
Full System FunctionalSimulator (Simics)
  • Create disk dumps of tuned workloads
  • In simulator Boot OS, start, warm application
  • Create Simics checkpoint (snapshot)

13
Specify Proposed Computer Design
Memory Protocol Generator (SLICC)
Memory TimingSimulator (Ruby)
  • Coherence Protocol (control tables states X
    events)
  • Cache Hierarchy (parameters queues)
  • Interconnect (switches queues)
  • Processor (later)

14
Test Proposed Computer Design
Memory TimingSimulator (Ruby)
Pseudo-RandomProtocol Checker
  • Randomly select write action later read check
  • Massive false-sharing for interaction
  • Perverse network stresses design
  • Transient error deadlock detection
  • Sound but not complete

15
Simulate with Simple Blocking Processor
Scaled Workloads
Full System FunctionalSimulator (Simics)
Memory TimingSimulator (Ruby)
  • Warm-up caches or sometimes sufficient
    (SafetyNet)
  • Run for fixed number of transactions
  • Some transaction partially done at start
  • Other transactions partially done at end
  • Cope with workload variability (later)

16
Simulate with Detailed Processor
Scaled Workloads
Full System FunctionalSimulator (Simics)
Memory TimingSimulator (Ruby)
Processor TimingSimulator (Opal)
  • Accurate (future) timing (current) function
  • Simulation complexity decoupled (discussed soon)
  • Same transaction methodology work variability
    issues

17
Simulation Infrastructure Workload Process
Commercial Server(Sun E6000)
Scaled Workloads
Full Workloads
Full System FunctionalSimulator (Simics)
Memory Protocol Generator (SLICC)
Memory TimingSimulator (Ruby)
Processor TimingSimulator (Opal)
Pseudo-RandomProtocol Checker
  • Select important workloads run, tune, scale,
    re-tune
  • Specify system pseudo-randomly test
  • Create warm workload checkpoint
  • Simulate with simple or detailed processor
  • Fixed transactions, manage simulation complexity
    (next),cope with workload variability (next next)

18
Outline
  • Context
  • Simulation Infrastructure Workload Process
  • Separate Timing Functional Simulation
  • Simulation Challenges
  • Managing Simulation Complexity
  • Timing-First Simulation
  • Evaluation
  • Cope with Workload Variability
  • Summary

19
Challenges to Timing Simulation
  • Execution driven simulation is getting harder
  • Micro-architecture complexity
  • Multiple in-flight instructions
  • Speculative execution
  • Out-of-order execution
  • Thread-level parallelism
  • Hardware Multi-threading
  • Traditional Multi-processing

20
Challenges to Functional Simulation
  • Commercial workloads have high functional
    fidelity demands

Database
Web Server
Operating System
SPEC Benchmarks
Kernels
21
Managing Simulator Complexity
Timing feedback - Tight Coupling - Performance?
Timing feedback Using existing simulators
Software development advantages
22
Timing-First Simulation
  • Timing Simulator
  • does functional execution of user and privileged
    operations
  • does speculative, out-of-order multiprocessor
    timing simulation
  • does NOT implement functionality of full
    instruction set or any devices
  • Functional Simulator
  • does full-system multiprocessor simulation
  • does NOT model detailed micro-architectural timing

System
CPU
Network
RAM
Timing Simulator
Functional Simulator
23
Timing-First Operation
  • As instruction retires, step CPU in functional
    simulator
  • Verify instructions execution
  • Reload state if timing simulator deviates from
    functional
  • Loads in multi-processors
  • Instructions with unidentified side-effects
  • NOT loads/store to I/O devices

System
CPU
Network
RAM
Timing Simulator
Functional Simulator
24
Benefits of Timing-First
  • Supports speculative multi-processor timing
    models
  • Leverages existing simulators
  • Software development advantages
  • Increases flexibility and reduces code complexity
  • Immediate, precise check on timing simulator
  • However
  • How much performance error is introduced in this
    approach?
  • Are there simulation performance penalties?

25
Evaluation
  • Our implementation, TFsim uses
  • Functional Simulator Virtutech Simics
  • Timing simulator Implemented less than
    one-person year
  • Evaluated using OS intensive commercial workloads
  • OS Boot gt 1 billion instructions of Solaris 8
    startup
  • OLTP TPC-C-like benchmark using a 1 GB database
  • Dynamic Web Apache serving message board, using
    code and data similar to slashdot.org
  • Static Web Apache web server serving static web
    pages
  • Barnes-Hut Scientific SPLASH-2 benchmark

26
Measured Deviations
  • Less than 20 deviations per 100,000 instructions

(0.02)
27
If the Timing Simulator Modeled Fewer Events
28
Sensitivity Results
29
Analysis of Results
  • Runs full-system workloads!
  • Timing performance impact of deviations
  • Worst case less than 3 performance error
  • Overhead of redundant execution
  • 18 on average for uniprocessors
  • 18 (2 processors) up to 36 (16 processors)

30
Performance Comparison
  • Absolute simulation performance comparison
  • In kilo-instructions committed per second (KIPS)
  • RSIM Scaled 107 KIPS
  • Uniprocessor TFsim 119 KIPS

match
close
different
TFsim
RSIM
31
Bundled Retires
32
Timing-First Conclusions
  • Execution-driven simulators are increasingly
    complex
  • How to manage complexity?
  • Our answer
  • Introduces relatively little performance error
    (worst case 3)
  • Has low-overhead (18 uniprocessor average)
  • Rapid development time

33
Outline
  • Context
  • Workload Process Infrastructure
  • Separate Timing Functional Simulation
  • Cope with Workload Variability
  • Variability in Multithreaded Workloads
  • Coping in Simulation
  • Examples Statistics
  • Summary

34
What is Happening Here?
OLTP
35
What is Happening Here?
  • How can slower memory lead to faster workload?
  • Answer Multithreaded workload takes different
    path
  • Different lock race outcomes
  • Different scheduling decisions
  • (1) Does this happen for real hardware?
  • (2) If so, what should we do about it?

36
One Second Intervals (on real hardware)
OLTP
37
60 Second Intervals (on real hardware)
16-day simulation
OLTP
38
Coping with Workload Variability
  • Running (simulating) long enough not appealing
  • Need to separate coincidental real effects
  • Standard statistics on real hardware
  • Variation within base system runs
  • vs. variation between base enhanced system
    runs
  • But deterministic simulation has no within
    variation
  • Solution with deterministic simulation
  • Add pseudo-random delay on L2 misses
  • Simulate base (enhanced) system many times
  • Use simple or complex statistics

39
Coincidental (Space) Variability
40
Wrong Conclusion Ratio
  • WCR (16,32) 18
  • WCR (16,64) 7.5
  • WCR (32,64) 26

41
More Generally Use Standard Statistics
  • As one would for a measurement of a live system
  • Confidence Intervals
  • 95 confidence intervals contain true value 95
    of the time
  • Non-overlapping confidence intervals give
    statistically significant conclusions
  • Use ANOVA or Hypothesis Testing even better!

42
Confidence Interval Example
ROB
  • Estimate runs to getnon-overlapping confidence
    intervals

43
Also Time Variability (on real hardware)
OLTP
  • Therefore, select checkpoint(s) carefully

44
Workload Variability Summary
  • Variability is a real phenomenon for
    multi-threaded workloads
  • Runs from same initial conditions are different
  • Variability is a challenge for simulations
  • Simulations are short
  • Wrong conclusions may be drawn
  • Our solution accounts for variability
  • Multiple runs, confidence intervals
  • Reduces wrong conclusion probability

45
Talk Summary
  • Simulations of 2M Commercial Servers must
  • Complete in reasonable time (on 2K PCs)
  • Handle OS, devices, multithreaded hardware
  • Cope with variability of multithreaded software
  • Multifacet
  • Scale tune transactional workloads
  • Separate timing functional simulation
  • Cope w/ workload variability via randomness
    statistics
  • References (www.cs.wisc.edu/multifacet/papers)
  • Simulating a 2M Commercial Server on a 2K PC
    Computer03
  • Full-System Timing-First Simulation
    Sigmetrics02
  • Variability in Architectural Simulations
    HPCA03

46
Other Multifacet Methods Work
  • Specifying Verifying Coherence Protocols
  • SPAA98, HPCA99, SPAA99, TPDS02
  • Workload Analysis Improvement
  • Database systems VLDB99 VLDB01
  • Pointer-based PLDI99 Computer00
  • Middleware HPCA03
  • Modeling Simulation
  • Commercial workloads Computer02 HPCA03
  • Decoupling timing/functional simulation
    Sigmetrics02
  • Simulation generation PLDI01
  • Analytic modeling Sigmetrics00 TPDS TBA
  • Micro-architectural slack ISCA02

47
Backup Slides
48
One Ongoing/Future Methods Direction
  • Middleware Applications
  • Memory system behavior of Java Middleware HPCA
    03
  • Machine measurements
  • Full-system simulation
  • Future Work Multi-Machine Simulation
  • Isolate middle-tier from client emulators and
    database
  • Understand fundamental workload behaviors
  • Drives future system design

49
ECPerf vs. SpecJBB
  • Different cache-to-cache transfer ratios!

50
Online Transaction Processing (OLTP)
  • DB2 with a TPC-C-like workload. The TPC-C
    benchmark is widely used to evaluate system
    performance for the on-line transaction
    processing market. The benchmark itself is a
    specification that describes the schema, scaling
    rules, transaction types and transaction mix, but
    not the exact implementation of the database.
    TPC-C transactions are of five transaction types,
    all related to an order-processing environment.
    Performance is measured by the number of New
    Order transactions performed per minute (tpmC).
  • Our OLTP workload is based on the TPC-C v3.0
    benchmark. We use IBMs DB2 V7.2 EEE database
    management system and an IBM benchmark kit to
    build the database and emulate users. We build an
    800 MB 4000-warehouse database on five raw disks
    and an additional dedicated database log disk. We
    scaled down the sizes of each warehouse by
    maintaining the reduced ratios of 3 sales
    districts per warehouse, 30 customers per
    district, and 100 items per warehouse (compared
    to 10, 30,000 and 100,000 required by the TPC-C
    specification). Each user randomly executes
    transactions according to the TPC-C transaction
    mix specifications, and we set the think and
    keying times for users to zero. A different
    database thread is started for each user. We
    measure all completed transactions, even those
    that do not satisfy timing constraints of the
    TPC-C benchmark specification.

51
Java Server Workload (SPECjbb)
  • Java-based middleware applications are
    increasingly used in modern e-business settings.
    SPECjbb is a Java benchmark emulating a 3-tier
    system with emphasis on the middle tier server
    business logic. SPECjbb runs in a single Java
    Virtual Machine (JVM) in which threads represent
    terminals in a warehouse. Each thread
    independently generates random input (tier 1
    emulation) before calling transaction-specific
    business logic. The business logic operates on
    the data held in binary trees of java objects
    (tier 3 emulation). The specification states that
    the benchmark does no disk or network I/O.
  • We used Suns HotSpot 1.4.0 Server JVM and
    Solariss native thread implementation. The
    benchmark includes driver threads to generate
    transactions. We set the system heap size to
    1.8 GB and the new object heap size to 256 MB to
    reduce the frequency of garbage collection. Our
    experiments used 24 warehouses, with a data size
    of approximately 500 MB.

52
Static Web Content Serving Apache
  • Web servers such as Apache represent an important
    enterprise server application. Apache is a
    popular open-source web server used in many
    internet/intranet settings. In this benchmark, we
    focus on static web content serving.
  • We use Apache 2.0.39 for SPARC/Solaris 8
    configured to use pthread locks and minimal
    logging at the web server. We use the Scalable
    URL Request Generator (SURGE) as the client.
    SURGE generates a sequence of static URL requests
    which exhibit representative distributions for
    document popularity, document sizes, request
    sizes, temporal and spatial locality, and
    embedded document count. We use a repository of
    20,000 files (totalling 500 MB), and use clients
    with zero think time. We compiled both Apache and
    Surge using Suns WorkShop C 6.1 with aggressive
    optimization.

53
Dynamic Web Content Serving Slashcode
  • Dynamic web content serving has become
    increasingly important for web sites that serve
    large amount of information. Dynamic content is
    used by online stores, instant news, and
    community message board systems. Slashcode is an
    open-source dynamic web message posting system
    used by the popular slashdot.org message board
    system.
  • We used Slashcode 2.0, Apache 1.3.20, and
    Apaches mod_perl module 1.25 (with perl 5.6) on
    the server side. We used MySQL 3.23.39 as the
    database engine. The server content is a snapshot
    from the slashcode.com site, containing
    approximately 3000 messages with a total size of
    5 MB. Most of the run time is spent on dynamic
    web page generation. We use a multi-threaded user
    emulation program to emulate user browsing and
    posting behavior. Each user independently and
    randomly generates browsing and posting requests
    to the server according to a transaction mix
    specification. We compiled both server and client
    programs using Suns WorkShop C 6.1 with
    aggressive optimization.
Write a Comment
User Comments (0)
About PowerShow.com