Title: Simics Accelerator Virtualizing Large Systems
1Simics AcceleratorVirtualizing Large Systems
- Dr. Mikael Bergqvist, Senior Application Engineer
- 2008-05-30
2Topic
- Speeding up the simulation of large target
systems - Bring virtualized software development to the big
stuff - Outline
- Virtualized software development
- Apologies if you attended the morning
presentation on multicore debug, some parts will
be repeated. But with two tracks we cannot be
sure that you all saw that. - Target and host system trends
- Multithreading virtual hardware models
- Leveraging redundant information with Page
Sharing - Results
3- Virtualization for Software Developers
4What is Virtual Hardware?
- A piece of software
- Running on a regular PC, server, or workstation
- Functionally identical to a particular hardware
- Runs the same software as the physical hardware
system
Virtual HW
5Virtutech Core Technology
- Model any electronic system on a PC or
workstation - Simics is a software program, no hardware
required - Run the exact same software as the physical
target (complete binary) - Run it fast (100s of MIPS)
- Model any target system
- Networks, SoCs, boards, ASICs, ... no limits
- Here is where accelerator comes in
- For the benefit of software developers and
hardware providers - Enables process change in software development
User application code
Middleware and libraries
Target operating system (s)
Virtual target hardware
6Why do we use Virtual Hardware?
- Business Reasons
- It hits the bottom line
- Develop software before hardware becomes
available - Shorten time-to-market
- Decouple hardware and software development
- Reduce software risk
- Increase quality
- Availability Flexibility
- Engineering Reasons
- It is cool
- Checkpoint restore
- Virtual time
- Precisely synchronized
- Stopped at any point
- Repeatability
- Reverse execution
- Configurable
- Control
- Change anything
- Inspection power
- See anything
- No debug bandwidth limit
7Value Proposition
Optimize
Accelerate
Enhance System Debug
Replace
Early Software Development
Test and configuration
Cost of Recall and System Maintenance
Time to Market
Capital Expenditure Reduction
8Replace
- Availability
- Virtual system is software
- Trivial to copy
- Trivial to distribute
- Cheaper than custom HW
- Each engineer can have a custom hardware system
at their desk
- Scalability
- No physical supply limit
- Any number of any board
- Any type of system in infinite supply at no
cost - Old systems or new
- A virtual system can be big or small by simple
software (re)configuration
9Accelerate
- Virtual hardware created from the system
specification - Model available much earlier than prototype
hardware - Software development starts much earlier
- Software available when hardware starts shipping
- Shorter sales cycles, less product risk, shorter
time-to-market
Board design
Board prototype production
Hardware/Software Integration and Test
Hardware-dependent software development
Virtual modelproduction
Application software development
10Optimize
- Take advantage of the full power of virtualized
software development and virtual hardware - Factor it into the project plan for a system
- Observed effects
- Software not blocked by hardware availability
- Development schedules that start earlier and end
earlier - Shorter development time for equivalent
functionality - Shorter time to find and fix the really hard bugs
- Fewer show-stoppers
- More tested software
- Improved hardware and hardware documentation
quality - Very short time before software runs on first
hardware
11Optimized Debugging Power
- Virtual hardware has very nice debugging and
testing abilities
... con0.wait-for-string gt con0.input
bootm\n con0.wait-for-string login con0.input
root\n ...
break x 0x0000 0x1F00 break-io
uart0 break-exception int13
12The Disk Corruption Example Bug
- Distributed fault-tolerant file system got
corrupted - Rack-based system with many boards
- Intermittent error
- Error seen as a composite state across multiple
disks they suddenly and intermittently became
inconsistent - Months spent chasing it on physical hardware
- Simics solution
- Reproduce corruption in Simics model of target
- Pin-point time when it happens, by interval
halving - Around the critical time, take periodic snapshots
of disks - Check consistency of disk states in offline
scripts - Result
- Found the precise instruction causing the problem
- Captured the network traffic pattern causing the
issue - Communicated the complete setup and reproduction
instructions to development, greatly facilitating
fixing the bug
13What Types of Systems Can Be Virtualized?
Complete Systems Networks
Examples
- e300, e500, 440, 970, 7450, Power6, ...
Racks of Boards Backplanes
- MPC8572E, PPC440GX, or CSSP ASIC
This is where performance becomes an issue
- PCIe, RapidIO, I2C, Custom FPGA
Complete Boards
Devices Buses
- MPC8572DS board, ebony board, custom
SoC Devices
- Telecom rack, avionicsbay, blade server
Processor Memory
- Satellite constellation, telecom network
14- Technology Trends and Simics Accelerator
15Trends
- Target systems are getting more complex
- Multiple boards
- Multiple processors
- Multicore SoCs
- More and larger memories
- Reduces perceived simulation performance as more
work is needed per target time unit
- Host hardware is parallel
- Multicore processors
- Multiple processors
- Clusters of PCs
- Multicore standard for desktop
- 600 EUR for a 2-core PC
- 3000 EUR gets 8-core server
- Increases processing power for software which is
parallel - NB Memory size is not increasing as quickly as
cores
16Simics Accelerator
- Launched with Simics 4.0 in April 2008
- Contains a set of technologies for speeding up
execution of large target systems in Simics - Tackle more complex target systems
- Using multiple host processor cores
- Taking advantage of redundancy in target system
- Without impacting Simics determinism, control,
synchronization, insight, and reverse execution
17The Target Systems
- Large, complex targets
- Multiple boards
- Multiple networks
- 20-100 processors
- Heterogeneous processors
- Many gigabytes of memory
- Almost overwhelming but not with Accelerator!
- Brings a whole new level of systems into the
bracket of conveniently fast - Typical target markets
- Telecom network equipment (racks and clusters)
- Military/aerospace racks
- Datacenter blade enclosures
- Distributed systems
- Networked systems
18 19Not Trivial to do Right
20Multithreading Simics Overview
Simics
Simics
Simics
Single thread
Host Workstation
Host Workstation
Host Workstation
Target simulation speed
Total simulator work
1.0
1.0
25
100
100
4.0
21Multithreading Simics Details
- Simics 4.0 can utilize multiple host processors
for simulation - The simulation is divided up into cells
- The cells can run concurrently in different
threads - Objects in different cells can only communicate
with each other through message passing (Simics
links) - Processors that share memory or devices have to
be in the same cell (currently) - Boards or machines that communicate over Ethernet
and other networks can be in separate cells - Typically, one or a few boards/machines in a cell
- Links connecting machines require some smarts
- Orthogonal to other Simics features
- Reuses target structure for earlier Simics
versions
22Hierarchical Synchronization
Synchronize shared memory machine tightly
- Deterministic semantics
- Regardless of host cores
- Periodic synchronization between different cells
and target machines - Puts a minimum latency on communication
propagation - Synch interval determines simulation results, not
number of execution threads in Simics - Latency within a cell
- 1000-10000 cycles
- Works well for SMP OS
- Latency between cells
- 10 to 1000 ms
- Works well for latency-tolerant networks
- Builds on current Simics experience in temporally
decoupled simulation - This works well in practice
Longer latency on network between cells
link
link
Short latency between machines with tight network
coupling, inside a single cell
23Scaling Out
- Multithreading and distribution of the simulation
can be combined to simulate extremely large
systems - Make more cores and more host memory available
- Takes Simics into the hundreds of nodes domain
- Distribute at network links, just like cell
boundaries
Simics
Simics
Simics
link
link
Switch
link
Host Workstation
Host Workstation
Host Workstation
24- Leveraging Target Redundancy
25Redundancy in Target Systems
- Large systems are not built from all-unique
components - Software repeats
- Machines use the same OS, middleware,
applications - Data repeats
- Redundant databases
- Data packets passed around in a cluster
- Copies within machine
- Code and data copied from disk to memory to be
used - Simulator sees the whole system, leverage
repetition to reduce memory footprint
Packet
Dataset
DB
App A
RTOS
Packet
Dataset
Linux
DB
Dataset
RTOS
Dataset
Packet
DB
App A
RTOS
App A
RTOS
26Data Page Sharing Implementation
- Simics memory images used for all data stores
(flash, ram, rom, disks, etc.) - Standard Simics feature
- Identical pages in different memory images stored
in a single copy - Within machines
- Between machines
- Regardless of type of memory in the target
- Copy-on-write semantics for safety (obviously)
- Reduces memory footprint, increase data locality,
helps maintain performance
Simics
cpu
RAM
cpu
cpu
RAM
cpu
dev
dev
dev
dev
flash
flash
dev
dev
flash
dev
dev
dev
RAM
cpu
27- Simics Accelerator Results
28Accelerator Scaling
- Many times better scalability for virtual
hardware - Brings virtualization to larger system setups
- More boards and larger memories handled with same
host - (No real effect on single-machine setups)
- Better use of host hardware
- Use all the cores in a workstation
- Do not waste workstation memory
- Same semantics everywhere start on a small
machine, move to a larger one for large
simulations if needed - Overall, removes target system size as an
obstacle for using virtualized software
development
29Single Point of Control
Eight machines simulated by two threads, inspect
any part of any machine from single interface
30Multithreading Performance Results
- Performance effect of multithreading depends on
- Target system characteristics
- Software latency requirements
- Target system load balance
- Target system communications pattern
- Synthetic experiments and lab experience
- Single-thread performance not affected
- Simics works just as well as before on a single
core - No impact on idle loop simulation
- Up to 10x Simics 3.2 performance
- 8-core host, 64 target machines, no communication
- Up to 6x scaling on 8-core host
- Pretty respectable
31Page Sharing Results
All results are for networks of machines booted
to prompt, but no applications loaded
Local unique data 4
Data repeated within the machine 20
Shared data across machines 96
Total data savings 65
Total data savings 20
Local unique data 1
Shared data across and within machines 98
Zero pages 90
Other shared 1
Total data savings 89
Total data savings 91
32 33Munich
34 35Simulation Speed
- Detail level determines speed
- The more detail, the slower the simulation
- You can run lots of software with low detail
level - or not very much software with high detail level
- But not lots of software with high detail level
36Workload Sizes
37Temporal Decoupling Speed Impact
- Experimental data
- 4 virtual PPC440 boards
- Booting Linux
- Which is a particularly hard workload, lots of
device accesses - Execution quanta of 1, 10, 100, ... 1000_000
cycles - Notable points
- 10x performance increase from 10 to 1000 quantum
- 30 from 1000 to 1000_000 quantum
38Simics 4.0 Accelerator Performance
- Running a single machine in a single thread is
equal in performance with Simics 3.2 - Setups with many machines are often faster than
with 3.2 - Multithreading makes it much easier to utilize
multicore and multiprocessor host machines - Linear scaling seen for simple cases such as
compute-intense workloads or boot with little
communication - Variability of the workload limits performance
(see next slide) - Performance reduced if low-latency communication
is required - Page sharing is not yet optimized for performance
- Current implementation saves memory without
affecting performance (neither better nor worse
than without page sharing) - Have potential to improve performance
39Multithreading Performance in Practice
- Multiple boards in a single target system
- Virtual time progress, with time quanta
Simics
A
B
C
A1
A3
A2
B1
B3
B2
C1
C3
C2
Virtual time
40Execution on Single-Threaded Simics
Virtual time progress, with quanta
A1
A3
A2
B1
B3
B2
C1
C3
C2
Virtual time
Serialized execution on single-threaded Simics
A1
A2
B1
B2
C1
C2
Real time
The simulation of the three target machines are
interleaved on a single processor
The real time it takes to execute each time
quantum tend to vary with target hardware and
software characteristics
41Execution on Multi-threaded Simics
Virtual time progress, with quanta
A1
A3
A2
B1
B3
B2
C1
C3
C2
Virtual time
Each time quantum has to be finished on all
machines before progressing to next quantum
Serialized execution on single-threaded Simics
Best case, all time target time quanta take the
same time to simulate.
A1
A2
B1
B2
C1
C2
Real time
Parallel execution on multi-threaded Simics
A1
A2
A3
Stall
B1
B2
B3
Stall
C1
C2
C3
Stall
Stall
Real time
42Execution on Multi-threaded Simics
Virtual time progress, with quanta
A1
A3
A2
B1
B3
B2
Speed-up over single-threaded Simics will vary
over time, and is limited by load balance
C1
C3
C2
Virtual time
Serialized execution on single-threaded Simics
A1
A2
B1
B2
C1
C2
Real time
Parallel execution on multi-threaded Simics
A1
A2
A3
Stall
B1
B2
B3
Stall
C1
C2
C3
Stall
Stall
Real time
43Simics Accelerator vs Simics Central
Simics Central coordinates a set of separate
Simics processes.
Simics
Simics
Simics
Accelerator uses multiple threads inside a single
Simics instance.
Simics
Simics
Host Workstation
Host Workstation
- Accelerator advantages
- Easier to setup, control and coordinate the
simulation - Potentially more efficient use of host machine
resources
44Are Cores for Free?