Title: Climate Machine Update
1Climate Machine Update
- David Donofrio
- RAMP Retreat
- 8/20/2008
2Agenda
- Project Overview
- Tensilica Architecture and Design Flow
- Tensilica Tools Demo
- Why we need RAMP
- Current Progress
- Next Steps
3A New Approach to HPC
- Current HPC Design approach
- Leverage commodity processors from Intel, AMD,
etc - Once machine is built, optimize problems to run
on it - Power wall prevents scaling to exaflop
performance - Power is the new design point
Olukotun and Sutter
Moores Law still in effect - but number of
processors double every 18 months rather than
clock rate
4A New Approach to HPC
- Our approach
- Identify application, then tailor machine using
semi-custom design - Optimize CPU architecture and further extend with
semi-custom ISA - Leverage auto-tuning to access architecture
specific optimizations - Even if each simple core is 1/4 as
computationally efficient as a complex core you
can fit hundreds on a single die and be 100x more
power efficient - Learn from embedded market where Flops / Watt and
rapid design cycles are crucial - Start with building blocks from embedded designs
rather than full custom ASIC - Preserve ability to run general purpose C code
- Application Target 1km Scale Climate Model
- Tailor machine architecture to application to
- reduce waste
5Climate Model Resource Requirements
- DOE has identified high-resolution climate
modeling as a leading justification for exascale
computing - Must express 20M way parallelism
- Requires performance of 200 Pflops peak
- Simulation must run 1000x faster than real time
- Amenable to massively concurrent architectures
composed of power efficient embedded cores. - Actively working with the climate science
community to enable new Icosahedral model
NASA
Randall / CSU
6Tensilica Processor Design Flow
- Complete Solution Hardware, Software and
Verification - Fully customizable
- Required base ISA ensures general purpose
applications - Processor configuration submitted to Tensilicas
servers where synthesis is performed - Returned design can be spun for ASIC or FPGA
- Bit file available for Avnet boards
- Building block approach drastically reduces
design cycle time compared to full-custom design
Tensilica Inc.
7Tensilica Architecture Features
- Verilog-like TIE language allows for custom ISA
extensions - Functional and performance verification built in
- Auto generated compiler intrinsics
- 64-bit IEEE-DP floating point coded up in TIE and
available - Custom VLIW support
- Inter-processor communication easily enabled
through - TIE Ports
- TIE Queues
- Access to direct HW support for interprocessor
communication - TIE Lookups
- Allows interface to external ROMs or other RTL
block
8Tensilica Architecture Overview
Tensilica Inc.
9Tensilica Performance Debug
- Processor viewed as black box
- State can be compressed (via HW) and pushed out
JTAG port - Intended for program replay
- Xtensa trace port gives real-time visibility
into internal pipeline state with unprecedented
detail - hit miss with virtual address
- Branch taken / not taken
- Call / return
- Resource dependency
- Etc
- Opportunity for hundreds
- of performance counters
- to be made available
Tensilica Inc.
10Tensilica Tools Demo
11Why we need RAMP
- Fast, accurate emulation enables
- Dual nested loop of HW / SW co-design
- Preliminary work using Stanford SM sim shows
significant improvement in power eff. using
automated HW/SW co-tuning - RAMP critical to accelerate
- Rapid prototyping and analysis of Tensilica
architectural options - Inter-processor communication architecture
exploration - Running FULL climate code providing a more
complete performance picture - Cycle accurate simulator currently running at
100 kHz vs. 50MHz on V5 - Extensive HW performance counter data enables an
emulation environment with similar resolution but
much greater speed
Tensilica provided emulation environment
kick-starts this effort
12Current Status
- ML505 used for initial design exploration
- Basic xtensa processor JTAG and memory
controller is 50 of a Virtex 5 50t - Runs at 50MHz
- ASIC in 65G process runs at 650MHz
- OnChip Debug working
- Can load / run programs using main memory
synthesized from BRAM - DRAM interface coded - currently being debugged
- RTL license recently obtained - full simulation
environment (in ModelSim) being brought up
13Next Steps
- Transition to BEE3 from ML505
- Bring up XTOS environment on single xtensa
processor on BEE3 - Run single column of climate code on single
processor - Demo at SC08 in November
- Continue HW / SW co-tuning optimization
- Begin multi-processor emulation
- Emulation of single socket, 32 core, using
networked BEE3s - Running full 2 Million line climate model
14Backup
15The Need for Exascale Computing
- DOE has identified high-resolution climate
modeling as leading justification for exascale
computing - 1 km resolution targeted for accurate cloud
resolving model - Difficult to scale existing systems
- HPC design using commodity processors estimated
to draw 179MW - BlueGene design estimated to draw 20MW
- Leveraging embedded cores and more application
specific design a power envelope of 3-5MW is
projected
Randall / CSU
LBNL will seek an external vendor to build the
machine if our approach is proven valid - LBNL is
not entering the commercial HPC market.