Title: Customizable Domain-Specific Computing -- Proposal for NSF
1Customizable Domain-Specific Computing --
Proposal for NSF Expedition in Computing Program
- Point of Contact Jason Cong
- cong_at_cs.ucla.edu
- Participating Universities
- UCLA (lead), Rice, Ohio-State, and UC Santa
Barbara - (Complete list of PI/Co-PI available inside)
- Motivation
- Overall approach
- Research plan
- Management and collaboration plan
- Value added as Expedition
- Education and outreach plan
- Deliverables and knowledge transfer
3The Power Barrier
Source Shekhar Borkar, Intel
4Current Solution Parallelization
Source Shekhar Borkar, Intel
5Rise of Multi-core Processors
Sony-Toshiba-IBM Cell Processor(1PPE8SPE)
Intel Larrabee (32core)
Nvidia's GT200 GPU (308 240 cores)
Sun Rock processor (44 16 cores)
6Cluster of Computers
IBM BlueGene/L No.1 in the Top500 list of
nov.2007, now No.4 in the newest Top500 list
7Cost and Energy are Still a Big Issue
- Cost of computing
- HW acquisition
- Energy bill
- Heat removal
- Space
8Our Proposal Beyond Parallelization
Customizable Domain-Specific Computing
Source Shekhar Borkar, Intel
9Motivation and Vision
- A few facts
- We have sufficient computing power for most
applications - Each user/enterprise need high computing power
for only limited tasks in his/her
application-domain - Application-specific integrated circuits (ASIC)
can lead to 1000X more power performance
efficiency, but too expensive to design and
manufacture - Our vision and approach
- A general, domain-specific customizable platform
with customizable computing engines and
interconnects - Can be customized to a wide-range of applications
in the domain with novel compilation and runtime
systems - A supercomputer-in-a-box for the intended
domain with 100X performance/power efficiency
(vs. general-purpose solutions) - Can be massively produced with cost efficiency
- Can be programmed efficiently
- Analogy advance of civilization via
10Overview of the Proposed Research
- Domain-specific modeling
- Domain-Specific Coordination Graph (DSCG) and
Domain-Specific Language Extensions (DSLEs) - Executable models that generate application
characterizations for CHP Mapping - Creation of customizable heterogeneous platform
(CHP) for domain-specific computing - Customizable computing engines
- Customizable interconnects
- CHP mapping
- Source-to-source CHP mapper
- Compilation for customization
- Adaptive runtime
- Application domain healthcare
- Medical imaging
- Hemodynamic simulation
- Integration and demonstration
11Need Slides from Glenn, Vivek, and Alex
- 1. Overview of the research tasks in the thrust
(1) - 2. Transformative nature of research in terms of
its impact to the society and the field (1) - 3. Fundamental theoretical contribution and
implication, if applicable (1) - 4. A well integrated milestone chart with annual
milestones covering all activities by PI/Co-Pis
in the thrust (1) - 5. A list of possible summer research projects
for UG and high-school students (1)
12Application Domains Medical Image Processing
Hemodynamic Simulation
- Medical imaging has changed the nature of
healthcare and research - An in vivo method for understanding the nature of
disease and the human condition - It is estimated that medical imaging accounts for
100 billion/year in US healthcare costs - Better/faster algorithms can minimize the time
spent by the patient in the scanner and improve
clinical assessment - Many advanced image processing techniques to
improve images and analyses are too slow for
clinical purposes - Compressive sensing promises much faster imaging
but need computationally demanding image recovery
algorithms - Methods are needed to drive costs down while
addressing computational needs - Hemodynamic simulation
- Surgical procedures involving blood flow and
vasculature increasingly consider hemodynamics - Planning reduces complications during the
operation - Simulations built from angiography can take
several days to construct
Magnetic resonance (MR) angiography of an aneurysm
Intracranial aneurysm reconstruction with
13Application Domains Medical Image Processing
- Minimization of energy formed of data fidelity
term (modeling Rician noise) and total variation
regularization term (non-explicit solution many
iterations). - One-step explicit solution, requires non-local
communication, non-iterative
total variational algorithm
highly parallel, local and global communication
sparse linear algebra, structured grid,
optimization methods
Current models use physics principles with local
(linear/nonlinear) regularization, and with local
(L2) or non-local (mutual information, MI)
similarity measures. MI requires computation of
(non-local) histograms. PDEs are nonlinear.
Full analysis of an anatomical volume (e.g.,
brain) takes 3 hours, but for real-time clinical
applications, this full pipeline must be
performed in lt 2 minutes
fluid registration
parallel, global communication
dense linear algebra, optimization methods
level set methods
Involves solving a system of nonlinear PDEs and
the use of an implicit surface (level sets) that
evolve to detect boundaries (anatomical regions)
local communication
dense linear algebra, spectral methods, MapReduce
- 3D Navier-Stokes equations
- Population-based comparisons
local communication
sparse linear algebra, n-body methods, graphical
14Application Domains Research Tasks
- Creation of a real-time imaging pipeline
- Each step involves computationally intensive
algorithms with distinct communication and
processing patterns - Implement current image processing hemodynamic
simulation algorithms - ITK concurrent collection-based implementation
- Total variational techniques
- Develop models of aneurysms for surgical planning
- Establish baseline benchmarks for core
algorithms, comparing to GPU and FPGA - A CHP-based environment will also foster a new
class of image processing algorithms - Compressive sensing
- Investigate new algorithms based on CHP, allowing
for changes in performance parameters given a
dynamic platform - Evaluation of speed-up and benefit analysis
(cost, quality of life) using real clinical data
Importantly, core methods in both areas have
applications to other computational
domains Insert results from Yi here
15Domain-Specific Modeling Research Tasks
- Design and implementation of Domain-Specific
Coordination Graph (DSCG) and Domain-Specific
Language Extensions (DSLEs) - DSLEs include domain-specific stencil
computation, type systems, data structures,
bitwidths, along with wrappers for
domain-specific libraries - Design and implementation of simulation tools for
executable models that generate application
characterization to drive CHP Creation - Application characterization includes
identification of intrinsic parallelism
communication topologies, and selection of
operand clusters for customized CHP instructions - Creation of executable models for medical imaging
hemodynamic flow simulation domains - Simulations should be driven by both synthetic
and (anonymized) real-world data - Transformative nature use of Domain-Specific
Modeling to drive CHP Creation instead of
software playing second fiddle to hardware
16Domain-Specific Modeling Fundamental theoretical
contributions and implications
- Deterministic executable models with implicit
parallelism - Executable models are inherently fault-tolerant
- Failed step can be re-executed without change in
semantics - High-level stencil operations and their semantics
- Yij ... -b(Xi2j-6Xi1j-6Xi-1j
Xi-2j Xij2-6Xij1-6Xij-1X
ij-2 Xi1j1Xi-1j1Xi1j-1Xi-
1j-1) - For example, replace the above statement by Y
-bapplyStencil(X,St) - Type systems with error significance and
probabilities - Enables new forms of local reasoning about
uncertainties and errors in software
17CHP Creation Research Tasks
- Hierarchical simulation methodology for CHP
design space exploration - Fast analytical/statistical models
- Initial design space pruning
- Kernel-level simulation
- Cycle-accurate simulation to refine the candidate
set (e.g. SECS MC-Sim) - Full-system simulation
- Cycle-accurate simulation of domain applications
(e.g. SIMICS/GEMS) - CHP creation and optimization
- Intelligent search of optimal CHP guided by
domain-specific models knowledge - E.g. knowledge on working set, ILP, etc
- Pruning guided by fast analytical/statistical
models and/or kernel-level simulation - Validation by full-system simulation
- Considering the impact of compilation and runtime
systems - Silicon Implementation of CHP prototypes
- Design based on simulation-driven exploration of
design space - Industry partners will provide implementation
18CHP Design Space Exploration
19CHP Design Space Exploration
- Core Parameters
- NoC Parameters
20CHP Design Space Exploration
- Core Parameters
- NoC Parameters
- Custom Instructions and Accelerators
The key question What is the desired level of
tunability for a given domain?
21Adaptive Interconnect with RF-I
- NoC Topology Adapts to Application Demand
- One example is application-specific shortcuts in
hybrid mesh topology
Physical Topology
22Tri-band On-Chip RF-I Test Results
Process IBM 90nm CMOS Digital Process
Total 3 Channels 30GHz, 50GHz, Base Band
Data Rate in each channel RF Band 4Gbps Base Band 2Gbps
Total Data Rate 10Gbps
Bit Error Rate Across all Bands lt10E-9
Latency 6 ps/mm
Enegry Per Bit (RF) 0.09pJ/bit/mm
Enegry Per Bit (BB) 0.125pJ/bit/mm
VCO power (5mW) can be shared by all (many tens)
parallel RF-I links in NOC and does not
burden individual link significantly.
30GHz Channel
50GHz Channel
Base Band Channel
Output Spectrum of the RF-Bands, 30GHz and 50GHz
Data Output waveform
23Further Amortization of CHP Cost
Daughter Board CHP
- One generic CHP for all domains/applications
- Still expect better performance/power efficiency
over existing CMPs due to heterogeneity and
programmability - One base CHP shared for a spectrum of domains
- Contains key components and tunability for the
intersection of many domain-optimal CHPs - Further cost amortization over many domains
- Each domain has a domain-specific co-processor
- Provides further customization within a domain
- RF-I or optical provides low-latency
communication between CHP and co-processor
General Purpose CHP
RF or optical connection
Fine-grain Cores
Domain-Specific (DS) Daughter Board
DS IP blocks
Fine-grain Cores
DCT Unit
Layer 2
Layer 1
24CHP Mapping Overall Structure
25CHP Mapping Research Tasks
- Source-to-source CHP Mapper for given CHP
- Includes loop transformations, polyhedral
optimizations, space-time scheduling, RF-I
bandwidth allocation, mapping to heterogeneous
cores - Reconfiguring and optimizing back-end
- Includes selection of register file sizes, cache
sizes, datapath bitwidths - C/C-to-RTL synthesizer for FPGAs
- Includes SDC scheduling, communication and
behavior co-optimization - Adaptive Runtime
- Includes fine-grained task scheduling using
domain-specific information, as well as
adaptation to different phases of the application - Software Reliability
- Complement hardware reliability fault-tolerant
algorithms with type checking for error
significance, DSCG test coverage, and
re-execution of DSCG steps
26Tentative Experimental Hardware Platform
- Nallatech FSB Compute module
- FPGA-based accelerator unit
- (Xilinx Virtex-5 LX330T FPGA 51,840 Slices)
- Xeon-socket compatible
- Allows stacking (2 to 4) compute modules
- NVIDIA Tesla C1070
- The fasted GPU / Computing Processor by NVIDIA
- 4GB device memory
- 30 Multi-processors (each has 8 cores)
- Standard PCI-express 2.0 interface 8GB/s
- Intel S7000 series server motherboard
- Supporting up to 4 Xeon CPUs
- 1066 MHz FSB (bandwidth 8.5GB/s)
27CDSC Organization
UCLA Rice UCSB Ohio State
Domain-specific specification Bui, Reinman, Potkonjak Sarkar, Baraniuk Sadayappan
CHP creation Chang, Cong, Reinman Cheng
CHP mapping Cong, Palsberg, Potkonjak Sarkar Sadayappan
Application modeling Aberle, Bui, Vese Baraniuk
Experimental systems All (led by Cong Bui) All All All
Sarkar(Associate Dir)
28Management and Collaboration Plan
- Director Jason Cong (UCLA), Associate Director
Vivek Sarkar (Rice) - Oversee the center operation
- Research Executive Committee (REC) leaders of 4
research thrusts 2 directors - Monthly teleconferences to review the research
progress and facilitate inter-thrust
collaboration - Each thrust will have weekly or biweekly meeting
driven by research milestones - Leveraging extensive collaboration history among
PI/Co-PIs - Everyone had/has joint projects/publications with
others in the center - Inter-campus students exchanges are planned and
encouraged - Three center-wide meetings each year
- January, May, and September (annual review, with
guests from NSF and industry) - Research talks poster sessions brainstorm
sessions feedback session (at annual review) - Student activities
- Seminars and workshops on interdisciplinary
research, career development, ethics,
29Application Domains Milestones
- Year 1
- Identify and prioritize components of the ITK
library to transform to concurrent collections
and CHP - Select major medical image processing algorithms
to form benchmarks as part of image pipeline - Initiate GPU and FPGA implementations of the
selected algorithms (as appropriate) - Identify core hemodynamic simulation algorithms
for transformation into CHP - Establish image testbed with gold standard
results - Year 2
- Complete base image testbed
- Demonstration of initial implementation of select
image processing algorithms on Prototype 1a - Assess potential speed-up and subsequent points
for compiler and hardware improvements, compare
to baseline benchmarks - Ascertain issues related to translation of C
code to target CHP code representation
- Year 3
- Complete and document initial implementation of
ITK library components - Demonstrate remaining medical image processing
algorithms in Prototype 1b, including changes
identified in Year 2 testing - Complete compressive sensing, TV methods
- Year 4
- Demonstrate initial hemodynamic simulation
algorithms running under CHP, Prototype 1b - Demonstrate the adaptive runtime environment
based on the algorithms thus far for CHP - Assess degree of recoding required to move from
C to CHP - Perform profiling to inform improvements to
compiler and hardware implementations - Year 5
- Final demonstration of ITK library for image
processing and hemodynamics simulation on CHP
prototype - Evaluation of CHP performance and impact relative
to real-world clinical data
30CHP Creation Milestones
- Year 1
- Simulation Infrastructure
- Initial CHP prototype COTS components (Prototype
1a) enable SW development - Year 2
- CHP design space exploration initial space
pruning - Domain-specific component synthesis and selection
- Prototype RF-I chip (Prototype 1b) with traffic
generators and multicast - Year 3
- CHP design space exploration refining with
kernel simulation - CHP testbed creation component design and unit
test - Year 4
- CHP design space exploration full system
simulation - CHP testbed prototyping (Prototype 2) on FPGAs
- Year 5
- CHP testbed tapeout (Prototype 2)
- Full system integration and demonstration
31Integrated Research and Education
- New courses planned based on the research
- Architecture and compilation for domain-specific
computing - Computational techniques for medical imaging, and
- Programming models and application development
for domain-specific computing - With projects for new domain, e.g. scientific
computing, VLSI CAD, and digital entertainment - May be jointly taught (multi-disciplinary)
- Will be distributed and shared on Connexions
(cnx.org), an open-access education project now
has about 750,000 users per month - Graduate student training
- Estimated around 18 students in total in four
campuses - Undergraduate student training
- 10 summer research fellowship each year, via UCLA
FOCUS, Rice AGEP and similar programs - AGEP program especially targets women and URM
candidates - Outreach to high-school graduates
- 5-7 each year, via UCLA SMARTS or similar programs
32Outreach Partner Frontier Opportunities in
Computing for Underrepresented Students (FOCUS)
- Aims at increasing the number of underrepresented
minorities interested in computing disciplines. - Currently has 50 underrepresented undergraduates
- 23 in CS
- 27 in CSE.
- http//ceed.ucla.edu/focus/
2007 summer research poster competition
The first prize winner
33Outreach Partner Science Mathematics
Achievement and Research Technology for Students
- A six-week summer college preparation program at
UCLA - Engage underrepresented students in science,
technology, engineering and math training. - SMARTS activities
- Course related activities,
- Math courses (Intro to Statistics and AP Calculus
Readiness) - SAT preparation
- Research activities
- Will have CDSC faculty and graduate students
involved to serve as mentors and provide projects - This year, SMARTS program has over 80 applicants
- 30-35 will be admitted (due to limitation of
34Possible Partner Teach For Americawww.teachfora
- About 300 teachers in LA area (6000 nationwide)
- Cover 3,000 students in LA area (400,000
nationwide/year) - 95 underrepresented students
- Initial contact with Celia Alvarado, Manager for
LA area - Will let CDSC speaker at orientation of teachers
in related areas (e.g. math, science) - TFA teachers will introduce CDSC summer program
to high school students, and make recommendation
of students for the program - Will contact TFA in areas close to other CDSC
35CHP Creation Summer Outreach Projects
- Premise
- Small-scale, introductory projects that are
self-contained - Leverage our development infrastructure to
accelerate development time - Expose students to cutting edge tools and ideas
- Simulation Infrastructure
- Sample Undergraduate Project Refinement of
statistical regression models - Sample High School Project Exploration of new
design drivers and kernels - Physical Design
- Sample Undergraduate Project Analysis of
critical loops in new design drivers and creation
of custom accelerators (leveraging our design
framework) - Sample High School Project Basic power modeling
methodology - RF Interconnect
- Sample Undergraduate Project NoC exploration
with RF
36Value Added As Expedition
37CHP Creation Transformative Nature of Research
- Custom Heterogeneous CMPs
- Conventional designs exploit parallelism with
homogeneous resources - Designs are general
- Resource allocation (i.e. datapath width, cache
organization, types of functional units) - Instruction implementation (i.e. generally
designed ISAs and machine organization) - Next transformative step in computing
- Domain-specific integration of
- Tunable processing cores (i.e. match the resource
requirements of the application) - Programmable fabric (i.e. offload critical
computation to customized bit-parallel datapaths) - Power-efficient domain-specific performance
- Reconfigurable Network on Chip
- Adapts to domain demand
- Provide power-hungry bandwidth only where it is
required - Shift away from designing for worst-case behavior
- Two emerging technologies each enable efficient
reconfiguration - RF Interconnect
- Optical Interconnect
38Knowledge Transfer
- Main outcome of the project
- CHP prototypes
- Compilation and runtime system for CHP mapping
- Application drivers both the original source
code and modified source code with
domain-specific modeling - General methodology for customizable computing
(mainly through publications) - 1 - 3 will be shared with the research
community via web as they become available - Industrial partners
- Altera, IBM, Intel, Magma, Mentor Graphics,
Nvidia, Xilinx - More will be contacted and included if the
project is officially funded - Startup experience
- Aplus design technologies (acquired by Magma in
2003), AutoESL design technologies (Magma and
Xilinx were investors) - Extensive experience working with Office of IP
Administration (OIPA) for tech transfer
39Backup Slides
40Domain of Focus Health Care
- Health care consumes of 16 of US economic output
as of 2004 and is still increasing rapidly - Has most directly impact on the quality-of-life
- Revolution in this area with the rapid advance of
the computer and information technologies
arguably has the most significant impact to the
society and national economy - Many problems in this domain are extremely
computationally challenging and beyond the reach
of current computing technology
41Simulation Framework
- Statistical/Analytical Models initial design
space pruning - Architecture-agnostic application model
- Working set size, thread scaling, sequential vs
parallel sections, etc - Core model
- NoC/coherence model
- MC-Sim design space refinement
- Cycle accurate simulator based on SESC (MIPS
emulation) - No operating system overhead useful for running
application kernels - SIMICS/GEMS full system simulation
- Cycle accurate simulation with operating system
support - Used to run the full applications in the domain
42Intelligent Design Space Exploration
- Pruning
- Initial studies with statistical/analytical
models and kernel simulation will prune portions
of the design space - Pruning will be conservative to avoid false
negatives - Guided Space Exploration
- Rather than explore the entire space in a brute
force manner, we will filter certain
architectural parameters based on domain specific
knowledge - Example 1 Working set size is 2 MB for a
particular domain - We will guide the design space to only consider
architectural configurations which directly (i.e.
2MB caches) or indirectly (i.e. aggressive
prefetching or thread-level speculation) provide
this effective working set size - Example 2 Domain has very limited ILP
- We will prune dynamically scheduled cores from
the space to avoid wasting power on needless ILP
43Connexions Open Access Course Ware
- Connexions (cnx.org) is a non-profit open access
educational publishing system based at Rice - goal make high-quality educational content
available for free on the web and at very low
cost in print - open-licensed repository of 10,000 Lego-block
modules for authors, instructors, learners - global reach gt1M users monthly from nearly
200 countries - A vibrant community will develop around the
materials(think Wikipedia) - Will take leadership role in quality evaluation
of open materials - build on successful Connexions partnership with