Title: Advanced Architectures CSE 190
1Advanced ArchitecturesCSE 190
- Reagan W. Moore
- San Diego Supercomputer Center
- moore_at_sdsc.edu
- http//www.npaci.edu/DICE
2Course Organization
- Professors / TA
- Sid Karin - Director, San Diego Supercomputer
Center, ltskarin_at_sdsc.edugt - Reagan Moore - Associate Director, SDSC
ltmoore_at_sdsc.edugt - Holly Dail - UCSD TA lthdail_at_cs.ucsd.edugt
- Seminars
- State of the art computer architectures
- Mid-term / SDSC tour
- Final exam
3Seminars
- 4/3 Reagan Moore- Performance evaluation
heuristics modeling - 4/10 Sid Karin - Historical perspective
- 4/17 Richard Kaufmann, Compaq - Teraflops
systems - 4/24 IBM or Sun
- 5/1 Mark Seager, LLNL - ASCI 10 Tflops
computer - 5/8 Midterm / SDSC Tour
- 5/15 John Feo, Tera - Multi-threaded
architectures - 5/22 Peter Beckman, LANL - Clusters
- 5/29 Holiday / no class
- 6/5 Thomas Sterling, Caltech - Petaflops
computers - 6/12 Final exam
4Supercomputers for Simulation and Data Mining
Data Mining
Distributed Archives
Application
Collection Building
Information Discovery
Digital Library
5Heuristics for Characterizing Supercomputers
- Generators of data - numerically intensive
computing - Usage models for the rate at which supercomputers
move data between memory, disk, and archives - Usage models for capacity of the data caches
(memory size, local disk, and archival storage) - Analyzers of data - data intensive computing
- Performance models for combining data analysis
with data movement (between caches, disks,
archives)
6Heuristics
- Experience based models of computer usage
- Dependent on computer architecture
- Presence of data caches, memory-mapped I/O
- Architectures used at SDSC
- CRAY vector computers
- X/MP, Y/MP, C-90, T-90
- Parallel computers
- MPPs - Ipsc 860, Paragon, T3D, T3E
- Clusters - SP
7Supercomputer Data Flow Model
8Y-MP Heuristics
- Utilization measured on Cray Y-MP
- Real memory architecture - entire job context is
in memory, no paging of data - Exceptional memory bandwidth
- I/O rate from CPU to memory was 28 Bytes per
cycle - Maximum execution rate was 2 Flops per cycle
- Scaled memory on C-90 to test heuristics
- Noted that increasing memory from 1 GB to 2 GBs
decreased idle time from 10 to 2 - Sustained execution rate was 1.8 GFlops
9Data Generation Metrics
7 Bytes/Flop
CPU
Memory
1 Byte of storage per Flops
1 Byte/60 Flop
Local Disk
Hold data for 1 day
1/7 of data persists for a day
1/7 of data sent to archive
Archive Disk
Hold data for 1 week
All data sent to tape
Archive tape
Hold data forever
10Peak Teraflops System
? TB
TeraFlops System
Compute Engine
Local Disk
? GB/sec
1 day cache
0.5-1 TB memory Sustain ? GF
? MB/sec
Archive Disk
Archive Tape
1 week cache
? MB/sec
? TB
? PB
11Data Sizes on Disk
- How much scratch space is used by each job?
- Disk space is 20 - 40 times the memory size.
- Data lasts for about one day
- Average execution time for long running jobs
- 30 minutes to 1 hour
- For jobs using all of memory
- Between 48 and 24 jobs per day
- Each job uses (Disk space) / (Number of jobs)
- Or 40/48 Memory 80 of memory
12Peak Teraflops Data Flow Model
10 TB
TeraFlops System
Compute Engine
Local Disk
1 GB/sec
1 day cache
0.5-1 TB memory Sustain 150 GF
40 MB/sec
Archive Disk
Archive Tape
1 week cache
40 MB/sec
5 TB
0.5-1 PB
13HPSS Archival Storage System
14Equivalent of Ohms Law for Computer Science
- How does one relate application requirements to
computation rates and I/O bandwidths? - Use prototype data movement problem to derive
physical parameters that characterize
applications.
15 Data Distribution Comparison
Reduce size of data from S bytes to s bytes and
analyze
Data Handling Platform
Supercomputer
Data
B
b
Execution rate r lt R
Bandwidths linking systems are B b
Operations per bit for analysis is C
Operations per bit for data transfer is c
Should the data reduction be done before
transmission?
16Distributing Services
Compare times for analyzing data with size
reduction from S to s
Supercomputer
Data Handling Platform
Read Data
Reduce Data
Transmit Data
Network
Receive Data
S / B
C S / r
c s / r
s / b
c s / R
Supercomputer
Data Handling Platform
Read Data
Reduce Data
Transmit Data
Receive Data
Network
c S / R
c S / r
S / b
C S / R
S / B
17Comparison of Time
18Optimization Parameter Selection
Have algebraic equation with eight independent
variables. T (Super) lt T (Archive) S/B CS/r
cs/r s/b cs/R lt S/B cS/r S/b cS/R
CS/R Which variable provides the simplest
optimization criterion?
19Scaling Parameters
Data size reduction ratio s/S Execution slow
down ratio r/R Problem complexity c/C Communica
tion/Execution balance r/(cb)
Note (r/c) is the number of bits/sec that can be
processed.
When r/(cb) 1, the data processing rate is the
same as the data transmission rate.
Optimal designs have r/(cb) 1
20Bandwidth Optimization
Moving all of the data is faster, T(Super) lt
T(Archive) Sufficiently fast network
21Execution Rate Optimization
Moving all of the data is faster, T(Super) lt
T(Archive) Sufficiently fast supercomputer R gt r
1 (c/C) (1 - s/S) / 1 - (c/C) (1 - s/S) (1
r/(cb) Note the denominator changes sign when C
lt c (1 - s/S) 1 r/(cb) Even with an
infinitely fast supercomputer, it is better
to process at the archive if the complexity is
too small.
22Data Reduction Optimization
Moving all of the data is faster, T(Super) lt
T(Archive) Data reduction is small enough s gt S
1 - (C/c)(1 - r/R) / 1 r/R r/(cb) Note
criteria changes sign when C gt c 1 r/R
r/(cb) / (1 - r/R) When the complexity is
sufficiently large, it is faster to process on
the supercomputer even when data can be reduced
to one bit.
23Complexity Analysis
Moving all of the data is faster, T(Super) lt
T(Archive) Sufficiently complex analysis
24Characterization of Supercomputer Systems
- Sufficiently high complexity
- Move data to processing engine
- Digital Library execution of remote services
- Traditional supercomputer processing of
applications - Sufficiently low complexity
- Move process to the data source
- Metacomputing execution of remote applications
- Traditional digital library service
25Computer Architectures
- Processor in memory
- Do computations within memory
- Complexity of supported operations
- Commodity processors
- L2 caches
- L3 caches
- Parallel computers
- Memory bandwidth between nodes
- MPP - shared memory
- Cluster - distributed memory
26Characterization Metric
- Describe systems in terms of their balance
- Optimal designs have r/(cb) 1
- Equivalent of Ohms law
- R C B
- Characterize applications in terms of their
complexity - Operations per byte of data
- C R / B
-
27Second Example
- Inclusion of latency (time for process to start)
and overhead (time to execute communication
protocol) - Illustrate with combined optimization of use of
network and CPU
28Optimizing Use of Resources
- Compare time needed to do calculations with time
needed to access data over a network - Time spent using a CPU
- Execution time protocol processing time
- Cc Sc / Rc Cp St / Rp
- Where
- St size of transmitted data (bytes)
- Sc size of application data (bytes)
- Cc number of operations per byte of transmitted
data for the application - Cp number of operations per byte to process
protocol - Rc execution rate of application
- Rp execution rate of protocol
29Characterizing Latency
- Time during which a network transmits data
- Latency for initiating transfer transmission
time - L St / B
- Where
- L is the round trip latency at the speed of light
(sec) - B is the bandwidth (bytes/sec)
30Solve for Balanced System
- CPU utilization time Network utilization time
- Solve for transmission size as a function of
Sc/St - St L B / B Cp / Rp (B Cc / Rc) (Sc /
St) -1 - Solution exists when Sc/St gt Rc / (BCc) 1 -
BCp / Rp - and B Cp / Rp lt 1
31Comparing Utilization of Resources
- Network utilization
- Un Transmission time / (Transmission latency)
- 1 / 1 (L B / St)
- CPU utilization
- Uc Execution time / (Execution Protocol
processing) - 1 / 1 (Cp Rc) / (Cc Rp) (St /
Sc) - Define h Sc / St
32Comparing Efficiencies
Utilization
U-cpu
U-network
h S-compute / S-transmit
33Crossover Point
- When utilization of bandwidth and execution
resources is balanced - 1 / 1 (L B / St) 1 / 1 (Cp Rc) /
(Cc Rp) / h - For optimal St, solve for h Sc/St, and find
- h (Rc Cp / 2 Rp Cc) sqrt(1 4 Rp / Cp B) -1
- For small B Cp / Rp
- h Rc / Cc B or St / B Sc Cc / Rc
- And transmission time execution time
34Application Summary
- Optimal application for a given architecture
- B Cc / Rc 1
- (Bytes/sec) (Operations/byte) / (Operations/sec)
- Cc Rc / B
- Also need cost of network utilization to be small
- B Cp / Rp lt 1
- And amount of data transmitted proportional to
latency - St L B / B Cp / Rp (B Cc / Rc) (Sc /
St) -1 -
35Further Information
http//www.npaci.edu/DICE