IRAM and ISTORE Projects

About This Presentation

Title:

IRAM and ISTORE Projects

Description:

... Beck, Rich Fromm, Joe Gebis, Paul Harvey, Adam Janin, Dave Judd, Kimberly Keeton, ... 2X-4X (no off-chip bus) serial I/O 5-10X v. buses. smaller board area/volume ... – PowerPoint PPT presentation

Number of Views:119

Avg rating:3.0/5.0

Slides: 57

Provided by: davidoppe

Learn more at: http://iram.cs.berkeley.edu

more less

Transcript and Presenter's Notes

Title: IRAM and ISTORE Projects

1
IRAM and ISTORE Projects

Aaron Brown, James Beck, Rich Fromm, Joe Gebis,
Paul Harvey, Adam Janin, Dave Judd,
Kimberly Keeton, Christoforos Kozyrakis, David
Martin, Rich Martin, Thinh Nguyen, David
Oppenheimer, Steve Pope, Randi Thomas,
Noah Treuhaft, Sam Williams, John
Kubiatowicz, Kathy Yelick, and David Patterson
http//iram.cs.berkeley.edu/istore
Winter 2000 IRAM/ISTORE Retreat

2
Intelligent RAM IRAM

Microprocessor DRAM on a single chip
10X capacity vs. DRAM
on-chip memory latency 5-10X, bandwidth 50-100X
improve energy efficiency 2X-4X (no off-chip
bus)
serial I/O 5-10X v. buses
smaller board area/volume
IRAM advantages extend to
a single chip system
a building block for larger systems

3
IRAM Vision Intelligent PDA

Pilot PDA
gameboy, cell phone, radio, timer, camera, TV
remote, am/fm radio, garage door opener, ...
Wireless data (WWW)
Speech, vision, video
Voice output for conversations

Speech control Vision to see, scan documents,
read bar code, ...
4
ISTORE Hardware Vision

System-on-a-chip enables computer, memory,
without significantly increasing size of disk
5-7 year target

MicroDrive1.7 x 1.4 x 0.2 2006 ?
1999 340 MB, 5400 RPM, 5 MB/s, 15 ms seek
2006 9 GB, 50 MB/s ? (1.6X/yr capacity,1.4X/yr
BW)
Integrated IRAM processor
2x height
Connected via crossbar switch
growing like Moores law
10,000 nodes in one rack!

5
VIRAM System on a Chip

Prototype scheduled for tape-out 1H 2000
0.18 um EDL process
16 MB DRAM, 8 banks
MIPS Scalar core and
caches _at_ 200 MHz
4 64-bit vector unit
pipelines _at_ 200 MHz
4 100 MB parallel I/O lines
17x17 mm, 2 Watts
25.6 GB/s memory (6.4 GB/s per direction
and per Xbar)
1.6 Gflops (64-bit), 6.4 GOPs (16-bit)

Memory (64 Mbits / 8 MBytes)
Xbar
I/O
Memory (64 Mbits / 8 MBytes)
6
IRAM Architecture Update

ISA mostly frozen since 6/99
better fixed-point model and instructions
gained some experience using them over past year
better exception model
better support for short vectors
auto-increment memory addressing
instructions for in-register reductions
butterfly-permutations
memory consistency model spec refined (poster)
Suite of simulators actively used and maintained
vsim-isa (functional), vsim-p (performance),
vsim-db (debugger), vsim-sync (memory
synchronization)

7
IRAM Software Update

Vectorizing Compiler for VIRAM
retargeting CRAY vectorizing compiler (talk)
Initial backend complete scalar and vector
instructions
Extensive testing for correct functionality
Instruction scheduling and performance tuning
begun
Applications using compiler underway
Speech processing (talk)
Small benchmarks suggestions welcome
Hand-coded fixed point applications
Video encoder application complete (poster)
FFT, floating point done, fixed point started
(talk)

8
IRAM Chip Update

IBM to supply embedded DRAM/Logic (98)
DRAM macro added to 0.18 micron logic process
DRAM specs under NDA final agreement in UCB
bureaucracy
MIPS to supply scalar core (99)
MIPS processor, caches, TLB
MIT to supply FPU (100)
single precision (32 bit) only
VIRAM-1 Tape-out scheduled for mid-2000
Some updates of micro-architecture based on
benchmarks (talk)
Layout of multiplier (poster), register file
nearly complete
Test strategy developed (talk)
Demo system high level hardware design complete
(talk)
Network interface design complete (talk)

9
VIRAM-1 block diagram
10
Microarchitecture configuration

2 arithmetic units
both execute integer operations
one executes FP operations
4 64-bit datapaths (lanes) per unit
2 flag processing units
for conditional execution and speculation support
1 load-store unit
optimized for strides 1,2,3, and 4
4 addresses/cycle for indexed and strided
operations
decoupled indexed and strided stores

Memory system
8 DRAM banks
256-bit synchronous interface
1 sub-bank per bank
16 Mbytes total capacity
Peak performance
3.2 GOPS64, 12.8 GOPS16 (w. madd)
1.6 GOPS64, 6.4 GOPS16 (wo. madd)
0.8 GFLOPS64, 3.2 GFLOPS32 (w. madd)
6.4 Gbyte/s memory bandwidth

11
Media Kernel Performance
12
Base-line system comparison

All numbers in cycles/pixel
MMX and VIS results assume all data in L1 cache

13
Scaling to 10K Processors

IRAM micro-disk offer huge scaling
opportunities
Still many hard system problems, SAM AME (talk)
Availability
24 x7 databases without human intervention
Discrete vs. continuous model of machine being up
Maintainability
42 of system failures are due to administrative
errors
self-monitoring, tuning, and repair
Evolution
Dynamic scaling with plug-and-play components
Scalable performance, gracefully down as well as
up
Machines become heterogeneous in performance at
scale

14
ISTORE-1 Hardware for AME

Hardware plug-and-play intelligent devices with
self-monitoring, diagnostics, and fault injection
hardware
intelligence used to collect and filter
monitoring data
diagnostics and fault injection enhance
robustness
networked to create a scalable shared-nothing
cluster

Intelligent Disk Brick Portable PC Processor
Pentium II DRAM Redundant NICs (4 100 Mb/s
links) Diagnostic Processor

Intelligent Chassis
80 nodes, 8 per tray
2 levels of switches
20 100 Mb/s
2 1 Gb/s
Environment Monitoring
UPS, redundant PS,
fans, heat and vibrartion sensors...

15
ISTORE Brick Block Diagram
Mobile Pentium II Module
SCSI
North Bridge
CPU
Disk (18 GB)
South Bridge
Diagnostic Net
DUAL UART
DRAM 256 MB
Super I/O
Monitor Control
Diagnostic Processor
BIOS
Ethernets 4x100 Mb/s
PCI

Sensors for heat and vibration
Control over power to individual nodes

Flash
RTC
RAM
16
ISTORE Software Approach

Two-pronged approach to providing reliability
1) reactive self-maintenance dynamic reaction
to exceptional system events
self-diagnosing, self-monitoring hardware
software monitoring and problem detection
automatic reaction to detected problems
2) proactive self-maintenance continuous online
self- testing and self-analysis
automatic characterization of system components
in situ fault injection, self-testing, and
scrubbing to detect flaky hardware components and
to exercise rarely-taken application code paths
before theyre used

17
ISTORE Applications

Storage-intensive, reliable services for ISTORE-1
infrastructure for thin clients, e.g., PDAs
web services, such as mail and storage
large-scale databases (talk)
information retrieval (search and on-the-fly
indexing)
Scalable memory-intensive computations for ISTORE
in 2006
Performance estimates through IRAM simulation
model
not major emphasis
Large-scale defense and scientific applications
enabled by high memory bw and arithmetic
performance

18
Performance Availability

System performance limited by the weakest link
NOW Sort experience performance heterogeneity is
the norm
disks inner vs. outer track (50), fragmentation
processors load (1.5-5x) and heat
Virtual Streams dynamically off-load I/O work
from slower disks to faster ones

19
ISTORE Update

High level hardware design by UCB complete (talk)
Design of ISTORE boards handed off to Anigma
First run complete SCSI problem to be fixed
Testing of UCB design (DP), to start asap
10 nodes by end of 1Q 2000, 80 by 2Q 2000
Design of BIOS handed off to AMI
Most parts donated or discounted
Adaptec, Andataco, IBM, Intel, Micron, Motorola,
Packet Engines
Proposal for Quantifying AME (talk)
Beginning work on short-term applications
Mail server
Web server will be used
to
Large database drive
principled
Decision support primitives system design

20
Conclusions

IRAM attractive for two Post-PC applications
because of low power, small size, high memory
bandwidth
Mobile consumer electronic devices
Scaleable infrastructure
IRAM benchmarking result faster than DSPs
ISTORE hardware/software architecture for large
scale network services
Scaling systems requires
new continuous models of availability
performance not limited by the weakest link
self systems to reduce human interaction

21
Backup Slides
22
Introduction and Ground Rules

Who is here?
Mixed IRAM/ISTORE experience
Questions are welcome during talks
Schedule lecture from Brewster Kahle during
Thursdays Open Mic Session.
Feedback is required (Fri am)
Be careful, we have been known to listen to you
Mixed experience please ask
Time for skiing and talking tomorrow afternoon

23
2006 ISTORE

ISTORE node
Add 20 pad to MicroDrive size for packaging,
connectors
Then double thickness to add IRAM
2.0 x 1.7 x 0.5 (51 mm x 43 mm x 13 mm)
Crossbar switches growing by Moores Law
2x/1.5 yrs ? 4X transistors/3yrs
Crossbars grow by N2 ? 2X switch/3yrs
16 x 16 in 1999 ? 64 x 64 in 2005
ISTORE rack (19 x 33 x 84)1 tray (3 high) ?
16 x 32 ? 512 ISTORE nodes / try
20 traysswitchesUPS ? 10,240 ISTORE nodes /
rack (!)

24
IRAM/VSUIF Decryption (IDEA)
lanes
Virtual processor width

IDEA Decryption operates on 16-bit ints
Compiled with IRAM/VSUIF
Note scalability of both lanes and data width
Some hand-optimizations (unrolling) will be
automated by Cray compiler

25
1D FFT on IRAM

FFT study on IRAM
bit-reversal time included cost hidden using
indexed store
Faster than DSPs on floating point (32-bit) FFTs
CRI Pathfinder does 24-bit fixed point, 1K points
in 28 usec (2 Watts without SRAM)

26
3D FFT on ISTORE 2006

Performance of large 3D FFTs depend on 2 factors
speed of 1D FFT on a single node (next slide)
network bandwidth for transposing data
1.3 Tflop FFT possible w/ 1K IRAM nodes, if
network bisection bandwidth scales (!)

27
ISTORE-1 System Layout
28
V-IRAM1 0.18 µm, Fast Logic, 200 MHz1.6
GFLOPS(64b)/6.4 GOPS(16b)/32MB

x
2-way Superscalar
Vector
4 x 64 or 8 x 32 or 16 x 16
Instruction

Processor
Queue
Load/Store
16K I cache
16K D cache
Vector Registers
4 x 64
4 x 64
29
Fixed-point multiply-add model
Multiply half word Shift Round
Add Saturate
z
x
n
w

n/2
sat

n
n
Round
y
n
n/2
a

Same basic model, different set of instructions
fixed-point multiply shift round, shift
right round, shift left saturate
integer saturated arithmetic add or sub
saturate
added multiply-add instruction for improved
performance and energy consumption

30
Other ISA modifications

Auto-increment loads/stores
a vector load/store can post-increment its base
address
added base (16), stride (8), and increment (8)
registers
necessary for applications with short vectors or
scaled-up implementations
Butterfly permutation instructions
perform step of a butterfly permutation within a
vector register
used for FFT and reduction operations
Miscellaneous instructions added
min and max instructions (integer and FP)
FP reciprocal and reciprocal square root

31
Major architecture updates

Integer arithmetic units support multiply-add
instructions
1 load store unit
complexity Vs. benefit
Optimize for strides 2, 3, and 4
useful for complex arithmetic and image
processing functions
Decoupled strided and indexed stores
memory stalls due to bank conflicts do not stall
the arithmetic pipelines
allows scheduling of independent arithmetic
operations in parallel with stores that
experience many stalls
implemented with address, not data, buffering
currently examining a similar optimization for
loads

32
Micro-kernel results simulated systems

Note simulations performed with 2 load-store
units and without decoupled stores or
optimizations for strides 2, 3, and 4

33
Micro-kernels

Vectorization and scheduling performed manually

34
Scaled system results

Near linear speedup for all application apart
from iDCT
iDCT bottlenecks
large number of bank conflicts
4 addresses/cycle for strided accesses

35
iDCT scaling with sub-banks

Sub-banks reduce bank conflicts and increase
performance
Alternative (but not as effective) ways to reduce
conflicts
different memory layout
different address interleaving schemes

36
Compiling for VIRAM

Long-term success of DIS technology depends on
simple programming model, i.e., a compiler
Needs to handle significant class of applications
IRAM multimedia, graphics, speech and image
processing
ISTORE databases, signal processing, other DIS
benchmarks
Needs to utilize hardware features for
performance
IRAM vectorization
ISTORE scalability of shared-nothing programming
model

37
IRAM Compilers

IRAM/Cray vectorizing compiler Judd
Production compiler
Used on the T90, C90, as well as the T3D and T3E
Being ported (by SGI/Cray) to the SV2
architecture
Has C, C, and Fortran front-ends (focus on C)
Extensive vectorization capability
outer loop vectorization, scatter/gather, short
loops,
VIRAM port is under way
IRAM/VSUIF vectorizing compiler Krashinsky
Based on VSUIF from Corinna Lees group at
Toronto which is based on MachineSUIF from Mike
Smiths group at Harvard which is based on SUIF
compiler from Monica Lams group at Stanford
This is a research compiler, not intended for
compiling large complex applications
It has been working since 5/99.

38
IRAM/Cray Compiler Status

MIPS backend developed in this year
Validated using a commercial test suite for code
generation
Vector backend recently started
Testing with simulator under way
Leveraging from Cray
Automatic vectorization

39
VIRAM/VSUIF Matrix/Vector Multiply

VIRAM/VSUIF does reasonably well on long loops

256x256 single matrix
Compare to 1600 Mflop/s (peak without multadd)
Note BLAS-2 (little reuse)
350 on Power3 and EV6
Problems specific to VSUIF
hand strip-mining results in short loops
reductions
no multadd support

mvm
vmm
40
Reactive Self-Maintenance

ISTORE defines a layered system model for
monitoring and reaction

ISTORE API defines interface between runtime
system and app. reaction mechanisms

Policies define systems monitoring, detection,
and reaction behavior

41
Proactive Self-Maintenance

Continuous online self-testing of HW and SW
detects flaky, failing, or buggy components via
fault injection triggering hardware and software
error handling paths to verify their
integrity/existence
stress testing pushing HW/SW components past
normal operating parameters
scrubbing periodic restoration of potentially
decaying hardware or software state
automates preventive maintenance
Dynamic HW/SW component characterization
used to adapt to heterogeneous hardware and
behavior of application software components

42
ISTORE-0 Prototype and Plans

ISTORE-0 testbed for early experimentation with
ISTORE research ideas
Hardware cluster of 6 PCs
intended to model ISTORE-1 using COTS components
nodes interconnected using ISTORE-1 network
fabric
custom fault-injection hardware on subset of
nodes
Initial research plans
runtime system software
fault injection
scalability, availability, maintainability
benchmarking
applications block storage server, database, FFT

43
Runtime System Software

Demonstrate simple policy-driven adaptation
within context of a single OS and application
software monitoring information collected and
processed in realtime
e.g., health performance parameters of OS,
application
problem detection and coordination of reaction
controlled by a stock set of configurable
policies
application-level adaptation mechanisms
invoked to implement reaction
Use experience to inform ISTORE API design
Investigate reinforcement learning as technique
to infer appropriate reactions from goals

44
Record-breaking performance is not the common case

NOW-Sort records demonstrate peak performance
But perturb just 1 of 8 nodes and...

45
Virtual StreamsDynamic load balancing for I/O

Replicas of data serve as second sources
Maintain a notion of each processs progress
Arbitrate use of disks to ensure equal progress
The right behavior, but what mechanism?

46
Graduated DeclusteringA Virtual Streams
implementation

Clients send progress, servers schedule in
response

47
Read PerformanceMultiple Slow Disks
48
Storage Priorities Research v. Users

Traditional Research Priorities
1) Performance
1) Cost
3) Scalability
4) Availability
5) Maintainability

ISTORE Priorities 1) Maintainability 2)
Availability 3) Scalability 4) Performance 5) Cost
easy to measure
hard to measure
49
Intelligent Storage Project Goals

ISTORE a hardware/software architecture for
building scaleable, self-maintaining storage
An introspective system it monitors itself and
acts on its observations
Self-maintenance does not rely on administrators
to configure, monitor, or tune system

50
Self-maintenance

Failure management
devices must fail fast without interrupting
service
predict failures and initiate replacement
failures ? immediate human intervention
System upgrades and scaling
new hardware automatically incorporated without
interruption
new devices immediately improve performance or
repair failures
Performance management
system must adapt to changes in workload or
access patterns

51
ISTORE-I 2H99

Intelligent disk
Portable PC Hardware Pentium II, DRAM
Low Profile SCSI Disk (9 to 18 GB)
4 100-Mbit/s Ethernet links per node
Placed inside Half-height canister
Monitor Processor/path to power off components?
Intelligent Chassis
64 nodes 8 enclosures, 8 nodes/enclosure
64 x 4 or 256 Ethernet ports
2 levels of Ethernet switches 14 small, 2 large
Small 20 100-Mbit/s 2 1-Gbit Large 25 1-Gbit
Just for prototype crossbar chips for real
system
Enclosure sensing, UPS, redundant PS, fans, ...

52
Disk Limit

Continued advance in capacity (60/yr) and
bandwidth (40/yr)
Slow improvement in seek, rotation (8/yr)
Time to read whole disk
Year Sequentially Randomly (1 sector/seek)
1990 4 minutes 6 hours
1999 35 minutes 1 week(!)
3.5 form factor make sense in 5-7 years?

53
Related Work

ISTORE adds to several recent research efforts
Active Disks, NASD (UCSB, CMU)
Network service appliances (NetApp, Snap!, Qube,
...)
High availability systems (Compaq/Tandem, ...)
Adaptive systems (HP AutoRAID, M/S AutoAdmin, M/S
Millennium)
Plug-and-play system construction (Jini, PC
PlugPlay, ...)

54
Other (Potential) Benefits of ISTORE

Scalability add processing power, memory,
network bandwidth as add disks
Smaller footprint vs. traditional server/disk
Less power
embedded processors vs. servers
spin down idle disks?
For decision-support or web-service applications,
potentially better performance than traditional
servers

55
Disk Limit I/O Buses

Cannot use 100 of bus
Queuing Theory (lt 70)
Command overhead(Effective size size x 1.2)

Multiple copies of data,SW layers

CPU
Memory bus
Internal I/O bus
Memory
External I/O bus
(PCI)

Bus rate vs. Disk rate
SCSI Ultra2 (40 MHz), Wide (16 bit) 80 MByte/s
FC-AL 1 Gbit/s 125 MByte/s (single disk in
2002)

(SCSI)
(15 disks)
Controllers
56
State of the Art Seagate Cheetah 36

36.4 GB, 3.5 inch disk
12 platters, 24 surfaces
10,000 RPM
18.3 to 28 MB/s internal media transfer rate(14
to 21 MB/s user data)
9772 cylinders (tracks), (71,132,960 sectors
total)
Avg. seek read 5.2 ms, write 6.0 ms (Max. seek
12/13,1 track 0.6/0.9 ms)
2100 or 17MB/ (6/MB)(list price)
0.15 ms controller time

source www.seagate.com
57
User Decision Support Demand vs. Processor speed
Database demand 2X / 9-12 months
Gregs Law
Database-Proc. Performance Gap
Moores Law
CPU speed 2X / 18 months

Write a Comment

User Comments (0)