Title: Joe Gebis
1IRAM Chip Status
- Joe Gebis
- Computer Science Division
- University of California, Berkeley
- gebis_at_cs.berkeley.edu
- http//iram.cs.berkeley.edu
2Outline
- Overview of VIRAM-1 organization
- Hardware status
- CAD plan
3VIRAM1 Block Diagram
Flag Unit 0
Flag Unit 1
Flag Register File (512B)
Arithmetic Unit 0
Arithmetic Unit 1
32B
32B
Vector Register File (8KB)
SysAD IF
8B
8B
TLB
Memory Unit
32B
DMA
Memory Crossbar
JTAG IF
JTAG
DRAM0 (2MB)
DRAM1 (2MB)
DRAM7 (2MB)
4VIRAM1 Vector Units
Datapath
Lane
Vector Registers
Functional Unit
Memory Unit
- 4 partitionable 64-bit lanes
- 2 arithmetic functional units (one FP), 2 flag
processing - Lane provides basic unit of design, replication
5VIRAM1 Layout
- IBM SA-27E process
- .18mm, 6 copper layers
- 290 mm2 area
- 150M transistors
- 1.2V logic, 1.8V DRAM
- 2W Power consumption
6Scalar Core Status
- Have synthesizable MIPS64 5Kc core
- Will run at 200 MHz
- Has 8KB instruction and data caches
- Caches will be compiled by IBM
7Vector Integer Unit Status
- Complete
- Design of all blocks
- Layout of subblocks
- Partially done
- Assembling block component
- Remaining
- Final assembly
Logical Unit
Multiplier
Shifter
Rounder
shamt
Design complete, basic subblocks layout done
Adder
zero detect
Design complete, components ready for assembly
Saturate
8Vector Register File
- Have a register file from Transmeta which was
successfully fabbed in the same process - Using the complete Transmeta register file?
- Contains shadow registers we couldnt use
- Has more ports than we need
- Would require combining 8 duplicates
- Use the Transmeta bit cell?
- It is larger than it needs to be for our purposes
- Build our own bit cell?
- Possibly a significant amount of work
9Control
- Small changes to work with new MIPS core
- Working model of the vector unit complete
- Some small glue logic remains to be able to do
complete simulation with core and on-chip DRAM
10Floating Point Vector Unit
- Synthesizable Verilog received from MIT RAW
architecture group - FPU as received not fully IEEE compliant
- Required some changes to work with core
11Crossbar Design
DRAM 0 Port
DRAM 1 Port
DRAM 2 Port
DRAM 3 Port
256-bit load crossbar
Scalar / DMA Port
256-bit store crossbar
VL0 Port
VL1 Port
VL2 Port
VL3 Port
12Memory and Crossbar
- Model for DRAM controllers complete
- Crossbar design is complete, layout progressing
- Crossbar issues
- Switches only 64-bit words
- Operates at 1.2V, contains level shifters to
interface to 1.8V DRAM - Segmented with repeaters at approximately 2mm
intervals - 5 ns cycle time, interfaces to DRAM without
additional subclocks
13CAD Plan - Synthesized Blocks
- Synopsys VCS Verilog compiler and environment
- Synopsys Design Compiler
- Synopsys Module Compiler
- Avant! Apollo place route
- Synopsys PrimeTime static timing analysis
14CAD Plan - Custom Blocks
- Cadence layout editor
- Cadence schematic editor
- Avant! Hercules DRC and LVN
- Avant! StarRC parasitic extraction
- Avant! Hspice
- Synopsys TimeMill dynamic timing
- Synopsys PowerMill power consumption simulation
15CAD Plan - Integrated Blocks
- Avant! Apollo place route
- Avant! Hercules DRC
- Synopsys PrimeTime static timing analysis
16CAD Plan - Other Blocks
- Cache blocks
- IBM SRAM compilers
- Functional verification
- Synopsys VCS
17Remaining Work
- Some design, layout work remains
- Synthesizing blocks
- Verification
- Tapeout planned for late fall
18(No Transcript)
19Vector Execution Model
Scalar Execution
Vector Execution
r1
vector length
add r3, r1, r2
add.vv v3, v1, v2
20Vector Architectural State
21VIRAM ISA Extensions
Scalar
MIPS64 scalar instruction set
Vector ALU
All ALU / mem operations under mask
Vector Memory
Vector Register
Plus flag, convert, fixed-point, and transfer
operations
22Fixed-point Arithmetic
- Multiply upper or lower halves, shift and round
- Add/Sub and saturate
- Shift right and round, shift left and saturate
- All combinations of multiply and add/sub
instructions - Saturate to narrower width
23Multiplier Partitioning
16-bit Multiplier Block
result150
result3116
16-bit adder
24Scaling
- Scaled-down version from the original
- Vector unit with same control
- Or scale up for future versions
25Scalar Core
- Synthesizable core from MIPS
- 64 bit (MIPS64 ISA)
- 6 stage
- Single instruction issue
- 8 kB direct-mapped D/I cache
- Has coprocessor interface used for vector unit
access and FPU
26Floating-point datapath
- Single precision
- Contains add, sub, mul, div, compare, convert,
truncate - Does not contain mul-add, sqrt
- Only supports round to nearest even mode
- Fully pipelined
- 3 cycle latency for add/sub/mul/compare/convert
- 10 cycle latency for divide, 8 cycle repeat rate
- Fast execution mode
- exceptions for each element noted in flag
register exception raised at the end of
instruction execution - Precise execution mode
- Following FP instructions are stalled early in
the pipeline until execution of previous
instruction is complete and any exceptions are
raised - Operates at half performance (0.8 GFLOPS)