Title: CPSC 614:Graduate Computer Architecture Memory Technology
1CPSC 614Graduate Computer ArchitectureMemory
Technology
- Based on lectures by
- Prof. David Culler
- Prof. David Patterson
- UC Berkeley
2Main Memory Background
- Random Access Memory (vs. Serial Access Memory)
- Different flavors at different levels
- Physical Makeup (CMOS, DRAM)
- Low Level Architectures (FPM,EDO,BEDO,SDRAM)
- Cache uses SRAM Static Random Access Memory
- No refresh (6 transistors/bit vs. 1
transistorSize DRAM/SRAM 4-8, Cost/Cycle
time SRAM/DRAM 8-16 - Main Memory is DRAM Dynamic Random Access Memory
- Dynamic since needs to be refreshed periodically
(8 ms, 1 time) - Addresses divided into 2 halves (Memory as a 2D
matrix) - RAS or Row Access Strobe
- CAS or Column Access Strobe
3Static RAM (SRAM)
- Six transistors in cross connected fashion
- Provides regular AND inverted outputs
- Implemented in CMOS process
Single Port 6-T SRAM Cell
4SRAM Read Timing (typical)
- tAA (access time for address) how long it takes
to get stable output after a change in address. - tACS (access time for chip select) how long it
takes to get stable output after CS is
asserted. - tOE (output enable time) how long it takes for
the three-state output buffers to leave the
high- impedance state when OE and CS are both
asserted. - tOZ (output-disable time) how long it takes for
the three-state output buffers to enter high-
impedance state after OE or CS are negated. - tOH (output-hold time) how long the output
data remains valid after a change to the
address inputs.
5SRAM Read Timing (typical)
stable
stable
stable
ADDR
CS_L
OE_L
tOE
valid
valid
valid
DOUT
WE_L HIGH
6Dynamic RAM
- SRAM cells exhibit high speed/poor density
- DRAM simple transistor/capacitor pairs in high
density form
Word Line
C
Bit Line
...
Sense Amp
7Basic DRAM Cell
- Planar Cell
- Polysilicon-Diffusion Capacitance, Diffused
Bitlines - Problem Uses a lot of area (lt 1Mb)
- You cant just ride the process curve to shrink C
(discussed later)
8Advanced DRAM Cells
9Advanced DRAM Cells
- Trench Cell (Expand DOWN)
10DRAM Operations
- Write
- Charge bitline HIGH or LOW and set wordline HIGH
- Read
- Bit line is precharged to a voltage halfway
between HIGH and LOW, and then the word line is
set HIGH. - Depending on the charge in the cap, the
precharged bitline is pulled slightly higheror
lower. - Sense Amp Detects change
- Explains why Cap cant shrink
- Need to sufficiently drive bitline
- Increase density gt increase parasiticcapacitance
11DRAM logical organization (4 Mbit)
D
Column Decoder
Sense
Amps I/O
1
1
Q
Memory
Array
A0A1
0
Row Decoder
(2,048 x 2,048)
Storage
W
ord Line
Cell
- Square root of bits per RAS/CAS
12So, Why do I freaking care?
- By its nature, DRAM isnt built for speed
- Reponse times dependent on capacitive circuit
properties which get worse as density increases - DRAM process isnt easy to integrate into CMOS
process - DRAM is off chip
- Connectors, wires, etc introduce slowness
- IRAM efforts looking to integrating the two
- Memory Architectures are designed to minimize
impact of DRAM latency - Low Level Memory chips
- High Level memory designs.
- You will pay and then some for a good
memory system.
13So, Why do I freaking care?
- 1960-1985 Speed Æ’(no. operations)
- 1990
- Pipelined Execution Fast Clock Rate
- Out-of-Order execution
- Superscalar Instruction Issue
- 1998 Speed Æ’(non-cached memory accesses)
- What does this mean for
- Compilers?,Operating Systems?, Algorithms? Data
Structures?
144 Key DRAM Timing Parameters
- tRAC minimum time from RAS line falling to the
valid data output. - Quoted as the speed of a DRAM when buy
- A typical 4Mb DRAM tRAC 60 ns
- Speed of DRAM since on purchase sheet?
- tRC minimum time from the start of one row
access to the start of the next. - tRC 110 ns for a 4Mbit DRAM with a tRAC of 60
ns - tCAC minimum time from CAS line falling to valid
data output. - 15 ns for a 4Mbit DRAM with a tRAC of 60 ns
- tPC minimum time from the start of one column
access to the start of the next. - 35 ns for a 4Mbit DRAM with a tRAC of 60 ns
15DRAM Read Timing
- Every DRAM access begins at
- The assertion of the RAS_L
- 2 ways to read early or late v. CAS
DRAM Read Cycle Time
CAS_L
A
Row Address
Junk
Col Address
Row Address
Junk
Col Address
WE_L
OE_L
D
High Z
Data Out
Junk
Data Out
High Z
Read Access Time
Output Enable Delay
Early Read Cycle OE_L asserted before CAS_L
Late Read Cycle OE_L asserted after CAS_L
16DRAM Performance
- A 60 ns (tRAC) DRAM can
- perform a row access only every 110 ns (tRC)
- perform column access (tCAC) in 15 ns, but time
between column accesses is at least 35 ns (tPC). - In practice, external address delays and turning
around buses make it 40 to 50 ns - These times do not include the time to drive the
addresses off the microprocessor nor the memory
controller overhead! - Can it be made faster?
17Fast Page Mode DRAM
- Page All bits on the same ROW (Spatial Locality)
- Dont need to wait for wordline to recharge
- Toggle CAS with new column address
18Extended Data Out (EDO)
- Overlap Data output w/ CAS toggle
- Later brother Burst EDO (CAS toggle used to get
next addr)
19Synchronous DRAM
- Has a clock input.
- Data output is in bursts w/ each element clocked
- Flavors SDRAM, DDR
20RAMBUS (RDRAM)
- Protocol based RAM w/ narrow (16-bit) bus
- High clock rate (400 Mhz), but long latency
- Pipelined operation
- Multiple arrays w/ data transferred on both edges
of clock
RAMBUS Bank
RDRAM Memory System
21RDRAM Timing
22DRAM History
- DRAMs capacity 60/yr, cost 30/yr
- 2.5X cells/area, 1.5X die size in 3 years
- 98 DRAM fab line costs 2B
- DRAM only density, leakage v. speed
- Rely on increasing no. of computers memory per
computer (60 market) - SIMM or DIMM is replaceable unit gt computers
use any generation DRAM - Commodity, second source industry gt high
volume, low profit, conservative - Little organization innovation in 20 years
- Dont want to be chip foundries (bad for RDRAM)
- Order of importance 1) Cost/bit 2) Capacity
- First RAMBUS 10X BW, 30 cost gt little impact
23Main Memory Organizations
- Simple
- CPU, Cache, Bus, Memory same width (32 or 64
bits) - Wide
- CPU/Mux 1 word Mux/Cache, Bus, Memory N words
(Alpha 64 bits 256 bits UtraSPARC 512) - Interleaved
- CPU, Cache, Bus 1 word Memory N Modules(4
Modules) example is word interleaved
24Main Memory Performance
- Timing model (word size is 32 bits)
- 1 to send address,
- 6 access time, 1 to send data
- Cache Block is 4 words
- Simple M.P. 4 x (161) 32
- Wide M.P. 1 6 1 8
- Interleaved M.P. 1 6 4x1 11
25Independent Memory Banks
- Memory banks for independent accesses vs. faster
sequential accesses - Multiprocessor
- I/O
- CPU with Hit under n Misses, Non-blocking Cache
- Superbank all memory active on one block
transfer (or Bank) - Bank portion within a superbank that is word
interleaved (or Subbank)
Superbank
Bank
Superbank Offset
Superbank Number
Bank Number
Bank Offset
26Independent Memory Banks
- How many banks?
- number banks ? number clocks to access word in
bank - For sequential accesses, otherwise will return to
original bank before it has next word ready - Increasing DRAM gt fewer chips gt less banks
RIMMs can have a HOTSPOT (literally)
27Avoiding Bank Conflicts
- Lots of banks
- int x256512
- for (j 0 j lt 512 j j1)
- for (i 0 i lt 256 i i1)
- xij 2 xij
- Even with 128 banks, since 512 is multiple of
128, conflict on word accesses - SW loop interchange or declaring array not power
of 2 (array padding) - HW Prime number of banks
- bank number address mod number of banks
- address within bank address / number of words
in bank - modulo divide per memory access with prime no.
banks? - address within bank address mod number words in
bank - bank number? easy if 2N words per bank
28Fast Bank Number
- Chinese Remainder Theorem As long as two sets of
integers ai and bi follow these rules - and that ai and aj are co-prime.If i ? j, then
the integer x has only one solution (unambiguous
mapping) - bank number b0, number of banks a0 ( 3 in
example) - address within bank b1, number of words in bank
a1 ( 8 in example) - N word address 0 to N-1, prime no. banks, words
power of 2
Seq. Interleaved Modulo
Interleaved Bank Number 0 1 2 0 1 2 Address
within Bank 0 0 1 2 0 16 8 1 3 4 5
9 1 17 2 6 7 8 18 10 2 3 9 10 11 3 19 11 4 12 13
14 12 4 20 5 15 16 17 21 13 5 6 18 19 20 6 22 14 7
21 22 23 15 7 23
29DRAMs per PC over Time
DRAM Generation
86 89 92 96 99 02 1 Mb 4 Mb 16 Mb 64
Mb 256 Mb 1 Gb
4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB
16
4
Minimum Memory Size
30Need for Error Correction!
- Motivation
- Failures/time proportional to number of bits!
- As DRAM cells shrink, more vulnerable
- Went through period in which failure rate was low
enough without error correction that people
didnt do correction - DRAM banks too large now
- Servers always corrected memory systems
- Basic idea add redundancy through parity bits
- Simple but wastful version
- Keep three copies of everything, vote to find
right value - 200 overhead, so not good!
- Common configuration Random error correction
- SEC-DED (single error correct, double error
detect) - One example 64 data bits 8 parity bits (11
overhead) - Really want to handle failures of physical
components as well - Organization is multiple DRAMs/SIMM, multiple
SIMMs - Want to recover from failed DRAM and failed SIMM!
- Requires more redundancy to do this
- All major vendors thinking about this in high-end
machines
31Architecture in practice
- (as reported in Microprocessor Report, Vol 13,
No. 5) - Emotion Engine 6.2 GFLOPS, 75 million polygons
per second - Graphics Synthesizer 2.4 Billion pixels per
second - Claim Toy Story realism brought to games!
32FLASH Memory
- Floating gate transitor
- Presence of charge gt 0
- Erase Electrically or UV (EPROM)
- Peformance
- Reads like DRAM (ns)
- Writes like DISK (ms). Write is a complex
operation
33More esoteric Storage Technologies?
- Tunneling Magnetic Junction RAM (TMJ-RAM)
- Speed of SRAM, density of DRAM, non-volatile (no
refresh) - New field called Spintronics combination of
quantum spin and electronics - Same technology used in high-density disk-drives
- MEMs storage devices
- Large magnetic sled floating on top of lots of
little read/write heads - Micromechanical actuators move the sled back and
forth over the heads
34Tunneling Magnetic Junction
35MEMS-based Storage
- Magnetic sled floats on array of read/write
heads - Approx 250 Gbit/in2
- Data ratesIBM 250 MB/s w 1000 headsCMU 3.1
MB/s w 400 heads - Electrostatic actuators move media around to
align it with heads - Sweep sled 50?m in lt 0.5?s
- Capacity estimated to be in the 1-10GB in 10cm2
See Ganger et all http//www.lcs.ece.cmu.edu/rese
arch/MEMS
36Main Memory Summary
- Wider Memory
- Interleaved Memory for sequential or independent
accesses - Avoiding bank conflicts SW HW
- DRAM specific optimizations page mode
Specialty DRAM - Need Error correction
37Virtual Memory
38Terminology
- Page a virtual memory block
- Page fault a virtual memory miss
- Memory mapping (memory translation) converting a
virtual memory produced by the CPU to a physical
address
39Mapping of Virtual Memory to Physical Memory
40Typical Parameter Ranges
Parameter First-level cache Virtual Memory
Block size 16 128 B 409665,536B
Hit time 13 cycles 50150 cycles
Miss penalty 8150 cycles 1,000,000 10,000,000 cycles
access time 6130 cycles 800,000 8,000,000 cycles
transfer time 220 cycles 200,000 2,000,000 cycles
Miss rate 0.110 0.000010.001
41Design Issues
- A page fault takes millions of cycles to process.
- Pages should be large enough to amortize the high
access time. (4KB 64KB) - Fully associative placement of pages is used.
- Page faults can be handled in software.
- Write-back (Write-through scheme does not work.)
42Where to Place a Page and How to Find it
- Fully associative placement
- A page table is used to located pages.
- Resides in memory
- Indexed with the page number from the virtual
address and contains the corresponding physical
page number. - Each program has its own page table.
- To indicate the location of the page table in
memory, the page table register is used. - A valid bit in each entry (off the page is not
in memory gt page fault)
43Translation of Virtual Address
44Page Table
Virtual page
45Writes in Virtual Memory
- Writes to the next level of memory hierarchy
(disk) take millions of cycles. - Write-through (with write buffer) is not
practical. - Write-back (copy back) Virtual memory systems
perform the individual writes into the page in
memory and copy the page back to disk when it is
replaced. - Dirty bit indicates the page has been modified.
46TLB (Translation Lookaside Buffer)
- Each memory access by a program takes at least
twice as long. - One to obtain the physical address in the page
table - One to get the data
- TLB (Translation Lookaside Buffer)
- A cache that holds only page table mapping
- Includes the reference bit, the dirty bit, and
the valid bit. - We dont need to access the page table on every
reference.
47TLB Structure
Dirty
Valid
Ref
Virtual Page
Physical Page
48TLB Acting as a Cache on Page Table
49TLB Design Issues
- When a TLB entry is replaced, we need to copy the
reference and dirty bits back to the page table
entry. - Write-back (due to small miss rate)
- Fully associative mapping (due to small TLB)
- If larger TLBs are used, no or small
associativity can be used. - Randomly choose an entry to replace.
50Alpha 21264 Data TLB
PID
51MIPS R2000 TLB
Virtual address
52Memory Hierarchy
Processor
First-level Cache
TLB
Second-level Cache
address
blocks
Memory
Page Table
pages
Disk