CPSC 614:Graduate Computer Architecture Memory Technology - PowerPoint PPT Presentation

About This Presentation

Title:

CPSC 614:Graduate Computer Architecture Memory Technology

Description:

CPSC 614:Graduate Computer Architecture. Memory Technology. Based on lectures by ... Same technology used in high-density disk-drives. MEMs storage devices: ... – PowerPoint PPT presentation

Number of Views:215

Avg rating:3.0/5.0

Slides: 53

Provided by: johnk204

Learn more at: https://people.engr.tamu.edu

Category:

more less

Transcript and Presenter's Notes

Title: CPSC 614:Graduate Computer Architecture Memory Technology

1
CPSC 614Graduate Computer ArchitectureMemory
Technology

Based on lectures by
Prof. David Culler
Prof. David Patterson
UC Berkeley

2
Main Memory Background

Random Access Memory (vs. Serial Access Memory)
Different flavors at different levels
Physical Makeup (CMOS, DRAM)
Low Level Architectures (FPM,EDO,BEDO,SDRAM)
Cache uses SRAM Static Random Access Memory
No refresh (6 transistors/bit vs. 1
transistorSize DRAM/SRAM 4-8, Cost/Cycle
time SRAM/DRAM 8-16
Main Memory is DRAM Dynamic Random Access Memory
Dynamic since needs to be refreshed periodically
(8 ms, 1 time)
Addresses divided into 2 halves (Memory as a 2D
matrix)
RAS or Row Access Strobe
CAS or Column Access Strobe

3
Static RAM (SRAM)

Six transistors in cross connected fashion
Provides regular AND inverted outputs
Implemented in CMOS process

Single Port 6-T SRAM Cell
4
SRAM Read Timing (typical)

tAA (access time for address) how long it takes
to get stable output after a change in address.
tACS (access time for chip select) how long it
takes to get stable output after CS is
asserted.
tOE (output enable time) how long it takes for
the three-state output buffers to leave the
high- impedance state when OE and CS are both
asserted.
tOZ (output-disable time) how long it takes for
the three-state output buffers to enter high-
impedance state after OE or CS are negated.
tOH (output-hold time) how long the output
data remains valid after a change to the
address inputs.

5
SRAM Read Timing (typical)
stable
stable
stable
ADDR
CS_L
OE_L
tOE
valid
valid
valid
DOUT
WE_L HIGH
6
Dynamic RAM

SRAM cells exhibit high speed/poor density
DRAM simple transistor/capacitor pairs in high
density form

Word Line
C
Bit Line
...
Sense Amp
7
Basic DRAM Cell

Planar Cell
Polysilicon-Diffusion Capacitance, Diffused
Bitlines
Problem Uses a lot of area (lt 1Mb)
You cant just ride the process curve to shrink C
(discussed later)

8
Advanced DRAM Cells

Stacked cell (Expand UP)

9
Advanced DRAM Cells

Trench Cell (Expand DOWN)

10
DRAM Operations

Write
Charge bitline HIGH or LOW and set wordline HIGH
Read
Bit line is precharged to a voltage halfway
between HIGH and LOW, and then the word line is
set HIGH.
Depending on the charge in the cap, the
precharged bitline is pulled slightly higheror
lower.
Sense Amp Detects change
Explains why Cap cant shrink
Need to sufficiently drive bitline
Increase density gt increase parasiticcapacitance

11
DRAM logical organization (4 Mbit)
D
Column Decoder

Sense
Amps I/O
1
1
Q
Memory
Array
A0A1
0
Row Decoder

(2,048 x 2,048)
Storage
W
ord Line
Cell

Square root of bits per RAS/CAS

12
So, Why do I freaking care?

By its nature, DRAM isnt built for speed
Reponse times dependent on capacitive circuit
properties which get worse as density increases
DRAM process isnt easy to integrate into CMOS
process
DRAM is off chip
Connectors, wires, etc introduce slowness
IRAM efforts looking to integrating the two
Memory Architectures are designed to minimize
impact of DRAM latency
Low Level Memory chips
High Level memory designs.
You will pay and then some for a good
memory system.

13
So, Why do I freaking care?

1960-1985 Speed ƒ(no. operations)
1990
Pipelined Execution Fast Clock Rate
Out-of-Order execution
Superscalar Instruction Issue
1998 Speed ƒ(non-cached memory accesses)
What does this mean for
Compilers?,Operating Systems?, Algorithms? Data
Structures?

14
4 Key DRAM Timing Parameters

tRAC minimum time from RAS line falling to the
valid data output.
Quoted as the speed of a DRAM when buy
A typical 4Mb DRAM tRAC 60 ns
Speed of DRAM since on purchase sheet?
tRC minimum time from the start of one row
access to the start of the next.
tRC 110 ns for a 4Mbit DRAM with a tRAC of 60
ns
tCAC minimum time from CAS line falling to valid
data output.
15 ns for a 4Mbit DRAM with a tRAC of 60 ns
tPC minimum time from the start of one column
access to the start of the next.
35 ns for a 4Mbit DRAM with a tRAC of 60 ns

15
DRAM Read Timing

Every DRAM access begins at
The assertion of the RAS_L
2 ways to read early or late v. CAS

DRAM Read Cycle Time
CAS_L
A
Row Address
Junk
Col Address
Row Address
Junk
Col Address
WE_L
OE_L
D
High Z
Data Out
Junk
Data Out
High Z
Read Access Time
Output Enable Delay
Early Read Cycle OE_L asserted before CAS_L
Late Read Cycle OE_L asserted after CAS_L
16
DRAM Performance

A 60 ns (tRAC) DRAM can
perform a row access only every 110 ns (tRC)
perform column access (tCAC) in 15 ns, but time
between column accesses is at least 35 ns (tPC).
In practice, external address delays and turning
around buses make it 40 to 50 ns
These times do not include the time to drive the
addresses off the microprocessor nor the memory
controller overhead!
Can it be made faster?

17
Fast Page Mode DRAM

Page All bits on the same ROW (Spatial Locality)
Dont need to wait for wordline to recharge
Toggle CAS with new column address

18
Extended Data Out (EDO)

Overlap Data output w/ CAS toggle
Later brother Burst EDO (CAS toggle used to get
next addr)

19
Synchronous DRAM

Has a clock input.
Data output is in bursts w/ each element clocked
Flavors SDRAM, DDR

20
RAMBUS (RDRAM)

Protocol based RAM w/ narrow (16-bit) bus
High clock rate (400 Mhz), but long latency
Pipelined operation
Multiple arrays w/ data transferred on both edges
of clock

RAMBUS Bank
RDRAM Memory System
21
RDRAM Timing
22
DRAM History

DRAMs capacity 60/yr, cost 30/yr
2.5X cells/area, 1.5X die size in 3 years
98 DRAM fab line costs 2B
DRAM only density, leakage v. speed
Rely on increasing no. of computers memory per
computer (60 market)
SIMM or DIMM is replaceable unit gt computers
use any generation DRAM
Commodity, second source industry gt high
volume, low profit, conservative
Little organization innovation in 20 years
Dont want to be chip foundries (bad for RDRAM)
Order of importance 1) Cost/bit 2) Capacity
First RAMBUS 10X BW, 30 cost gt little impact

23
Main Memory Organizations

Simple
CPU, Cache, Bus, Memory same width (32 or 64
bits)
Wide
CPU/Mux 1 word Mux/Cache, Bus, Memory N words
(Alpha 64 bits 256 bits UtraSPARC 512)
Interleaved
CPU, Cache, Bus 1 word Memory N Modules(4
Modules) example is word interleaved

24
Main Memory Performance

Timing model (word size is 32 bits)
1 to send address,
6 access time, 1 to send data
Cache Block is 4 words
Simple M.P. 4 x (161) 32
Wide M.P. 1 6 1 8
Interleaved M.P. 1 6 4x1 11

25
Independent Memory Banks

Memory banks for independent accesses vs. faster
sequential accesses
Multiprocessor
I/O
CPU with Hit under n Misses, Non-blocking Cache
Superbank all memory active on one block
transfer (or Bank)
Bank portion within a superbank that is word
interleaved (or Subbank)

Superbank
Bank
Superbank Offset
Superbank Number
Bank Number
Bank Offset
26
Independent Memory Banks

How many banks?
number banks ? number clocks to access word in
bank
For sequential accesses, otherwise will return to
original bank before it has next word ready
Increasing DRAM gt fewer chips gt less banks

RIMMs can have a HOTSPOT (literally)
27
Avoiding Bank Conflicts

Lots of banks
int x256512
for (j 0 j lt 512 j j1)
for (i 0 i lt 256 i i1)
xij 2 xij
Even with 128 banks, since 512 is multiple of
128, conflict on word accesses
SW loop interchange or declaring array not power
of 2 (array padding)
HW Prime number of banks
bank number address mod number of banks
address within bank address / number of words
in bank
modulo divide per memory access with prime no.
banks?
address within bank address mod number words in
bank
bank number? easy if 2N words per bank

28
Fast Bank Number

Chinese Remainder Theorem As long as two sets of
integers ai and bi follow these rules
and that ai and aj are co-prime.If i ? j, then
the integer x has only one solution (unambiguous
mapping)
bank number b0, number of banks a0 ( 3 in
example)
address within bank b1, number of words in bank
a1 ( 8 in example)
N word address 0 to N-1, prime no. banks, words
power of 2

Seq. Interleaved Modulo
Interleaved Bank Number 0 1 2 0 1 2 Address
within Bank 0 0 1 2 0 16 8 1 3 4 5
9 1 17 2 6 7 8 18 10 2 3 9 10 11 3 19 11 4 12 13
14 12 4 20 5 15 16 17 21 13 5 6 18 19 20 6 22 14 7
21 22 23 15 7 23
29
DRAMs per PC over Time
DRAM Generation
86 89 92 96 99 02 1 Mb 4 Mb 16 Mb 64
Mb 256 Mb 1 Gb
4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB
16
4
Minimum Memory Size
30
Need for Error Correction!

Motivation
Failures/time proportional to number of bits!
As DRAM cells shrink, more vulnerable
Went through period in which failure rate was low
enough without error correction that people
didnt do correction
DRAM banks too large now
Servers always corrected memory systems
Basic idea add redundancy through parity bits
Simple but wastful version
Keep three copies of everything, vote to find
right value
200 overhead, so not good!
Common configuration Random error correction
SEC-DED (single error correct, double error
detect)
One example 64 data bits 8 parity bits (11
overhead)
Really want to handle failures of physical
components as well
Organization is multiple DRAMs/SIMM, multiple
SIMMs
Want to recover from failed DRAM and failed SIMM!
Requires more redundancy to do this
All major vendors thinking about this in high-end
machines

31
Architecture in practice

(as reported in Microprocessor Report, Vol 13,
No. 5)
Emotion Engine 6.2 GFLOPS, 75 million polygons
per second
Graphics Synthesizer 2.4 Billion pixels per
second
Claim Toy Story realism brought to games!

32
FLASH Memory

Floating gate transitor
Presence of charge gt 0
Erase Electrically or UV (EPROM)
Peformance
Reads like DRAM (ns)
Writes like DISK (ms). Write is a complex
operation

33
More esoteric Storage Technologies?

Tunneling Magnetic Junction RAM (TMJ-RAM)
Speed of SRAM, density of DRAM, non-volatile (no
refresh)
New field called Spintronics combination of
quantum spin and electronics
Same technology used in high-density disk-drives
MEMs storage devices
Large magnetic sled floating on top of lots of
little read/write heads
Micromechanical actuators move the sled back and
forth over the heads

34
Tunneling Magnetic Junction
35
MEMS-based Storage

Magnetic sled floats on array of read/write
heads
Approx 250 Gbit/in2
Data ratesIBM 250 MB/s w 1000 headsCMU 3.1
MB/s w 400 heads
Electrostatic actuators move media around to
align it with heads
Sweep sled 50?m in lt 0.5?s
Capacity estimated to be in the 1-10GB in 10cm2

See Ganger et all http//www.lcs.ece.cmu.edu/rese
arch/MEMS
36
Main Memory Summary

Wider Memory
Interleaved Memory for sequential or independent
accesses
Avoiding bank conflicts SW HW
DRAM specific optimizations page mode
Specialty DRAM
Need Error correction

37
Virtual Memory
38
Terminology

Page a virtual memory block
Page fault a virtual memory miss
Memory mapping (memory translation) converting a
virtual memory produced by the CPU to a physical
address

39
Mapping of Virtual Memory to Physical Memory
40
Typical Parameter Ranges
Parameter First-level cache Virtual Memory
Block size 16 128 B 409665,536B
Hit time 13 cycles 50150 cycles
Miss penalty 8150 cycles 1,000,000 10,000,000 cycles
access time 6130 cycles 800,000 8,000,000 cycles
transfer time 220 cycles 200,000 2,000,000 cycles
Miss rate 0.110 0.000010.001
41
Design Issues

A page fault takes millions of cycles to process.
Pages should be large enough to amortize the high
access time. (4KB 64KB)
Fully associative placement of pages is used.
Page faults can be handled in software.
Write-back (Write-through scheme does not work.)

42
Where to Place a Page and How to Find it

Fully associative placement
A page table is used to located pages.
Resides in memory
Indexed with the page number from the virtual
address and contains the corresponding physical
page number.
Each program has its own page table.
To indicate the location of the page table in
memory, the page table register is used.
A valid bit in each entry (off the page is not
in memory gt page fault)

43
Translation of Virtual Address
44
Page Table
Virtual page
45
Writes in Virtual Memory

Writes to the next level of memory hierarchy
(disk) take millions of cycles.
Write-through (with write buffer) is not
practical.
Write-back (copy back) Virtual memory systems
perform the individual writes into the page in
memory and copy the page back to disk when it is
replaced.
Dirty bit indicates the page has been modified.

46
TLB (Translation Lookaside Buffer)

Each memory access by a program takes at least
twice as long.
One to obtain the physical address in the page
table
One to get the data
TLB (Translation Lookaside Buffer)
A cache that holds only page table mapping
Includes the reference bit, the dirty bit, and
the valid bit.
We dont need to access the page table on every
reference.

47
TLB Structure
Dirty
Valid
Ref
Virtual Page
Physical Page
48
TLB Acting as a Cache on Page Table
49
TLB Design Issues

When a TLB entry is replaced, we need to copy the
reference and dirty bits back to the page table
entry.
Write-back (due to small miss rate)
Fully associative mapping (due to small TLB)
If larger TLBs are used, no or small
associativity can be used.
Randomly choose an entry to replace.

50
Alpha 21264 Data TLB
PID
51
MIPS R2000 TLB
Virtual address
52
Memory Hierarchy
Processor
First-level Cache
TLB
Second-level Cache
address
blocks
Memory
Page Table
pages
Disk

Write a Comment

User Comments (0)