CPSC 614:Graduate Computer Architecture Memory Technology - PowerPoint PPT Presentation

About This Presentation
Title:

CPSC 614:Graduate Computer Architecture Memory Technology

Description:

CPSC 614:Graduate Computer Architecture. Memory Technology. Based on lectures by ... Same technology used in high-density disk-drives. MEMs storage devices: ... – PowerPoint PPT presentation

Number of Views:215
Avg rating:3.0/5.0
Slides: 53
Provided by: johnk204
Category:

less

Transcript and Presenter's Notes

Title: CPSC 614:Graduate Computer Architecture Memory Technology


1
CPSC 614Graduate Computer ArchitectureMemory
Technology
  • Based on lectures by
  • Prof. David Culler
  • Prof. David Patterson
  • UC Berkeley

2
Main Memory Background
  • Random Access Memory (vs. Serial Access Memory)
  • Different flavors at different levels
  • Physical Makeup (CMOS, DRAM)
  • Low Level Architectures (FPM,EDO,BEDO,SDRAM)
  • Cache uses SRAM Static Random Access Memory
  • No refresh (6 transistors/bit vs. 1
    transistorSize DRAM/SRAM 4-8, Cost/Cycle
    time SRAM/DRAM 8-16
  • Main Memory is DRAM Dynamic Random Access Memory
  • Dynamic since needs to be refreshed periodically
    (8 ms, 1 time)
  • Addresses divided into 2 halves (Memory as a 2D
    matrix)
  • RAS or Row Access Strobe
  • CAS or Column Access Strobe

3
Static RAM (SRAM)
  • Six transistors in cross connected fashion
  • Provides regular AND inverted outputs
  • Implemented in CMOS process

Single Port 6-T SRAM Cell
4
SRAM Read Timing (typical)
  • tAA (access time for address) how long it takes
    to get stable output after a change in address.
  • tACS (access time for chip select) how long it
    takes to get stable output after CS is
    asserted.
  • tOE (output enable time) how long it takes for
    the three-state output buffers to leave the
    high- impedance state when OE and CS are both
    asserted.
  • tOZ (output-disable time) how long it takes for
    the three-state output buffers to enter high-
    impedance state after OE or CS are negated.
  • tOH (output-hold time) how long the output
    data remains valid after a change to the
    address inputs.

5
SRAM Read Timing (typical)
stable
stable
stable
ADDR
CS_L
OE_L
tOE
valid
valid
valid
DOUT
WE_L HIGH
6
Dynamic RAM
  • SRAM cells exhibit high speed/poor density
  • DRAM simple transistor/capacitor pairs in high
    density form

Word Line
C
Bit Line
...
Sense Amp
7
Basic DRAM Cell
  • Planar Cell
  • Polysilicon-Diffusion Capacitance, Diffused
    Bitlines
  • Problem Uses a lot of area (lt 1Mb)
  • You cant just ride the process curve to shrink C
    (discussed later)

8
Advanced DRAM Cells
  • Stacked cell (Expand UP)

9
Advanced DRAM Cells
  • Trench Cell (Expand DOWN)

10
DRAM Operations
  • Write
  • Charge bitline HIGH or LOW and set wordline HIGH
  • Read
  • Bit line is precharged to a voltage halfway
    between HIGH and LOW, and then the word line is
    set HIGH.
  • Depending on the charge in the cap, the
    precharged bitline is pulled slightly higheror
    lower.
  • Sense Amp Detects change
  • Explains why Cap cant shrink
  • Need to sufficiently drive bitline
  • Increase density gt increase parasiticcapacitance

11
DRAM logical organization (4 Mbit)
D
Column Decoder

Sense
Amps I/O
1
1
Q
Memory
Array
A0A1
0
Row Decoder

(2,048 x 2,048)
Storage
W
ord Line
Cell
  • Square root of bits per RAS/CAS

12
So, Why do I freaking care?
  • By its nature, DRAM isnt built for speed
  • Reponse times dependent on capacitive circuit
    properties which get worse as density increases
  • DRAM process isnt easy to integrate into CMOS
    process
  • DRAM is off chip
  • Connectors, wires, etc introduce slowness
  • IRAM efforts looking to integrating the two
  • Memory Architectures are designed to minimize
    impact of DRAM latency
  • Low Level Memory chips
  • High Level memory designs.
  • You will pay and then some for a good
    memory system.

13
So, Why do I freaking care?
  • 1960-1985 Speed Æ’(no. operations)
  • 1990
  • Pipelined Execution Fast Clock Rate
  • Out-of-Order execution
  • Superscalar Instruction Issue
  • 1998 Speed Æ’(non-cached memory accesses)
  • What does this mean for
  • Compilers?,Operating Systems?, Algorithms? Data
    Structures?

14
4 Key DRAM Timing Parameters
  • tRAC minimum time from RAS line falling to the
    valid data output.
  • Quoted as the speed of a DRAM when buy
  • A typical 4Mb DRAM tRAC 60 ns
  • Speed of DRAM since on purchase sheet?
  • tRC minimum time from the start of one row
    access to the start of the next.
  • tRC 110 ns for a 4Mbit DRAM with a tRAC of 60
    ns
  • tCAC minimum time from CAS line falling to valid
    data output.
  • 15 ns for a 4Mbit DRAM with a tRAC of 60 ns
  • tPC minimum time from the start of one column
    access to the start of the next.
  • 35 ns for a 4Mbit DRAM with a tRAC of 60 ns

15
DRAM Read Timing
  • Every DRAM access begins at
  • The assertion of the RAS_L
  • 2 ways to read early or late v. CAS

DRAM Read Cycle Time
CAS_L
A
Row Address
Junk
Col Address
Row Address
Junk
Col Address
WE_L
OE_L
D
High Z
Data Out
Junk
Data Out
High Z
Read Access Time
Output Enable Delay
Early Read Cycle OE_L asserted before CAS_L
Late Read Cycle OE_L asserted after CAS_L
16
DRAM Performance
  • A 60 ns (tRAC) DRAM can
  • perform a row access only every 110 ns (tRC)
  • perform column access (tCAC) in 15 ns, but time
    between column accesses is at least 35 ns (tPC).
  • In practice, external address delays and turning
    around buses make it 40 to 50 ns
  • These times do not include the time to drive the
    addresses off the microprocessor nor the memory
    controller overhead!
  • Can it be made faster?

17
Fast Page Mode DRAM
  • Page All bits on the same ROW (Spatial Locality)
  • Dont need to wait for wordline to recharge
  • Toggle CAS with new column address

18
Extended Data Out (EDO)
  • Overlap Data output w/ CAS toggle
  • Later brother Burst EDO (CAS toggle used to get
    next addr)

19
Synchronous DRAM
  • Has a clock input.
  • Data output is in bursts w/ each element clocked
  • Flavors SDRAM, DDR

20
RAMBUS (RDRAM)
  • Protocol based RAM w/ narrow (16-bit) bus
  • High clock rate (400 Mhz), but long latency
  • Pipelined operation
  • Multiple arrays w/ data transferred on both edges
    of clock

RAMBUS Bank
RDRAM Memory System
21
RDRAM Timing
22
DRAM History
  • DRAMs capacity 60/yr, cost 30/yr
  • 2.5X cells/area, 1.5X die size in 3 years
  • 98 DRAM fab line costs 2B
  • DRAM only density, leakage v. speed
  • Rely on increasing no. of computers memory per
    computer (60 market)
  • SIMM or DIMM is replaceable unit gt computers
    use any generation DRAM
  • Commodity, second source industry gt high
    volume, low profit, conservative
  • Little organization innovation in 20 years
  • Dont want to be chip foundries (bad for RDRAM)
  • Order of importance 1) Cost/bit 2) Capacity
  • First RAMBUS 10X BW, 30 cost gt little impact

23
Main Memory Organizations
  • Simple
  • CPU, Cache, Bus, Memory same width (32 or 64
    bits)
  • Wide
  • CPU/Mux 1 word Mux/Cache, Bus, Memory N words
    (Alpha 64 bits 256 bits UtraSPARC 512)
  • Interleaved
  • CPU, Cache, Bus 1 word Memory N Modules(4
    Modules) example is word interleaved

24
Main Memory Performance
  • Timing model (word size is 32 bits)
  • 1 to send address,
  • 6 access time, 1 to send data
  • Cache Block is 4 words
  • Simple M.P. 4 x (161) 32
  • Wide M.P. 1 6 1 8
  • Interleaved M.P. 1 6 4x1 11

25
Independent Memory Banks
  • Memory banks for independent accesses vs. faster
    sequential accesses
  • Multiprocessor
  • I/O
  • CPU with Hit under n Misses, Non-blocking Cache
  • Superbank all memory active on one block
    transfer (or Bank)
  • Bank portion within a superbank that is word
    interleaved (or Subbank)


Superbank
Bank
Superbank Offset
Superbank Number
Bank Number
Bank Offset
26
Independent Memory Banks
  • How many banks?
  • number banks ? number clocks to access word in
    bank
  • For sequential accesses, otherwise will return to
    original bank before it has next word ready
  • Increasing DRAM gt fewer chips gt less banks

RIMMs can have a HOTSPOT (literally)
27
Avoiding Bank Conflicts
  • Lots of banks
  • int x256512
  • for (j 0 j lt 512 j j1)
  • for (i 0 i lt 256 i i1)
  • xij 2 xij
  • Even with 128 banks, since 512 is multiple of
    128, conflict on word accesses
  • SW loop interchange or declaring array not power
    of 2 (array padding)
  • HW Prime number of banks
  • bank number address mod number of banks
  • address within bank address / number of words
    in bank
  • modulo divide per memory access with prime no.
    banks?
  • address within bank address mod number words in
    bank
  • bank number? easy if 2N words per bank

28
Fast Bank Number
  • Chinese Remainder Theorem As long as two sets of
    integers ai and bi follow these rules
  • and that ai and aj are co-prime.If i ? j, then
    the integer x has only one solution (unambiguous
    mapping)
  • bank number b0, number of banks a0 ( 3 in
    example)
  • address within bank b1, number of words in bank
    a1 ( 8 in example)
  • N word address 0 to N-1, prime no. banks, words
    power of 2

Seq. Interleaved Modulo
Interleaved Bank Number 0 1 2 0 1 2 Address
within Bank 0 0 1 2 0 16 8 1 3 4 5
9 1 17 2 6 7 8 18 10 2 3 9 10 11 3 19 11 4 12 13
14 12 4 20 5 15 16 17 21 13 5 6 18 19 20 6 22 14 7
21 22 23 15 7 23
29
DRAMs per PC over Time
DRAM Generation
86 89 92 96 99 02 1 Mb 4 Mb 16 Mb 64
Mb 256 Mb 1 Gb
4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB
16
4
Minimum Memory Size
30
Need for Error Correction!
  • Motivation
  • Failures/time proportional to number of bits!
  • As DRAM cells shrink, more vulnerable
  • Went through period in which failure rate was low
    enough without error correction that people
    didnt do correction
  • DRAM banks too large now
  • Servers always corrected memory systems
  • Basic idea add redundancy through parity bits
  • Simple but wastful version
  • Keep three copies of everything, vote to find
    right value
  • 200 overhead, so not good!
  • Common configuration Random error correction
  • SEC-DED (single error correct, double error
    detect)
  • One example 64 data bits 8 parity bits (11
    overhead)
  • Really want to handle failures of physical
    components as well
  • Organization is multiple DRAMs/SIMM, multiple
    SIMMs
  • Want to recover from failed DRAM and failed SIMM!
  • Requires more redundancy to do this
  • All major vendors thinking about this in high-end
    machines

31
Architecture in practice
  • (as reported in Microprocessor Report, Vol 13,
    No. 5)
  • Emotion Engine 6.2 GFLOPS, 75 million polygons
    per second
  • Graphics Synthesizer 2.4 Billion pixels per
    second
  • Claim Toy Story realism brought to games!

32
FLASH Memory
  • Floating gate transitor
  • Presence of charge gt 0
  • Erase Electrically or UV (EPROM)
  • Peformance
  • Reads like DRAM (ns)
  • Writes like DISK (ms). Write is a complex
    operation

33
More esoteric Storage Technologies?
  • Tunneling Magnetic Junction RAM (TMJ-RAM)
  • Speed of SRAM, density of DRAM, non-volatile (no
    refresh)
  • New field called Spintronics combination of
    quantum spin and electronics
  • Same technology used in high-density disk-drives
  • MEMs storage devices
  • Large magnetic sled floating on top of lots of
    little read/write heads
  • Micromechanical actuators move the sled back and
    forth over the heads

34
Tunneling Magnetic Junction
35
MEMS-based Storage
  • Magnetic sled floats on array of read/write
    heads
  • Approx 250 Gbit/in2
  • Data ratesIBM 250 MB/s w 1000 headsCMU 3.1
    MB/s w 400 heads
  • Electrostatic actuators move media around to
    align it with heads
  • Sweep sled 50?m in lt 0.5?s
  • Capacity estimated to be in the 1-10GB in 10cm2

See Ganger et all http//www.lcs.ece.cmu.edu/rese
arch/MEMS
36
Main Memory Summary
  • Wider Memory
  • Interleaved Memory for sequential or independent
    accesses
  • Avoiding bank conflicts SW HW
  • DRAM specific optimizations page mode
    Specialty DRAM
  • Need Error correction

37
Virtual Memory
38
Terminology
  • Page a virtual memory block
  • Page fault a virtual memory miss
  • Memory mapping (memory translation) converting a
    virtual memory produced by the CPU to a physical
    address

39
Mapping of Virtual Memory to Physical Memory
40
Typical Parameter Ranges
Parameter First-level cache Virtual Memory
Block size 16 128 B 409665,536B
Hit time 13 cycles 50150 cycles
Miss penalty 8150 cycles 1,000,000 10,000,000 cycles
access time 6130 cycles 800,000 8,000,000 cycles
transfer time 220 cycles 200,000 2,000,000 cycles
Miss rate 0.110 0.000010.001
41
Design Issues
  • A page fault takes millions of cycles to process.
  • Pages should be large enough to amortize the high
    access time. (4KB 64KB)
  • Fully associative placement of pages is used.
  • Page faults can be handled in software.
  • Write-back (Write-through scheme does not work.)

42
Where to Place a Page and How to Find it
  • Fully associative placement
  • A page table is used to located pages.
  • Resides in memory
  • Indexed with the page number from the virtual
    address and contains the corresponding physical
    page number.
  • Each program has its own page table.
  • To indicate the location of the page table in
    memory, the page table register is used.
  • A valid bit in each entry (off the page is not
    in memory gt page fault)

43
Translation of Virtual Address
44
Page Table
Virtual page
45
Writes in Virtual Memory
  • Writes to the next level of memory hierarchy
    (disk) take millions of cycles.
  • Write-through (with write buffer) is not
    practical.
  • Write-back (copy back) Virtual memory systems
    perform the individual writes into the page in
    memory and copy the page back to disk when it is
    replaced.
  • Dirty bit indicates the page has been modified.

46
TLB (Translation Lookaside Buffer)
  • Each memory access by a program takes at least
    twice as long.
  • One to obtain the physical address in the page
    table
  • One to get the data
  • TLB (Translation Lookaside Buffer)
  • A cache that holds only page table mapping
  • Includes the reference bit, the dirty bit, and
    the valid bit.
  • We dont need to access the page table on every
    reference.

47
TLB Structure
Dirty
Valid
Ref
Virtual Page
Physical Page
48
TLB Acting as a Cache on Page Table
49
TLB Design Issues
  • When a TLB entry is replaced, we need to copy the
    reference and dirty bits back to the page table
    entry.
  • Write-back (due to small miss rate)
  • Fully associative mapping (due to small TLB)
  • If larger TLBs are used, no or small
    associativity can be used.
  • Randomly choose an entry to replace.

50
Alpha 21264 Data TLB
PID
51
MIPS R2000 TLB
Virtual address
52
Memory Hierarchy
Processor
First-level Cache
TLB
Second-level Cache
address
blocks
Memory
Page Table
pages
Disk
Write a Comment
User Comments (0)
About PowerShow.com