Title: Main Memory and Virtual Memory
1Main Memory and Virtual Memory
- Vincent H. Berk
- October 26, 2005
- Reading for today Sections 5.1 5.4, (Jouppi
article) - Reading for Friday Sections 5.5 5.8
- Reading for Monday Sections 5.8 5.12 and 5.16
2Main Memory Background
- Performance of Main Memory
- Latency Cache miss penalty
- Access Time time between request and word
arrives - Cycle Time time between requests
- Bandwidth I/O large block miss penalty (L2)
- Main Memory is DRAM dynamic random access
memory - Dynamic since needs to be refreshed periodically
(1 time) - Addresses divided into 2 halves (memory as a 2-D
matrix) - RAS or Row Access Strobe
- CAS or Column Access Strobe
- Cache uses SRAM static random access memory
- No refresh 6 transistors/bit vs. 1 transistor
Size DRAM/SRAM 4-8
Cost/Cycle time
SRAM/DRAM 8-16
34 Key DRAM Timing Parameters
- tRAC minimum time from RAS line falling to the
valid data output. - Quoted as the speed of a DRAM when buying
- A typical 512Mbit DRAM tRAC 60-40 ns
- tRC minimum time from the start of one row
access to the start of the next. - tRC 80 ns for a 512Mbit DRAM with a tRAC of
60-40 ns - tCAC minimum time from CAS line falling to
valid data output. - 5 ns for a 512Mbit DRAM with a tRAC of 60-40 ns
- tPC minimum time from the start of one column
access to the start of the next. - 15 ns for a 512Mbit DRAM with a tRAC of 60-40 ns
4DRAM Performance
- A 40 ns (tRAC) DRAM can
- perform a row access only every 80 ns (tRC)
- perform column access (tCAC) in 5 ns, but time
between column accesses is at least 15 ns (tPC). - In practice, external address delays and turning
around buses make it 20 to 25 ns - These times do not include the time to drive the
addresses off the microprocessor or the memory
controller overhead!
5DRAM History
- DRAMs capacity 60/yr, cost 30/yr
- 2.5X cells/area, 1.5X die size in 3 years
- 98 DRAM fab line costs 2B
- Rely on increasing numbers of computers memory
per computer (60 market) - SIMM or DIMM is replaceable unit ? computers use
any generation DRAM - Commodity, second source industry ? high
volume, low profit, conservative - Little organization innovation in 20 years
- Order of importance 1) Cost/bit, 2) Capacity
- First RAMBUS 10X BW, 30 cost ? little impact
- Current SDRAM yield very high gt 80
6Main Memory Performance
- Simple
- CPU, Cache, Bus, Memory same width (32 or 64
bits) - Wide
- CPU/Mux 1 word Mux/Cache, Bus, Memory N words
(Alpha 64 bits 256 bits UltraSPARC 512) - Interleaved
- CPU, Cache, Bus 1 word Memory N modules (4
modules) example is word interleaved
7Main Memory Performance
- Timing model (word size is 32 bits)
- 1 to send address,
- 6 for access time, 1 to send data
- Cache Block is 4 words
- Simple memory ? 4 ? (1 6 1) 32
- Wide memory ? 1 6 1 8
- Interleaved memory ? 1 6 4 ? 1 11
8Independent Memory Banks
- Memory banks for independent accesses vs. faster
sequential accesses - Multiprocessor
- I/O (DMA)
- CPU with Hit under n Misses, Non-blocking Cache
- Superbank all memory active on one block
transfer (or Bank) - Bank portion within a superbank that is word
interleaved (or subbank)
Superbank
Superbank offset (Bank)
Superbank
Bank offset
Bank
9Independent Memory Banks
- How many banks?
- number banks number clocks to access word in
bank - For sequential accesses, otherwise will return to
original bank before it has next word ready - (like in vector case)
- Increasing DRAM ? fewer chips ? harder to have
banks
10Avoiding Bank Conflicts
- Lots of banks
- int x256512
- for (j 0 j lt 512 j j1)
- for (i 0 i lt 256 i i1)
- xij 2 xij
- Even with 128 banks, since 512 is multiple of
128, conflict on word accesses - SW loop interchange or declaring array not
power of 2 (array padding) - HW prime number of banks
- bank number address mod number of banks
- address within bank address / number of words
in bank - modulo divide per memory access with prime no.
banks? - address within bank address mod number words in
bank - bank number? easy if 2N words per bank
11Fast Memory Systems DRAM specific
- Multiple CAS accesses several names (page mode)
- Extended Data Out (EDO) 30 faster in page mode
- New DRAMs to address gap what will they cost,
will they survive? - RAMBUS startup company reinvent DRAM
interface - gtgt Each chip a module vs. slice of memory
- gtgt Short bus between CPU and chips
- gtgt Does own refresh
- gtgt Variable amount of data returned
- gtgt 1 byte / 2 ns (500 MB/s per chip)
- Synchronous DRAM 2 banks on chip, a clock
signal to DRAM, transfer synchronous to system
clock (66 - 150 MHz) - Intel claims RAMBUS Direct is future of PC memory
- Niche memory or main memory?
- e.g., Video RAM for frame buffers, DRAM fast
serial output
12Virtual Memory
- Virtual Address (232, 264) to Physical Address
mapping (228) - Virtual memory in terms of cache
- Cache block?
- Cache miss?
- How is virtual memory different from caches?
- What controls replacement
- Size (transfer unit, mapping mechanisms)
- Lower-level use
13Figure 5.36 The logical program in its contiguous
virtual address space is shown on the left it
consists of four pages A, B, C, and D.
14Figure 5.37 Typical ranges of parameters for
caches and virtual memory.
15Virtual Memory
- 4 Questions for Virtual Memory (VM)?
- Q1 Where can a block be placed in the upper
level? - fully associative, set associative, or direct
mapped? - Q2 How is a block found if it is in the upper
level? - Q3 Which block should be replaced on a miss?
- random or LRU?
- Q4 What happens on a write?
- write back or write through?
- Other issues size pages or segments or hybrid
16Figure 5.40 The mapping of a virtual address to a
physical address via a page table.
17Fast Translation Translation Buffer (TLB)
- Cache of translated addresses
- Data portion usually includes physical page frame
number, protection field, valid bit, use bit, and
dirty bit - Alpha 21064 data TLB 32-entry fully associative
18Selecting a Page Size
- Reasons for larger page size
- Page table size is inversely proportional to the
page size - therefore memory saved
- Fast cache hit time easy when cache page size
(VA caches) - bigger page makes it feasible as cache grows in
size - Transferring larger pages to or from secondary
storage, - possibly over a network, is more efficient
- Number of TLB entries is restricted by clock
cycle time, so a larger - page size maps more memory, thereby reducing TLB
misses - Reasons for a smaller page size
- Fragmentation dont waste storage data must
be contiguous within page - Quicker process start for small processes
- Hybrid solution multiple page sizes
- Alpha 8 KB, 16 KB, 32 KB, 64 KB pages (43, 47,
51, 55 virtual addr bits)
19Alpha VM Mapping
21
Virtual address
- 64-bit address divided into 3 segments
- seg0 (bit 63 0) user code/heap
- seg1 (bit 63 1, 62 1) user stack
- kseg (bit 63 1, 62 0)
- kernel segment for OS
- Three level page table, each one page
- Alpha only 43 bits of VA
- (future min page size up to 64 KB ? 55 bits of
VA) - PTE bits valid, kernel user, read write
enable (no reference, use, or dirty bit) - What do you do?
seg0/seg1 selector
000 0 or 111 1
page offset
level3
level1
level2
10
10
10
13
L1 page table
. . .
L2 page table
. . .
L3 page table
. . .
. . .
. . .
. . .
8 bytes 32 bit address 32 bit fields
Main memory
20Protection
- Avoid separate processes to access each others
memory - Causes Segmentation Fault sigSEG
- Useful for Multitasking systems
- Operating system issue
- At least two levels of protection
- Supervisor (Kernel) mode (privileged)
- Creates page tables, sets process bounds, handles
exceptions - User mode (non-privileged)
- Can only make requests to Kernel called
SYSCALLs - Shared memory
- SYSCALL parameter passing
21Protection 2
- Each page needs
- PID bit
- Read/Write/Execute bit
- Each process needs
- Stack frame page(s)
- Text or code pages
- Data or heap pages
- State table keeping
- PC and other CPU status registers
- State of all registers
22Alpha 21064
- Separate Instruction Data TLB Caches
- TLBs fully associative
- TLB updates in SW(Private Arch Lib)
- Caches 8KB direct mapped, write through
- Critical 8 bytes first
- Prefetch instr. stream buffer
- 2 MB L2 cache, direct mapped, WB (off-chip)
- 256 bit path to main memory, 4 ? 64-bit modules
- Victim buffer to give read priority over write
- 4-entry write buffer between D L2
Data
Instr
Write Buffer
Stream Buffer
Victim Buffer
23Alpha CPI Components
- Instruction stall branch mispredict (green)
- Data cache (blue) Instruction cache (yellow)
L2 (pink) Other compute register conflicts,
structural conflicts
24Pitfall Predicting Cache Performance of One
Program from Another (ISA, compiler, ...)
35
- 4KB data cache miss rate 8, 12, or 28?
- 1KB instruction cache miss rate 0, 3, or 10?
- Alpha vs. MIPSfor 8 KB Data 17 vs. 10
- Why 2X Alpha v. MIPS?
D, Tom
30
D tomcatv
D gcc
25
D espresso
I gcc
I espresso
D, gcc
20
I tomcatv
Miss
Rate
D, esp
15
10
I, gcc
5
I, esp
0
1
2
4
8
16
32
64
128
I, Tom
Cache Size (KB)
25Pitfall Simulating Too Small an Address Trace
4.5
4
Cumulative Average Memory Access Time
3.5
3
2.5
2
1.5
1
0
1
2
3
4
5
6
7
8
9
10
11
12
I 4 KB, B 16 B D 4 KB, B 16 B L2 512
KB, B 128 B MP 12, 200 (miss penalties)
Instructions Executed (billions)
26Additional Pitfalls
- Having too small an address space
- Ignoring the impact of the operating system on
the performance of the memory hierarchy
27Figure 5.53 Summary of the memory-hierarchy
examples in Chapter 5.