CPE%20626%20CPU%20Resources:%20ARM%20Cache%20Memories - PowerPoint PPT Presentation

About This Presentation
Title:

CPE%20626%20CPU%20Resources:%20ARM%20Cache%20Memories

Description:

zero wait state access speed. power efficiency. reduced electromagnetic interference ... Option 2: Hardware traps to OS, up to OS to decide what to do ... – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 36
Provided by: Aleksandar84
Learn more at: http://www.ece.uah.edu
Category:

less

Transcript and Presenter's Notes

Title: CPE%20626%20CPU%20Resources:%20ARM%20Cache%20Memories


1
CPE 626 CPU ResourcesARM Cache Memories
  • Aleksandar Milenkovic
  • E-mail milenka_at_ece.uah.edu
  • Web http//www.ece.uah.edu/milenka

2
On-chip RAM
  • On-chip memory is essential if a processor is to
    deliver its best performance
  • zero wait state access speed
  • power efficiency
  • reduced electromagnetic interference
  • In many embedded systems simple on-chip RAM is
    preferred to a cache
  • Advantages
  • simpler, cheaper, less power
  • more deterministic behavior
  • Disadvantages
  • require explicit control by the programmer

3
Unified instruction and data cache
4
Separate data and instruction caches
5
Direct-mapped cache organization
6
Two-way set-associative cache organization
7
Fully associative cache
8
An Example
  • ARM3 designed in 1989 was the first to
    incorporate an on-chip cache
  • Design steps
  • analysis using ARM2 collect hardware traces
    running typical benchmarks
  • exploring the upper-bound performance benefit
    considering a perfect cache (always contains the
    requested data)
  • Assuming 20MHz cache and 8MHz main memory
    performance of various systems is No cache 1,
    Instr. only 1.95, Data only 1.13, Instr.
    data cache 2.5
  • investigate different cache organizations and
    sizes
  • write-back, write-through, write-allocate,
    write-no-allocate, replacement policies,
    associativity, power

9
Summary of cache organizational options
10
Unified cache performance as a function of size
and organization
11
The effect of associativity on performance and
bandwidth requirement
12
ARM3 cache organization
64-way, 4KB cache
13
ARM600 cache control state machine
  • After initialization, the processor enters ltCheck
    taggt state
  • If address is non-sequential, does not fault in
    MMU, and is either read hit of a buffered write,
    the state machine remains in the ltCheck taggt
  • data value is read or written every clock cycle
  • When the next address is sequential read in the
    same cache line or a sequential buffered write,
    the state moves to ltSequential fastgt where the
    data may be accessed without checking the tag and
    without activating the MMU
  • data value is read or written every clock cycle
  • If the address is not in the cache or is an
    unbuffered write an external access is needed
    this begins in the ltStart externalgt state. Reads
    from uncacheable memory and unbuffered writes are
    completed as single memory transactions.
    Cacheable reads perform a quad-word line fetch,
    after fetching the necessary translation
    information if this was not already in the MMU
  • Cycles where the CPU does not use memory are
    executed in the ltIdlegt state

14
ARM600 cache control state machine
15
Memory Management
  • Today computer systems typically run multiple
    processes, each with its own address space
  • It would be too expensive to dedicate a full-
    address-space worth of memory for each process
    (many use only a small part of their address
    spac.)
  • If Principle of Locality allows caches to offer
    speed of cache memory with size of DRAM memory,
    then recursively DRAM can act as a cache for
    secondary storage (disk)? Virtual Memory
  • Virtual memory divides physical memory into
    blocks and allocate them to different processes

16
Virtual Memory Motivation
  • Historically virtual memory was invented when
    programs became too large for physical memory
  • Allows OS to share memory and protect programs
    from each other (main reason today)
  • Provides illusion of very large memory
  • sum of the memory of many jobs greater than
    physical memory
  • allows each job to exceed the size of physical
    mem.
  • Allows available physical memory to be very well
    utilized
  • Exploits memory hierarchy to keep average access
    time low

17
Virtual Memory Terminology
  • Virtual Address
  • address used by the programmer CPU produces
    virtual addresses
  • Virtual Address Space
  • collection of such addresses
  • Memory (Physical or Real) Address
  • address of word in physical memory
  • Memory mapping or address translation
  • process of virtual to physical address
    translation

18
Paging vs. Segmentation
  • Fixed size blocks, called pages (4KB 64KB)
  • Both logical and physical memory are divided into
    fixed-size components called pages (typically a
    few KBs)
  • Relationship between the logical and physical
    pages is stored in page tables (PTs) which are
    held in main memory
  • Variable size blocks, called segments (1B
    64KB/4GB) each segment contains a particular
    sort of information
  • e.g., code segment, data segment, stack segment
  • Paged segments a segment is an integral number
    of pages

19
Segmented memory management
  • Segmentation allows a program to have its own
    private view of memory
  • Segments are of variable size gt free memory
    becomes fragmented over time
  • it is possible that a new program is unable to
    start when the memory is fragmented in small
    pieces, none of which is big enough to hold a
    segment, even if there is enough free memorygt
    OS is responsible to coalesce the free memory
    into one large piece

20
Paging memory management
  • Use table lookup (Page Table) for mappings
    Virtual Page number is index
  • Virtual Memory Mapping Function
  • Physical Offset Virtual Offset
  • Physical Page Number (P.P.N. or Page frame)
    PageTableVirtual Page Number

Virtual Address
translation
29
0
...
10
9
...
Physical Address
21
Paging memory management (contd)
Virtual Address
virtual page no.
offset
Page Table
Access Rights
Physical Page Number
Valid
index into Page Table
...
offset
physical page no.
Physical Address
22
Paging memory management (contd)
  • Size of a PT?
  • 4KB pages, 32-bit VA gt 220 x 20 (2.5MB)
  • Use two or more levels of page table
  • Example
  • 10 MSBs are used to identify appropriate second
    level table page table in the first-level page
    table directory
  • second ten bits of the address then identify the
    page table entry which contains the physical page
    number

23
Mapping Virtual to Physical Memory
  • Program with 4 pages (A, B, C, D)
  • Any chunk of Virtual Memory assigned to any
    chuck of Physical Memory (page)

Physical Memory
Virtual Memory
A
0
0
4 KB
B
B
4 KB
8 KB
C
8 KB
A
12 KB
D
12 KB
16 KB
C
20 KB
Disk
D
24 KB
28 KB
24
Fast Address Translation
  • PTs are stored in main memory? Every memory
    access logically takes at least twice as long,
    one access to obtain physical address and second
    access to get the data
  • Observation locality in pages of data, must be
    locality in virtual addresses of those pages?
    Remember the last translation(s)
  • Address translations are kept in a special cache
    called Translation Look-Aside Buffer or TLB
  • TLB must be on chip its access time is
    comparable to cache

25
Typical TLB Format
Virtual Addr. Physical Addr. Dirty Ref Valid Access Rights
  • Tag Portion of virtual address
  • Data Physical Page number
  • Dirty since use write back, need to know whether
    or not to write page to disk when replaced
  • Ref Used to help calculate LRU on replacement
  • Valid Entry is valid
  • Access rights R (read permission), W (write
    perm.)

26
Translation Look-Aside Buffers
  • TLBs usually small, typically 128 - 256 entries
  • Like any other cache, the TLB can be fully
    associative, set associative, or direct mapped

hit
PA
VA
miss
TLBLookup
Main Memory
Processor
Cache
hit
miss
Data
Translation
27
TLB Translation Steps
  • Assume 32 entries, fully-associative TLB (Alpha
    AXP 21064)
  • 1 Processor sends the virtual address to all
    tags
  • 2 If there is a hit (there is an entry in TLB
    with that Virtual Page number and valid bit is 1)
    and there is no access violation, then
  • 3 Matching tag sends the corresponding Physical
    Page number
  • 4 Combine Physical Page number and Page Offset
    to get full physical address

28
What if not in TLB?
  • Option 1 Hardware checks page table and loads
    new Page Table Entry into TLB
  • Option 2 Hardware traps to OS, up to OS to
    decide what to do
  • When in the operating system, we don't do
    translation (turn off virtual memory)
  • The operating system knows which program caused
    the TLB fault, page fault, and knows what the
    virtual address desired was requested
  • So it looks the data up in the page table
  • If the data is in memory, simply add the entry to
    the TLB, evicting an old entry from the TLB

29
What if the data is on disk?
  • We load the page off the disk into a free block
    of memory, using a DMA transfer
  • Meantime we switch to some other process waiting
    to be run
  • When the DMA is complete, we get an interrupt and
    update the process's page table
  • So when we switch back to the task, the desired
    data will be in memory

30
What if we don't have enough memory?
  • We chose some other page belonging to a program
    and transfer it onto the disk if it is dirty
  • If clean (other copy is up-to-date), just
    overwrite that data in memory
  • We chose the page to evict based on replacement
    policy (e.g., LRU)
  • And update that program's page table to reflect
    the fact that its memory moved somewhere else

31
Page Replacement Algorithms
  • First-In/First Out
  • in response to page fault, replace the page that
    has been in memory for the longest period of time
  • does not make use of the principle of locality
    an old but frequently used page could be
    replaced
  • easy to implement (OS maintains history thread
    through page table entries)
  • usually exhibits the worst behavior
  • Least Recently Used
  • selects the least recently used page for
    replacement
  • requires knowledge of past references
  • more difficult to implement, good performance

32
Page Replacement Algorithms
  • Not Recently Used (an estimation of LRU)
  • A reference bit flag is associated to each page
    table entry such thatRef flag 1 - if page has
    been referenced in recent pastRef flag 0 -
    otherwise
  • If replacement is necessary, choose any page
    frame such that its reference bit is 0
  • OS periodically clears the reference bits
  • Reference bit is set whenever a page is accessed

33
Virtual and physical caches
  • When system incorporates both MMU and a cache.
    the cache may operate either with virtual or
    physical address
  • Virtual cache
  • cache access may start immediately there is
    no need to activate the MMU if the data is found
    in the cache gt save the power, eliminates
    address translation from a cache hit
  • - drawbacks
  • every time a process is switched VAs refer to
    different physical addresses gt cache to flushed
    on each process switch
  • increase the width of the cache address tag with
    a PID
  • OS and user programs may use two different VAs
    for the same physical location (synonyms,
    aliases) gt may result in two copies of the same
    data,
  • if we modify one, the other will have wrong value

34
Virtual and physical caches (contd)
  • Paging MMU only affects the high-order address
    bits,while the cache is accessed by the
    low-order address bits
  • if these two sets do not overlap, the cache and
    MMU may proceed in parallel
  • the physical address from MMU arrives at the
    right time to be compared with the physical
    address tags from the cache,hiding the address
    translation time behind the cache tag access
  • Limits if we have 4KB page gtmax cache is 4KB
    direct-mapped, or8KB 2-way set associative,
    or16KN 4-way, etc ...

35
The ARM710T cache organization
4-way, 4KB, 16B blocks, random replacement
policy, write-through, virtual cache,
Write a Comment
User Comments (0)
About PowerShow.com