TLP on Chip: SMT and CMP - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

TLP on Chip: SMT and CMP

Description:

... custom designed snoopy bus connecting the L1 controllers or may use a simple directory protocol ... build an SMP over a snoopy bus; you can connect these ... – PowerPoint PPT presentation

Number of Views:197
Avg rating:3.0/5.0
Slides: 53
Provided by: cseIi
Category:
Tags: cmp | smt | tlp | chip | snoopy

less

Transcript and Presenter's Notes

Title: TLP on Chip: SMT and CMP


1
TLP on ChipSMT and CMP
2
SMT
  • Discussed simultaneous multithreading (SMT) in
    the last lecture
  • Basic goal is to run multiple threads at the same
    time
  • Helps in hiding large memory latency because even
    if one thread is blocked due to a cache miss, it
    is still possible to schedule ready instructions
    from other threads without taking the overhead of
    context switch
  • Improves memory level parallelism (MLP)
  • Overall, improves resource utilization enormously
    as compared to a superscalar processor
  • Latency of a particular thread may not improve,
    but the overall throughput of the system
    increases (i.e. average number of retired
    instructions per cycle)

3
Multi-threading
  • Three design choices for single-core hardware
    multi-threading
  • Coarse-grain multithreading Execute one thread
    at a time when the running thread is blocked on
    a long-latency event e.g., cache miss, swap in a
    new thread this swap can take place in hardware
    (needs extra support and extra cycles for
    flushing the pipe and saving register values
    unless renamed registers remain pinned)
  • Fine-grain multithreading Fetch, decode, rename,
    issue, execute instructions from threads in round
    robin fashion improved utilization across
    cycles, but problem remains within cycle also if
    a thread gets blocked on a long-latency event its
    slots will go wasted for many cycles
  • Simultaneous multithreading (SMT) Mix
    instructions from all threads every cycle
    maximum utilization of resources

4
Problems of SMT
  • Offers a processor that can deliver reasonably
    good multithreaded performance with fine-grained
    fast communication through cache
  • Although it is possible to design an SMT
    processor with small die area increase (5 in
    Pentium 4), for good performance it is necessary
    to rethink about resource allocation policies at
    various stages of the pipe
  • Also, verifying an SMT processor is much harder
    than the basic underlying superscalar design
  • Must think about various deadlock/livelock
    possibilities since the threads interact with
    each other through shared resources on a
    per-cycle basis
  • Why not exploit the transistors available today
    to just replicate existing superscalar cores and
    design a single chip multiprocessor (CMP)?

5
CMP
  • CMP is the mantra of todays microprocessor
    industry
  • Intels dual-core Pentium 4 each core is still
    hyperthreaded (just uses existing cores)
  • Intels quad-core Whitefield is coming up in a
    year or so
  • For the server market Intel has announced a
    dual-core Itanium 2 (code named Montecito) again
    each core is 2-way threaded
  • AMD has released dual-core Opteron in 2005
  • IBM released their first dual-core processor
    POWER4 circa 2001 next-generation POWER5 also
    uses two cores but each core is also 2-way
    threaded
  • Suns UltraSPARC IV (released in early 2004) is a
    dual-core processor and integrates two UltraSPARC
    III cores

6
Why CMP?
  • Today microprocessor designers can afford to have
    a lot of transistors on the die
  • Ever-shrinking feature size leads to dense
    packing
  • What would you do with so many transistors?
  • Can invest some to cache, but beyond a certain
    point it doesnt help
  • Natural choice was to think about greater level
    of integration
  • Few chip designers decided to bring the memory
    and coherence controllers along with the router
    on the die
  • The next obvious choice was to replicate the
    entire core it is fairly simple just use the
    existing cores and connect them through a
    coherent interconnect

7
Moores law
  • The number of transistors on a die doubles every
    18-24 months
  • Exponential growth in available transistor count
  • If transistor utilization is constant, this would
    lead to exponential performance growth but life
    is slightly more complicated
  • Wires dont scale with transistor technology
    wire delay becomes the bottleneck
  • Short wires are good dictates localized logic
    design
  • But superscalar processors exercise a
    centralized control requiring long wires (or
    pipelined long wires)
  • However, to utilize the transistors well, we need
    to overcome the memory wall problem
  • To hide memory latency we need to extract more
    independent instructions i.e. more ILP

8
Moores law
  • Extracting more ILP directly requires more
    available in-flight instructions
  • But for that we need bigger ROB which in turn
    requires a bigger register file
  • Also we need to have bigger issue queues to be
    able to find more parallelism
  • None of these structures scale well main problem
    is wiring
  • So the best solution to utilize these transistors
    effectively with a low cost must not require long
    wires and must be able to leverage existing
    technology CMP satisfies these goals exactly
    (use existing processors and invest transistors
    to have more of these on-chip instead of trying
    to scale the existing processor for more ILP)

9
Moores law
10
Power consumption?
  • Hey, didnt I just make my power consumption
    roughly N-fold by putting N cores on the die?
  • Yes, if you do not scale down voltage or
    frequency
  • Usually CMPs are clocked at a lower frequency
  • Oops! My games run slower!
  • Voltage scaling happens due to smaller process
    technology
  • Overall, roughly cubic dependence of power on
    voltage or frequency
  • Need to talk about different metrics
  • Performance/Watt (same as reciprocal of energy)
  • More general, Performancek1/Watt (k gt 0)
  • Need smarter techniques to further improve these
    metrics
  • Online voltage/frequency scaling

11
Clustered arch.
  • An alternative to CMP is clustered
    microarchitecture
  • Still tries to extract ILP and runs a single
    thread
  • But divides the execution unit into clusters
    where each cluster has a separate register file
  • Number of ports per register file goes down
    dramatically reducing the complexity
  • Can even replicate/partition caches
  • Big disadvantage keeping the register file and
    cache partitions coherent may need global wires
  • Key factor frequency of communication
  • Also, standard problems of single-threaded
    execution remain branch prediction, fetch
    bandwidth, etc.

12
Clustered arch.
May want to steer dependent instructions to the
same Cluster to minimize communication
13
ABCs of CMP
  • Where to put the interconnect?
  • Do not want to access the interconnect too
    frequently because these wires are slow
  • It probably does not make much sense to have the
    L1 cache shared among the cores requires very
    high bandwidth and may necessitate a redesign of
    the L1 cache and surrounding load/store unit
    which we do not want to do so settle for private
    L1 caches, one per core
  • Makes more sense to share the L2 or L3 caches
  • Need a coherence protocol at L2 interface to keep
    private L1 caches coherent may use a high-speed
    custom designed snoopy bus connecting the L1
    controllers or may use a simple directory
    protocol
  • An entirely different design choice is not to
    share the cache hierarchy at all (dual-core AMD
    and Intel) rids you of the on-chip coherence
    protocol, but no gain in communication latency

14
Shared cache design
  • Need to be banked
  • How many coherence engines per bank?
  • Notion of home bank? Miss in home bank means
    what?
  • Snoop or directory?
  • COMA with home bank?

15
Hierarchical MP
  • SMT and CMP add couple more levels in
    hierarchical multiprocessor design
  • If you just have an SMT processor, among the
    threads you can do shared memory multiprocessing
    with possibly the fastest communication you can
    connect the SMT processors to build an SMP over a
    snoopy bus you can connect these SMP nodes over
    a network with a directory protocol
  • Can do the same thing with CMP, only difference
    is that you need to design the on-chip coherence
    logic (that is not automatically enforced as in
    SMT)
  • If you have a CMP with each core being an SMT,
    then you really have a tall hierarchy of shared
    memory the communication becomes costlier as you
    go up the hierarchy also communication becomes
    very much non-uniform

16
IBM POWER4
17
IBM POWER4
  • Dual-core chip multiprocessor

18
4-chip 8-way NUMA
19
32-way ring bus
20
POWER4 core
  • 8-wide fetch, 8-wide issue, 5-wide commit
  • Features out-of-order issue with renaming and
    branch prediction (bimodalgshare hybrid)
  • Allows 20 groups of at most 5 instructions each
    to be in-flight beyond dispatch (100 instructions)

21
POWER4 pipeline
  • Relatively short pipe
  • Clocked at more than 1 GHz for 0.18µm technology
  • Minimum 15 cycles for integer instructions
  • Minimum 12-cycle branch misprediction penalty
  • 11 small parallel issue queues (divided into four
    groups) for fast selection
  • Back-to-back issue of dependent instructions not
    allowed slow bypass or bypass absent? Requires
    at least one cycle gap
  • Out-of-order load issue, load-load and load-store
    replay load-load replay optimized with load
    queue snoop bit
  • Write through write no allocate private L1 data
    cache at most 8 outstanding L1 load misses
  • Inclusion maintained between L2 and L1

22
POWER4 pipeline
23
POWER4 caches
  • Private L1 instruction and data caches (on chip)
  • L1 icache 64 KB/direct mapped/128 bytes line
  • L1 dcache 32 KB/2-way associative/128 bytes
    line/LRU
  • No M state in L1 data cache (write through)
  • On-chip shared L2 (on-chip coherence point)
  • 1.5 MB/8-way associative/128 bytes line/pseudo
    LRU
  • For on-chip coherence, L2 tag is augmented with a
    two-bit sharer vector used to invalidate L1 on
    other cores write
  • Three L2 controllers and each L2 controller has
    four local coherence units each L2 controller
    handles roughly 512 KB of data divided into four
    SRAM partitions
  • For off-chip coherence, each L2 controller has
    four snoop engines executes enhanced MESI with
    seven states

24
POWER4 L2 cache
25
POWER4 L3 cache
  • On-chip tag (IBM calls it directory), off-chip
    data
  • 32 MB/8-way associative/512 bytes line
  • Contains eight coherence/snoop controllers
  • Does not maintain inclusion with L2 requires L3
    to snoop fabric interconnect also
  • Maintains five coherence states
  • Putting the L3 cache on the other side of the
    fabric requires every L2 cache miss (even local
    miss) to cross the fabric increases latency
    quite a bit

26
POWER4 L3 cache
27
POWER4 die photo
28
IBM POWER5
29
IBM POWER5
  • Carries on POWER4 to the next generation
  • Each core of the dual-core chip is 2-way SMT 24
    area growth per core
  • More than two threads not only add complexity,
    may not provide extra performance benefit in
    fact, performance may degrade because of resource
    contention and cache thrashing unless all shared
    resources are scaled up accordingly (hits a
    complexity wall)
  • L3 cache is moved to the processor side so that
    L2 cache can directly talk to it reduces
    bandwidth demand on the interconnect (L3 hits at
    least do not go on bus)
  • This change enabled POWER5 designers to scale to
    64-processor systems (i.e. 32 chips with a total
    of 128 threads)
  • Bigger L2 and L3 caches 1.875 MB L2, 36 MB L3
  • On-chip memory controller

30
IBM POWER5
Reproduced from IEEE Micro
31
IBM POWER5
  • Same pipeline structure as POWER4
  • Added SMT facility
  • Like Pentium 4, fetches from each thread in
    alternate cycles (8-instruction fetch per cycle
    just like POWER4)
  • Threads share ITLB and ICache
  • Increased size of register file compared to
    POWER4 to support two threads 120 integer and
    floating-point registers (POWER4 has 80 integer
    and 72 floating-point registers) improves
    single-thread performance compared to POWER4
    smaller technology (0.13 µm) made it possible to
    access a bigger register file in same or shorter
    time leading to same pipeline as POWER4
  • Doubled associativity of L1 caches to reduce
    conflict misses icache is 2-way and dcache is
    4-way

32
IBM POWER5
Reproduced from IEEE Micro
33
IBM POWER5
  • Thread priority
  • Software can set priority of a thread and the
    hardware (essentially the decoder) reads these
    priority registers to decide which thread to
    process in a given cycle
  • Higher priority thread gets more decode cycles in
    the long run i.e. injects more instructions into
    the pipe
  • Eight priority levels for each thread level 0
    means idle
  • Real time tasks get higher priority while a
    thread looping on a spin-lock will get lower
    priority
  • Level 1 is the lowest priority for an active
    thread if both threads are running at level 1
    the processor throttles the overall decode rate
    to save dynamic power

34
IBM POWER5
  • Adaptive resource balancing
  • Mainly three hardware mechanisms used by POWER5
    to make sure that one thread is not hogging too
    much
  • If one thread is found to consume too many GCT
    entries i.e. has too many in-flight instructions
    (one GCT entry is at most 5 instructions), that
    thread will get less decode cycles until GCT
    occupancy reaches a balanced state (note the
    difference with ICOUNT)
  • If a thread has too many outstanding L2 cache
    misses, that thread will be given less decode
    cycles (why?)
  • If a thread is executing a sync, all instructions
    belonging to that thread that are waiting in the
    pipe at the dispatch stage will be flushed and
    fetching from that thread will be inhibited until
    sync finishes (why?)

35
IBM POWER5
  • Dynamic power management
  • With SMT and CMP average number of switching per
    cycle increases leading to more power consumption
  • Need to reduce power consumption without losing
    performance simple solution is to clock it at a
    slower frequency, but that hurts performance
  • POWER5 employs fine-grain clock-gating in every
    cycle the power management logic decides if a
    certain latch will be used in the next cycle if
    not, it disables or gates the clock for that
    latch so that it will not unnecessarily switch in
    the next cycle
  • Clock-gating and power management logic
    themselves should be very simple
  • If both threads are running at priority level 1,
    the processor switches to a low power mode where
    it dispatches instructions at a much slower pace

36
POWER5 die photo
37
Intel Montecito
38
Features
  • Dual core Itanium 2, each core dual threaded
  • 1.7 billion transistors, 21.5 mm x 27.7 mm die
  • 27 MB of on-chip three levels of cache
  • Not shared among cores
  • 1.8 GHz, 100 W
  • Single-thread enhancements
  • Extra shifter improves performance of crypto
    codes by 100
  • Improved branch prediction
  • Improved data and control speculation recovery
  • Separate L2 instruction and data caches buys 7
    improvement over Itanium2 four times bigger L2I
    (1 MB)
  • Asynchronous 12 MB L3 cache

39
Overview
Reproduced from IEEE Micro
40
Dual threads
  • SMT only for cache, not for core resources
  • Simulations showed high resource utilization at
    core level, but low utilization of cache
  • Branch predictor is still shared but use thread
    id tags
  • Thread switch is implemented by flushing the pipe
  • More like coarse-grain multithreading
  • Five thread switch events
  • L3 cache miss (immense impact on in-order pipe)/
    L3 cache refill
  • Quantum expiry
  • Spin lock/ ALAT invalidation
  • Software-directed switch
  • Execution in low power mode

41
Thread urgency
  • Each thread has eight urgency levels
  • Every L3 miss decrements urgency by one
  • Every L3 refill increments urgency by one until
    urgency reaches 5
  • A switch due to time quantum expiry sets the
    urgency of the switched thread to 7
  • Arrival of asynchronous interrupt for a
    background thread sets the urgency level of that
    thread to 6
  • Switch from L3 miss requires urgency level to be
    compared also

42
Thread urgency
Reproduced from IEEE Micro
43
Core arbiter
Reproduced from IEEE Micro
44
Power efficiency
  • Foxton technology
  • Blind replication of Itanium 2 cores at 90 nm
    would lead to roughly 300 W peak power
    consumption (Itanium 2 consumes 130 W peak at 130
    nm)
  • In case of lower than the ceiling power
    consumption, the voltage is increased leading to
    higher frequency and performance
  • 10 boost for enterprise applications
  • Software or OS can also dictate a frequency
    change if power saving is required
  • 100 ms response time for the feedback loop
  • Frequency control is achieved by 24 voltage
    sensors distributed across the chip the entire
    chip runs at a single frequency (other than
    asynchronous L3)
  • Clock gating found limited application in
    Montecito

45
Foxton technology
Reproduced from IEEE Micro
  • Embedded microcontroller runs a real-time
    scheduler to execute various tasks

46
Die photo
47
Sun NiagaraORUltrasparc T1
48
Features
  • Eight pipelines or cores, each shared by 4
    threads
  • 32-way multithreading on a single chip
  • Starting frequency of 1.2 GHz, consumes 60 W
  • Shared 3 MB L2 cache, 4-way banked, 12-way set
    associative, 200 GB/s bandwidth
  • Single-issue six stage pipe
  • Target market is web service where ILP is
    limited, but TLP is huge (independent
    transactions)
  • Throughput matters

49
Pipeline details
Reproduced from IEEE Micro
50
Pipeline details
  • Four threads share a six-stage pipeline
  • Shared L1 caches and TLBs
  • Dedicated register file per thread
  • Fetches two instructions every cycle from a
    selected thread
  • Thread select logic also determines which
    threads instruction should be fed into the pipe
  • Although pipe is in-order, there is an 8-entry
    store buffer per thread (why?)
  • Instructions come with predecoded bits to
    facilitate thread selection
  • Threads may run into structural hazards due to
    limited number of FUs
  • Divider is granted to the least recently executed
    thread

51
Cache hierarchy
  • L1 instruction cache
  • 16 KB / 4-way / 32 bytes / random replacement
  • Fetches two instructions every cycle
  • If both instructions are useful, next cycle is
    free for icache refill
  • L1 data cache
  • 8 KB / 4-way / 16 bytes/ write-through,
    no-allocate
  • On avearge 10 miss rate for target benchmarks
  • L2 cache extends the tag to maintain a directory
    for keeping the core L1s coherent
  • L2 cache is writeback with silent clean eviction

52
Thread selection
  • Based on long latency events such as load,
    divide, multiply, branch
  • Also based on pipeline stalls due to cache
    misses, traps, or structural hazards
  • Speculative load dependent issue with low priority
Write a Comment
User Comments (0)
About PowerShow.com