Kiloinstruction Processors

1 / 59
About This Presentation
Title:

Kiloinstruction Processors

Description:

Next IP. Next IP. Fetch. Fetch. L1. Instr. L1. Data. L2. Memory. Branch ... Blue-Gene like. Multiscalar,Trace Processor. Raw, Imagine, Levo,TRIPS. It is . – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 60
Provided by: Adri245

less

Transcript and Presenter's Notes

Title: Kiloinstruction Processors


1
Kilo-instruction Processors
Mateo Valero, UPC HPCA-10, Madrid February
14-17th 2004
2
Motivation
Technology works against ILP Faster clock rates
gt Lower ILP
Justin Rattner, Intel-MRL, Keynote lecture,
Micro-32
3
The trends are changing
  • 1990s architecture
  • Short pipelines
  • Low memory latencies
  • 2010 architectures
  • Long pipelines
  • 30-50 stages
  • Power-Thermal-Wire delay aware architecture
  • Long memory latencies
  • 500 to 1000 cycles
  • ISCA-2003 50 to 160

M. Valero. NSF Workshop on Computer Architecture.
ISCA Conference. San Diego, June 2003
4
Memory Wall Problem
0.6X
0.45X
Memory latency has enormous impact on IPC
M. Valero. NSF Workshop on Computer Architecture.
ISCA Conference. San Diego, June 2003
5
Reducing Memory Latency
  • Technology
  • Caches
  • Prefetching
  • Hardware, Software and combined
  • Assisted/SSMT Threads
  • Kilo-instruction Processor

6
Kilo-instruction Processors
  • Our goals
  • Better tolerate increasing memory latency
  • Further improve ILP, even for such longer memory
    latency
  • Allow additional optimizations enabled by the new
    architecture (See below)
  • Our proposal Kilo-instruction Processors
  • Out-Of-Order processors with thousands of
    instructions in-flight (Very Large Instruction
    Windows)
  • Intelligent use of resources (Resource
    requirements growing much slower than window size)

7
Kilo-instruction Processsor
  • It is not..
  • A heavy processor ?
  • Cyber-205 like processor
  • Vector Processor
  • Blue-Gene like
  • Multiscalar,Trace Processor
  • Raw, Imagine, Levo,TRIPS
  • It is .
  • An Affordable O-O-O Superscalar Processor having
    Thousands of In-flight Instructions

8
Outline
  • Motivation
  • Increasing the number of in-flight instructions
  • Kilo-instruction Processor Ingredients
  • Multi-Checkpointing the ROB
  • Out-of-Order Commit
  • Early Release of Resources
  • Ephemeral Registers
  • Load Queues
  • Locality Exploitation
  • Instruction Queues
  • LSQ
  • Cross-pollination with other techniques
  • Kilo-processor and multiprocessor systems
  • kilo-vector processor
  • Kilo-SMT processor
  • Further Improvements
  • Branch prediction
  • kilo-valpred processor

9
ROB Activity
ROB
Register File
load 1
x
a
x
branch 1
x
branch
x
x
load 2
IQ
x
b
x
load 1
branch 3
a
x
branch 1
x
load 2
128-entry
b
branch 3
1024-entry
M. Valero. NSF Workshop on Computer Architecture.
ISCA Conference. San Diego, June 2003
10
Integer, 8-way, L2 1MB
1.22X
1.1X
1.86X
0.6X
1.41X
Research Proposal to Intel (July 2001) and
presentation to Intel-MRL Feb. 2002 Cristal et
al. Large Virtual ROBs by Processor
Checkpointing, TR UPC-DAC, July 2002 M. Valero.
NSF Workshop on Computer Architecture. ISCA
Conference. San Diego, June 2003
11
Floating-point, 8-way, L2 1MB
2.34X
2X
4.58X
3.91X
0.45X
Research Proposal to Intel (July 2001) and
presentation to Intel-MRL Feb. 2002 Cristal et
al. Large Virtual ROBs by Processor
Checkpointing, TR UPC-DAC, July 2002 M. Valero.
NSF Workshop on Computer Architecture. ISCA
Conference. San Diego, June 2003
12
Scalability
  • Thousands of In-flight Instructions and In-Order
    Commit make designs impractical
  • ROB Needs to maintain a copy of every in-flight
    instruction
  • IQs Instructions depending on long latency
    instructions remain in these queues for a long
    time
  • LSQs Instructions remain in the queue until
    commit
  • Registers A new physical register for each
    instruction producing a new value
  • We would like to get the IPC of thousands of
    instructions in-flight without drastically
    increasing resource requirements

M. Valero. NSF Workshop on Computer Architecture.
ISCA Conference. San Diego, June 2003
13
Late Allocation/Early Release of Registers
Register File
ROB
Virtual Registers
load 1
x
R1, R2
a
a
x
a, b, c
branch 1
R1 ?
R1 ?
x
IQ
Early Release
R2 ?
R2 ?
branch 2
load 1
load 2
a
branch 1
x
load 2
b
b
b
x
c
branch 3
x
c
c
Monreal et al. Delaying physical register
allocation through virtual-physical registers,
MICRO99 T. Monreal et al., Late allocation and
early release of physical registers, IEEE-TC (to
appear)
14
Nearby Distant Parallelism
ROB
Register File
load
Nearby
X
load
f(X)
branch
load
Distant
Speculative Replayable
branch
load
branch
Balasubramonian et al. Dynamically Allocating
Processor Resources, ISCA01
15
Dynamic Vectorization
load
ROB_head
register file
br
C.I. 1
C.I. 2
ROB_tail
C.I.1 C.I.2
A. Pajuelo et al. Control-Flow Independence
Reuse via Dynamic Vectorization, UPC-DAC
16
Outline
  • Motivation
  • Increasing the number of in-flight instructions
  • Kilo-instruction Processor Ingredients
  • Multi-Checkpointing the ROB
  • Out-of-Order Commit
  • Early Release of Resources
  • Ephemeral Registers
  • Load Queues
  • Locality Exploitation
  • Instruction Queues
  • LSQ
  • Cross-pollination with other techniques
  • Kilo-processor and multiprocessor systems
  • kilo-vector processor
  • Kilo-SMT processor
  • Further Improvements
  • Branch prediction
  • kilo-valpred processor

17
Checkpointing the ROB
  • Checkpointing to support precise exceptions
  • Quite well established and used technique
  • W.M.Hwu and Y.N.Patt, ISCA 1987
  • Checkpointing to early release resources
  • Quite recent concept
  • Cherry J. Martínez et al., MICRO, Nov. 2002
  • Large VROB A. Cristal et al. TR-UPC-DAC, July
    2002

M. Valero. NSF Workshop on Computer Architecture.
ISCA Conference. San Diego, June 2003
18
Cherry
ROB
load
  • registers
  • loads
  • stores

Early Release
Cherry
irreversible
Point of no return (PNR)
branch
reversible
Martínez et al. Cherry Checkpointed Early
Resource Recycling, MICRO02
19
Multi-Checkpoint
ROB
Checkpointing Table
Checkpoint 1
Checkpoint 2
branch 2
load 1
load 1
x
load 1 PC, status, counter,
a
branch 2 PC, status, counter,
x
branch 1
Gang commit Checkpoint 1
OOO commit
x
branch
x
x
branch 2
x
b
IQ
x
load 3
x
x
Cristal et al. Large Virtual ROBs by Processor
Checkpointing, TR UPC-DAC, July 2002 Research
Proposal to Intel (July 2001) and presentation
to Intel-MRL Feb. 2002
20
Outline
  • Motivation
  • Increasing the number of in-flight instructions
  • Kilo-instruction Processor Ingredients
  • Multi-Checkpointing the ROB
  • Out-of-Order Commit
  • Early Release of Resources
  • Ephemeral Registers
  • Load Queues
  • Locality Exploitation
  • Instruction Queues
  • LSQ
  • Cross-pollination with other techniques
  • Kilo-processor and multiprocessor systems
  • kilo-vector processor
  • Kilo-SMT processor
  • Further Improvements
  • Branch prediction
  • kilo-valpred processor

21
Early Release of Resources
Commit
Memory Latency i.e, 1000 cycles
Fetch
T. Karkhanis and J.Smith, A day in the life of a
data cache miss Workshop Memory Performance
Issues. ISCA-2002M. Valero. NSF Workshop on
Computer Architecture. ISCA Conference. San
Diego, June 2003
22
Registers
  • Register File is a critical component of a modern
    superscalar processor
  • Large number of entries to support out-of-order
    execution and memory latency
  • Large number of ports to increase issue width
  • Power and access time are key issues for register
    file design
  • It is always beneficial, to reduce the number of
    physical registers

23
Physical Registers
  • Conventional renaming scheme
  • Virtual-Physical Registers
  • Early Release
  • Ephemeral Registers checkpoint virtual-physical

Register Unused
Register Used
Register Unused
Register Used
Register Unused
Register Unused
Register Used
Register Used
T. Monreal et al. Delaying physical register
allocation through virtual-physical registers,
MICRO99 M. Moudgill et al, Register renaming
and dynamic speculation an alternative
approach, MICRO93 T. Monreal et al., Late
allocation and early release of physical
registers, IEEE-TC (to appear) J. Martínez et
al, Ephemeral Registers, Technical Report
CSL-TR-2003-1035 , 2003
24
State of Registers (FP, ROB2048)
A. Cristal, et al, A case for resource-concious
out-of-order processors, IEEE TCCA CA Letters,
Vol. 2, Oct. 2003
25
Outline
  • Motivation
  • Increasing the number of in-flight instructions
  • Kilo-instruction Processor Ingredients
  • Multi-Checkpointing the ROB
  • Out-of-Order Commit
  • Early Release of Resources
  • Ephemeral Registers
  • Load Queues
  • Locality Exploitation
  • Instructions Queues
  • LSQ
  • Cross-pollination with other techniques
  • Kilo-processor and multiprocessor systems
  • kilo-vector processor
  • Kilo-SMT processor
  • Further Improvements
  • Branch prediction
  • kilo-valpred processor

26
IQs and Kilo processors
  • Increasing the number of IQ entries increase the
    power, area and access time
  • Wake-up and selection logic need to be done
    efficiently
  • Kilo-instruction processors may have many
    in-flight instructions
  • We need new organization for the IQs in order to
    have affordable kilo-instruction processors

27
Execution Time of Instructions
  • Lebeck et al., A large, fast instruction window
    for tolerating cache misses, ISCA-29, 2002.
  • Brekelbaum et al., Hierarchical scheduling
    windows, ISCA-35, 2002.
  • Cristal et al., Out-of-Order Commit Processors,
    TR UPC-DAC-2003-44, July 2003 HPCA-10, Feb.
    2004

ROB
Secondary Buffer
2
3
1
IQ
3
1
28
Load/Store Queues
  • Efficient and affordable memory disambiguation is
    mandatory for kilo-instruction processors
  • We need to guarantee that loads and stores arrive
    to the memory in the correct order
  • Increasing the number of in-flight instructions,
    can make the load/store queues a true bottleneck
    both in latency and power

29
State of LD Queues (specFP, ROB2048)
A. Cristal, et al, A case for
resource-conscious out-of-order processors, IEEE
TCCA CA Letters, Vol. 2, October 2003
30
State of ST Queues (specFP, ROB2048)
A. Cristal, et al, A case for
resource-conscious out-of-order processors, IEEE
TCCA CA Letters, Vol. 2, October 2003
31
Search Filtering
  • Determine independence without associative search
    on addresses
  • Use Bloom Filter to control associative search
  • Approximate tracking (false positives are
    possible)
  • No false negatives gt no mispredictions

Associatively search If hashed bit is set to 1
Filter
S. Sethumadhavan et al. Scalable Hardware Memory
Disambiguation for High ILP Processors Micro-36,
2003
32
Putting It All Together
PhysicalRegisters
Virtual Registers
Memory Latency
IQs of 128 entries
A. Cristal et al. Kilo-instruction Processors.
Invited paper. ISHPC-V.Tokyo, LNCS-2858. October
20-22th, 2003
33
Outline
  • Motivation
  • Increasing the number of in-flight instructions
  • Kilo-instruction Processor Ingredients
  • Multi-Checkpointing the ROB
  • Out-of-Order Commit
  • Early Release of Resources
  • Ephemeral Registers
  • Load Queues
  • Locality Exploitation
  • Instructions Queues
  • LSQ
  • Cross-pollination with other techniques
  • Kilo-processor and multiprocessor systems
  • kilo-vector processor
  • Kilo-SMT processor
  • Further Improvements
  • Branch prediction
  • kilo-valpred processor

34
Kilo-processor and multiprocessor systems
First results Ideal Network
M. Galluzzi et al. A First glance at
Kiloinstruction Based Multiprocessors Invited
Paper. ACM Computing Frontiers Conference.
Ischia, Italy, April 10-12, 2004
35
Kilo-processor and multiprocessor systems
Impact of the network-ROB 64
M. Galluzzi et al. A First glance at
Kiloinstruction Based Multiprocessors Invited
Paper. ACM Computing Frontiers Conference.
Ischia, Italy, April 10-12, 2004
36
Kilo-processor and multiprocessor systems
First Results
M. Galluzzi et al. A First glance at
Kiloinstruction Based Multiprocessors Invited
Paper. ACM Computing Frontiers Conference.
Ischia, Italy, April 10-12, 2004
37
Kilo-processor and multiprocessor systems
Network latency, Radix, 250 cyc. latency
M. Galluzzi et al. A First glance at
Kiloinstruction Based Multiprocessors Invited
Paper. ACM Computing Frontiers Conference.
Ischia, Italy, April 10-12, 2004
38
Kilo-vector processor
20
80
Program
Vector
20
8
Program
Speedup 3.5
Kilo
5
8
Program
Speedup 7.7
F. Quintana et al, Kilo-vector processors,
UPC-DAC
39
Outline
  • Motivation
  • Increasing the number of in-flight instructions
  • Kilo-instruction Processor Ingredients
  • Multi-Checkpointing the ROB
  • Out-of-Order Commit
  • Early Release of Resources
  • Ephemeral Registers
  • Load Queues
  • Locality Exploitation
  • Instructions Queues
  • LSQ
  • Cross-pollination with other techniques
  • Kilo-processor and multiprocessor systems
  • kilo-vector processor
  • Kilo-SMT processor
  • Further Improvements
  • Branch prediction
  • kilo-valpred processor

40
Kilo-valpred processor
T. Ramírez et al. Kilo-value prediction
processor UPC-DAC
41
Kilo and Control Independence
  • More opportunities to find control independent
    instructions
  • Squash reuse
  • Control-independent instruction
  • reexecution removal
  • Savings
  • Power/energy
  • Execution bandwidth
  • Resources
  • Helps to go far ahead in the instruction window
    faster

42
UPC contribution to kilo processors
  • We started our work in June 2001
  • Grant proposal to Intel-MRL (Konrad Lai and Ronny
    Ronen) in January 28th. 2002
  • Presentation to Intel-MRL in February 2002
  • A. Cristal, et al. Large virtual ROBs by
    processor checkpointing Technical Report
    UPC-DAC-2002-39, July 2002. (Rejected for
    Micro-2002)
  • Multiple Checkpointers
  • Out-of-order Commit, No need for ROB
  • Early release of registers and loads
  • A. Cristal and M. Valero, ROBs virtuales
    utilizando checkpointing. Spanish Workshop on
    Parallelism. Lleida, Sept., 2002
  • Same as the previous report, but in Spanish
  • A. Cristal, J. Martínez, M. Valero and J. Llosa,
    Ephemeral Registers, Technical Report
    CSL-TR-2003-1035 , 2003. Rejected for ISCA 2003
    and Micro 2003
  • Ckeckpoint Early Release Late allocation of
    registers
  • Presentation to Intel-MRL in March 2003
  • A. Cristal, J. Martínez, J. LLosa and M. Valero,
    A case for resource-conscious out-of-order
    processors, IEEE TCCA Computer Architecture
    Letters, Vol. 2, October 2003
  • Underutilization of resources

43
UPC contribution to kilo processors
  • A. Cristal, et al. A case for
    resource-conscious out-of-order processors
    Towards Kilo-instruction in-flight processors.
    MEDEA Workshop, Sept 2003 and ACM-CAN, March 2004
  • A. Cristal et al. Kilo-instruction Processors.
    Invited paper. ISHPC-V.Tokyo, LNCS-2858. October
    20-22th, 2003
  • A. Cristal et al. Future ILP Processors.
    Invited paper. IJHPCN, to be published
  • A. Cristal, et al. Out-of-Order Commit
    Processors Technical Report UPC-DAC-2003-44,
    July 2003. HPCA-10, Madrid, Feb. 2004
  • Remove-Reinsert Mechanism
  • Simple reinsert mechanism
  • M. Galluzzi et al. A First glance at
    Kiloinstruction Based Multiprocessors Invited
    Paper. ACM Computing Frontiers Conference.
    Ischia, Italy, April 10-12, 2004
  • Much new work done at this moment

44
Talks about Kilo processors, from UPC
  • Presentation in Barcelona, to Intel-MRL in
    February 2002
  • Spanish Workshop on Parallelism. Lleida, Sept.,
    2002
  • Presentation to Intel-MRL in March 2003
  • Invited presentation. NSF Panel On the Future
    of Computer Architecture Research Wise Views and
    Fresh Perspectives. San Diego, June 2003
  • Invited Lecture. PA3CT Conference. Edegem,
    Belgium, September 22-23, 2003
  • MEDEA Workshop. New Orleans, September 2003
  • Invited Lecture. ISHPC-V. The 5th International
    Symposium on High Performance Computing. Tokyo,
    Japan, October 20-22, 2003
  • Keynote lecture. Seminar on Compilers and
    Architecture. IBM Haifa. November 11th., 2003.
  • Invited lecture. Intel MRL. Haifa., Israel. Nov.
    12th., 2003
  • HPCA-10, Madrid, February 14-18, 2003
  • Keynote lecture. HPCA-10. Madrid, February 14-18,
    2003
  • Invited lecture. ACM Computing Frontiers. Ischia,
    April, 2004
  • ACM Invited lecture. ENCAR México, May 2004
  • More future presentations scheduled

45
Memory Latency
  • Jouppi and P. Ranganathan. The relative
    importance of memory latency, bandwidth and
    branch prediction Whorkshop on Mixing Logic and
    DRAM Chips that compute and remember, during
    ISCA-24, 1997
  • S. Srinivasan and A. Lebeck, Load latency
    tolerance in dynamically scheduled processors,
    Micro-31, 1998
  • K. Skadron, P. Ahuja, M. Martonosi and D. Clark
    Branch prediction, instruction window size and
    cache size Performance tradeoffs and simulation
    techniques IEEE-TC, pp. 1260-1281, 1999.

46
Large Reorder Buffers
  • G. Sohi, S. Breach, and T. N. Vijaykumar
    Multiscalar processors ISCA-22, 1995.
  • E. Rotenberg, Q. Jacobson, Y. Sazeides, and J.
    Smith Trace processors ISCA-24, 1997
  • H. Akkari and M. Driscoll A dynamic
    multithreaded processor Micro-31, 1998
  • R. Balasubramonian, S. Dwarkadas, and D.
    Albonesi.Dynamically allocating processor
    resources between nearby and distant ilp ISCA,
    June 2001.
  • Save some resources allocated for eager
    execution
  • P. Ranganathan, V. Pai, and S. Adve Using
    speculative retirement and large instruction
    windows to narrow the performance gap between
    memory consistency models SPAA, 1997
  • J. M. Tendler, S. Dodson, S. Fields, H. Lee, and
    B. Sinharoy Power4 System Microarchitecture
    IBM Journal of Research and Development, pp.
    5-25, January 2002.

47
Checkpointing
  • W.M. Hwu and Y. N. Patt, Checkpoint repair for
    out-of-order execution machines ISCA-14, 1987.
  • Checkpointing as a recovery mechanism
  • Early Release of Resources
  • A. Cristal, M. Valero, and J. LLosa. Large
    virtual ROBs by processor checkpointing
    Technical Report UPC-DAC-2002-39, July 2002.
  • Multiple Checkpointers
  • Out-of-order Commit, No need for ROB
  • Early release of registers and loads
  • J.F. Martínez, J. Renau, M.C. Huang, M.
    Prvulovic, and J. Torrellas. Cherry checkpointed
    early resource recycling in out-of-order
    microprocessors. MICRO-35, Nov. 2002.
  • One checkpoint
  • Early release of resources

48
Register File
  • M. Moudgill and K. Pingali and S. Vassiliadis,
    Register renaming and dynamic speculation an
    alternative approach, In Proceedings of the 26th
    annual international symposium on
    Microarchitecture, 1993.
  • Early Release of Registers
  • T. Monreal, A. González, M. Valero, J. González,
    V. Viñals, Delaying Physical Register Allocation
    through Virtual-Physical Registers, In
    Proceedings of the 33th annual international
    symposium on Microarchitecture, 1999.
  • Virtual Registers, Late allocation of registers
  • A. Cristal, J. Martínez, M. Valero and J. Llosa,
    Ephemeral Registers, Technical Report
    CSL-TR-2003-1035 , 2003.
  • Ckeckpoint Early Release Late allocation of
    registers
  • T. Monreal et al., Late allocation and early
    release of physical registers, IEEE-TC (to
    appear)

49
Instruction Queues
  • S. Palacharla, N.P. Jouppi, and J.E. Smith
    Complexity-effective superscalar processors
    ISCA-24, 1997.
  • Divide the Instruction queues in a set of FIFO
    queues
  • A.R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan,
    and E. Rotenberg A large, fast instruction
    window for tolerating cache misses ISCA-29,
    2002.
  • Remove-Reinsert Mechanism
  • Keep the load dependence of all instructions
  • E. Brekelbaum, J. Rupley, C.Wilkerson, and B.
    Black Hierarchical scheduling windows ISCA-35,
    2002.
  • Two clusters, a slow/big one, and a faster/small
    one for critical instructions
  • A. Cristal, D. Ortega, J. Llosa and M. Valero
    Out-of-Order Commit Processors Technical Report
    UPC-DAC-2003-44, July 2003. HPCA-10, Madrid, Feb.
    2004
  • Remove-Reinsert Mechanism
  • Simple reinsert mechanism

50
References for LSQ for Large ROB
  • A. Cristal, M. Valero, and J. LLosa. Large
    virtual ROBs by processor checkpointing
    Technical Report UPC-DAC-2002-39, July 2002
  • J.F. Martínez, J. Renau, M.C. Huang, M.
    Prvulovic, and J. Torrellas. Cherry
    checkpointed early resource recycling in
    out-of-order microprocessors. MICRO-35, 2002
  • H. Akkari, R. Rajwar and S. T. Srinivasan
    Checkpointing Processing and Recovery Towards
    Scalable Large Instruction Window Processors
    Micro-36, 2003
  • S. Sethumadhavan, R. Desikan, D. Burger, C.R.
    Moore and S. W. Keckler Scalable Hardware Memory
    Disambiguation for High ILP Processors Micro-36,
    2003

51
Conclusion
  • Affordable Kilo-instruction Processors
  • Checkpointing and resource-conscious
    architectures
  • Out-of- order commit
  • Ephemeral registers
  • Two-level instruction queues
  • Early release of loads
  • Load/store queue management
  • New ideas to watch for
  • Better branch predictors
  • Predication and Multi-path execution
  • Control and data independent instructions
  • Reuse of large blocks of instructions
  • New processor paradigms
  • Kilo-based multiprocessor systems
  • Kilo-vector processors
  • Kilo-SMT processors
  • Kilo-valpred processors

52
Acknowledgments
  • Yale Patt
  • Alex Veidenbaum
  • Guri Sohi
  • Mark Hill
  • Wen-mei Hwu
  • Mon Beivide
  • Valentín Puente
  • José Angel Gregorio
  • Teresa Monreal
  • Victor Viñals
  • Intel, Konrad Lai and Ronny Ronen
  • Adrián Cristal
  • José Martínez
  • Josep Llosa
  • Daniel Ortega
  • Fran Cazorla
  • Enrique Fernández
  • Ayose Falcón
  • Alex Pajuelo
  • Marco Galluzzi
  • Tanausu Ramírez
  • Jim Smith

53
  • Thank you very much ?

54
Processor-DRAM Gap (latency)
µProc 60/yr.
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
D.A. Patterson New directions in Computer
Architecture Berkeley, June 1998
55
Runahead Execution
ROB
L2 cache miss
Checkpoint
load 1
x
INV
a
x
INV
  • generate bogus value
  • invalidate dep. registers
  • continue execution

branch 1
INV
Runahead Mode
x
branch
x
x
load 2
INV
x
b
x
INV
  • Virtually increments ROB size
  • Prefetch data of future loads

branch 3
x
x
Mutlu et al. Runahead Execution An
Alternative, HPCA03
56
Kilo and Control Independence
  • Larger windows improve
  • The probability of finding the
  • reconvergence point
  • The correct detection of control
  • independent instructions because the wrong path
    is completely executed
  • The execution of more control
  • independent instructions for later reuse

Wrong path
Correct path
current instruction windows
RP
CI
kilo-instruction windows
57
Kilo and Control Independence
  • The larger the window the more opportunities to
    find the reconvergence point.

Current instruction windows
58
Grant Proposal to Intel January 28th, 2002
  • In the first semester we worked on the smart
    register file and the associated ISA, and
    evaluated the proposed architecture with a few
    kernels. We showed speedups around 20 in the
    tested kernels. At the end of the first semester,
    we began the work on wide registers.From April to
    August 2001, we have been investigating three
    different approaches to use register files with
    wide ports (i.e. ports that allow to read various
    consecutive registers in a single access). The
    first one was trying to find subgraphs in the
    data dependence graph that have the same shape.
    The drawback of this approach is that it requires
    to move loads above stores in order to have a
    significant coverage. Some type of dependence
    speculation that adds a non-negligible complexity
    is required. We also did a study of the potential
    to exploit wide registers by looking at
    instructions in a window of 32 instructions. For
    Spec95 programs, we obtained that 48.9 and 52.3
    of the operands were not wide for integer and
    FP codes respectively. We continue working on an
    approach that tries to group the two values of
    all two-operand instructions in a single wide
    register.
  • Since August 2001, we have been working on
    committing instructions out of order that allows
    to free in advance processor resources and to
    continue the execution of new instructions. The
    main idea is as follows when the processor finds
    an old instruction in the ROB with a large
    latency and the ROB is full, the processor
    removes this instruction by checkpointing the
    state of the processor at the last committed
    instruction. The processor continues its work
    normally and it moves all instructions that
    depend on the checkpointed instruction, to the
    checkpointing table. In case of misspeculation or
    an exception of either the checkpointed
    instruction or any instruction dependent on it,
    the checkpointed state is restored. The design of
    the mechanism is still in progress. We are
    building a simulation environment that will
    permit us to evaluate the proposal.
  • The work we plan to do during this year
    concerning to the out-of-order commit mechanism
    is the following
  • To finish the simulator to start with the
    evaluation of different alternatives for the
    implementation of the out-of-order commit
    mechanism
  • To optimize the mechanism for those branches
    where the branch predictor fails frequently.
  • To study new organizations for the load-store
    queues.
  • To use the concept of virtual registers to
    optimize the register file organization.
  • Concerning to the work dealing with wide
    registers, we are going to finish the design of
    the mechanism and to evaluate it.

59
Grant Proposal to Intel January 28th, 2002
  • Since August 2001, we have been working on
    committing instructions out of order that allows
    to free in advance processor resources and to
    continue the execution of new instructions. The
    main idea is as follows when the processor finds
    an old instruction in the ROB with a large
    latency and the ROB is full, the processor
    removes this instruction by checkpointing the
    state of the processor at the last committed
    instruction. The processor continues its work
    normally and it moves all instructions that
    depend on the checkpointed instruction, to the
    checkpointing table. In case of misspeculation or
    an exception of either the checkpointed
    instruction or any instruction dependent on it,
    the checkpointed state is restored. The design of
    the mechanism is still in progress. We are
    building a simulation environment that will
    permit us to evaluate the proposal.
  • The work we plan to do during this year
    concerning to the out-of-order commit mechanism
    is the following
  • To finish the simulator to start with the
    evaluation of different alternatives for the
    implementation of the out-of-order commit
    mechanism
  • To optimize the mechanism for those branches
    where the branch predictor fails frequently.
  • To study new organizations for the load-store
    queues.
  • To use the concept of virtual registers to
    optimize the register file organization.
  • Concerning to the work dealing with wide
    registers, we are going to finish the design of
    the mechanism and to evaluate it.
Write a Comment
User Comments (0)