Kiloinstruction Processors

About This Presentation

Title:

Kiloinstruction Processors

Description:

Next IP. Next IP. Fetch. Fetch. L1. Instr. L1. Data. L2. Memory. Branch ... Blue-Gene like. Multiscalar,Trace Processor. Raw, Imagine, Levo,TRIPS. It is . – PowerPoint PPT presentation

Number of Views:102

Avg rating:3.0/5.0

Slides: 60

Provided by: Adri245

more less

Transcript and Presenter's Notes

Title: Kiloinstruction Processors

1
Kilo-instruction Processors
Mateo Valero, UPC HPCA-10, Madrid February
14-17th 2004
2
Motivation
Technology works against ILP Faster clock rates
gt Lower ILP
Justin Rattner, Intel-MRL, Keynote lecture,
Micro-32
3
The trends are changing

1990s architecture
Short pipelines
Low memory latencies
2010 architectures
Long pipelines
30-50 stages
Power-Thermal-Wire delay aware architecture
Long memory latencies
500 to 1000 cycles
ISCA-2003 50 to 160

M. Valero. NSF Workshop on Computer Architecture.
ISCA Conference. San Diego, June 2003
4
Memory Wall Problem
0.6X
0.45X
Memory latency has enormous impact on IPC
M. Valero. NSF Workshop on Computer Architecture.
ISCA Conference. San Diego, June 2003
5
Reducing Memory Latency

Technology
Caches
Prefetching
Hardware, Software and combined
Assisted/SSMT Threads
Kilo-instruction Processor

6
Kilo-instruction Processors

Our goals
Better tolerate increasing memory latency
Further improve ILP, even for such longer memory
latency
Allow additional optimizations enabled by the new
architecture (See below)
Our proposal Kilo-instruction Processors
Out-Of-Order processors with thousands of
instructions in-flight (Very Large Instruction
Windows)
Intelligent use of resources (Resource
requirements growing much slower than window size)

7
Kilo-instruction Processsor

It is not..
A heavy processor ?
Cyber-205 like processor
Vector Processor
Blue-Gene like
Multiscalar,Trace Processor
Raw, Imagine, Levo,TRIPS
It is .
An Affordable O-O-O Superscalar Processor having
Thousands of In-flight Instructions

8
Outline

Motivation
Increasing the number of in-flight instructions
Kilo-instruction Processor Ingredients
Multi-Checkpointing the ROB
Out-of-Order Commit
Early Release of Resources
Ephemeral Registers
Load Queues
Locality Exploitation
Instruction Queues
LSQ
Cross-pollination with other techniques
Kilo-processor and multiprocessor systems
kilo-vector processor
Kilo-SMT processor
Further Improvements
Branch prediction
kilo-valpred processor

9
ROB Activity
ROB
Register File
load 1
x
a
x
branch 1
x
branch
x
x
load 2
IQ
x
b
x
load 1
branch 3
a
x
branch 1
x
load 2
128-entry
b
branch 3
1024-entry
M. Valero. NSF Workshop on Computer Architecture.
ISCA Conference. San Diego, June 2003
10
Integer, 8-way, L2 1MB
1.22X
1.1X
1.86X
0.6X
1.41X
Research Proposal to Intel (July 2001) and
presentation to Intel-MRL Feb. 2002 Cristal et
al. Large Virtual ROBs by Processor
Checkpointing, TR UPC-DAC, July 2002 M. Valero.
NSF Workshop on Computer Architecture. ISCA
Conference. San Diego, June 2003
11
Floating-point, 8-way, L2 1MB
2.34X
2X
4.58X
3.91X
0.45X
Research Proposal to Intel (July 2001) and
presentation to Intel-MRL Feb. 2002 Cristal et
al. Large Virtual ROBs by Processor
Checkpointing, TR UPC-DAC, July 2002 M. Valero.
NSF Workshop on Computer Architecture. ISCA
Conference. San Diego, June 2003
12
Scalability

Thousands of In-flight Instructions and In-Order
Commit make designs impractical
ROB Needs to maintain a copy of every in-flight
instruction
IQs Instructions depending on long latency
instructions remain in these queues for a long
time
LSQs Instructions remain in the queue until
commit
Registers A new physical register for each
instruction producing a new value
We would like to get the IPC of thousands of
instructions in-flight without drastically
increasing resource requirements

M. Valero. NSF Workshop on Computer Architecture.
ISCA Conference. San Diego, June 2003
13
Late Allocation/Early Release of Registers
Register File
ROB
Virtual Registers
load 1
x
R1, R2
a
a
x
a, b, c
branch 1
R1 ?
R1 ?
x
IQ
Early Release
R2 ?
R2 ?
branch 2
load 1
load 2
a
branch 1
x
load 2
b
b
b
x
c
branch 3
x
c
c
Monreal et al. Delaying physical register
allocation through virtual-physical registers,
MICRO99 T. Monreal et al., Late allocation and
early release of physical registers, IEEE-TC (to
appear)
14
Nearby Distant Parallelism
ROB
Register File
load
Nearby
X
load
f(X)
branch
load
Distant
Speculative Replayable
branch
load
branch
Balasubramonian et al. Dynamically Allocating
Processor Resources, ISCA01
15
Dynamic Vectorization
load
ROB_head
register file
br
C.I. 1
C.I. 2
ROB_tail
C.I.1 C.I.2
A. Pajuelo et al. Control-Flow Independence
Reuse via Dynamic Vectorization, UPC-DAC
16
Outline

Motivation
Increasing the number of in-flight instructions
Kilo-instruction Processor Ingredients
Multi-Checkpointing the ROB
Out-of-Order Commit
Early Release of Resources
Ephemeral Registers
Load Queues
Locality Exploitation
Instruction Queues
LSQ
Cross-pollination with other techniques
Kilo-processor and multiprocessor systems
kilo-vector processor
Kilo-SMT processor
Further Improvements
Branch prediction
kilo-valpred processor

17
Checkpointing the ROB

Checkpointing to support precise exceptions
Quite well established and used technique
W.M.Hwu and Y.N.Patt, ISCA 1987
Checkpointing to early release resources
Quite recent concept
Cherry J. Martínez et al., MICRO, Nov. 2002
Large VROB A. Cristal et al. TR-UPC-DAC, July
2002

M. Valero. NSF Workshop on Computer Architecture.
ISCA Conference. San Diego, June 2003
18
Cherry
ROB
load

registers
loads
stores

Early Release
Cherry
irreversible
Point of no return (PNR)
branch
reversible
Martínez et al. Cherry Checkpointed Early
Resource Recycling, MICRO02
19
Multi-Checkpoint
ROB
Checkpointing Table
Checkpoint 1
Checkpoint 2
branch 2
load 1
load 1
x
load 1 PC, status, counter,
a
branch 2 PC, status, counter,
x
branch 1
Gang commit Checkpoint 1
OOO commit
x
branch
x
x
branch 2
x
b
IQ
x
load 3
x
x
Cristal et al. Large Virtual ROBs by Processor
Checkpointing, TR UPC-DAC, July 2002 Research
Proposal to Intel (July 2001) and presentation
to Intel-MRL Feb. 2002
20
Outline

Motivation
Increasing the number of in-flight instructions
Kilo-instruction Processor Ingredients
Multi-Checkpointing the ROB
Out-of-Order Commit
Early Release of Resources
Ephemeral Registers
Load Queues
Locality Exploitation
Instruction Queues
LSQ
Cross-pollination with other techniques
Kilo-processor and multiprocessor systems
kilo-vector processor
Kilo-SMT processor
Further Improvements
Branch prediction
kilo-valpred processor

21
Early Release of Resources
Commit
Memory Latency i.e, 1000 cycles
Fetch
T. Karkhanis and J.Smith, A day in the life of a
data cache miss Workshop Memory Performance
Issues. ISCA-2002M. Valero. NSF Workshop on
Computer Architecture. ISCA Conference. San
Diego, June 2003
22
Registers

Register File is a critical component of a modern
superscalar processor
Large number of entries to support out-of-order
execution and memory latency
Large number of ports to increase issue width
Power and access time are key issues for register
file design
It is always beneficial, to reduce the number of
physical registers

23
Physical Registers

Conventional renaming scheme
Virtual-Physical Registers
Early Release
Ephemeral Registers checkpoint virtual-physical

Register Unused
Register Used
Register Unused
Register Used
Register Unused
Register Unused
Register Used
Register Used
T. Monreal et al. Delaying physical register
allocation through virtual-physical registers,
MICRO99 M. Moudgill et al, Register renaming
and dynamic speculation an alternative
approach, MICRO93 T. Monreal et al., Late
allocation and early release of physical
registers, IEEE-TC (to appear) J. Martínez et
al, Ephemeral Registers, Technical Report
CSL-TR-2003-1035 , 2003
24
State of Registers (FP, ROB2048)
A. Cristal, et al, A case for resource-concious
out-of-order processors, IEEE TCCA CA Letters,
Vol. 2, Oct. 2003
25
Outline

Motivation
Increasing the number of in-flight instructions
Kilo-instruction Processor Ingredients
Multi-Checkpointing the ROB
Out-of-Order Commit
Early Release of Resources
Ephemeral Registers
Load Queues
Locality Exploitation
Instructions Queues
LSQ
Cross-pollination with other techniques
Kilo-processor and multiprocessor systems
kilo-vector processor
Kilo-SMT processor
Further Improvements
Branch prediction
kilo-valpred processor

26
IQs and Kilo processors

Increasing the number of IQ entries increase the
power, area and access time
Wake-up and selection logic need to be done
efficiently
Kilo-instruction processors may have many
in-flight instructions
We need new organization for the IQs in order to
have affordable kilo-instruction processors

27
Execution Time of Instructions

Lebeck et al., A large, fast instruction window
for tolerating cache misses, ISCA-29, 2002.
Brekelbaum et al., Hierarchical scheduling
windows, ISCA-35, 2002.
Cristal et al., Out-of-Order Commit Processors,
TR UPC-DAC-2003-44, July 2003 HPCA-10, Feb.
2004

ROB
Secondary Buffer
2
3
1
IQ
3
1
28
Load/Store Queues

Efficient and affordable memory disambiguation is
mandatory for kilo-instruction processors
We need to guarantee that loads and stores arrive
to the memory in the correct order
Increasing the number of in-flight instructions,
can make the load/store queues a true bottleneck
both in latency and power

29
State of LD Queues (specFP, ROB2048)
A. Cristal, et al, A case for
resource-conscious out-of-order processors, IEEE
TCCA CA Letters, Vol. 2, October 2003
30
State of ST Queues (specFP, ROB2048)
A. Cristal, et al, A case for
resource-conscious out-of-order processors, IEEE
TCCA CA Letters, Vol. 2, October 2003
31
Search Filtering

Determine independence without associative search
on addresses
Use Bloom Filter to control associative search
Approximate tracking (false positives are
possible)
No false negatives gt no mispredictions

Associatively search If hashed bit is set to 1
Filter
S. Sethumadhavan et al. Scalable Hardware Memory
Disambiguation for High ILP Processors Micro-36,
2003
32
Putting It All Together
PhysicalRegisters
Virtual Registers
Memory Latency
IQs of 128 entries
A. Cristal et al. Kilo-instruction Processors.
Invited paper. ISHPC-V.Tokyo, LNCS-2858. October
20-22th, 2003
33
Outline

Motivation
Increasing the number of in-flight instructions
Kilo-instruction Processor Ingredients
Multi-Checkpointing the ROB
Out-of-Order Commit
Early Release of Resources
Ephemeral Registers
Load Queues
Locality Exploitation
Instructions Queues
LSQ
Cross-pollination with other techniques
Kilo-processor and multiprocessor systems
kilo-vector processor
Kilo-SMT processor
Further Improvements
Branch prediction
kilo-valpred processor

34
Kilo-processor and multiprocessor systems
First results Ideal Network
M. Galluzzi et al. A First glance at
Kiloinstruction Based Multiprocessors Invited
Paper. ACM Computing Frontiers Conference.
Ischia, Italy, April 10-12, 2004
35
Kilo-processor and multiprocessor systems
Impact of the network-ROB 64
M. Galluzzi et al. A First glance at
Kiloinstruction Based Multiprocessors Invited
Paper. ACM Computing Frontiers Conference.
Ischia, Italy, April 10-12, 2004
36
Kilo-processor and multiprocessor systems
First Results
M. Galluzzi et al. A First glance at
Kiloinstruction Based Multiprocessors Invited
Paper. ACM Computing Frontiers Conference.
Ischia, Italy, April 10-12, 2004
37
Kilo-processor and multiprocessor systems
Network latency, Radix, 250 cyc. latency
M. Galluzzi et al. A First glance at
Kiloinstruction Based Multiprocessors Invited
Paper. ACM Computing Frontiers Conference.
Ischia, Italy, April 10-12, 2004
38
Kilo-vector processor
20
80
Program
Vector
20
8
Program
Speedup 3.5
Kilo
5
8
Program
Speedup 7.7
F. Quintana et al, Kilo-vector processors,
UPC-DAC
39
Outline

Motivation
Increasing the number of in-flight instructions
Kilo-instruction Processor Ingredients
Multi-Checkpointing the ROB
Out-of-Order Commit
Early Release of Resources
Ephemeral Registers
Load Queues
Locality Exploitation
Instructions Queues
LSQ
Cross-pollination with other techniques
Kilo-processor and multiprocessor systems
kilo-vector processor
Kilo-SMT processor
Further Improvements
Branch prediction
kilo-valpred processor

40
Kilo-valpred processor
T. Ramírez et al. Kilo-value prediction
processor UPC-DAC
41
Kilo and Control Independence

More opportunities to find control independent
instructions
Squash reuse
Control-independent instruction
reexecution removal
Savings
Power/energy
Execution bandwidth
Resources
Helps to go far ahead in the instruction window
faster

42
UPC contribution to kilo processors

We started our work in June 2001
Grant proposal to Intel-MRL (Konrad Lai and Ronny
Ronen) in January 28th. 2002
Presentation to Intel-MRL in February 2002
A. Cristal, et al. Large virtual ROBs by
processor checkpointing Technical Report
UPC-DAC-2002-39, July 2002. (Rejected for
Micro-2002)
Multiple Checkpointers
Out-of-order Commit, No need for ROB
Early release of registers and loads
A. Cristal and M. Valero, ROBs virtuales
utilizando checkpointing. Spanish Workshop on
Parallelism. Lleida, Sept., 2002
Same as the previous report, but in Spanish
A. Cristal, J. Martínez, M. Valero and J. Llosa,
Ephemeral Registers, Technical Report
CSL-TR-2003-1035 , 2003. Rejected for ISCA 2003
and Micro 2003
Ckeckpoint Early Release Late allocation of
registers
Presentation to Intel-MRL in March 2003
A. Cristal, J. Martínez, J. LLosa and M. Valero,
A case for resource-conscious out-of-order
processors, IEEE TCCA Computer Architecture
Letters, Vol. 2, October 2003
Underutilization of resources

43
UPC contribution to kilo processors

A. Cristal, et al. A case for
resource-conscious out-of-order processors
Towards Kilo-instruction in-flight processors.
MEDEA Workshop, Sept 2003 and ACM-CAN, March 2004
A. Cristal et al. Kilo-instruction Processors.
Invited paper. ISHPC-V.Tokyo, LNCS-2858. October
20-22th, 2003
A. Cristal et al. Future ILP Processors.
Invited paper. IJHPCN, to be published
A. Cristal, et al. Out-of-Order Commit
Processors Technical Report UPC-DAC-2003-44,
July 2003. HPCA-10, Madrid, Feb. 2004
Remove-Reinsert Mechanism
Simple reinsert mechanism
M. Galluzzi et al. A First glance at
Kiloinstruction Based Multiprocessors Invited
Paper. ACM Computing Frontiers Conference.
Ischia, Italy, April 10-12, 2004
Much new work done at this moment

44
Talks about Kilo processors, from UPC

Presentation in Barcelona, to Intel-MRL in
February 2002
Spanish Workshop on Parallelism. Lleida, Sept.,
2002
Presentation to Intel-MRL in March 2003
Invited presentation. NSF Panel On the Future
of Computer Architecture Research Wise Views and
Fresh Perspectives. San Diego, June 2003
Invited Lecture. PA3CT Conference. Edegem,
Belgium, September 22-23, 2003
MEDEA Workshop. New Orleans, September 2003
Invited Lecture. ISHPC-V. The 5th International
Symposium on High Performance Computing. Tokyo,
Japan, October 20-22, 2003
Keynote lecture. Seminar on Compilers and
Architecture. IBM Haifa. November 11th., 2003.
Invited lecture. Intel MRL. Haifa., Israel. Nov.
12th., 2003
HPCA-10, Madrid, February 14-18, 2003
Keynote lecture. HPCA-10. Madrid, February 14-18,
2003
Invited lecture. ACM Computing Frontiers. Ischia,
April, 2004
ACM Invited lecture. ENCAR México, May 2004
More future presentations scheduled

45
Memory Latency

Jouppi and P. Ranganathan. The relative
importance of memory latency, bandwidth and
branch prediction Whorkshop on Mixing Logic and
DRAM Chips that compute and remember, during
ISCA-24, 1997
S. Srinivasan and A. Lebeck, Load latency
tolerance in dynamically scheduled processors,
Micro-31, 1998
K. Skadron, P. Ahuja, M. Martonosi and D. Clark
Branch prediction, instruction window size and
cache size Performance tradeoffs and simulation
techniques IEEE-TC, pp. 1260-1281, 1999.

46
Large Reorder Buffers

G. Sohi, S. Breach, and T. N. Vijaykumar
Multiscalar processors ISCA-22, 1995.
E. Rotenberg, Q. Jacobson, Y. Sazeides, and J.
Smith Trace processors ISCA-24, 1997
H. Akkari and M. Driscoll A dynamic
multithreaded processor Micro-31, 1998
R. Balasubramonian, S. Dwarkadas, and D.
Albonesi.Dynamically allocating processor
resources between nearby and distant ilp ISCA,
June 2001.
Save some resources allocated for eager
execution
P. Ranganathan, V. Pai, and S. Adve Using
speculative retirement and large instruction
windows to narrow the performance gap between
memory consistency models SPAA, 1997
J. M. Tendler, S. Dodson, S. Fields, H. Lee, and
B. Sinharoy Power4 System Microarchitecture
IBM Journal of Research and Development, pp.
5-25, January 2002.

47
Checkpointing

W.M. Hwu and Y. N. Patt, Checkpoint repair for
out-of-order execution machines ISCA-14, 1987.
Checkpointing as a recovery mechanism
Early Release of Resources
A. Cristal, M. Valero, and J. LLosa. Large
virtual ROBs by processor checkpointing
Technical Report UPC-DAC-2002-39, July 2002.
Multiple Checkpointers
Out-of-order Commit, No need for ROB
Early release of registers and loads
J.F. Martínez, J. Renau, M.C. Huang, M.
Prvulovic, and J. Torrellas. Cherry checkpointed
early resource recycling in out-of-order
microprocessors. MICRO-35, Nov. 2002.
One checkpoint
Early release of resources

48
Register File

M. Moudgill and K. Pingali and S. Vassiliadis,
Register renaming and dynamic speculation an
alternative approach, In Proceedings of the 26th
annual international symposium on
Microarchitecture, 1993.
Early Release of Registers
T. Monreal, A. González, M. Valero, J. González,
V. Viñals, Delaying Physical Register Allocation
through Virtual-Physical Registers, In
Proceedings of the 33th annual international
symposium on Microarchitecture, 1999.
Virtual Registers, Late allocation of registers
A. Cristal, J. Martínez, M. Valero and J. Llosa,
Ephemeral Registers, Technical Report
CSL-TR-2003-1035 , 2003.
Ckeckpoint Early Release Late allocation of
registers
T. Monreal et al., Late allocation and early
release of physical registers, IEEE-TC (to
appear)

49
Instruction Queues

S. Palacharla, N.P. Jouppi, and J.E. Smith
Complexity-effective superscalar processors
ISCA-24, 1997.
Divide the Instruction queues in a set of FIFO
queues
A.R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan,
and E. Rotenberg A large, fast instruction
window for tolerating cache misses ISCA-29,
2002.
Remove-Reinsert Mechanism
Keep the load dependence of all instructions
E. Brekelbaum, J. Rupley, C.Wilkerson, and B.
Black Hierarchical scheduling windows ISCA-35,
2002.
Two clusters, a slow/big one, and a faster/small
one for critical instructions
A. Cristal, D. Ortega, J. Llosa and M. Valero
Out-of-Order Commit Processors Technical Report
UPC-DAC-2003-44, July 2003. HPCA-10, Madrid, Feb.
2004
Remove-Reinsert Mechanism
Simple reinsert mechanism

50
References for LSQ for Large ROB

A. Cristal, M. Valero, and J. LLosa. Large
virtual ROBs by processor checkpointing
Technical Report UPC-DAC-2002-39, July 2002
J.F. Martínez, J. Renau, M.C. Huang, M.
Prvulovic, and J. Torrellas. Cherry
checkpointed early resource recycling in
out-of-order microprocessors. MICRO-35, 2002
H. Akkari, R. Rajwar and S. T. Srinivasan
Checkpointing Processing and Recovery Towards
Scalable Large Instruction Window Processors
Micro-36, 2003
S. Sethumadhavan, R. Desikan, D. Burger, C.R.
Moore and S. W. Keckler Scalable Hardware Memory
Disambiguation for High ILP Processors Micro-36,
2003

51
Conclusion

Affordable Kilo-instruction Processors
Checkpointing and resource-conscious
architectures
Out-of- order commit
Ephemeral registers
Two-level instruction queues
Early release of loads
Load/store queue management
New ideas to watch for
Better branch predictors
Predication and Multi-path execution
Control and data independent instructions
Reuse of large blocks of instructions
New processor paradigms
Kilo-based multiprocessor systems
Kilo-vector processors
Kilo-SMT processors
Kilo-valpred processors

52
Acknowledgments

Yale Patt
Alex Veidenbaum
Guri Sohi
Mark Hill
Wen-mei Hwu
Mon Beivide
Valentín Puente
José Angel Gregorio
Teresa Monreal
Victor Viñals
Intel, Konrad Lai and Ronny Ronen

Adrián Cristal
José Martínez
Josep Llosa
Daniel Ortega
Fran Cazorla
Enrique Fernández
Ayose Falcón
Alex Pajuelo
Marco Galluzzi
Tanausu Ramírez
Jim Smith

Thank you very much ?

54
Processor-DRAM Gap (latency)
µProc 60/yr.
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
D.A. Patterson New directions in Computer
Architecture Berkeley, June 1998
55
Runahead Execution
ROB
L2 cache miss
Checkpoint
load 1
x
INV
a
x
INV

generate bogus value
invalidate dep. registers
continue execution

branch 1
INV
Runahead Mode
x
branch
x
x
load 2
INV
x
b
x
INV

Virtually increments ROB size
Prefetch data of future loads

branch 3
x
x
Mutlu et al. Runahead Execution An
Alternative, HPCA03
56
Kilo and Control Independence

Larger windows improve
The probability of finding the
reconvergence point
The correct detection of control
independent instructions because the wrong path
is completely executed
The execution of more control
independent instructions for later reuse

Wrong path
Correct path
current instruction windows
RP
CI
kilo-instruction windows
57
Kilo and Control Independence

The larger the window the more opportunities to
find the reconvergence point.

Current instruction windows
58
Grant Proposal to Intel January 28th, 2002

In the first semester we worked on the smart
register file and the associated ISA, and
evaluated the proposed architecture with a few
kernels. We showed speedups around 20 in the
tested kernels. At the end of the first semester,
we began the work on wide registers.From April to
August 2001, we have been investigating three
different approaches to use register files with
wide ports (i.e. ports that allow to read various
consecutive registers in a single access). The
first one was trying to find subgraphs in the
data dependence graph that have the same shape.
The drawback of this approach is that it requires
to move loads above stores in order to have a
significant coverage. Some type of dependence
speculation that adds a non-negligible complexity
is required. We also did a study of the potential
to exploit wide registers by looking at
instructions in a window of 32 instructions. For
Spec95 programs, we obtained that 48.9 and 52.3
of the operands were not wide for integer and
FP codes respectively. We continue working on an
approach that tries to group the two values of
all two-operand instructions in a single wide
register.
Since August 2001, we have been working on
committing instructions out of order that allows
to free in advance processor resources and to
continue the execution of new instructions. The
main idea is as follows when the processor finds
an old instruction in the ROB with a large
latency and the ROB is full, the processor
removes this instruction by checkpointing the
state of the processor at the last committed
instruction. The processor continues its work
normally and it moves all instructions that
depend on the checkpointed instruction, to the
checkpointing table. In case of misspeculation or
an exception of either the checkpointed
instruction or any instruction dependent on it,
the checkpointed state is restored. The design of
the mechanism is still in progress. We are
building a simulation environment that will
permit us to evaluate the proposal.
The work we plan to do during this year
concerning to the out-of-order commit mechanism
is the following
To finish the simulator to start with the
evaluation of different alternatives for the
implementation of the out-of-order commit
mechanism
To optimize the mechanism for those branches
where the branch predictor fails frequently.
To study new organizations for the load-store
queues.
To use the concept of virtual registers to
optimize the register file organization.
Concerning to the work dealing with wide
registers, we are going to finish the design of
the mechanism and to evaluate it.

59
Grant Proposal to Intel January 28th, 2002

Since August 2001, we have been working on
committing instructions out of order that allows
to free in advance processor resources and to
continue the execution of new instructions. The
main idea is as follows when the processor finds
an old instruction in the ROB with a large
latency and the ROB is full, the processor
removes this instruction by checkpointing the
state of the processor at the last committed
instruction. The processor continues its work
normally and it moves all instructions that
depend on the checkpointed instruction, to the
checkpointing table. In case of misspeculation or
an exception of either the checkpointed
instruction or any instruction dependent on it,
the checkpointed state is restored. The design of
the mechanism is still in progress. We are
building a simulation environment that will
permit us to evaluate the proposal.
The work we plan to do during this year
concerning to the out-of-order commit mechanism
is the following
To finish the simulator to start with the
evaluation of different alternatives for the
implementation of the out-of-order commit
mechanism
To optimize the mechanism for those branches
where the branch predictor fails frequently.
To study new organizations for the load-store
queues.
To use the concept of virtual registers to
optimize the register file organization.
Concerning to the work dealing with wide
registers, we are going to finish the design of
the mechanism and to evaluate it.

Write a Comment

User Comments (0)