Instruction Level Parallelism ILP - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Instruction Level Parallelism ILP

Description:

Monica S. Lam, Robert P. Wilson. 19th ISCA, May 1992, pages 19-21. ... Computer Architecture A Quantitative Approach, Hennessy & Patterson, 3rd edition, M Kaufmann ... – PowerPoint PPT presentation

Number of Views:168

Avg rating:3.0/5.0

Slides: 32

Provided by: sir47

Category:

more less

Transcript and Presenter's Notes

Title: Instruction Level Parallelism ILP

1
Instruction Level Parallelism ILP

Advanced Computer Architecture
CSE 8383
Spring 2004 2/19/2004
Presented By
Saad Al-Harbi
Saeed Abu Nimeh

2
Outline

Whats ILP
ILP vs Parallel Processing
Sequential execution vs ILP execution
Limitations of ILP
ILP Architectures
Sequential Architecture
Dependence Architecture
Independence Architecture
ILP Scheduling
Open Problems
References

3
Whats ILP

Architectural technique that allows the overlap
of individual machine operations ( add, mul,
load, store )
Multiple operations will execute in parallel
(simultaneously)
Goal Speed Up the execution
Example
load R1 ? R2 add R3 ? R3, 1
add R3 ? R3, 1 add R4 ? R3, R2
add R4 ? R4, R2 store R4 ? R0

4
Example Sequential vs ILP

Sequential execution (Without ILP)
Add r1, r2 ? r8 4 cycles
Add r3, r4 ? r7 4 cycles 8 cycles
ILP execution (overlap execution)
Add r1, r2 ? r8
Add r3, r4 ? r7
Total of 5 cycles

5
ILP vs Parallel Processing

ILP
Overlap individual machine operations (add, mul,
load) so that they execute in parallel
Transparent to the user
Goal speed up execution

Parallel Processing
Having separate processors getting separate
chunks of the program ( processors programmed to
do so)
Nontransparent to the user
Goal speed up and quality up

6
ILP Challenges

In order to achieve parallelism we should not
have dependences among instructions which are
executing in parallel
H/W terminology Data Hazards ( RAW, WAR, WAW)
S/W terminology Data Dependencies

7
Dependences and Hazards

Dependences are a property of programs
If two instructions are data dependent they can
not execute simultaneously
A dependence results in a hazard and the hazard
causes a stall
Data dependences may occur through registers or
memory

8
Types of Dependencies

Name dependencies
Output dependence
Anti-dependence
Data True dependence
Control Dependence
Resource Dependence

9
Name dependences

Output dependence
When instruction I and J write the same register
or memory location. The ordering must be
preserved to leave the correct value in the
register
add r7,r4,r3
div r7,r2,r8
Anti-dependence
When instruction j writes a register or memory
location that instruction I reads
i add r6,r5,r4
j sub r5,r8,r11

10
Data Dependences

An instruction j is data dependent on instruction
i if either of the following hold
instruction i produces a result that may be used
by instruction j , or
instruction j is data dependent on instruction k,
and instruction k is data dependent on
instruction i

LOOP LD F0, 0(R1)
ADD F4, F0, F2
SD F4, 0(R1)
SUB R1, R1, -8
BNE R1, R2, LOOP

11
Control Dependences

A control dependence determines the ordering of
an instruction i, with respect to a branch
instruction so that the instruction i is executed
in correct program order.
Example
If p1
S1
If p2
S2

Two constraints imposed by control dependences
An instruction that is control dependent on a
branch cannot be moved before the branch
An instruction that is not control dependent on
a branch cannot be moved after the branch

12
Resource dependences

An instruction is resource-dependent on a
previously issued instruction if it requires a
hardware resource which is still being used by a
previously issued instruction.
e.g.
div r1, r2, r3
div r4, r2, r5

13
ILP Architectures

Computer Architecture is a contract (instruction
format and the interpretation of the bits that
constitute an instruction) between the class of
programs that are written for the architecture
and the set of processor implementations of that
architecture.
In ILP Architectures information embedded in
the program pertaining to available parallelism
between instructions and operations in the program

14
ILP Architectures Classifications

Sequential Architectures the program is not
expected to convey any explicit information
regarding parallelism. (Superscalar processors)
Dependence Architectures the program explicitly
indicates the dependences that exist between
operations (Dataflow processors)
Independence Architectures the program provides
information as to which operations are
independent of one another. (VLIW processors)

15
Sequential architecture and superscalar processors

Program contains no explicit information
regarding dependencies that exist between
instructions
Dependencies between instructions must be
determined by the hardware
It is only necessary to determine dependencies
with sequentially preceding instructions that
have been issued but not yet completed
Compiler may re-order instructions to facilitate
the hardwares task of extracting parallelism

16
Superscalar Processors

Superscalar processors attempt to issue multiple
instructions per cycle
However, essential dependencies are specified by
sequential ordering so operations must be
processed in sequential order
This proves to be a performance bottleneck that
is very expensive to overcome

17
Dependence architecture and data flow processors

The compiler (programmer) identifies the
parallelism in the program and communicates it to
the hardware (specify the dependences between
operations)
The hardware determines at run-time when each
operation is independent from others and perform
scheduling
Here, no scanning of the sequential program to
determine dependences
Objective execute the instruction at the
earliest possible time (available input operands
and functional units).

18
Dependence architectures Dataflow processors

Dataflow processors are representative of
Dependence architectures
Execute instruction at earliest possible time
subject to availability of input operands and
functional units
Dependencies communicated by providing with each
instruction a list of all successor instructions
As soon as all input operands of an instruction
are available, the hardware fetches the
instruction
The instruction is executed as soon as a
functional unit is available
Few Dataflow processors currently exist

19
Dataflow strengths and limitations

Dataflow processors use control parallelism alone
to fully utilize the FU.
Dataflow processor is more successful than others
at looking far down the execution path to find
control parallelism
When successful its better than speculative
execution
Every instruction is executed is useful
Processor does not have to deal with error
conditions, because of speculative operations

20
Independence architecture and VLIW processors

By knowing which operations are independent, the
hardware needs no further checking to determine
which instructions can be issued in the same
cycle
The set of independent operations gtgt the set of
dependent operations
Only a subset of independent operations are
specified
The compiler may additionally specify on which
functional unit and in which cycle an operation
is executed
The hardware needs to make no run-time decisions

21
VLIW processors

Operation vs instruction
Operation is an unit of computation (add, load,
branch instruction in sequential ar.)
Instruction set of operations that are intended
to be issued simultaneously
Compiler decides which operation to go to each
instruction (scheduling)
All operations that are supposed to begin at the
same time are packaged into a single VLIW
instruction

22
VLIW strengths

In hardware it is very simple
consisting of a collection of function units
(adders, multipliers, branch units, etc.)
connected by a bus, plus some registers and
caches
More silicon goes to the actual processing
(rather than being spent on branch prediction,
for example),
It should run fast, as the only limit is the
latency of the function units themselves.
Programming a VLIW chip is very much like writing
microcode

23
VLIW limitations

The need for a powerful compiler,
Increased code size arising from aggressive
scheduling policies,
Larger memory bandwidth and register-file
bandwidth,
Limitations due to the lock-step operation,
binary compatibility across implementations with
varying number of functional units and latencies

24
Summary ILP Architectures
Sequential Architecture Dependence Architecture Independence Architectures
Additional info required in the program None Specification of dependences between operations Minimally, a partial list of independences. A complete specification of when and where each operation to be executed
Typical kind of ILP processor Superscalar Dataflow VLIW
Dependences analysis Performed by HW Performed by compiler Performed by compiler
Independences analysis Performed by HW Performed by HW Performed by compiler
Scheduling Performed by HW Performed by HW Performed by compiler
Role of compiler Rearranges the code to make the analysis and scheduling HW more successful Replaces some analysis HW Replaces virtually all the analysis and scheduling HW
25
ILP Scheduling

Static Scheduling boosted by parallel code
optimization

Dynamic Scheduling without static parallel code
optimization
Dynamic Scheduling boosted by static parallel
code optimization

done by the compiler
The processor receives dependency-free and
optimized code for parallel execution
Typical for VLIWs and a few pipelined processors
(e.g. MIPS)

done by the processor
The code is not optimized for parallel execution.
The processor detects and resolves dependencies
on its own
Early ILP processors (e.g. CDC 6600, IBM 360/91
etc.)

done by processor in conjunction with parallel
optimizing compiler
The processor receives optimized code for
parallel execution, but it detects and resolves
dependencies on its own
Usual practice for pipelined and superscalar
processors (e.g. RS6000)

26
ILP Scheduling Trace scheduling

An optimization technique that has been widely
used for VLIW, superscalar, and pipelined
processors.
It selects a sequence of basic blocks as a trace
and schedules the operations from the trace
together.
Example
Instr1
Instr2
Branch x
Instr3

27
Trace Scheduling

Extract more ILP
Increase machine fetch bandwidth by storing
logically consecutive blocks in physically
contiguous cache location (possible to fetch
multiple basic blocks in one cycle)
Trace scheduling can be implemented by hardware
or software

28
Trace Scheduling in HW

Hardware technique makes use of a large amount of
information in dynamic execution to format traces
dynamically and schedule the instructions in
trace more efficiently.
Since the dependency and memory access addresses
have been solved in dynamic execution,
instructions in trace can be reordered more
easily and efficiently.
Example trace cache approach

29
Trace scheduling in SW

Supplement to machines without hardware trace
scheduling support.
Formats traces based on static profiled data, and
schedules instructions using traditional compiler
scheduling and optimization technique.
It faces some difficulties like code explosion
and exception handling.

30
ILP open problems

Pipelined scheduling Optimized scheduling of
pipelined behavioral descriptions. Two simple
type of pipelining (structural and functional).
Controller cost Most scheduling algorithms do
not consider the controller costs which is
directly dependent on the controller style used
during scheduling.
Area constraints The resource constrained
algorithms could have better interaction between
scheduling and floorplanning.
Realism
Scheduling realistic design descriptions that
contain several special language constructs.
Using more realistic libraries and cost
functions.
Scheduling algorithms must also be expanded to
incorporate different target architectures.

31
References

Instruction-Level Parallel Processing History,
Overview and Perspective. B. Ramakrishna Rau,
Joseph A. Fisher. Journal of Supercomputing, Vol.
7, No. 1, Jan. 1993, pages 9-50.
Limits of Control Flow on Parallelism. Monica S.
Lam, Robert P. Wilson. 19th ISCA, May 1992, pages
19-21.
Global Code Generation for Instruction-Level
Parallelism Trace Scheduling-2. Joseph A.
Fisher. Technical Report, HPLabs HPL-93-43, Jun.
1993.
VLIW at IBM Research
http//www.research.ibm.com/vliw
Intel and HP hope to speed CPUs with VLIW
technology that's riskier than RISC, Dick
Pountain
http//www.byte.com/art/9604/sec8/art3.htm
Hardware and Software Trace Scheduling
http//charlotte.ucsd.edu/users/yhu/paperlist/sum
mary.html
ILP open problems
http//www.ececs.uc.edu/ddel/projects/dss/hls_pa
per/node9.html
Computer Architecture A Quantitative Approach,
Hennessy Patterson, 3rd edition, M Kaufmann