Ch 6: Pipelining Modified from Dave Pattersons notes

About This Presentation

Title:

Ch 6: Pipelining Modified from Dave Pattersons notes

Description:

EE30332 Ch6 DP .1. Ch 6: Pipelining Modified from Dave Patterson's notes. Laundry Example ... EE30332 Ch6 DP .2. Sequential Laundry ... EE30332 Ch6 DP .16 ... – PowerPoint PPT presentation

Number of Views:132

Avg rating:3.0/5.0

Slides: 66

Provided by: davidec

Category:

more less

Transcript and Presenter's Notes

Title: Ch 6: Pipelining Modified from Dave Pattersons notes

1
Ch 6 Pipelining Modified from Dave Pattersons
notes

Laundry Example
Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 30 minutes
Folder takes 30 minutes
Stasher takes 30 minutesto put clothes into
drawers

A
B
C
D
2
Sequential Laundry
2 AM
12
6 PM
7
8
11
1
10
9
30
30
30
30
30
30
30
30
30
30
30
30
30
30
30
30
T a s k O r d e r
Time

Sequential laundry takes 8 hours for 4 loads
If they learned pipelining, how long would
laundry take?

3
Pipelined Laundry Start work ASAP
2 AM
12
6 PM
8
1
7
10
11
9
Time
T a s k O r d e r

Pipelined laundry takes 3.5 hours for 4 loads!

4
Pipelining Lessons

Pipelining doesnt help latency of single task,
it helps throughput of entire workload
Multiple tasks operating simultaneously using
different resources
Potential speedup Number pipe stages
Pipeline rate limited by slowest pipeline stage
Unbalanced lengths of pipe stages reduces speedup
Time to fill pipeline and time to drain it
reduces speedup
Stall for Dependences

6 PM
7
8
9
Time
T a s k O r d e r
5
Pipelining

Improve performance by increasing instruction
throughput.
To increase throughput, is to minimize the
individual stage time duration.
One natural way to minimize the stage time
duration is to split an instruction into more
stages.
One disadvantage for more stage is branching,
because it will alter the instruction flow, which
is sequential.
What do we need to add to actually split the
datapath into stages?
Ans By adding storage devices between stages
to increase more stages.

6
The Five Stages of Load
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Load

Ifetch Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec Registers Fetch and Instruction Decode
Exec Calculate the memory address
Mem Read the data from the Data Memory
Wr Write the data back to the register file

7
Conventional Pipelined Execution Representation
Time
Program Flow
8
Single Cycle, Multiple Cycle, vs. Pipeline
Cycle 1
Cycle 2
Clk
Single Cycle Implementation
Load
Store
Waste
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Cycle 10
Clk
Multiple Cycle Implementation
Load
Store
R-type
Pipeline Implementation
Load
Store
R-type
9
Why Pipeline?

Suppose we execute 100 instructions
Single Cycle Machine
45 ns/cycle x 1 CPI x 100 inst 4500 ns
Multicycle Machine
10 ns/cycle x 4.6 CPI (due to inst mix) x 100
inst 4600 ns
Ideal pipelined machine
10 ns/cycle x (1 CPI x 100 inst 4 cycle drain)
1040 ns

10
Why Pipeline? Because the resources are there!
Time (clock cycles)
I n s t r. O r d e r
Inst 0
Inst 1
Inst 2
Inst 3
Inst 4
11
Can pipelining get us into trouble?

Yes Pipeline Hazards
structural hazards attempt to use the same
resource two different ways at the same time
E.g., combined washer/dryer would be a structural
hazard or folder busy doing something else
(watching TV)
data hazards attempt to use item before it is
ready
E.g., one sock of pair in dryer and one in
washer cant fold until get sock from washer
through dryer
instruction depends on result of prior
instruction still in the pipeline
control hazards attempt to make a decision
before condition is evaulated
E.g., washing football uniforms and need to get
proper detergent level need to see after dryer
before next load in
branch instructions
Can always resolve hazards by waiting
pipeline control must detect the hazard
take action (or delay action) to resolve hazards

12
Single Memory is a Structural Hazard
Time (clock cycles)
I n s t r. O r d e r
Mem
Reg
Reg
Load
Instr 1
Instr 2
Mem
Mem
Reg
Reg
Instr 3
Instr 4
Detection is easy in this case! (right half
highlight means read, left half write)
13
Structural Hazards limit performance

Example if 1.3 memory accesses per instruction
and only one memory access per cycle then
average CPI 1.3
otherwise resource is more than 100 utilized

14
Control Hazard Solutions

Stall wait until decision is clear
Its possible to move up decision to 2nd stage by
adding hardware to check registers as being read
Impact 2 clock cycles per branch instruction
slow

I n s t r. O r d e r
Time (clock cycles)
Mem
Reg
Reg
Add
Mem
Reg
Reg
Beq
Load
Mem
Reg
Reg
15
Control Hazard Solutions

Predict guess one direction then back up if
wrong
Predict not taken
Impact 1 clock cycles per branch instruction if
right, 2 if wrong (right 50 of time)
More dynamic scheme history of 1 branch ( 90)

I n s t r. O r d e r
Time (clock cycles)
Mem
Reg
Reg
Add
Mem
Reg
Reg
Beq
Load
Mem
Mem
Reg
Reg
16
Control Hazard Solutions

Redefine branch behavior (takes place after next
instruction) delayed branch
Impact 0 clock cycles per branch instruction if
can find instruction to put in slot ( 50 of
time)
As launch more instruction per clock cycle, less
useful

I n s t r. O r d e r
Time (clock cycles)
Mem
Reg
Reg
Add
Mem
Reg
Reg
Beq
Misc
Mem
Mem
Reg
Reg
Load
Mem
Mem
Reg
Reg
17
Data Hazard on r1
add r1 ,r2,r3
sub r4, r1 ,r3
and r6, r1 ,r7
or r8, r1 ,r9
xor r10, r1 ,r11
18
Data Hazard on r1

Dependencies backwards in time are hazards

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
ALU
Reg
Im
Dm
I n s t r. O r d e r
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
ALU
Reg
or r8,r1,r9
xor r10,r1,r11
19
Data Hazard Solution

Forward result from one stage to another
or OK if define read/write properly

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
ALU
Reg
Im
Dm
I n s t r. O r d e r
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
ALU
Reg
or r8,r1,r9
xor r10,r1,r11
20
Forwarding (or Bypassing) What about Loads

Dependencies backwards in time are
hazards
Cant solve with forwarding
Must delay/stall instruction dependent on loads

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
lw r1,0(r2)
Reg
ALU
Reg
Im
Dm
sub r4,r1,r3
Dm
Reg
Reg
21
Designing a Pipelined Processor

Go back and examine your datapath and control
diagram
associated resources with states
ensure that flows do not conflict, or figure out
how to resolve
assert control in appropriate stage

22
Pipelined Processor (almost) for slides

What happens if we start a new instruction every
cycle?

Valid
IRex
IR
IRwb
Inst. Mem
IRmem
WB Ctrl
Dcd Ctrl
Ex Ctrl
Mem Ctrl
Equal
Reg. File
Reg File
Exec
PC
Next PC
Mem Access
Data Mem
23
Control and Datapath
IR A S S S S If Cond PC M MemS Rrd Rrd Rrt Equal
Reg. File
Reg File
Exec
PC
IR
Next PC
Inst. Mem
Mem Access
Data Mem
24
Pipelining the Load Instruction
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Clock
2nd lw
3rd lw

The five independent functional units in the
pipeline datapath are
Instruction Memory for the Ifetch stage
Register Files Read ports (bus A and busB) for
the Reg/Dec stage
ALU for the Exec stage
Data Memory for the Mem stage
Register Files Write port (bus W) for the Wr
stage

25
The Four Stages of R-type
Cycle 1
Cycle 2
Cycle 3
Cycle 4
R-type

Ifetch Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec Registers Fetch and Instruction Decode
Exec
ALU operates on the two register operands
Update PC
Wr Write the ALU output back to the register file

26
Pipelining the R-type and Load Instruction
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
Ops! We have a problem!
R-type
R-type
Load
R-type
R-type

We have pipeline conflict or structural hazard
Two instructions try to write to the register
file at the same time!
Only one write port

27
Important Observation

Each functional unit can only be used once per
instruction
Each functional unit must be used at the same
stage for all instructions
Load uses Register Files Write Port during its
5th stage
R-type uses Register Files Write Port during its
4th stage

2 ways to solve this pipeline hazard.

28
Solution 1 Insert Bubble into the Pipeline
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
Load
R-type
Pipeline
R-type
R-type
Bubble

Insert a bubble into the pipeline to prevent 2
writes at the same cycle
The control logic can be complex.
Lose instruction fetch and issue opportunity.
No instruction is started in Cycle 6!

29
Solution 2 Delay R-types Write by One Cycle

Delay R-types register write by one cycle
Now R-type instructions also use Reg Files write
port at Stage 5
Mem stage is a NOOP stage nothing is being done.

4
1
2
3
5
Mem
R-type
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
R-type
R-type
Load
R-type
R-type
30
The Four Stages of Store
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Store
Wr

Ifetch Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec Registers Fetch and Instruction Decode
Exec Calculate the memory address
Mem Write the data into the Data Memory

31
The Three Stages of Beq
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Beq
Wr

Ifetch Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec
Registers Fetch and Instruction Decode
Exec
compares the two register operand,
select correct branch target address
latch into PC

32
Control Diagram
IR A S S S S If Cond PC M MemS M M Rrd Rrd Rrt Equal
Reg. File
Reg File
Exec
PC
IR
Next PC
Inst. Mem
Mem Access
Data Mem
33
Lets Try it Out
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
these addresses are octal
34
Start Fetch 10
Inst. Mem
Decode
WB Ctrl
Mem Ctrl
IR
im
rs
rt
Reg. File
Reg File
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
10
PC
35
Fetch 14, Decode 10
lw r1, r2(35)
Inst. Mem
Decode
WB Ctrl
Mem Ctrl
IR
im
2
rt
Reg. File
Reg File
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
14
PC
36
Fetch 20, Decode 14, Exec 10
addI r2, r2, 3
Inst. Mem
Decode
WB Ctrl
lw r1
Mem Ctrl
IR
35
2
rt
Reg. File
Reg File
r2
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
20
PC
37
Fetch 24, Decode 20, Exec 14, Mem 10
sub r3, r4, r5
addI r2, r2, 3
Inst. Mem
Decode
WB Ctrl
lw r1
Mem Ctrl
IR
3
4
5
Reg. File
Reg File
r2
r235
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
24
PC
38
Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10
beq r6, r7 100
Inst. Mem
Decode
WB Ctrl
addI r2
lw r1
sub r3
Mem Ctrl
IR
6
7
Reg. File
Reg File
r4
Mr235
r23
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
30
PC
39
Fetch 34, Dcd 30, Ex 24, Mem 20, WB 14
ori r8, r9 17
Inst. Mem
Decode
WB Ctrl
addI r2
sub r3
Mem Ctrl
beq
IR
9
xx
100
r1Mr235
Reg. File
Reg File
r6
r23
r4-r5
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
34
PC
40
Fetch 100, Dcd 34, Ex 30, Mem 24, WB 20
Inst. Mem
Decode
ori r8
WB Ctrl
sub r3
beq
add r10, r11, r12
Mem Ctrl
11
12
17
Reg. File
r1Mr235
IR
Reg File
r9
r4-r5
r2 r23
xxx
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
100
PC
ooops, we should have only one delayed instruction
41
Fetch 104, Dcd 100, Ex 34, Mem 30, WB 24
n
Inst. Mem
Decode
add r10
WB Ctrl
beq
ori r8
Mem Ctrl
and r13, r14, r15
14
15
xx
Reg. File
r1Mr235
IR
Reg File
r11
xxx
r9 17
r2 r23
Exec
r3 r4-r5
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
104
PC
Squash the extra instruction in the branch shadow!
42
Fetch 108, Dcd 104, Ex 100, Mem 34, WB 30
n
Inst. Mem
Decode
ori r8
add r10
WB Ctrl
and r13
Mem Ctrl
xx
Reg. File
r1Mr235
IR
Reg File
r14
r9 17
r2 r23
r11r12
Exec
r3 r4-r5
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
110
PC
Squash the extra instruction in the branch shadow!
43
Fetch 114, Dcd 110, Ex 104, Mem 100, WB 34
n
NO WB NO Ovflow
and r13
Inst. Mem
Decode
add r10
WB Ctrl
Mem Ctrl
Reg. File
r1Mr235
IR
Reg File
r11r12
r2 r23
r14 R15
Exec
r3 r4-r5
r8 r9 17
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
114
PC
Squash the extra instruction in the branch shadow!
44
Summary Pipelining

What makes it easy
all instructions are the same length
just a few instruction formats
memory operands appear only in loads and stores
What makes it hard?
structural hazards suppose we had only one
memory
control hazards need to worry about branch
instructions
data hazards an instruction depends on a
previous instruction
Well build a simple pipeline and look at these
issues
Well talk about modern processors and what
really makes it hard
exception handling
trying to improve performance with out-of-order
execution, etc.

45
Summary

Pipelining is a fundamental concept
multiple steps using distinct resources
Utilize capabilities of the Datapath by pipelined
instruction processing
start next instruction while working on the
current one
limited by length of longest stage (plus
fill/flush)
detect and resolve hazards

46
Ch6 Supplementary

Branch prediction
Branch prediction is critical for superpipelining
and superscalar computers
More instructions issued at same time, larger the
penalty of hazards
Statistically, 60 conditional branches will
branch
Higher-level (and more powerful) instruction sets
need less conditional branches, such as those
support variable-length operands
Conditional branching can be classified into two
types
Program loops
Random decision making

47
Static branch prediction

Static and Dynamic Branch Predictions
Static are compiler determining conditional
branches
Dynamic ones are run-time (during execution)
generated
Static good for looping
Loop Exit at Loop Start Predict continue
Loop Exit at Loop End Predict branching
Random decision guess branch to be taken

48
Dynamic branch prediction

Dynamic Data sensitive, at run-time (at
execution)
One-bit dynamic branch prediction
Predict as previous record shows
Two-bit dynamic branch prediction
If previous two are same, predict the same
If previous two are alternating, predict
alternating
Branch branch predict branch
Not branch not branch predict not branch
Branch not branch predict branch
Not branch branch predict not branch

49
Branch prediction cont.

Branch prediction cache
Cache with entries for instruction addresses that
host the branch instructions, and a bit to
indicate the previous branching or not branching

Last B or NB Previous to last B or
NB B-Branch NB-Not Branch
Branch inst address
?
Valid AND Matched
50
Other schemes for minimizing incorrect branch
prediction penalty

Speculative execution
Execute first and there is a way to roll back
By doing the storeback on shadow copy
By having a backup copy for roll undo
Conditional execution
To minimize conditional branching for some common
action such as clear, set-to-1, move, or add
Delayed (conditional) branching
Always execute the next instruction and then do
the conditional branching
Save a cycle or more

51
Branching

Call and Return are branching instructions
through Stack
Software interrupts are implicit branchings and
transparent to the program even though their
actions are carried out. They are treated as
parts of the program.

52
Superscalar

Superscalar
Fetch and execute two or more instructions
concurrently
To achieve CPI
Dynamic issue decided at run-time to schedule
executing two or more instructions concurrently
Require multiple copies of functional units such
as instruction fetch, arithmetic execution, etc.,
and multi-port register files and caches (or
cache buffers)
There are more potential data dependency hazards,
resource hazards, and control hazards
Penalty for incorrect branch prediction is BIG

53
VLIW

Very Long Instruction Word
A VLIW instruction is machine (implementation)
dependent
A VLIW instruction consists of various fields
Each field specifies the operation of a
functional unit, such as Ifetch (instruction
fetch), Idecode, Ofetch (operand fetch), EX
(integer execute), FPA (Floating-point add),
FPMUL and FPDIV
Static issue instructions generated by
compilers
Multiple instructions issued and being executed
at same time

54
VLIW advantages

Static generating codes (by compiler)
Compilers can take a lot of time to pack the VLIW
instructions, else they are done dynamically by
hardware instruction scheduler (analyzed by
circuitry for scheduling functional units)
Easier to power down individual functional units
if they are not used, and easier for compilers to
deliberately arrange the functional unit
executions to minimize power consumption
Can execute different computer architecture
instruction sets with a machine through
respective compilers.
However, the functional units must be so
constructed to support these instruction sets and
architectures.

55
VLIW disadvantages

Compilers are hard to build
Machine dependent must have different compilers
for different machines of the same architecture
Binary incompatible must have different binary
codes for different machines of the same
architecture
Cannot see input data when compiling, must
prepare for all possible cases of input data
Difficult to recover from compiler mistakes, and
the time penalty can be BIG
Difficult to debug
Non-VLIW machines can also power down individual
functional units when not used

56
Ch6 Supplementary

Branch prediction
Branch prediction is critical for superpipelining
and superscalar computers
More instructions issued at same time, larger the
penalty of hazards
Statistically, 60 conditional branches will
branch
Higher-level (and more powerful) instruction sets
need less conditional branches, such as those
support variable-length operands
Conditional branching can be classified into two
types
Program loops
Random decision making

57
Static branch prediction

Static and Dynamic Branch Predictions
Static are compiler determining conditional
branches
Dynamic ones are run-time (during execution)
generated
Static good for looping
Loop Exit at Loop Start Predict continue
Loop Exit at Loop End Predict branching
Random decision guess branch to be taken

58
Dynamic branch prediction

Dynamic Data sensitive, at run-time (at
execution)
One-bit dynamic branch prediction
Predict as previous record shows
Two-bit dynamic branch prediction
If previous two are same, predict the same
If previous two are alternating, predict
alternating
Branch branch predict branch
Not branch not branch predict not branch
Branch not branch predict branch
Not branch branch predict not branch

59
Branch prediction cont.

Branch prediction cache
Cache with entries for instruction addresses that
host the branch instructions, and a bit to
indicate the previous branching or not branching

Last B or NB Previous to last B or
NB B-Branch NB-Not Branch
Branch inst address
?
Valid AND Matched
60
Other schemes for minimizing incorrect branch
prediction penalty

Speculative execution
Execute first and there is a way to roll back
By doing the storeback on shadow copy
By having a backup copy for roll undo
Conditional execution
To minimize conditional branching for some common
action such as clear, set-to-1, move, or add
Delayed (conditional) branching
Always execute the next instruction and then do
the conditional branching
Save a cycle or more

61
Branching

Call and Return are branching instructions
through Stack
Software interrupts are implicit branchings and
transparent to the program even though their
actions are carried out. They are treated as
parts of the program.

62
Superscalar

Superscalar
Fetch and execute two or more instructions
concurrently
To achieve CPI
Dynamic issue decided at run-time to schedule
executing two or more instructions concurrently
Require multiple copies of functional units such
as instruction fetch, arithmetic execution, etc.,
and multi-port register files and caches (or
cache buffers)
There are more potential data dependency hazards,
resource hazards, and control hazards
Penalty for incorrect branch prediction is BIG

63
VLIW

Very Long Instruction Word
A VLIW instruction is machine (implementation)
dependent
A VLIW instruction consists of various fields
Each field specifies the operation of a
functional unit, such as Ifetch (instruction
fetch), Idecode, Ofetch (operand fetch), EX
(integer execute), FPA (Floating-point add),
FPMUL and FPDIV
Static issue instructions generated by
compilers
Multiple instructions issued and being executed
at same time

64
VLIW advantages

Static generating codes (by compiler)
Compilers can take a lot of time to pack the VLIW
instructions, else they are done dynamically by
hardware instruction scheduler (analyzed by
circuitry for scheduling functional units)
Easier to power down individual functional units
if they are not used, and easier for compilers to
deliberately arrange the functional unit
executions to minimize power consumption
Can execute different computer architecture
instruction sets with a machine through
respective compilers.
However, the functional units must be so
constructed to support these instruction sets and
architectures.

65
VLIW disadvantages

Compilers are hard to build
Machine dependent must have different compilers
for different machines of the same architecture
Binary incompatible must have different binary
codes for different machines of the same
architecture
Cannot see input data when compiling, must
prepare for all possible cases of input data
Difficult to recover from compiler mistakes, and
the time penalty can be BIG
Difficult to debug
Non-VLIW machines can also power down individual
functional units when not used