Software Exploits for ILP

1 / 34
About This Presentation
Title:

Software Exploits for ILP

Description:

variables may be aliased or pointed to. variables may use indirect ... Determine all dependencies (true, output, anti) in the loop below and determine ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 35
Provided by: rfox

less

Transcript and Presenter's Notes

Title: Software Exploits for ILP


1
Software Exploits for ILP
  • We have mostly concentrated on dynamic approaches
    to exploit ILP
  • branch prediction through buffers
  • superscalar dynamic scheduling
  • register renaming
  • Some architectures attempt to simplify the
    hardware by using compiler techniques to further
    exploit ILP instead
  • IA-64 and Itanium 2
  • Techniques will
  • include static scheduling for superscalars and
    VLIW (and a variation called EPIC)
  • we already looked at these in chapter 2
  • analyzing high-level language code for loop
    dependencies
  • eliminating computation dependencies
  • symbolic loop unrolling
  • global code scheduling for branch prediction
  • predicated instructions

2
Loop Dependencies
  • In order to achieve greater ILP, we need to
    promote LLP (loop level parallelism)
  • we already used loop unrolling to do so, we
    must ensure that there are no loop dependencies
  • a dependence arises if two loop iterations are
    dependent
  • Consider
  • for (i1ilt100i) xixis
  • data dependence between xi on lhs and rhs but
    it only exists within an iteration, not across
    iterations (as in xi1xis)
  • for (i1ilt100i) xi xi-1 s
  • here, the data dependence on x is loop carried
  • a compiler may not be able to unroll a loop with
    a loop carried dependence
  • question how far is the distance of the
    dependence?
  • if far enough, unrolling may still succeed
    (consider xi xi-5 s)
  • three forms of dependencies true (data), anti
    and output

3
Example
  • Convert this loop into
  • A1A1B1 for (i1ilt99i)
    Bi1CiDi Ai1Ai1Bi1
    B101C100D100
  • now the loop is parallel
  • the dependencies are not loop carried
  • Bi1 from S1 to S2
  • Ai1 to Ai1 in S2
  • Identify any dependencies and note which are loop
    carried
  • can we parallelize the loop?
  • note a loop is parallel if it has no loop
    carried dependencies or it can be written without
    a cycle of dependencies
  • For (i1ilt100i) AiAiBi /
    S1 / Bi1CiDi / S2 /
  • data dependence on A in S1 (not loop carried)
  • data dependence on B from S2 to S1 (loop carried)
  • the loop carried dependence implies no LLP, but
    the dependence is not a circular one
  • this loop can be parallelized

notice the reversed order between the two
statements
4
Identifying Dependencies
  • Detecting data dependencies is done through
    matching symbolic names, so there are situations
    where the compiler may not be able to help us
  • variables may be aliased or pointed to
  • variables may use indirect referencing through
    arrays
  • consider R1 1000 and R2 1004, a dependence
    exists with 4(R1) and 0(R2) yet may not be
    identified by the compiler
  • Other forms of dependencies might be easier to
    identify (particularly in the high-level code)
    where a loop-based dependence is usually in the
    form of a recurrence
  • for (i2ilt100i)
  • yiyi-2yi
  • here, yi is dependent on a value that was
    computed 2 iterations ago
  • this recurrence distance is called the dependence
    distance
  • for (i6ilt100i)
  • yiyi-5yi
  • here, the dependence distance is 5

5
Algorithm for Identifying Recurrence
  • Array accesses are affine if indices follow a
    pattern like aib
  • a and b are constants, i is the loop index
  • Almost all loop carried dependence algorithms
    rely on the arrays being affine
  • The GCD test states that given two accesses of
    the same array
  • where the accesses are to elements aib and
    cid
  • if d-b is not divisible by GCD(a, c) then there
    are no loop carried dependencies
  • note that if d b is divisible by GCD(a, c),
    then there is no conclusion (there might or might
    not be loop carried dependencies)
  • additional tests might be applied in this case
    although some of these tests are NP-Complete, so
    we might use approximation methods instead
  • Does the following loop have loop carried
    dependences?
  • for (i1ilt100ii1) x2i3x2i 5
  • a2, b3, c2, d0
  • GCD(a, c)2, d - b -3
  • 2 does not divide -3 (-3/2 has a remainder), so
    the loop has no loop carried dependencies

6
Removing Dependencies
  • Consider the following example
  • what are the types of dependencies?
  • how can we rewrite the code to remove
    dependencies to make the loop parallelizable?
  • true from S1 to S3, from S1 to S4 (yi)
  • anti from S1 to S2 (xi), from S3 to S4 (yi)
  • output from S1 to S4 (yi)
  • yi is used in S3 and S4 as a source register
    and in S4 as a destination register, so we rename
    one use of yi to ti
  • xi is used in S2 as a destination register and
    in S1 and S2 as a source register, so we rename
    one of them to x1i
  • solution shown above to the right

for (i1ilt100ii1) for(i1ilt100ii1)
yixi/c / S1 / tixi/c xixi
c / S2 / x1ixic ziyic / S3
/ zitic yic-yi / S4
/ yic-ti
7
Dealing with Pointers
  • Determining if two pointers are pointing at the
    same memory location (aliases) is nearly
    impossible statically
  • however, there are some things we can determine
  • are two pointers pointing into the same data
    structure (list or tree)?
  • are two pointers pointing at the same type of
    data?
  • is a pointing, which is being used to pass an
    address as a parameter, pointing at an object
    being referenced in the function?
  • of two pointers, is one pointing at a local
    object and the other at a global object?
  • so, while there is no solution to identifying
    aliases, we can use some analysis (called
    points-to analysis) to rule out aliases in
    specific situations
  • and therefore, we will be able to guarantee that
    two pointers are not aliases
  • that is, if points-to analysis claims two
    pointers are not aliases, we can trust that
    result, if points-to analysis fails, we cannot
    conclude anything and should be more cautious
    about loop unrolling

8
Back Substitution
  • Within a basic block of code, we can eliminate
    operations that copy values using copy
    propagation
  • as a simple example, we can reduce the two
    statements to the right to a single statement
    with the single statement below them
  • we have already done this type of optimization in
    loop unrolling by combining array index
    adjustments
  • another example is to take advantage of algebraic
    associativity and rearrange the order of
    computations in an expression as shown to the
    right
  • in the first block of code, there are data
    dependencies between the first and second and the
    second and third instructions, which might result
    in an extra stall that does not appear in the
    second block of code

DADDUI R1, R2, 4 DADDUI R1, R1, 4
DADDUI R1, R2, 8 DADD R1, R2, R3 DADD
R4, R1, R6 DADD R8, R4, R7 DADD R1, R2, R3
DADD R4, R6, R7 DADD R8, R1, R4
9
Software Pipelining
  • Compiler-based loop unrolling adds instructions
    to the program and uses more registers
  • Another idea is to symbolically unroll the loop
    by arranging the loop components in their
    opposite order
  • the new loop interleaves execution of the
    original loop
  • consider a loop that has three parts A, B, C,
    then the new loop contains iteration i2 of A,
    i1 of B and i of C for iteration i
  • this requires manipulating the loop maintenance
    mechanisms and to have pre and post loop
    instructions

10
Example
Loop L.D F0,0(R1) ADD.D F4,F0,F2 S.D
F4,0(R1) DSUBI R1,R1,8 BNE
R1,R2,Loop Iteration i L.D F0,0(R1) ADD.D
F4,F0,F2 S.D F4,0(R1) Iteration i1 L.D
F0,0(R1) ADD.D F4,F0,F2 S.D
F4,0(R1) Iteration i2 L.D F0,0(R1) ADD.D
F4,F0,F2 S.D F4,0(R1) Bold-faced instructions
are unrolled
  • The compiler selects the appropriate instruction
    from three iterations of the loop and builds a
    new loop out of them, adding the proper pre- and
    post-code

L.D F0, 16(R1) L.D F6, 8(R1) ADD.D F4, F6,
F2 Loop S.D F4,16(R1) ADD.D F4,F0,F2 L.D
F0,0(R1) DSUBI R1,R1,8 BNE
R1,R2,Loop ADD.D F8, F0, F2 S.D F4,
8(R1) S.D F8, 0(R1)
11
Another Example
  • This loop has 4 operations to unroll
  • We precede this loop with the three L.Ds, 2
    ADD.Ds, 1 MUL.D, and follow the loop with 1
    ADD.D, 2 MUL.Ds and 3 S.Ds

Loop L.D F0, 0(R1) ADD.D F2, F1, F0 MUL.D F4,
F2, F3 S.D F4, 0(R1) DSUBI R1, R1, 8 BNEZ R1,
Loop
Loop S.D F4, 24(R1) // S.D from iteration I
3 MUL.D F4, F2, F3 // MUL.D from iteration I
2 ADD.D F2, F1, F0 // ADD.D from iteration I
1 L.D F0, 0(R1) // L.D from iteration
I DSUBI R1, R1, 8 BNEZ R1, Loop
12
Global Code Scheduling
  • Here, the compiler attempts to select a path
    through selection statements based on branch
    predictions
  • some code might be moved prior to a branch to
    make it more efficient
  • or, we might combine this with loop unrolling so
    that the compiler is able to perform the
    operation without having to know whether a
    condition will be true or not
  • The compiler generates a straight line of code
    without a branch
  • We must preserve data and control dependencies
    which makes it tricky
  • since this relies on branch predictions, the
    straight line of code may lead to a violation of
    data dependencies if the prediction is
    inaccurate, so the approach must include
    mechanisms for failed predictions such as
    canceling instructions or not allowing the
    instructions to write to registers/memory
  • in addition, we have to ensure that if a moved
    instruction raises an exception, that we only
    handle the exception if the instruction would
    have been executed anyway (that is, if we
    predicted correctly)

13
Example
  • Consider the following code
  • aiaibi if (ai 0) bi
    else ci
  • if we have knowledge that says the condition
    (ai 0) is mostly true
  • then we can move bi before the comparison
  • removing one of the branches
  • we would still have one branch to branch around
    the else clause

Another option is to move ci z before the
if-statement or into the branch delay slot
  • in moving bi we must ensure that we have not
    violated other dependencies
  • for instance, if the condition was ai bi,
    then we couldnt move the assignment statement

14
Trace Scheduling
SGT R3, R1, R2 BEQZ R3, else DADDI
R1, R1, 1 J next else DSUBI R2,
R2, 1 next
  • Trace scheduling is a form of global scheduling
    where we rearrange code to assume one branch will
    be taken
  • Consider the code
  • if(x gt y) x else y--
  • the MIPS code is given to the right, assuming
    that x is stored in R1 and y in R2
  • Assume that x gt y is true 90 of the time, we can
    then revise our code as shown to the right
  • original code if true, 4 instructions, if
    false, 3 instructions
  • new code if true, 3 instructions, if false, 5
    instructions
  • Assuming each instruction takes 1 cycle with no
    stalls, we have a speedup of
  • (90 4 10 3) / (90 3 10 5) 1.22
    or 22 speedup

SGT R3, R1, R2 DADDI R1, R1, 1 BNEZ R3,
next DSUBI R1, R1, 1 DSUBI R2, R2,
1 next
if we are wrong about the prediction, we have to
reset R1
15
Conditional Instructions
  • A conditional instruction is an instruction that
    can combine a comparison and an ALU operation
  • although we dont want to combine conditions and
    branches, we can combine conditions and simple
    ALU operations
  • such as a register move or a data load
  • if the condition is false, we just cancel the
    operation before the datum is placed into the
    destination register
  • The advantage in an If statement (without the
    else clause), there is an explicit branch, but in
    a conditional instruction, the branch is
    eliminated
  • in MIPS, we have conditional move operations
  • MOVZ R1, R2, R3 (R1 ? R2 if R3 0) and MOVN
    (move negative)
  • although MIPS does not have a conditional load,
    we can envision one
  • LWC R2, 0(R3), R1 (load 0(R3) into R2 if R1
    0)
  • these operations will start performing the move
    or load operation but only store the result in
    the WB stage if the condition evaluates to true

16
Examples
  • Consider
  • if (a 0) s t
  • R1 stores a
  • R2 stores s
  • R3 stores t
  • We can replace
  • with CMOVZ R2, R3, R1
  • The CMOV instruction has no branch penalty
  • 1 cycle instead of potentially 3 (including
    branch delay)
  • Consider a superscalar where we can issue a load
    ALU but not a branch ALU
  • This code will incur a stall if the branch is not
    taken between the two LW instructions
  • if we assume the branch is most often not taken,
    we can change LW R8, 20(R10) into LWC R8,
    20(R10), R10 and move it to before the BEQZ into
    the vacant instruction spot
  • LWC conditional load

BNEZ R1, L DADDI R2, R3, 0 L
17
Handling Exceptions
  • The conditional instruction should not cause an
    exception during the move or load (if the
    condition is false, the operation would never
    have taken place)
  • In the previous example, consider from before
  • LWC R8, 20(R10), R10
  • if R10 is 0 then this causes an exception since
    location 20 (R10 0, so we have 20 0) is most
    likely a part of the OS
  • Two approaches to handling this problem
  • hardware-software cooperation when an exception
    arises, hardware alerts the OS of whether the
    exception was raised through an ordinary or a
    speculated instruction
  • poison bits add a bit to each register and to
    each instruction and set the bit for any
    speculated instruction
  • set a registers poison bit if instruction is
    speculated, or if register was assigned a value
    computed with a register with a set poison bit
  • if an exception arises from a correctly
    speculated instruction, then all registers with
    set poison bits will have to be reinitialized

18
Hardware for Compiler Speculation
  • We can go further with compiler-based speculation
    by adding hardware to support what the compiler
    speculates
  • consider the following if-else statement
  • if (a 0) a b else a a 4
  • where a is stored at 0(R3) and b is at 0(R2),
    assume that the condition is true 90 of the
    time, then the code on the left can become the
    code on the right by adding an extra register
  • here we use an additional register combined with
    trace scheduling to make the code execute more
    efficiently

LW R1, 0(R3) BNEZ
R1, L1 LW R1, 0(R2)
J L2 L1 ADDI R1, R1, 4 L2
SW R1, 0(R3)
LW R1, 0(R3) LW R14,
0(R2) BEQZ R1, L3 ADDI R14,
R1, 4 L3 SW R14, 0(R3)
Discounting stalls, original code takes 90 5
10 4 4.9 cycles, the new code takes 90 4
10 5 4.1 cycles, a speedup of 1.195 or
almost 20
19
Speculative Load and Check
  • We can also add to the hardware two (or more)
    speculative instructions that preserve exception
    handling behavior
  • From our previous example where we speculated
    that the if clause was going to be executed, we
    can add a speculative load (sLW) and a
    speculative check (SPECCK) as follows

LD R1,0(R3) sLD R14,0(R2) //
speculatively load B such that BNEZ R1,L1 //
it cannot cause a terminating SPECCK 0(R2) //
exception check the spec. J L2 // here at
SPECCK L1 DADDI R14,R1,4 L2 SD R14,0(R3)
20
Limitations on Speculated Instructions
  • Instructions that are annulled (turned into
    no-ops) still take execution time
  • Conditional instructions are most useful when the
    condition can be evaluated early
  • such as during the ID stage of our pipeline
  • Speculated instructions may cause a slow down
    compared to unconditional instructions requiring
    either a slower clock rate or greater number of
    cycles
  • The use of conditional instructions can be
    limited when the control flow involves more than
    a simple alternative sequence
  • for example, moving an instruction across
    multiple branches requires making it conditional
    on both branches, which requires two conditions
    to be specified or requires additional
    instructions to compute the controlling predicate
  • if such capabilities are not present, the
    overhead of if conversion will be larger,
    reducing its advantage

21
The Intel IA-64
  • We wrap up our examination of software support
    for ILP by examining a processor that relies
    heavily on this
  • RISC-style, load-store instruction set
  • compiler speculated instructions known as EPIC
    (explicitly parallel instruction computer)
  • a variation on VLIW where the compiler explicitly
    denotes where instruction parallelism stops due
    to dependencies
  • 128 integer and 128 FP registers
  • integer registers are 65 bits to hold a poison
    bit
  • 64 1-bit predicate registers to store the
    assumption of a predicated instruction (which way
    was predicted)
  • 8 64-bit branch registers (for indirect branches)
  • register windows although in this case, stored
    on a stack, for quick parameter passing

22
Bundles
  • Compiler builds instruction groups out of
    individual instructions
  • take the next X instructions and place
    parallelizable ones together
  • stops are inserted after each group to indicate
    a limitation in parallelization
  • Next, the compiler takes groups and makes VLIW
    instruction bundles out of them
  • a bundle will consist of 3 instructions (128
    bits), possibly including no-ops to fill in slots
    that cant be used because of limitations in ILP
  • if a bundle contains a stop, then the stop is
    added to the VLIW to indicate that a stall might
    be necessary
  • Instructions are issued to one of 5 units
  • M memory
  • I integer
  • F FP
  • B branch
  • L X extended instructions which take up 2
    slots in an VLIW
  • the hardware can handle up to 2 M or 2 I in one
    bundle
  • These categories help identify legal types of
    bundles

23
Partial Listing of IA-64 Bundles
See page G-36 figure G.7 for complete
table Heavy lines indicate stops that is,
locations where a stall is required or where
instructions that follow are not parallel
Some bundles have multiple stops (see 3)
24
Example
  • Unroll the xi xi s loop seven times and
    schedule it for the IA-64 to minimize the number
    of cycles

First, we must determine instruction groups, what
follows is the unrolled but unscheduled code with
lines to break up instruction groups Loop L.D
F0, 0(R1) S.D F4, 0(R1) L.D F6,
-8(R1) S.D F8, -8(R1) L.D F10, -16(R1)
S.D F12, -16(R1) L.D F14, -24(R1) S.D F16,
-24(R1) L.D F18, -32(R1) S.D F20,
-32(R1) L.D F22, -40(R1) S.D F24,
-40(R1) L.D F26, -48(R1) S.D F28,
-48(R1) ADD.D F4, F0, F2 DADDI R1, R1,
-56 ADD.D F8, F6, F2 BNE R1, R2,
Loop ADD.D F12, F10, F2 ADD.D F16, F14,
F2 ADD.D F20, F18, F2 ADD.D F24, F22,
F2 ADD.D F28, F26, F2
Note that the final stop is not necessarily going
to be needed because the IA-64 uses speculation
and will branch if the target buffer indicates to
branch
25
Solution
Most of the stops will not cause stalls because
there is adequate distance, however the final
ADD.D is too close to the final S.D and so a
stall is required, thus the execution cycle
jumps from 9 to 11 totally time to execute, 12
cycles
26
Sample Problem 1
  • Determine all dependencies (true, output, anti)
    in the loop below and determine if the loop is
    parallelizable

for(j0jlt100j) ai-1 bi
ai //S1 bi ci-1 ci1
//S2 ci //S3 ai ci
s //S4
Output dependence loop carried on a from S1 to
S4 Antidependence non-loop carried on b from S2
to S1 and non-loop carried on a from S4 to S1
and loop carried on a from S1 to S1 True
dependence non-loop carried on c from S3 to S3
and loop-carried on c from S3 to S2 The loop
carried true dependence makes this loop
non-parallelizable
27
Sample Problem 2
  • Use the GCD test to determine if there is a
    dependency
  • for(i2ilt100i2) aia50i1
  • first normalize the code
  • by dividing the for-loop values by 2 and
    multiplying the array index by 2
  • for(i1ilt50i1) ai2a100i1
  • this gives us a2, b0, c100, d1
  • GCD(a, c) 2
  • d b 1
  • since 2 does not divide into 1, there are no loop
    carried dependencies in this loop
  • Repeat the same problem with this new loop
  • for(i2ilt100i2) aiai-1
  • first normalize the loop
  • for(i1ilt50i1) a2ia2i-1
  • a2, b0, c2, d-1
  • GCD(a, c) 2
  • d b -1
  • 2 does not divide into 1 so again this loop has
    no loop carried dependencies

28
Sample Problem 3
  • Consider the following payroll code
  • first, show how this code would appear in MIPS
    without speculation
  • assume F0 is 40.0, F2 is 1.5, F4 is hours and F6
    is wages
  • next, assume that the then clause is taken most
    of the time and rewrite the MIPS with this
    speculation using extra registers as needed
  • assume each instruction takes 1 cycle to execute
    and there are no stalls, if the speculation is
    correct 95 of the time, how much faster is the
    speculated code than the original?

if (hours lt 40) pay wages hours
else pay wages 40 wages 1.5 (hours
40)
29
Solution
Non-speculative code C.LE.D F4, F0 BC1F
else MUL.D F8, F4, F6 J out else MUL.D
F8, F6, F0 SUB.D F10, F4, F0 MUL.D F10,
F10, F6 MUL.D F10, F10, F2 ADD.D F8, F8,
F10 out S.D F8,
Speculative code C.LE.D R1, F4, F0 MUL.D
F8, F4, F6 BC1T out else MUL.D F8,
F6, F0 SUB.D F10, F4, F0 MUL.D F10, F10,
F6 MUL.D F10, F10, F2 ADD.D F8, F8,
F10 out S.D F8,
The non-speculative code takes 5 instructions if
the then clause is executed and 8
instructions if the else clause is executed
whereas the speculative code takes 4
instructions if the then clause is executed and 9
instructions if the else clause is executed If
the then clause is taken 95 of the time, we
have speculative 95 4 5 9
4.25 non-speculative 95 5 5 8
5.15 Speedup 5.15 / 4.25 1.212 or 21 speedup!
30
Sample Problem 4
if(x ! 0) y-- else y
  • For the if-else statement to the right
  • write the MIPS code without any speculation
  • write the MIPS code with predicated instructions
    so that there are no branches
  • write the MIPS code to speculate that the ELSE
    clause is taken most often
  • which code executes fastest?
  • assume R1 and R2 store x and y

DADDI R5, R2, 1 DSUBI R6, R2, 1 SEQ R4,
R1, R0 CMOVZ R2, R6, R4 CMOVZ R2, R5, R1
BEQZ R1, else DSUBI R2, R2, 1 J cont else DAD
DI R2, R2, 1 cont
DADDI R2, R2, 1 BEQZ R1, cont DSUBI R2, R2,
2 cont
Non-speculation and speculating the else clause
are the same (assuming no stalls) when the else
clause is taken but speculation is superior if
the then clause is taken, and the predicated
instructions take 5 cycles no matter what
31
Sample Problem 4
  • Assume that the MUL.D instruction takes 7 cycles
    to execute
  • Perform symbolic loop unrolling and scheduling on
    the following loop so that there are no stalls
    required to execute the code
  • how much performance increase is there over the
    code as given below (assume forwarding is
    available and branches are handled by assume not
    taken) not counting the startup and cleanup code?

Loop L.D F0, 0(R1) L.D F1, 0(R2) MUL.D F2,
F0, F1 S.D F2, 0(R3) DSUBI R1, R1, 8 DSUBI
R2, R2, 8 DSUBI R3, R3, 8 BNEZ R3, Loop
32
Solution
// start up code will require 2 L.D and 1
MUL.D Loop S.D F2, 16(R3) MUL.D F2, F0,
F1 L.D F0, 0(R1) L.D F1, 0(R2) DSUBI R1, R1,
8 DSUBI R2, R2, 8 DSUBI R3, R3, 8 BNEZ
R3, Loop // cleanup code will require 1 MUL.D and
2 S.D Original code would contain 1 stall after
the second L.D, 6 stalls after MUL.D, 1 stall
after the third DSUBI, and the branch delay slot,
so 9 stalls per iteration This code has no
stalls (outside of the startup and cleanup code),
so the loop takes 8 cycles versus 17 cycles, a
speedup of 17 / 8 2.125 (over 100!)
33
Sample Problem 5
  • Rewrite the given code without branches by using
    predicated instructions (conditional moves and
    conditional loads)

Consider that cmovz and cmovn dont tell us if a
previous condition was true or false, so we have
to organize our code cleverly
if(a gt b) x 1 else if(c lt d) x
2 else x 3
assume a, b, c, d and x are stored in R1, R2,
R3, R4, R5 and R11, R12, R13 store 1, 2 and 3
Solution DADD R5, R0, R13 // initialize x to
3 DSUB R20, R3, R4 // R20 c d (R20 lt 0 if c
lt d) DSUB R21, R2, R3 // R21 b a (R21 lt 0 if
a gt b) CMOVZ R5, R12, R20 // reset x to 2 if R20
lt 0 CMOVZ R5, R11, R21 // reset x to 1 if R21
lt 0
34
Sample Problem 6
  • Unroll and schedule the bundles for the IA-64 for
    the loop that performs ai bi ci such
    that the code executes in as few cycles as
    possible
Write a Comment
User Comments (0)