Title: Review last week
1Review last week
- The software problem
- Robust SW coding techniques
- Regression testing
- Reliability models for software
- Redundancy in Software
- Reliability of N-versioning
- Software rejuvenation
2Today
- Reliability of networks
- Hardware related FTC techniques
- Watchdog Techniques
- Redundancy in time (Re-execution)
- RESO
- Processes, threads
- Superscalar, CMP,SMT
- Research on FT microarchitectures
- AR-stream, DIVA
- Other error detection mechanisms in HW
- BIST
3Reliability of Networks
- Based on graph theory nodes represent computers,
branches represent communication links - Simplest model assumes nodes do not fail but
links do. - Link failures may be due to traffic congestion or
physical failures - Path is a collection of branches that provide
communications between specific pair of nodes - In general we are interested in knowing
- RallP(all nodes are connected)
- RstP(nodes s and t are connected)
- RkP(k nodes are connected)
4Reliability of Networks
Simple state space enumeration
b
a
1
5
4
6
2
c
3
d
Represent all possible ways to go from node a to
b considering that links fail
Prob. 1 link failure
Prob. 2 links failure
If all links are equal and pprob. of being
up qprob. being down
If p0.9 and q0.1 then Rab0.997
5Reliability of Networks
b
a
- To improve network reliability we can increase p
or add more branches to the network - There are other more efficient methods to compute
network reliability - Cut sets
- Graph Reduction
1
5
4
6
2
c
3
d
6Cut Sets
b
a
Cut sets A group of links that break all paths
between s and t when they are removed from the
graph (sa, tb in the example graph) C1145 C
2162 C31563 C41234
1
5
4
6
2
d
c
3
R1-P(C1 or C2 or C3 orCJ) Rab1-P(C1 or C2 or
C3 or C4) Rab1-P(145 or 162 or 1563 or
1234) Rab1-P(145)P(162)P(1563)P
(1234) P(12456)P(13456)P(1234
5)P(12356)P(123456)P(123456)
P(A or B) P(A) P(B) - P(A and B)
7Primary Graph Reductions
- Graph reductions facilitate calculation
- Series sequence of edges are required
simultaneously combine with axiom of
probability - P(A?B) P(A)P(B)
- Parallel network is operational if any of these
edges are operational combine with axiom of
probability - P(A?B) P(A) P(B) P(A?B)
serial
S
.9
.9
A
B
.9
.9
T
Serial reduction
P(A?B) .81.81-(.81.81) .9639
Parallel reduction
8Watchdog Techniques
- Key concept
- A process or processor is checked by another
hardware (normally) unit of its actions such as
if the process is still active, alive, not
executing incorrect paths during execution, etc.
9Watchdog Timers
- Check for aliveness
- Processor resets the timer at certain interval or
on certain conditions - Timer raises error flag if not reset before it
overruns
10Watchdog Timers (contd.)
- Check for timeout
- Processor sends a message and starts a timer, the
second processor must reply within this time
(hardware/software implementation)
11Watchdog Timers (contd.)
- Applications
- Processor control systems (chemical, mechanical
and other control systems) - Switching systems messages sent or received
often await certain length of time before they
are repeated - Networks email messages often have timeouts
associated with them
12Watchdog Processors
- Consider the following simple architecture
Watchdog can Observe the address bus Observe
the data Observe instructions Check the flow of
program control
Need to know what kind of errors can occur
13Watchdog Control flow checking
- Some studies have found that 60 of all transient
faults could be detected by monitoring control
flow - Control flow basic principle
- Analyze the program and extract control
information - Branch free intervals
- Subroutine calls
- Assign signatures to branch free intervals and
provide these signatures to the watchdog
processor to check these values - Signatures can be checksums of instruction opcodes
14Watchdog Control flow checking (contd.)
Watchdog Receive start Observe instr.
flow Calculate signature Check with stored
signature
Program Start branch free code End branch free
code
15Watchdog Mem access
- What to do about memory/data errors
- Use ECC
- AMD Opteron, Intel Pentium D multicore processors
use ECC techniques to avoid transient errors in
memory access - Few other methods using watchdog techniques
- Check for non existent memory addresses
- Check for out of range addresses
16Fault Detection in Complex Processors
- High density and complexity of current processors
increases the probability of occurrence of
transient, intermittent and permanent faults - Diverse techniques are used to detect these
faults - RESO
- Re-excution
- BIST
17Re-execution with Shifting operands (RESO)
- Re-execute the same arithmetic operations, but
shifting the operands - Goal detect errors in ALU
- Example shift left by 2
- 1 0 1 0 1 0 X X
- 1 0 0 1 0 1 X X
- 0 0 1 0 1 1 X X
- By comparing output bit 0 of the first execution
and output bit 2 of the shifted re-execution, we
detect an error in the ALU, since they should be
equal
error
18Re-execution
- Replicate the actions on a module either on the
same module (temporal redundancy) or on spare
modules (temporal spatial redundancy) - Good for detecting and/or correcting transient
faults - Transient error will only affect one execution
- Can implement this policy at many different
levels - ALU
- Thread context
- Processor
- System
19Race Conditions
- In concurrent applications race conditions may
happen - A race condition is a bug that occurs when the
outcome of a program depends on which of two or
more threads reaches a particular block of code
first. Running the program many times produces
different results, and the result of any given
run cannot be predicted. - Re-execution of the same threads may be used to
detect a race condition.
20Break
21Re-execution with Processes
- Idea Use redundant processes to detect errors
- Problem in a uniprocessor serialization,
slowdown factor of 2 - In a multicore/multiprocessor, we can execute
multiple copies of the same process
simultaneously on 2 processors and have them
periodically compare their results - Almost no slowdown, except for comparisons
- Disadvantage not using the other processor to
perform non-redundant work
Process
Process
CPU
Check errors
Process
Process
CPU
CPU
Check errors
22Current Multi-Core Procesors
- A multi-core CPU combines independent processors
(cores) onto a single silicon chip. - Intel Distinguishes between logical and
physical processors - Logical refers to the Hyperthreading side,
physical means core. - An Intel Dual-Core processor has two physical
processors in the same chip package (Paxville) - AMD Uses the concept of logical processor count
to refer to multiple cores existing within the
same chip package. - Dual-core Opteron and AMD64 (X2) dual-core
23Shared Memory Multiprocessor Architectures
Athlon 64FX2
Pentium D
24Past, Present and the Future?
Basic Multicore IBM Power5
Traditional Multiprocessor
Integrated Multicore 16 Tile MIT Raw
PE
PE
PE
PE
Memory
Memory
Memory
Memory
25Re-execution of microinstructions
Superscalar UniProcessor Microarchitecture
Pipleline Stages IF
ID
RD
( in order )
Dispatch
Buffer
Re-execute instructions on different Functional
Units
Drawback -Tests only FUs not whole pipeline
( out of order )
ALU
MEM1
FP1
BR
EX
MEM2
FP2
FP3
( out of order )
Reorder
Buffer
( in order )
WB
26Re-execution with Threads
- Use redundant threads to detect errors
- Many current superscalar microprocessors are
multithreaded ( Intel Pentium4, IBM Power5,
Compaqs Alpha21464,Suns UltraSparc 3) - Each processor can run multiple processes or
multiple threads of the same process - Can re-execute a program on multiple thread
contexts, just like with multiple processors - Better performance than re-execution with
multiple processors, since the comparison can be
performed on-chip - Lower cost to use an extra thread context rather
than extra processor
27SMT
- Simultaneous multithreading (SMT) is the same
concept as Intels Pentium 4 hyper-threading - Main idea of SMT
- Improve efficiency of a superscalar processor by
exploiting thread level parallelism (TLP) and
instruction level parallelism (ILP) at same time - Threads are generated by a compiler or OS
(processes) - According to Intels data SMT provides 30 of
improvement at the cost of 5 more chip area
28SMT - Flow of Instructions
Thread 1
Thread 2
Thread 3
Thread 4
29Re-execution with Simultaneous Multithreaded (SMT)
- Motivation (Rotenberg 99)
- Increasingly high clock rates and chip density
may cause transient errors in high performance
microprocessors - High cost of multiprocessor (at that time)
- Active stream/redundant stream Simultaneous
Multithreading (SMT) - Low overhead, broad coverage of transient faults
and some permanent faults - In AR-SMT, two explicit copies of the program run
concurrently on the same processor resources
30Re-execution with Simultaneous Multithreaded (SMT)
- A-stream is executed on SMT and results are
committed in the delay buffer - R-stream executes on the SMT, delayed from the
A-stream, by no more than the size of the delay
buffer - R-stream results are compared to A-stream results
in delay buffer, a fault is detected if results
differ - SMT Pipeline
- time-shared, in any given cycle, the pipeline
stage is consumed entirely by one thread. - space-shared, every cycle a fraction of the
bandwidth is allocated to both threads.
31DIVA Dynamic Implementation and Verification
architecture
- Permits detection and recovery of all functional
and electrical faults - Extends the speculative mechanism to fault
detection and recovery - Addresses recovery from permanent faults that
maybe caused through design faults
32A high level view of processor
DIVA processor
33DIVA Overview
- The processor is divided into a deeply
speculative core and a functionally and
electrically robust DIVA checker - Core has all the stages except the retirement
stage - DIVA checker verifies correctness of the
computations before saving in architected storage - Incorporates a watchdog timer that is used to
restart the core if no forward progress is being
made
34DIVA - Architecture
- Two pipelines
- CHKcomp verifies integrity of all functional
units computations - CHKcomm verifies register and memory
communications between the instructions - CT Commit stage. Instructions are committed if
both CHKcomp and CHKcomm pass
35DIVA
- EX results of instruction are recomputed
- CMP Recomputed results are compared with the
one from the core - RD Reads register/memory values from
architected storage - CHK compares the values read to the input
operands from the core - A bypass is provided in case an instruction
immediately before is checking the values
currently being written
36Other Error Detection Mechanisms
- Testing techniques are used to detect errors in
critical components in a processor - BIST (Built in Self Test) random testing
patterns of bits are applied to the circuit under
test and the output is checked for errors - ATP (Automatic Test Pattern Generation)
37 Basic BIST Architecture
BIST Start
BIST Done
Test Controller
Pass/Fail
Output Response Analyzer (ORA)
Test Pattern Generator
System Outputs
Input Isolation circuitry
Circuit Under Test
System Inputs
38Advantages of BIST
- Can be used at all levels of testing
- System level testing in field
- No need for external test machines
- Less I/O pins needed for testing
- Burn-in Test made easy
- No need for test vector development
39 Disadvantages of BIST
- Area overhead susceptibility to manufacturing
defects - Performance penalties
- Extra efforts to designing and verifying proper
operation of BIST at design level. - Additional risk in project
40Summary
- What were the main points of the lecture?