HWSW Fault Analysis of Multiprocessor Systems - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

HWSW Fault Analysis of Multiprocessor Systems

Description:

Pulls expired tasks first. Pulls highest priority first. Pulls ' ... Load situations requiring task-migration becomes more important in the embedded system domain ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 40

Provided by: jensbraune

Category:

more less

Transcript and Presenter's Notes

Title: HWSW Fault Analysis of Multiprocessor Systems

1
HW/SW Fault Analysis ofMultiprocessor Systems

Rainer G. Spallek,
Steffen Köhler
TU Dresden

2
Outline

Software Bugs and Fault Analysis
Principles of debugging uniprocessor systems
Performance and Efficiency of Embedded
Microprocessors
The evolution of embedded systems and its
software development requirements
Symmetric Multiprocessing Architectures
A flexible approach to provide performance
scalability
SMP Operation System and Application Development
Partitioning of execution load through
task/thread creation
Software Debugging in SMP Environments
Debugging concurrent tasks/threads at different
abstraction levels

3
Software Bugs andFault Ananlysis
4
Software Fault Sources

The programmer creates a defect. A defect is a
piece of the code
that can cause an infection. Because the defect
is part of the
code, and because every code is initially written
by a programmer,
the defect is technically created by the
programmer.
If the programmer creates a defect, does that
mean the
programmer was at fault? Not in every case.
A program behavior may become classified as a
failure only when the user sees it for the
first time.
In a modular program, a failure may happen
because of incompatible interfaces of two
modules.
In a distributed program (e.g. a multiprocessor
system), a failure may be the result of some
unpredictable interaction of several components.

5
Software Fault Propagation

The defect causes an infection. The program is
executed, and with
it the defect. The defect now creates an
infection - that is, after
execution of the defect, the program state
differs from what the
programmer intended.
A defect in the code does not necessarily cause
an infection.
The defective code must be executed, and it must
be executed under such conditions that the
infection actually occurs.
An infection need not, however, propagate
continuously. It may be overwritten, masked, or
corrected by some later program action.
The infection causes a failure. A failure is an
externally observable error in the program
behavior. It is caused by an infection in the
program state.

6
Performance and Efficiencyof Embedded
Microprocessors
7
Evolution of Embedded Systems

Embedded systems in the past
Single processor
Simple memory system
Small applications
Software development was reasonable simple
Embedded systems today and in future
Many processors
Multi-stage memory/communication hierarchy
Complex parallel and concurrent applications
Software development getting more and more
complex

8
Microprocessor Design Challenges
9
The Cycles Per Instruction Problem

Limited amount of instruction level parallelism
in programs
Super-scalar units not fully utilized
High hardware costs for out-of-order issue
implementation

Solution Use of task and thread level
parallelism

10
Multiprocessor Architectures

Hardware efficiency is the main argument
Potential performance gain scales nearly linear
with the number of processor cores
Chip area and power consumption scale linear with
the number of processor cores
One exception communication and memory hierarchy
Utilization of task and thread parallelism is a
complex task
Identify large blocks of data-independent program
code
Partition these blocks in such a way, that
communication between them can be achieved with
the available resources
Map the concurrent program blocks to physical
processor cores
Manage to adapt this mapping in accordance with
the current load situation

11
Symmetric Multiprocessing Architectures
12
SMP Basics

Symmetric Multi Processing
From symmetry follows every task can be
executed on every particular processor core
Trade-off between higher hardware effort for
universal communication network (shared memory)
and more flexible task and thread partitioning
scheme
All CPU are equivalent in access and performance
Interconnected with a bus or crossbar
Simplified programming through shared memory
model
Unified interrupt distribution sub-system
Several IP vendors provide SMP enabled processor
solution for embedded SoCs (ARM, PowerPC, etc)

13
ARM11_MPCore Architecture
14
SMP Programmers Model

Software developer partitions the applications
manually or semiautomatic
Concurrent execution of threads in same address
space
Thread synchronization issues are handled within
the application context through additional
dedicated OS functions
Developer is responsible for synchronisation. OS
supports by providing dedicated functions (e.g.
Linux futex, pthread mutex)
Mapping of application threads to physical
processor cores is handled transparently to the
user by the OS kernel

15
SMP Software Development

User driven application partitioning is an
iterative process
Sophisticated development tools required
(compiler, profiler, trace analysis, etc.)
Objective Find a partitioning, that maximizes
the overall system performance

16
Problems Introduced by Concurrency

Efficiency problems
Inefficient partitioning through lack of thread
parallelism in the considered application
(synchronization and communication reduce
performance benefits)
OS kernel overhead through automatic thread
mapping onto a particular core, thread migration
to a different core, use of synchronization
primitives (load balancing inefficiencies)
Potential Software Bugs
Deadlock, blocking or 'starving' situations
Race conditions data race, message race, relaxed
order memory access
Unprotected entries into critical sections
Shared use of local variables (re-entrancy)
Non-thread-safe libraries

17
SMP Operationg System andApplication Development
18
SMP Control Inside Linux Kernel
Contain SMP support functions
19
Single Processor vs. SMP

Objectives
Fair load sharing, efficient load distribution
System-Speedup Ncpu

20
Task Migration

Kernel function load_balance()
Called at most every 200 ms
Is called on empty runqueue on each cpu

CPU 0
CPU 1
CPU 2

Pulls from 'busiest run-queue'
Pulls expired tasks first
Pulls highest priority first
Pulls 'not running' tasks first
Repeat the last 2 steps until 'busiest run-queue'
has no overhead to CPU's run-queue

load_balance() Lock(rq1, rq2) Pull(task) Un
lock(rq1,rq2)
Scheduler Task
Scheduler Task
Scheduler Task
Shared Memory
Task
Task
Task
Task
Task
Task
Task
Run-queue CPU 1
Run-queue CPU 2
Run-queue CPU 0
21
Multi-Thread Application Problems

Locking

Shared and private memory access

Order of execution

Synchronization / communication overhead

22
Example Relaxed Order Memory Access
double G,L pragma omp parallel pragma
shared(G) private(L)
Parallel Region
Thread 0
Thread 1
G 0.0 L work() pragma omp atomic G L
G 0.0 L work() pragma omp
atomic G L
write stalled
write stalled
G changed by thread 0.
Changed G overwritten by thread 1. Results of
thread 0 are lost.
Memory
Temporary View
Temporary View
23
Application Thread Partitioning

pthread library
Explicit creation and termination of threads
Explicit, fast synchronization primitives (locks,
mutexes, conditions)
OpenMP compiler directed
Explicit specified parallel regions
Implicit creation and termination of threads
Explicit synchronization primitives (nested
locks, memory access barriers, ordered execution,
critical section, etc.)
Extraction of thread parallelism is controlled by
additional compiler pragma statements

24
Software Debugging inSMP Environments
25
System Behavior vs. Invasive Debugging

Typically stop, evaluate and restart the entire
target system
All cores are synchronously controlled, but may
operate asynchronously to peripheral devices and
memory sub-system
System state may not be completely restorable
after restart
System observability depends on debug HW/SW
system capabilities and includes all typical
cases
Relaxed Memory access race conditions
Execution order, or timing dependent bugs
Transient system state dependent bugs
Preemption based timing conditions
Sources of performance losses
Communication conditions
Cache Performance

26
JTAG/TAP Emulation

Entire system state is observable
Physical Mapping of tasks / threads onto
particular SMP cores
High level of intrusion - system completely
stopped
Stopping a single core in a multi core SMP system
might impact the system stability
Expensive and complex debug access (JTAG
accelerator)
Application state has to be evaluated through a
complex interpretation of the raw SMP system
state (core specific MMU tables, kernel thread
table, etc.)
May be required for kernel development, but is
rather oversized for pure application development

27
MMU Page Table Interpretation
28
Test Access Port IEEE 1149.1
29
MultiCore Debug-Interface IEEE 1500
30
MultiCore Debugger Architecture
31
Detecting Bugs through Trace-HW

E.g. Deadlock detection at a given check-point
Build the list of exclusive locks
Owned by the thread or
The thread is blocked by
Build the dependency graph and find a cycle
Required information (for hardware trace)
Thread ID(content of Context/Thread ID Register)
Enter/leave trigger events of lock-functions
(content of program counter)
Addresses of locks(content of first argument
register or stack)

32
Intelligent Trace Hardware Required

Paradigm shift from post-trace analysis to
pre-trace specification
User has to specify
What has to be captured?
When it has to be captured?
On-Chip trace pre-processing requires extensive
HW support
On-chip filter capabilities
Trace data compression
Trigger logic
Cross triggers (multi-core / multi-component)

33
On-Chip Trace Architecture (CoreSight)

Unfortunately still not available on any SMP
processor hardware

34
Trace Programming Model

Everything is hierarchical
Overall system is build from components (ETMs,
HTMs, Embedded Cross Trigger, ...)
Register based programming model for each trace
component
Registers for identification and management
Component specific control registers
Memory mapped interface to provide access to
component registers
Typically access via AMBA bus bridge

35
Operating System Debug Extensions

Application level debugging is supported by the
Linux kernel through several interfaces (ptrace,
thread_db).
One of the most common debuggers based on these
interfaces is GDB.
Low level of intrusion
Only selected tasks/threads are stopped /
observed
simple debug system
Thread / core mapping is transparent
Only effects of core mapping and concurrent
execution are visible
only user threads are observable

36
Thread State Observation

Display the number of current application threads
Observe the related run state of every particular
application
All threads are started and stopped
synchronously, allowing the debugger to evaluate
all thread contexts

37
Conclusion
38
Conclusion

SMP offers a good trade-off when considering
hardware effort and programming complexity
For a low number of processor cores, hardware
cost scales nearly linear with the potential
performance gain.
Through the unified shared memory model, the
implementation of multi-thread application is
significantly simplified.
Debugging SMP systems is a complex task
Kernel level development requires physical core
access (JTAG)
User application development might also benefit
from physical core access, but high intrusion
level effects can make debugging inefficient
OS kernel provided thread debug interfaces are
sufficient in most cases
Non Invasive trace support is always beneficial
when debugging concurrent tasks / threads
Load situations requiring task-migration becomes
more important in the embedded system domain