Title: HWSW Fault Analysis of Multiprocessor Systems
1HW/SW Fault Analysis ofMultiprocessor Systems
- Rainer G. Spallek,
- Steffen Köhler
- TU Dresden
2Outline
- Software Bugs and Fault Analysis
- Principles of debugging uniprocessor systems
- Performance and Efficiency of Embedded
Microprocessors - The evolution of embedded systems and its
software development requirements - Symmetric Multiprocessing Architectures
- A flexible approach to provide performance
scalability - SMP Operation System and Application Development
- Partitioning of execution load through
task/thread creation - Software Debugging in SMP Environments
- Debugging concurrent tasks/threads at different
abstraction levels
3Software Bugs andFault Ananlysis
4Software Fault Sources
- The programmer creates a defect. A defect is a
piece of the code - that can cause an infection. Because the defect
is part of the - code, and because every code is initially written
by a programmer, - the defect is technically created by the
programmer. - If the programmer creates a defect, does that
mean the - programmer was at fault? Not in every case.
- A program behavior may become classified as a
failure only when the user sees it for the
first time. - In a modular program, a failure may happen
because of incompatible interfaces of two
modules. - In a distributed program (e.g. a multiprocessor
system), a failure may be the result of some
unpredictable interaction of several components.
5Software Fault Propagation
- The defect causes an infection. The program is
executed, and with - it the defect. The defect now creates an
infection - that is, after - execution of the defect, the program state
differs from what the - programmer intended.
- A defect in the code does not necessarily cause
an infection. - The defective code must be executed, and it must
be executed under such conditions that the
infection actually occurs. - An infection need not, however, propagate
continuously. It may be overwritten, masked, or
corrected by some later program action. - The infection causes a failure. A failure is an
externally observable error in the program
behavior. It is caused by an infection in the
program state.
6Performance and Efficiencyof Embedded
Microprocessors
7Evolution of Embedded Systems
- Embedded systems in the past
- Single processor
- Simple memory system
- Small applications
- Software development was reasonable simple
- Embedded systems today and in future
- Many processors
- Multi-stage memory/communication hierarchy
- Complex parallel and concurrent applications
- Software development getting more and more
complex
8Microprocessor Design Challenges
9The Cycles Per Instruction Problem
- Limited amount of instruction level parallelism
in programs - Super-scalar units not fully utilized
- High hardware costs for out-of-order issue
implementation
- Solution Use of task and thread level
parallelism
10Multiprocessor Architectures
- Hardware efficiency is the main argument
- Potential performance gain scales nearly linear
with the number of processor cores - Chip area and power consumption scale linear with
the number of processor cores - One exception communication and memory hierarchy
- Utilization of task and thread parallelism is a
complex task - Identify large blocks of data-independent program
code - Partition these blocks in such a way, that
communication between them can be achieved with
the available resources - Map the concurrent program blocks to physical
processor cores - Manage to adapt this mapping in accordance with
the current load situation
11Symmetric Multiprocessing Architectures
12SMP Basics
- Symmetric Multi Processing
- From symmetry follows every task can be
executed on every particular processor core - Trade-off between higher hardware effort for
universal communication network (shared memory)
and more flexible task and thread partitioning
scheme - All CPU are equivalent in access and performance
- Interconnected with a bus or crossbar
- Simplified programming through shared memory
model - Unified interrupt distribution sub-system
- Several IP vendors provide SMP enabled processor
solution for embedded SoCs (ARM, PowerPC, etc)
13ARM11_MPCore Architecture
14SMP Programmers Model
- Software developer partitions the applications
manually or semiautomatic - Concurrent execution of threads in same address
space - Thread synchronization issues are handled within
the application context through additional
dedicated OS functions - Developer is responsible for synchronisation. OS
supports by providing dedicated functions (e.g.
Linux futex, pthread mutex) - Mapping of application threads to physical
processor cores is handled transparently to the
user by the OS kernel
15SMP Software Development
- User driven application partitioning is an
iterative process - Sophisticated development tools required
(compiler, profiler, trace analysis, etc.) - Objective Find a partitioning, that maximizes
the overall system performance
16Problems Introduced by Concurrency
- Efficiency problems
- Inefficient partitioning through lack of thread
parallelism in the considered application
(synchronization and communication reduce
performance benefits) - OS kernel overhead through automatic thread
mapping onto a particular core, thread migration
to a different core, use of synchronization
primitives (load balancing inefficiencies) - Potential Software Bugs
- Deadlock, blocking or 'starving' situations
- Race conditions data race, message race, relaxed
order memory access - Unprotected entries into critical sections
- Shared use of local variables (re-entrancy)
- Non-thread-safe libraries
17SMP Operationg System andApplication Development
18SMP Control Inside Linux Kernel
Contain SMP support functions
19Single Processor vs. SMP
- Objectives
- Fair load sharing, efficient load distribution
- System-Speedup Ncpu
20Task Migration
- Kernel function load_balance()
- Called at most every 200 ms
- Is called on empty runqueue on each cpu
CPU 0
CPU 1
CPU 2
- Pulls from 'busiest run-queue'
- Pulls expired tasks first
- Pulls highest priority first
- Pulls 'not running' tasks first
- Repeat the last 2 steps until 'busiest run-queue'
has no overhead to CPU's run-queue
load_balance() Lock(rq1, rq2) Pull(task) Un
lock(rq1,rq2)
Scheduler Task
Scheduler Task
Scheduler Task
Shared Memory
Task
Task
Task
Task
Task
Task
Task
Run-queue CPU 1
Run-queue CPU 2
Run-queue CPU 0
21Multi-Thread Application Problems
- Shared and private memory access
- Synchronization / communication overhead
22Example Relaxed Order Memory Access
double G,L pragma omp parallel pragma
shared(G) private(L)
Parallel Region
Thread 0
Thread 1
G 0.0 L work() pragma omp atomic G L
G 0.0 L work() pragma omp
atomic G L
write stalled
write stalled
G changed by thread 0.
Changed G overwritten by thread 1. Results of
thread 0 are lost.
Memory
Temporary View
Temporary View
23Application Thread Partitioning
- pthread library
- Explicit creation and termination of threads
- Explicit, fast synchronization primitives (locks,
mutexes, conditions) - OpenMP compiler directed
- Explicit specified parallel regions
- Implicit creation and termination of threads
- Explicit synchronization primitives (nested
locks, memory access barriers, ordered execution,
critical section, etc.) - Extraction of thread parallelism is controlled by
additional compiler pragma statements
24Software Debugging inSMP Environments
25System Behavior vs. Invasive Debugging
- Typically stop, evaluate and restart the entire
target system - All cores are synchronously controlled, but may
operate asynchronously to peripheral devices and
memory sub-system - System state may not be completely restorable
after restart - System observability depends on debug HW/SW
system capabilities and includes all typical
cases - Relaxed Memory access race conditions
- Execution order, or timing dependent bugs
- Transient system state dependent bugs
- Preemption based timing conditions
- Sources of performance losses
- Communication conditions
- Cache Performance
26JTAG/TAP Emulation
- Entire system state is observable
- Physical Mapping of tasks / threads onto
particular SMP cores - High level of intrusion - system completely
stopped - Stopping a single core in a multi core SMP system
might impact the system stability - Expensive and complex debug access (JTAG
accelerator) - Application state has to be evaluated through a
complex interpretation of the raw SMP system
state (core specific MMU tables, kernel thread
table, etc.) - May be required for kernel development, but is
rather oversized for pure application development
27MMU Page Table Interpretation
28Test Access Port IEEE 1149.1
29MultiCore Debug-Interface IEEE 1500
30MultiCore Debugger Architecture
31Detecting Bugs through Trace-HW
- E.g. Deadlock detection at a given check-point
- Build the list of exclusive locks
- Owned by the thread or
- The thread is blocked by
- Build the dependency graph and find a cycle
- Required information (for hardware trace)
- Thread ID(content of Context/Thread ID Register)
- Enter/leave trigger events of lock-functions
(content of program counter) - Addresses of locks(content of first argument
register or stack)
32Intelligent Trace Hardware Required
- Paradigm shift from post-trace analysis to
pre-trace specification - User has to specify
- What has to be captured?
- When it has to be captured?
- On-Chip trace pre-processing requires extensive
HW support - On-chip filter capabilities
- Trace data compression
- Trigger logic
- Cross triggers (multi-core / multi-component)
33On-Chip Trace Architecture (CoreSight)
- Unfortunately still not available on any SMP
processor hardware
34Trace Programming Model
- Everything is hierarchical
- Overall system is build from components (ETMs,
HTMs, Embedded Cross Trigger, ...) - Register based programming model for each trace
component - Registers for identification and management
- Component specific control registers
- Memory mapped interface to provide access to
component registers - Typically access via AMBA bus bridge
35Operating System Debug Extensions
- Application level debugging is supported by the
Linux kernel through several interfaces (ptrace,
thread_db). - One of the most common debuggers based on these
interfaces is GDB. - Low level of intrusion
- Only selected tasks/threads are stopped /
observed - simple debug system
- Thread / core mapping is transparent
- Only effects of core mapping and concurrent
execution are visible - only user threads are observable
36Thread State Observation
- Display the number of current application threads
- Observe the related run state of every particular
application - All threads are started and stopped
synchronously, allowing the debugger to evaluate
all thread contexts
37Conclusion
38Conclusion
- SMP offers a good trade-off when considering
hardware effort and programming complexity - For a low number of processor cores, hardware
cost scales nearly linear with the potential
performance gain. - Through the unified shared memory model, the
implementation of multi-thread application is
significantly simplified. - Debugging SMP systems is a complex task
- Kernel level development requires physical core
access (JTAG) - User application development might also benefit
from physical core access, but high intrusion
level effects can make debugging inefficient - OS kernel provided thread debug interfaces are
sufficient in most cases - Non Invasive trace support is always beneficial
when debugging concurrent tasks / threads - Load situations requiring task-migration becomes
more important in the embedded system domain
39Thank You