Title: Power management in Realtime systems
1Power management in Real-time systems
Collaborators Daniel Mosse Bruce ChildersPhD
students Hakan Aydin Dakai Zhu Cosmin
Rusu Nevine AbouGhazaleh Ruibin Xu
2Power Management
- Why?
- Battery operated Laptop, PDA and Cell phone
- Heating complex Servers (multiprocessors)
- Power Aware maintain QoS, reduce energy
- How?
- Power off un-used parts LCD, disk for Laptop
- Gracefully reduce the performance
- CPU dynamic power Pd Cef Vdd2 f
- Cef switch capacitance
- Vdd supply voltage
- f processor frequency ? linearly related to Vdd
3Power Aware Scheduling
- Static Power Management (SPM)
- Static slack uniformly slow down all tasks
- Gets more interesting for multiprocessors
fmax
Static Slack
D
E
T1
T2
idle
time
4Dynamic Power management (DPM)
- Dynamic slack average execution 10
- Utilize slack to slow down future tasks
(Proportional, Greedy, aggressive,)
5Stochastic power management
slack
ß1
6Computing ßi in Reverse Order
T1
T2
T3
T4
7Dynamic Speed adjustment techniques for
non-linear code
PMP
p1
p3
p2
min
average
max
At a
PMP
- Remaining WCET is based on the longest path
- Remaining average case execution time is based on
the branching probabilities (from trace
information).
8Who should manage?
Run-time information
Static analysis
Compiler (knows future better)
OS (knows past Better)
9Maximizing systems utility
(as opposed to minimizing energy consumption)
Energy constrains
Time constrains (deadlines or rates)
System utility (reward)
Increased reward with increased execution
Determine appropriate versions to execute
Determine the most rewarding subset of tasks to
execute
10Many problem formulations
- Continuous frequencies, continuous reward
functions - Discrete operating frequencies, no reward for
partial execution - Version programming an alternative to the IRIS
(IC) QoS model
Optimal solutions
Heuristics
EXAMPLE For homogeneous power functions, maximum
reward is when power is allocated equally to all
tasks.
Add a task
if constraintsis violated
no
yes
Repair schedule
11Rechargeable systems
(additional constrains on energy and power)
battery
Available power
Use to store (recharge)
split
merge
consume
time
Schedulable system
Example
- Solar panel (needs light)
- Tasks are continuously executed
- keep the battery level above a threshold at all
times - Frame based system
- Three dynamic policies (greedy, speculative and
proportional)
12Multiprocessing systems
13Scheduling Policy
- Partition Tasks to Processors
- Each processor applies PM individually
- Distributed
- Global management
- Shared memory
14Dynamic Power Management
- Greedy
- Any available slack is given to next ready task
- Feasible for single processor systems
- Fails for multi-processor systems
15Steaming Applications
- Streaming applications are prevalent
- Audio, video, real-time tasks, cognitive
applications - Executing on
- Servers, embedded systems
- Multiprocessors and processor clusters
- Chip Multiprocessors TRIPS, RAW, etc.
- Constrains
- Interarrival time (T)
- End-to-end delay (D)
- Two possible strategies
- Master-slave
- Pipelining
T
D
16Master-slave Strategy
- Single streaming application
- The optimal number, n, of active PEs strikes a
balance between static and dynamic power - Given n, the speed on each PE is chosen to
minimize energy consumption
- Multiple steaming applications
- Determine the optimal number of active PEs
- Given the number of active PEs,
- First assign streams to groups of PEs (ex
balance load using the minimum span algorithm). - Adjust the speed on each PE to minimize energy
17Pipeline Strategy
- (1) Linear pipeline ( of stages of PEs)
18Pipeline Strategy
- (2) Linear pipeline ( of stages of PEs)
- Solution 1 (optimal)
- Discretize the time and use dynamic programming
Solution 2 (use some heuristics)
(3) Nonlinear pipeline
- of stages of PEs
- Formulate an optimization problem with multiple
sets of constraints, each corresponding to a
linear pipeline - Problem the number of constraints (can be
exponential) - Solution add additional variables denoting the
finishing time for each stage - of stages of PEs
- Apply partitioning/mapping first and then do
power management
19Scheduling into a 2-D processor array (CMP)
A
B
C
D
E
F
G
H
I
J
- Step 1 topological-sort-based morphing
- Step 2 A dynamic programming approach to find
the optimal of stages and optimal of
processors for each stage
20 Tradeoff Energy Dependability
21Time slack (unused processor capacity)
Use to reduce speed
Use for redundancy
Use to do more work
Fault tolerance
Power management
Increase productivity
space redundancy
Time redundancy
Effect of DVS on reliability
22Exploring time redundancy
The slack is used to 1) add checkpoints 2)
reserve recovery time 3) reduce processing
speed
For a given slack and checkpoint
overhead, We can find the number of
checkpoints and the placement of
checkpoints Such that we minimizes energy
consumption, and guarantee recovery and
timeliness.
Energy
of checkpoints
23 TMR vs. Duplex
TMR
Duplex
r
overhead of checkpoint
p
ratio of static/dynamic power
r
r
TMR is more Energy efficient
Load0.7
0.035
Load0.6
Load0.5
Duplex is more Energy efficient
0.02
p
p
0.1
0.2
Identified energy efficient operating regions for
TMR/Duplex
24Effect of DVS on SEU rate
- Lower voltages ? higher fault rate
- Lower speed ? less slack for recovery
Reliability requirement
Fault model
Available slack
Acceptable level of DVS
25Near-memory Caching for Improved Energy
Consumption
26Near-CPU vs. Near-memory caches
CPU
- Caching to mask memory delays
- Where?
cache
cache
- Which is more power and performance efficeint ?
Main Memory
- Thesis Need to balance the allocation of the two
for better delay and energy.
27Near-memory caching Cached-DRAM (CDRAM)
- On-memory SRAM cache
- accessing fast SRAM cache ? Improves performance.
- High internal bandwidth ? use large block sizes
- Improves performance but consume more energy
Same config. as in Hsu et al., 2003
28Power-aware CDRAM
- Power management in near-memory caches
- Use distributed near-memory caches
- Choose adequate cache configuration
- to reduce miss rate energy per access.
- Power management in DRAM-core
- Use moderate sized SRAM cache
- Turn the DRAM core to low power state
- Use immediate shutdown
- Near-memory versus DRAM energy
- tradeoff cache block size
29Wireless Networks
Collaborators Daniel MossePhD student Sameh
Gobrial
30Saving Power
- Power is proportional to the square of the
distance - The closer the nodes, the less power is needed
- Power-aware Routing (PARO) identifies new nodes
between other nodes and re-routes packets to
save energy - Nodes decide to reduce/increase their transmit
power
31Asymmetry in Transmit Power
- Instead of C sending directly to A, it can go
through B - Saves transmit power, but may cause some
problems.
32Problems due to one-way links.
- Collision avoidance (RTS/CTS) scheme is impaired
- Even across bidirectional links!
- Unreliable transmissions through one-way link.
- May need multi-hop Acks at Data Link Layer.
- Link outage can be discovered only at downstream
nodes.
A
B
C
33Problems for Routing Protocols
- Route discovery mechanism.
- Cannot reply using inverse path of route request.
- Need to identify unidirectional links. (AODV)
- Route Maintenance.
- Need explicit neighbor discovery mechanism.
- Connectivity of the network.
- Gets worse (partitions!) if only bidirectional
links are used.
34Wireless bandwidth and Power savings
- In addition to transmit power, what else can we
do to save energy? - Power has a direct relation with signal to noise
ratio (SNR) - The higher the power, the higher the signal, the
less noise, the less errors, the more data a node
can transmit - Increasing the power allows for higher bandwidth
- Turn transceivers off when not used this
creates problems when a node needs to relay
messages for other nodes.
35Using Optical Interconnections in Supercomputers
Collaborators Alex JonesPhD student Ding
Zhu Dan Li (now doing AI) Shuyi Shao
36Motivation for using Optical Circuit Switching
(OCS) in Suprecomputers
- Many HPCS applications have only a small degree
(6-10) of high bandwidth communication among
processes/threads - Rest of a threads / process communication
traffic are low bandwidth communication
exceptions. - Many HPCS applications have persistent
communication patterns - Fixed over all the programs run time, or slowly
changing - But there are bad applications, or phases in
applications, which are chaos - GUPS..!
- Optics is good for high bandwidth, bad for fast
switching. Electronics is the other way around
and is good for processing (collectives) - Need two networks to complement each other
37The OCS Network fabric
2 networks complement each other
Circuit-Switched all Optical Fat-Trees made of
512x512 MEMS-based optical switches
One of multiple fat-tree networks
One of multiple fat-tree networks
One of multiple fat-tree networks
One of multiple fat-tree networks
OCS
One of multiple fat-tree networks
One of multiple fat-tree networks
One of multiple fat-tree networks
One of multiple fat-tree networks
Fat-tree network
Fat-tree network
Fat-tree network
Fat-tree network
Fat-tree network
Fat-tree network
Fat-tree network
Storage/IO Network
PERCS D-block
PERCS D-block
Intelligent Network (1/10 or less BW) Including
collective communication
38Communication Pattern node48 AMR CTH
Communication patterns change in phases lasting
10s of Sec
(Node 48)
250 sec phase
Communication Phases
39UMT2K Fixed, Irregular Communication Pattern
Percentage of Traffic By Bandwidth
Communication Matrix
Max communication degree from each node to any
other node is about 10. The pattern is irregular
but fixed.
100
10
1
40Handling HPCS application in OCS
Communication Un-predictability
Use multiple hops through OCS OR use
intelligent network
Un-Predictable
Run-time predictor
Communication
Run-Time Predictable
Compile Time Statically Analyzable
Compiled Communication
Temporal Locality
Low
High
NOTE No changes in the applications code. OCS
setup by the compiler, run time auto prediction,
and multi hop routing
41Paradigm of compiled communication
MPI trace code
HPC systems
Traces
MPI application
Optimized MPI code
Compiler
Communication Patterns
Network configuration Instruction Enhanced MPI
code
Network Configurations
Performance statistics
Compiled communication
Run-time predictor
HPC systems (Simulator)
42Compilation Framework
- Compiler
- Recognize and represent communication patterns
- Communication compiling
- Enhance applications with network configuration
instructions - Automate trace generation
43Communication Pattern
- Communication Classification
- Static
- Persistent
- Dynamic
- Executions of parallel applications show
phasescommunication phases
44The Communication Predictor
- Initially, setup the OCS for random traffic
- Keep track of connections utilization
- A migration policy to create circuits in OCS
- A simple threshold policy
- An intelligent pattern predictor
- An evacuation policy to remove circuits from OCS
- LRU replacement
- A compiler inserted directive
45Dealing with Unpredictable Communications
Set up the OCS planes so that any D-block can
reach any other D-block with at most two hops
through the network
Example Route from node 2 to node 4 (second node
in second group)
46Scheduling in Buffer Limited Networks
Collaborators Taieb ZnatiPhD
student Mahmoud Elhaddad
47Packet-switched Network with Fixed-Size Buffers
- Packet routers connected via time-slotted
buffer-limited links - Packet duration is one slot
- You cannot freely size packet buffers to prevent
loss - All-optical packet routers
- On-chip (line-driver chip) SRAM buffers
- Connections
- Ingress--egress traffic aggregates
- Fixed bandwidth demand
- Connection has fixed path
- Loss rate of a connection
- Loss rate is fraction of lost packets
- goal is to Guarantee loss rate
- Loss guarantee depends on the path of connection
48Link scheduling algorithms
- Packet Service discipline
- FCFS, LIFO, Fixed Priority, Nearest To Go
- Drop policy
- Drop tail, drop front, random drop, Furthest To
Go - Must be Work conserving
- Drop excess packets only when buffer overflows
- Serve packet in every slot as long as buffer not
empty - Must use only local information
- No hints or coordination between routers
49Link scheduling in buffer-limited networks
- Problem
- Minimize guaranteed loss rate for every
connection - Key question Is there a class of algorithms that
lead to better loss bounds as fn of utilization
and path length?
FCFS scheduling with drop tail
Proposed rolling priority scheduling
50Link scheduling in buffer-limited networks
- Findings
- A local fairness property is necessary to
minimize the guaranteed loss rate for every path
length and utilization constraint. - FCFS/RD (Random Drop) is locally fair
- A locally-fair algorithm Rolling Priority that
improves the loss guarantees compared to FCFS/RD,
and is simple to implement - Rolling Priority is optimal
- FCFS/RD is near-optimal at light load
51Rolling Priority
- Time divided into epochs of fixed duration nT.
- Connection Initialization
- Ingress chooses a phase at random from the
duration of an epoch. - At toffset, ingress sends an init packet along
the path of connection - Init packets are rare and never dropped
- At every link, a new epoch starts periodically
- At each time slot, every link gives higher
priority to the connection with earlier starting
current epoch.
52Roaming Honeypots for Mitigating
Denial-of-Service Attacks
Collaborators Daniel Mosse Taieb ZnatiPhD
student Sherif Khattab
53Denial-of-Service (DoS) Attack
- DoS attacks aim at disrupting legitimate
utilization of network/server resources.
54Clients
Servers
Dropped Requests
55Packet Filtering
Not Scalable (Grows with number of users)
??
56Packet Filtering
More Scalable attackers ??
57Roaming Honeypots Basic Idea
A1
A1
A1
A1
58Roaming Honeypots Basic Idea
A1
A1
A1
A1
A1
A1
59Effect of Attack Load
With roaming honeypots, the service exhibits a
stable average response time even in the presence
of attacks with increasing intensity