Title: Diagnosing and Debugging Wireless Sensor Networks
1Diagnosing and Debugging Wireless Sensor Networks
- Eric Osterweil
- Nithya Ramanathan
2Contents
- Introduction
- Network Management
- Parallel Processing
- Distributed Fault Tolerance
- WSNs
- Calibration / Model Based
- Conclusion
3What do apples, oranges, peaches have in common?
- Well, they are all fruits, they all grow in
groves of trees, etc.
However, grapes are also fruits, but they grown
on vines! )
4Defining the Problem
- Debugging an iterative process of detecting and
discovering the root-cause of faults - Distinct debugging phases
- Pre-deployment
- During deployment
- Post-deployment
- Ongoing Maintenance / Performance Analysis How
different from debugging?
5Characteristic Failures1,2
- Pre-Deployment
- Bugs characteristic of wireless, embedded, and
distributed platforms - During Deployment
- Not receiving data at the sink
- Neighbor density (or lack thereof)
- badly placed nodes
- Flaky/variable link connectivity
1 R. Szewczyk, J. Polastre, A. Mainwaring, D.
Culler Lessons from a Sensor Network
Expedition. In EWSN, 2004 2 A. Mainwaring, J.
Polastre, R. Szewczyk, D. Culler Wireless Sensor
Networks for Habitat Monitoring. In ACM
International Workshop on Wireless Sensor
Networks and Applications.
6Characteristic Failures (continued)
- Post-Deployment
- Failed/rebooted nodes
- Funny nodes/sensors
- batteries with low-voltage levels
- Un-calibrated sensors
- Ongoing Maintenance / Performance
- Low bandwidth / dropped data from certain regions
- High power consumption
- Poor load-balancing, or high re-transmission rate
7Scenarios
- You have just deployed a sensor network in the
forest, and are not getting data from any node
what do you do? - You are getting wildly fluctuating averages from
a region is this caused by - Actual environmental fluctuations
- Bad sensors
- Data randomly dropped
- Calculation / algorithmic errors
- Tampered nodes
8Challenges
- Existing tools fall-short for sensor networks
- Limited visibility
- Resource constrained nodes (Cant run gdb)
- Bugs characteristic of embedded, distributed, and
wireless platforms - Cant always use existing Internet
fault-tolerance techniques (i.e. rebooting) - Extracting Debugging Information
- With minimal disturbance to the network
- Identifying information used to infer internal
state - Minimizing central processing
- Minimizing resource consumption
9Challenges (continued)
- Applications behave differently in the field
- Testing configuration changes
- Cant easily log on to nodes
- Identifying performance-blocking bugs
- Cant continually manually monitor the network
(often physically impossible depending on
deployment environment)
10Contents
- Introduction
- Network Management
- Parallel Processing
- Distributed Fault Tolerance
- WSNs
- Calibration / Model Based
- Conclusion
11What is Network Management?
- I dont have to know anything about my neighbor
to count on them
12Network Management
- Observing and tracking nodes
- Routers
- Switches
- Hosts
- Ensuring that nodes are providing connectivity
- i.e. doing their jobs
13Problem
- Connectivity failures versus device failures
- Correlating outages with their cause(s)
14Outage Example
15Approach
- Polling
- ICMP
- SNMP
- Downstream event suppression
- If routing has failed, ignore events about
downstream nodes - Modeling
16Outage Example (2)
17How does this area differ from WSNs?
18Applied to WSNs
- Similarities
- Similar topologies
- Intersecting operations
- Network forwarding, routing, etc.
- Connectivity vs. device failures
- Differences
- Network links
- Topology dynamism
19Contents
- Introduction
- Network Management
- Parallel Processing
- Distributed Fault Tolerance
- WSNs
- Calibration / Model Based
- Conclusion
20What is Parallel Processing?
- If one car is fast, are 1,000 cars 1,000 times
faster?
21Parallel Processing
- Coordinating large sets of nodes
- Cluster sizes can range to the order of 104 nodes
- Knowing nodes states
- Efficient resource allocation
- Low communication overhead
22Problem
- Detecting faults
- Recovery of faults
- Reducing communication overhead
- Maintenance
- Software distributions, upgrades, etc.
23Approach
- Low-overhead state checks
- ICMP
- UDP-based protocols and topology sensitivity
- Ganglia
- Process recovery
- Process checkpoints
- Condor
24How does this area differ from WSNs?
25Applied to WSNs
- Similarities
- Potentially large sets of nodes
- Relatively difficult to track state (due to
resources) - Tracking state is difficult
- Communication overheads are limiting
26Applied to WSNs (continued)
- Differences
- Topology is more dynamic in WSNs
- Communications are more constrained
- Deployment is not structured around computation
- Energy is limiting rather than computation
overhead - WSNs are much less latency sensitive
27Contents
- Introduction
- Network Management
- Parallel Processing
- Distributed Fault Tolerance
- WSNs
- Calibration / Model Based
- Conclusion
28What is Distributed Fault Tolerance?
- Put me in coach PUT ME IN!
29Distributed Fault Tolerance
- High Availability is a broad category
- Hot backups (failover)
- Load balancing
- etc.
30Problem(s)
- HA
- Track status of nodes
- Keeping access to critical resources available as
much as possible - Sacrifice hardware for low-latency
- Load balancing
- Track status of nodes
- Keeping load even
31Approach
- HA
- High frequency/low latency heartbeats
- Failover techniques
- Virtual interfaces
- Shared volume mounting
- Load balancing
- Metric (Round robin, least connections, etc.)
32How does this area differ from WSNs?
33Applied to WSNs
- HA / Load balancing
- Similarities
- Redundant resources
- Differences
- Where to beginMANY
34Contents
- Introduction
- Network Management
- Parallel Processing
- Distributed Fault Tolerance
- WSNs
- Calibration / Model Based
- Conclusion
35What are WSNs?
- Warning, any semblance of an orderly system is
purely coincidental
36BluSH1
- Shell interface for Intels IMotes
- Enables interactive debugging can walk up to a
mote and access internal state - 1 Tom Schoellhammer
37Sympathy1,2
- Aids in debugging
- pre, during, and post-deployment
- Nodes collect metrics periodically broadcast to
the sink - Sink ensures good qualities specified by
programmer - based on metrics and other gathered information
- Faults are identified and categorized by metrics
and tests - Spatial-temporal correlation of distributed
events to root-cause failures - Test Injection
- Proactively injects network probes to validate a
fault hypothesis - Triggers self-tests (internal actuation)
1 N. Ramanathan, E. Kohler, D. Estrin, "Towards a
Debugging System for Sensor Networks",
International Journal for Network Management,
2005. 2 N. Ramanathan, E. Kohler, L. Girod, D.
Estrin. "Sympathy A Debugging System for Sensor
Networks". in Proceedings of The First IEEE
Workshop on Embedded Networked Sensors, Tampa,
Florida, USA, November 16, 2004
38SNMS1
- Enables interactive health monitoring of WSN in
the field - 3 Pieces
- Parallel dissemination and collection
- Query system for exported attributes
- Logging system for asynchronous events
- Small footprint / low overhead
- Introduces overhead only with human querying
1 Gilman Tolle, David Culler, Design of an
Application-Cooperative Management System for
WSN Second EWSN, Istanbul, Turkey, January 31 -
February 2, 2005
39Contents
- Introduction
- Network Management
- Parallel Processing
- Distributed Fault Tolerance
- WSNs
- Calibration / Model Based
- Conclusion
40What is Calibration and Modeling?
- Hey, if you and I both think the answer is true,
then whose to say were wrong? )
41Modeling1,2,3
- Root-cause Localization in large scale systems
- Process of identifying the source of problems in
a system using purely external observations - Identify anomalous behavior based on externally
observed metrics - Statistical analysis and Bayesian networks used
to identify faults
1 E. Kiciman, A. Fox Detecting application-level
failures in component-based internet services.
In IEEE Transactions on Neural Networks, Spring
2004 2 A. Fox, E. Kiciman, D. Patterson, M.
Jordan, R. Katz. Combining statistical
monitoring and predictable recovery for
self-management. In Procs. Of Workshop on
Self-Managed Systems, Oct 2004 3 E. Kiciman, L
Subramanian. Root cause localization in large
scale systems
42Calibration1,2
- Model physical phenomena in order to predict
which sensors are faulty - Model can be based on
- Environment that is monitored e.g. assume that
the majority of sensors are providing correct
data and then identify sensors that make this
model inconsistent1 - Assumptions about the environment e.g. in a
densely sampled area, values of neighboring
sensors should be similar2 - Debugging can be viewed as sensor network system
calibration - Use system metrics instead of sensor data
- Based on a model of what metrics in a properly
behaving system should look like, can identify
faulty behavior based on inconsistent metrics. - Locating and using ground truth
- In situ deployments
- Low communication/energy budgets
- Bias
- Noise
1 Jessica Feng, S. Megerian, M. Potkonjak
Model-based calibration for Sensor Networks.
IEEE International Conference on Sensors, Oct
2003 2 A Collaborative Approach to In-Place
Sensor Calibration Vladimir Bychovskiy Seapahn
Megerian et al
43Contents
- Introduction
- Network Management
- Parallel Processing
- Distributed Fault Tolerance
- WSNs
- Calibration / Model Based
- Conclusion
44Promising Ideas
- Management by Delegation
- Naturally supports heterogeneous architectures by
distributing control over network - Dynamically tasks/empowers lower-capable nodes
using mobile code - AINs
- Node can monitor its own behavior, detect,
diagnose, and repair issues - Model-based fault detection
- Models of physical environment
- Bayesian inference engines
45Comparison
- Network Management
- Close, but includes some inflexible assumptions
- Parallel Processing
- Many similar, but divergent constraints
- Distributed Fault Tolerance
- Almost totally different
- WSNs
- New techniques emerging
- Calibration
- WSN related work becoming available
1 F. Gump et al
46Conclusion
- Distributed debugging is as distributed debugging
does1 - WSNs are a particular class of distributed system
- There are numerous techniques for distributed
debugging - Different conditions warrant different approaches
- OR different spins to existing techniques
1 F. Gump et al
47References
- Todd Tannenbaum, Derek Wright, Karen Miller, and
Miron Livny, "Condor - A Distributed Job
Scheduler", in Thomas Sterling, editor, Beowulf
Cluster Computing with Linux, The MIT Press,
2002. ISBN 0-262-69274-0 - http//www.open.com/pdfs/alarmsuppression.pdf
- http//www.top500.org/
- .E. Culler and J.P. Singh, Parallel Computer
Architecture A Hardware/Software Approach,
Morgan Kaufmann Publishers Inc., San Francisco,
CA, 1999, ISBN 1-55860-343-3. - The Ganglia Distributed Monitoring System
Design, Implementation, and Experience.Matthew
L. Massie, Brent N. Chun, and David E. Culler.
Parallel Computing, Vol. 30, Issue 7, July 2004. - HA-OSCAR Release 1.0 Beta Unleashing HA-Beowulf,
2nd Annual OSCAR symposium, Winnipeg, Manitoba
Canada, May 2004 .
48Questions?