Title: Tracing Hadoop
1Tracing Hadoop
- Andy Konwinski, Matei Zaharia, Randy Katz, Ion
Stoica
2Objectives
- Monitor, debug and profile applications written
using the Hadoop framework (distributed file
system map-reduce) - Detect problems automatically from traces
3Approach
- Instrument Hadoop using X-Trace
- Analyze traces in web-based UI
- Detect problems using machine learning
- Identify bugs in new versions of software from
behavior anomalies - Identify nodes with faulty hardware from
performance anomalies
4Outline
- Introduction
- Architecture
- Trace analysis UI
- Applications
- Optimizing job performance
- Faulty machine detection
- Software bug detection
- Findings about Hadoop
- Overhead of X-Trace
- Conclusion
5Outline
- Introduction
- Architecture
- Trace analysis UI
- Applications
- Optimizing job performance
- Faulty machine detection
- Software bug detection
- Findings about Hadoop
- Overhead of X-Trace
- Conclusion
6Architecture
Hadoop Master
Hadoop Slave
Hadoop Slave
Hadoop Slave
X-Trace FE
X-Trace FE
X-Trace FE
X-Trace FE
TCP
X-Trace Backend
Trace Analysis Web UI
HTTP
HTTP
BerkeleyDB
HTTP
User
Fault Detection Programs
7Trace Analysis UI
- Web-based
- Provides
- Performance statistics for RPC and DFS ops
- Graphs of utilization, performance versus various
factors, etc - Critical path analysis breakdown of slowest map
and reduce tasks
8Outline
- Introduction
- Architecture
- Trace analysis UI
- Applications
- Optimizing job performance
- Faulty machine detection
- Software bug detection
- Findings about Hadoop
- Overhead of X-Trace
- Conclusion
9Optimizing Job Performance
- Examined performance of Apache Nutch web indexing
engine on a Wikipedia crawl - How long should creating an inverted link index
of a 50 GB crawl take? - With default configuration, 2 hours
- With optimized configuration, 7 minutes
10Optimizing Job Performance
Machine utilization under default configuration
11Optimizing Job Performance
Problem One single Reduce task, which actually
fails several times at the
beginning
12Optimizing Job Performance
Active tasks vs. time with improved
configuration (50 reduce tasks instead of one)
13Faulty Machine Detection
- Motivated by observing slow machine
Diagnosed to be failing hard drive
14Faulty Machine Detection
- Approach Compare performance of each machine vs
others in that trace - 4 Statistical tests
- Welchs t-test
- Welchs t-test on ranks
- Wilcoxon rank-sum test
- Permutation test
- 3 Significance levels
15Faulty Machine Detection
Failing disk, significance 0.01
No failing disk, significance 0.01
16Faulty Machine Detection
Failing disk, significance 0.01
No failing disk, significance 0.01
Failing disk, significance 0.05
No failing disk, significance 0.05
17Faulty Machine Detection
Failing disk, significance 0.01
No failing disk, significance 0.01
Failing disk, significance 0.05
No failing disk, significance 0.05
18Software Bug Detection
- Generate contingency tables containing counts of
X-Trace graph features (e.g. adjacent events) - Compare test run against a set of good runs
- Scenarios
- Simulated software bugs random System.exit()
- Real bug in a previous version of Hadoop
- Statistical tests
- Chi-squared (three variations)
- Naïve Bayes
19Software Bug Detection
Sample function contingency table (counts of
pairs of adjacent function calls)
20Software Bug Detection
Fault detection rates using the event contingency
table
21Software Bug Detection
Fault detection rates using the event contingency
table
22Outline
- Introduction
- Architecture
- Trace analysis UI
- Applications
- Optimizing job performance
- Faulty machine detection
- Software bug detection
- Findings about Hadoop
- Overhead of X-Trace
- Conclusion
23Findings about Hadoop
- Hardcoded timeouts may be inappropriate in some
cases (e.g. long reduce task) - Highly variable DFS performance under load, which
can slow the entire job
24Findings about Hadoop
Variable DFS performance in a typical Hadoop job
25Overhead of X-Trace
- Negligible overhead in our test clusters
- Partly because Hadoop operations are large, so
tracing is invoked infrequently - E.g. 18-minute Apache Nutch indexing job with 390
tasks generates 50,000 reports (20 MB of text
data)
26Conclusion
- Successfully used traces of Hadoop to find
interesting behavior, with low overhead - Detected software and hardware errors with high
accuracy and few false positives - Improved X-Trace scalability
27Future Work
- Larger-scale tracing
- Observe Hadoop at data center scale (EC2?)
- Observe real-world hardware failures
- Many opportunities for machine learning
- More advanced bug detection
- Failure diagnosis
- Performance modeling
- Correlate with telemetry data
- Integrate tracing into open-source Hadoop
- Trace younger, buggier software projects
28Questions?
?
?
?
29Findings about Hadoop
Unusual run - very slow small reads at start of
job
30Optimizing Job Performance
Breakdown of longest map with 3x more mappers
Tracing also illustrates the inefficiency of
having too many mappers and reducers
31Raw Event Graphs
Successful DFS write
32Raw Event Graphs
Failed DFS write
33Related Work
- Per-machine logs
- Active tracing log4j, etc
- Passive tracing dtrace, strace
- Log aggregation tools Sawzall, Pig
- Cluster monitoring tools Ganglia