Tracing Hadoop

About This Presentation

Title:

Tracing Hadoop

Description:

Monitor, debug and profile applications written using the Hadoop framework ... Hardcoded timeouts may be inappropriate in some cases (e.g. long reduce task) ... – PowerPoint PPT presentation

Number of Views:233

Avg rating:3.0/5.0

Slides: 34

Provided by: rfon6

Category:

more less

Transcript and Presenter's Notes

Title: Tracing Hadoop

1
Tracing Hadoop

Andy Konwinski, Matei Zaharia, Randy Katz, Ion
Stoica

2
Objectives

Monitor, debug and profile applications written
using the Hadoop framework (distributed file
system map-reduce)
Detect problems automatically from traces

3
Approach

Instrument Hadoop using X-Trace
Analyze traces in web-based UI
Detect problems using machine learning
Identify bugs in new versions of software from
behavior anomalies
Identify nodes with faulty hardware from
performance anomalies

4
Outline

Introduction
Architecture
Trace analysis UI
Applications
Optimizing job performance
Faulty machine detection
Software bug detection
Findings about Hadoop
Overhead of X-Trace
Conclusion

5
Outline

Introduction
Architecture
Trace analysis UI
Applications
Optimizing job performance
Faulty machine detection
Software bug detection
Findings about Hadoop
Overhead of X-Trace
Conclusion

6
Architecture
Hadoop Master
Hadoop Slave
Hadoop Slave
Hadoop Slave
X-Trace FE
X-Trace FE
X-Trace FE
X-Trace FE
TCP
X-Trace Backend
Trace Analysis Web UI
HTTP
HTTP
BerkeleyDB
HTTP
User
Fault Detection Programs
7
Trace Analysis UI

Web-based
Provides
Performance statistics for RPC and DFS ops
Graphs of utilization, performance versus various
factors, etc
Critical path analysis breakdown of slowest map
and reduce tasks

8
Outline

Introduction
Architecture
Trace analysis UI
Applications
Optimizing job performance
Faulty machine detection
Software bug detection
Findings about Hadoop
Overhead of X-Trace
Conclusion

9
Optimizing Job Performance

Examined performance of Apache Nutch web indexing
engine on a Wikipedia crawl
How long should creating an inverted link index
of a 50 GB crawl take?
With default configuration, 2 hours
With optimized configuration, 7 minutes

10
Optimizing Job Performance
Machine utilization under default configuration
11
Optimizing Job Performance
Problem One single Reduce task, which actually
fails several times at the
beginning
12
Optimizing Job Performance
Active tasks vs. time with improved
configuration (50 reduce tasks instead of one)
13
Faulty Machine Detection

Motivated by observing slow machine

Diagnosed to be failing hard drive
14
Faulty Machine Detection

Approach Compare performance of each machine vs
others in that trace
4 Statistical tests
Welchs t-test
Welchs t-test on ranks
Wilcoxon rank-sum test
Permutation test
3 Significance levels

15
Faulty Machine Detection
Failing disk, significance 0.01
No failing disk, significance 0.01
16
Faulty Machine Detection
Failing disk, significance 0.01
No failing disk, significance 0.01
Failing disk, significance 0.05
No failing disk, significance 0.05
17
Faulty Machine Detection
Failing disk, significance 0.01
No failing disk, significance 0.01
Failing disk, significance 0.05
No failing disk, significance 0.05
18
Software Bug Detection

Generate contingency tables containing counts of
X-Trace graph features (e.g. adjacent events)
Compare test run against a set of good runs
Scenarios
Simulated software bugs random System.exit()
Real bug in a previous version of Hadoop
Statistical tests
Chi-squared (three variations)
Naïve Bayes

19
Software Bug Detection
Sample function contingency table (counts of
pairs of adjacent function calls)
20
Software Bug Detection
Fault detection rates using the event contingency
table
21
Software Bug Detection
Fault detection rates using the event contingency
table
22
Outline

Introduction
Architecture
Trace analysis UI
Applications
Optimizing job performance
Faulty machine detection
Software bug detection
Findings about Hadoop
Overhead of X-Trace
Conclusion

23
Findings about Hadoop

Hardcoded timeouts may be inappropriate in some
cases (e.g. long reduce task)
Highly variable DFS performance under load, which
can slow the entire job

24
Findings about Hadoop
Variable DFS performance in a typical Hadoop job
25
Overhead of X-Trace

Negligible overhead in our test clusters
Partly because Hadoop operations are large, so
tracing is invoked infrequently
E.g. 18-minute Apache Nutch indexing job with 390
tasks generates 50,000 reports (20 MB of text
data)

26
Conclusion

Successfully used traces of Hadoop to find
interesting behavior, with low overhead
Detected software and hardware errors with high
accuracy and few false positives
Improved X-Trace scalability

27
Future Work

Larger-scale tracing
Observe Hadoop at data center scale (EC2?)
Observe real-world hardware failures
Many opportunities for machine learning
More advanced bug detection
Failure diagnosis
Performance modeling
Correlate with telemetry data
Integrate tracing into open-source Hadoop
Trace younger, buggier software projects

28
Questions?
?
?
?
29
Findings about Hadoop
Unusual run - very slow small reads at start of
job
30
Optimizing Job Performance
Breakdown of longest map with 3x more mappers
Tracing also illustrates the inefficiency of
having too many mappers and reducers
31
Raw Event Graphs
Successful DFS write
32
Raw Event Graphs
Failed DFS write
33
Related Work

Per-machine logs
Active tracing log4j, etc
Passive tracing dtrace, strace
Log aggregation tools Sawzall, Pig
Cluster monitoring tools Ganglia

Write a Comment

User Comments (0)

About PowerShow.com

Tracing Hadoop - PowerPoint PPT Presentation

Tracing Hadoop

Monitor, debug and profile applications written using the Hadoop framework ... Hardcoded timeouts may be inappropriate in some cases (e.g. long reduce task) ... – PowerPoint PPT presentation