Tracing Hadoop - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Tracing Hadoop

Description:

Monitor, debug and profile applications written using the Hadoop framework ... Hardcoded timeouts may be inappropriate in some cases (e.g. long reduce task) ... – PowerPoint PPT presentation

Number of Views:233
Avg rating:3.0/5.0
Slides: 34
Provided by: rfon6
Category:

less

Transcript and Presenter's Notes

Title: Tracing Hadoop


1
Tracing Hadoop
  • Andy Konwinski, Matei Zaharia, Randy Katz, Ion
    Stoica

2
Objectives
  • Monitor, debug and profile applications written
    using the Hadoop framework (distributed file
    system map-reduce)
  • Detect problems automatically from traces

3
Approach
  • Instrument Hadoop using X-Trace
  • Analyze traces in web-based UI
  • Detect problems using machine learning
  • Identify bugs in new versions of software from
    behavior anomalies
  • Identify nodes with faulty hardware from
    performance anomalies

4
Outline
  • Introduction
  • Architecture
  • Trace analysis UI
  • Applications
  • Optimizing job performance
  • Faulty machine detection
  • Software bug detection
  • Findings about Hadoop
  • Overhead of X-Trace
  • Conclusion

5
Outline
  • Introduction
  • Architecture
  • Trace analysis UI
  • Applications
  • Optimizing job performance
  • Faulty machine detection
  • Software bug detection
  • Findings about Hadoop
  • Overhead of X-Trace
  • Conclusion

6
Architecture
Hadoop Master
Hadoop Slave
Hadoop Slave
Hadoop Slave
X-Trace FE
X-Trace FE
X-Trace FE
X-Trace FE
TCP
X-Trace Backend
Trace Analysis Web UI
HTTP
HTTP
BerkeleyDB
HTTP
User
Fault Detection Programs
7
Trace Analysis UI
  • Web-based
  • Provides
  • Performance statistics for RPC and DFS ops
  • Graphs of utilization, performance versus various
    factors, etc
  • Critical path analysis breakdown of slowest map
    and reduce tasks

8
Outline
  • Introduction
  • Architecture
  • Trace analysis UI
  • Applications
  • Optimizing job performance
  • Faulty machine detection
  • Software bug detection
  • Findings about Hadoop
  • Overhead of X-Trace
  • Conclusion

9
Optimizing Job Performance
  • Examined performance of Apache Nutch web indexing
    engine on a Wikipedia crawl
  • How long should creating an inverted link index
    of a 50 GB crawl take?
  • With default configuration, 2 hours
  • With optimized configuration, 7 minutes

10
Optimizing Job Performance
Machine utilization under default configuration
11
Optimizing Job Performance
Problem One single Reduce task, which actually
fails several times at the
beginning
12
Optimizing Job Performance
Active tasks vs. time with improved
configuration (50 reduce tasks instead of one)
13
Faulty Machine Detection
  • Motivated by observing slow machine

Diagnosed to be failing hard drive
14
Faulty Machine Detection
  • Approach Compare performance of each machine vs
    others in that trace
  • 4 Statistical tests
  • Welchs t-test
  • Welchs t-test on ranks
  • Wilcoxon rank-sum test
  • Permutation test
  • 3 Significance levels

15
Faulty Machine Detection
Failing disk, significance 0.01
No failing disk, significance 0.01
16
Faulty Machine Detection
Failing disk, significance 0.01
No failing disk, significance 0.01
Failing disk, significance 0.05
No failing disk, significance 0.05
17
Faulty Machine Detection
Failing disk, significance 0.01
No failing disk, significance 0.01
Failing disk, significance 0.05
No failing disk, significance 0.05
18
Software Bug Detection
  • Generate contingency tables containing counts of
    X-Trace graph features (e.g. adjacent events)
  • Compare test run against a set of good runs
  • Scenarios
  • Simulated software bugs random System.exit()
  • Real bug in a previous version of Hadoop
  • Statistical tests
  • Chi-squared (three variations)
  • Naïve Bayes

19
Software Bug Detection
Sample function contingency table (counts of
pairs of adjacent function calls)
20
Software Bug Detection
Fault detection rates using the event contingency
table
21
Software Bug Detection
Fault detection rates using the event contingency
table
22
Outline
  • Introduction
  • Architecture
  • Trace analysis UI
  • Applications
  • Optimizing job performance
  • Faulty machine detection
  • Software bug detection
  • Findings about Hadoop
  • Overhead of X-Trace
  • Conclusion

23
Findings about Hadoop
  • Hardcoded timeouts may be inappropriate in some
    cases (e.g. long reduce task)
  • Highly variable DFS performance under load, which
    can slow the entire job

24
Findings about Hadoop
Variable DFS performance in a typical Hadoop job
25
Overhead of X-Trace
  • Negligible overhead in our test clusters
  • Partly because Hadoop operations are large, so
    tracing is invoked infrequently
  • E.g. 18-minute Apache Nutch indexing job with 390
    tasks generates 50,000 reports (20 MB of text
    data)

26
Conclusion
  • Successfully used traces of Hadoop to find
    interesting behavior, with low overhead
  • Detected software and hardware errors with high
    accuracy and few false positives
  • Improved X-Trace scalability

27
Future Work
  • Larger-scale tracing
  • Observe Hadoop at data center scale (EC2?)
  • Observe real-world hardware failures
  • Many opportunities for machine learning
  • More advanced bug detection
  • Failure diagnosis
  • Performance modeling
  • Correlate with telemetry data
  • Integrate tracing into open-source Hadoop
  • Trace younger, buggier software projects

28
Questions?
?
?
?
29
Findings about Hadoop
Unusual run - very slow small reads at start of
job
30
Optimizing Job Performance
Breakdown of longest map with 3x more mappers
Tracing also illustrates the inefficiency of
having too many mappers and reducers
31
Raw Event Graphs
Successful DFS write
32
Raw Event Graphs
Failed DFS write
33
Related Work
  • Per-machine logs
  • Active tracing log4j, etc
  • Passive tracing dtrace, strace
  • Log aggregation tools Sawzall, Pig
  • Cluster monitoring tools Ganglia
Write a Comment
User Comments (0)
About PowerShow.com