Title: Control%20Theory%20in%20Log%20Processing%20Systems
1Control Theory in Log Processing Systems
- Wei Xu (xuw_at_cs.berkeley.edu)
- UC Berkeley
- Joseph L. Hellerstein
- IBM T.J. Watson Research Center
2Outline
- Data streams and log processing
- Applying control theory
- Controlling queue length
- Load balancing
- Lessons learned
3Introduction
- Goal of our project
- A tool
- A testbed
- Problem data rate up to 1 TB a day
- Distributed Infrastructure
- How to make itself reliable?
4Example of system log data
- request data
- Apache log, etc
- performance data
- CPU, mem etc.
- failure data
- Detected problems /error messages
- reports from operators
5The big picture
Production System
Data Collection Automatic analysis
preprocessing
?
?
Repository
Sanitized Data
?
Failure Detection
6Preprocessing
- Sanitize the data
- Put logs into common format
- Merge information from various sources
- Sampling
- Needs to be fast
7Stream processing
- Log data are data streams
- Preprocessing tasks are continuous queries
- Telegraph Continuous Query (TCQ)
- SQL queries
- adaptive execution optimized on-the-fly
- performance doesnt depend on queries
8Data preprocessing architecture
load splitter
combiner
SLT 1
SLT 2
Intra-Event Processing
Inter-Event Processing
9Problem performance disturbance
- CPU contention
- Maintenance Tasks
- Packets drop
- Other failures
SELECTIVITY changes
10The result of disturbance
End to End Response time (ms)
Time (second)
11Solution Control Theory
- Treat this as a failure?
- Not necessary and too expensive
- Feedback control theory as first tier defense
mechanism - Try to make it stable at least for sometime
- If doesnt work out, try recovery
12Outline
- Data streams and log processing
- Applying control theory
- Controlling queue length
- Load balancing
- Lessons learned
13The problem
Source
Buffer
TCQ
Result Q
14Why does this happen?
TCQ Complex internal structure
Input Buffer
Controlled Data Source
TCQ drops tuples silently if result queue is
full Back pressure not possible
15Control Problems
- Goal?
- No dropping tuples
- What to control?
- The result queue length
- The Knob?
- Input data rate to the TCQ node
16Control block diagram
Target system (System identification)
u(k)u(k-1)(KpKI)e(k)-Kpe(k-1)
Error
Data rate in next interval
Last Error
Data rate in last interval
17Result Under CPU Contention
Source
Buffer
TCQ
Result Q
18Why useful?
- Original system
- Input data rate gttuple drop v.s. not drop
- New system
- Input data rate gt Response time
- Make it ready for load balancing
19Outline
- System log as data streams
- Applying control theory
- Controlling queue length
- Load balancing
- Lessons learned
20The problem
- Barrier in system
- Different response times
- End to end response time matches the slower node
21The control problem
- Goal?
- Make the response time equal
- What to control?
- Response time on each node
- The knob?
- Tuples assigned to each node
- What to monitor?
- Queue length v.s. response time
22System with control
Response time
23Control block diagram
24Result
End to End Response time (ms)
Time (second)
25Outline
- System log as data streams
- Applying control theory
- Controlling queue length
- Load balancing
- Lessons learned
26Advantages of control theory
- Performance can be analyzed
- Stability
- Accuracy
- Settling time
- Overshoot
27Other advantages
- Simple implementation
- Encourage good system design
- Modeling the system
- Treat system as black box
- First defense mechanism against disturbances in
system
28Limitations
- Not all software systems are designed to be
controlled - Finite input produces unbounded output
- E.g. Join in TCQ
- Useful state not measurable
- Queuing theory helps, but lacks other good theory
- Many binary variables
- Failed v.s working correctly
29Other Limitations
- The model for target system is complex
- Lack of a reliable knob
- E.g. change result queue length of TCQ sometime
it crash - What is the range you can turn?
- How often you can turn?
- How long will the system respond?
- Can not find the cause of problem
30Solution?
- More advanced modeling and controller?
- Adaptive control
- Design controller-friendly systems?
- A simple model
- User configurable parameter -gt knobs?
31Future Work
- As a tool, real users?
- Scheduling multiple streams
- Dynamically scale up/down
- Other control theory applications
32Backup Slides
33Future Work
- Load balancer
- Load control across multiple tiers
- Scheduling of multiple streams
34System with control
35Result
Source
Buffer
TCQ
Result Q
36Conclusion
- Advantages of feedback control
- Make system more robust under disturbance
- Allows more time for failure detection
- Treat complex systems as black boxes
- Cope with the system characteristics instead of
having to change it - Theoretical analysis
- Implementation is easy
- System statistics can also be used for SLT
37What is going on?
Controlled Output Thread(Code Reuse)
Queue Length Controller
Desired Queue length
Data Rate to TCQ
Actual Queue Length
38Theory meets reality
Queue length
Time
39Tricky part of parameter estimation
Model evaluation Making the system operate in
desired range
Data rate vs free space
Free Space
Non-Linear range
Easy for data source, but queue length ..
40Why do we need control?
- Data source does not provide accurate data rate
41Control Problems
- Not accurate for various reasons
- Scheduling
- Time spent on I/O
- Etc.
- Providing an accurate data source using feedback
control - By controlling the input of desired rate
42The Control Architecture
1500
1900
1600
P Controller (with precompensation)
u(k)Kpe(k)
U(k)u(k-1)(KpKI)e(k)-Kpe(k-1)
43Result An accurate data source
P Controller with Pre-compensation
PI Controller
44Zoom In
A lot of small disturbance in a Java
program Incremental garbage collection
P Controller
PI Controller