Title: Continuously Adaptive Continuous Queries (CACQ) over Streams
1Continuously Adaptive Continuous Queries (CACQ)
over Streams
Samuel Madden, Mehul Shah, Joseph Hellerstein,
and Vijayshankar Raman
Presented by Bhuvan Urgaonkar
2CACQ Introduction
- Proposed continuous query (CQ) systems are based
on static plans - But, CQs are long running
- Initially valid assumptions less so over time
- Static optimizers at their worst!
- CACQ insight apply continuous adaptivity of
eddies to continuous queries - Dynamic operator ordering avoids static optimizer
danger - Process multiple queries simultaneously
- Interestingly, enables sharing of work storage
3Outline
- Background
- Motivation
- Continuous Queries
- Eddies
- CACQ
- Contributions
- Example driven explanation
- Results Experiments
4Outline
- Background
- Motivation
- Continuous Queries
- Eddies
- CACQ
- Contributions
- - Example driven explanation
- Results Experiments
5Motivating Applications
- Monitoring queries look for recent events in
data streams - Sensor data processing
- Stock analysis
- Router, web, or phone events
- In CACQ, we confine our view to queries over
recent-history - Only tuples currently entering the system
- Stored in in-memory data tables for time-windowed
joins between streams
6Continuous Queries
- Long running, standing queries, similar to
trigger systems - Installed continuously produce results until
removed - Lots of queries, over the same data sources
- Opportunity for work sharing!
- Idea adaptive heuristics
7Eddies Adaptivity
- Eddies (Avnur Hellerstein, SIGMOD 2000)
Continuous Adaptivity - No static ordering of operators
- Policy dynamically orders operators on a per
tuple basis - done and ready bits encode where tuple has been,
where it can go
8Outline
- Background
- Motivation
- Continuous Queries
- Eddies
- CACQ
- Contributions
- - Example driven explanation
- Results Experiments
9CACQ Contributions
- Adaptivity
- Policies for continuous queries
- Single eddy for multiple queries
- Tuple Lineage
- In addition to ready and done, encode output
history in tuple in queriesCompleted bits - Enables flexible sharing of operators between
queries - Grouped Filter
- Efficiently compute selections over multiple
queries - Join Sharing through State Modules (SteMs)
10Explication By Example
- First, example with just one query and only
selections - Then, add multiple queries
- Then, (briefly) discuss joins
11Eddies CACQ Single Query, Single Source
SELECT FROM R WHERE R.a gt 10 AND R.b lt 15
- Use ready bits to track what to do next
- All 1s in single source
- Use done bits to track what has been done
- Tuple can be output when all bits set
- Routing policy dynamically orders tuples
R2
R2
R1
R2
R2
R2
R1
R2
R2 R2
a 15
b 0
R1 R1
a 5
b 25
1 1 0 0
1 1 0 1
1 1 0 0
1 1 1 0
1 1 11
12Multiple Queries
R.a gt 10
R.a gt 20
R1
R.a 0
Grouped Filters
R1
R.b lt 15
R1
R.b 25
R1
R.b ltgt 50
R1 R1
a 5
b 25
0 0 0 0 0
0 0 1 0 0
0 1 1 0 0
0 1 1 1 1
1 1 1 1 1
13Multiple Queries
R.a gt 10
R2
R.a gt 20
R2
R.a 0
R2
Grouped Filters
R2
R2
R.b lt 15
R2
Reorder Operators!
R.b 25
R.b ltgt 50
R2 R2
a 15
b 0
0 0 0 0 0
0 0 0 1 1
1 0 0 1 1
1 1 0 1 1
1 1 1 1 1
14Outputting Tuples
completionMasks completionMasks completionMasks completionMasks completionMasks
? a b c d
Q1 1 1 0 0
Q2 0 1 1 1
- Store a completionMask bitmap for each query
- One bit per operator
- Set if the operator in the query
- To determine if a tuple t can be output to query
q - Eddy ANDs qs completionMask with ts done bits
- Output only if qs bit not set in ts
queriesCompleted bits - Every time a tuple returns from an operator
completionMasks
Done 1100
QueriesCompleted0 0
Q1 1100
Q2 0111
Done 0111
15Grouped Filter
- Use binary trees to efficiently index range
predicates - Two trees (LT GT) per attribute
- Insert constant
- When tuple arrives
- Scan everything to right (for GT) or left (for
LT) of the tuple-attribute in the tree - Those are the queries that the tuple does not
pass - Hash tables to index equality, inequality
predicates
Greater-than tree over S.a
S.a gt 1 S.a gt 7 S.a gt 11
16Work Sharing via Tuple Lineage
Q1 SELECT FROM s WHERE A, B, C Q2 SELECT
FROM s WHERE A, B, D
Conventional Queries
Query 1
Query 2
Lineage (Queries Completed) Enables Any Ordering!
sCDBA
Intersection of CD goes through AB an extra time!
sBC
sCDB
sBD
sAB
sAB
sCD
AB must be applied first!
sc
sD
sC
sB
s
s
s
s
Data Stream S
17Tradeoff Overhead vs. Shared Work
- Overhead in additional bits per tuple
- Experiments studying performance, size in paper
- Bit / query / tuple is most significant
- Trading accounting overhead for work sharing
- 100 bits / tuple allows a tuple to be processed
once, not 100 times - Reduce overhead by not keeping state about
operators tuple will never pass through
18Joins in CACQ
- Use symmetric hash join to avoid blocking
- Use State Modules (SteMs) to share storage
between joins with a common base relation - Detail about effect on implementation benefit
in paper - See Raman, UC Berkeley Ph.D. Thesis, 2002.
19Routing Policies
- Previous system provides correctness policy
responsible for performance - Consult the policy to determine where to route
every tuple that - Enters the system
- Returns from an operator
- Basic Ticket Policy
- Give operators tickets for consuming tuples, take
away tickets for producing them - To choose the next operator to route, run a
lottery - More selective operators scheduled earlier
- Modification for CACQ
- Give more tickets to operators shared by multiple
queries (e.g. grouped filters) - When a shared operator outputs a tuple, charge it
multiple tickets - Intuition cardinality reducing shared operators
reduce global work more than unshared operators - Not optimizing for the throughput of a single
query!
20Outline
- Background
- Motivation
- Continuous Queries
- Eddies
- CACQ
- Contributions
- - Example driven explanation
- Results Experiments
21Evaluation
- Real Java implementation on top of Telegraph QP
- 4,000 new lines of code in 75,000 line codebase
- Server Platform
- Linux 2.4.10
- Pentium III 733, 756 MB RAM
- Queries posed from separate workstation
- Output suppressed
- Lots of experiments in paper, just a few here
22Results Routing Policy
All attributes uniformly distributed over 0,100
Query
1
From S select index where a gt 90
From S select index where a gt 90 and b gt 70
From S select index where a gt 90 and b gt 70 and c gt 50
From S select index where a gt 90 and b gt 70 and c gt 50 and d gt 30
From S select index where a gt 90 and b gt 70 and c gt 50 and d gt 30 and e gt 10
2
3
4
5
23CACQ vs. NiagaraCQ
- Performance Competitive with Workload from NCQ
Paper - Different workload where CACQ outperforms NCQ
result gt stocks
Expensive
SELECT stocks.sym, articles.text FROM
stocks,articles WHERE stocks.sym articles.sym
AND UDF(stocks)
See Chen et al., SIGMOD 2000, ICDE 2002
24CACQ vs. NiagaraCQ 2
SA
SA
SA
Lineage Allows Join To Be Applied Just Once
S
A
No shared subexpressions, so no shared work!
25CACQ vs. NiagaraCQ Graph
26Conclusion
- CACQ sharing and adaptivity for high performance
monitoring queries over data streams - Features
- Adaptivity
- Adapt to changing query workload without costly
multi-query reoptimization - Work sharing via tuple lineage
- Without constraining the available plans
- Computation sharing via grouped filter
- Storage sharing via SteMs
- Future Work
- More sophisticated routing policies
- Batching query grouping
- Better integration with historical results
(Chandrasekaran, VLDB 2002)