Title: Content-Based Routing: Different Plans for Different Data
1Content-Based RoutingDifferent Plans for
Different Data
- Pedro Bizarro
- Joint work withShivnath Babu, David DeWitt,
Jennifer Widom - September 1, 2005
- VLDB 2005
2Introduction
Introduction
- The opportunity to improve
- Optimizers pick a single plan for a query
- However, different subsets of data may have very
different statistical properties - May be more efficient to use different plans for
different subsets of data
3Overview of CBR
- Eliminates single plan assumption
- Identifies tuple classes
- Uses multiple plans, each customized for a
different tuple class - CBR applies to any streaming data
- E.gs. stream systems, regular DBMS operators
using iterators, and acquisitional systems. - Adaptive and low overhead algorithm
- Implemented in TelegraphCQ as an extension to
Eddies
4Overview of Eddies
- Eddy routes tuples through a pool of operators
- Routing decisions based on operator
characteristics selectivity, cost, queue size,
etc.
- Routing Tuples notdifferentiated based on
content - We call it SBR Source-Based Routing
5Content-Based Routing Example
- Consider stream S processed by O1, O2, O3
O1 O2 O3
Selectivities 30 40 60
Overall Operator Selectivities
- Best routing order is O1, then O2, then O3
6Content-Based Routing Example
- Let A be an attribute with domain a,b,c
Value of A O1 O2 O3
Aa 32 10 55
Ab 31 20 65
Ac 27 90 60
Overall 30 40 60
Content-Specific Selectivities
- Best routing order for Aa O2, O1, O3
- Best routing order for Ab O2, O1, O3
- Best routing order for Ac O1, O3, O2
7Classifier Attributes
- Goal identify tuple classes
- Each with a different optimal operator ordering
- CBR considers
- Tuple classes distinguished by content, i.e.,
attribute values - Classifier attribute (informal definition)
- Attribute A is classifier attribute for operator
O if the value of A is correlated with
selectivity of O.
8Best Classifier Attribute Example
- Attribute A with domain a, b, c
- Attribute B with domain x, y, z
- Which is the best to use for routing decisions?
- Similar to AI problem classifier attributes for
decision trees - Two labels pass operator, dropped by operator
- AI solution Use GainRatio to pick best
classifier attribute
s 1-s
Bx 43 57
By 38 62
Bz 39 61
Overall 40 60
s 1-s
Aa 10 90
Ab 20 80
Ac 90 10
Overall 40 60
9GainRatio to Measure Correlation
s 1-s
Aa 10 90
Ab 20 80
Ac 90 10
Overall 40 60
s 1-s
Bx 43 57
By 38 62
Bz 39 61
Overall 40 60
GainRatio(R, A) 0.87 GainRatio(R, B)
0.002
- R random sample of tuples processed by operator O
Formulas from T. Mitchell, Machine Learning.
McGraw-Hill, '97.
10Classifier AttributesDefinition
- An attribute A is a classifier attribute for
operator O, if for any large random sample R of
tuples processed by O, GainRatio(R,A)gt??, for
some threshold ?
11Content-Learns AlgorithmLearning Routes
Automatically
- Content-Learns consists of two continuous,
concurrent steps - Optimization For each Ol ? O1, ,On find
- that Ol does not have a classifier attribute or
- find the best classifier attribute, Cl, of Ol.
- Routing Route tuples according to the
- selectivities of Ol if Ol does not have a
classifier attribute or - according to the content-specific selectivities
of the pair ltOl, Clgt if Cl is the best classifier
attribute of Ol
12Content-Learns Optimization Step
- Find Cl by profiling Ol
- Route a fraction of input tuples to Ol
- For each sampled tuple
- For each attribute
- map attribute values to d partitions
- update pass/fail counters
- When all sample tuples seen, compute Cl
2
sampled tuple
3
1
2
4
4
7
1
1
1
2
1
1
1
2
1
1
2 partitions
1
1
1
2
1
2
1
1
2
1
1
2
1
2
corresponding partitions
3 attributes
13Content-Learns Routing Step
- SBR routes to Ol with probability inversely
proportional to Ols selectivity, Wl - CBR routes to operator with minimum??
- If Ol does not have a classifier attribute, its
?Wl - If Ol has a classifier attribute, its ?Sl,i,
jCAl, ifj(t.Cj)
40
1
2
-1
50
20
2 partitions
1
55
14Adaptivity and Overhead
- CBR introduces new routing and learning overheads
- Overheads at odds with adaptivity
- Adaptivity ability to find efficient plan
quickly when data or system characteristics change
15CBR Update Overheads
- Once per tuple
- selectivities as fresh as possible
- Once per sampled tuple
- correlations between operators and content
- Once per sample (2500 tuples)
- Computing GainRatio and updating one entry in
array CA
W
operator selectivities
25
40
50
60
CA
classifier attributes
2
-1
2
1
In
Out
partitions 1,,d
0
1
2
0
1
1
50
20
75
-
tuples in, tuples out
detailed selectivities
S
2
1
0
1
0
0
0
80
55
-
operators 1,...,n
attributes 1,,k
16Experimental ResultsDatasets
- Stream-star dataset
- Synthetic dataset based on a star-schema
- SELECT FROM stream S, d1, d2, , dN
- WHERE s.fkd1 d1.pk // Operator O1
-
- AND s.fkdN dN.pk // Operator ON
- Attribute S.attrC best classifier attribute for
all operators - 8 other attributes in S not correlated with
operator selectivities - 100K records in stream, 10K records in dimension
tables - Lab dataset
- Explained later
17Experimental ResultsSystem, Metrics, Defaults
- TelegraphCQ version 0.2
- Tao Linux release 1, 512 MB RAM, Pentium 4 2.4
GHz - Metrics improvement running time and routing
calls - Default values
Parameter Defaults Comment
P 6 Tuple sampling probability
R 150 tuples Sample size
d 24 Number of partitions
Confidence 95 Conf. interval in graphs
18Experimental ResultsRun-time Overheads
- Routing overhead
- time to perform routing decisions (SBR, CBR)
- Learning overhead
- Time to update data structures (SBR, CBR) plus
- Time to compute gain ratio (CBR only).
Overhead increase 30-45
19Experimental ResultsVarying Skew
- One operator with selectivity A, all others with
selectivity B - Skew is A-B. A varied from 5 to 95
- Overall selectivity 5
6 joins
20Experimental ResultsRandom Selectivities
- Attribute attrC correlated with the selectivities
of the operators - Other attributes in stream tuples not correlated
with selectivities - Random selectivities in each operator
21Experimental ResultsVarying Aggregate
Selectivity
- Aggregate selectivity in previous experiments was
5 or 8 - Here we vary aggregate selectivity between 5 to
35 - Random selectivities within these bounds
6 joins
22Experimental ResultsDatasets
- Lab dataset
- Real data
- 2.2 M readings from 54 sensors (Intel Research
Berkeley Lab) - Single stream with attributes
- Light
- Humidity
- Temperature
- Voltage
- sensorID
- Year
- Month
- Day
- Hours
- Minutes
- Seconds
23Experimental ResultsChallenging Adaptivity
Experiment (1)
- Using Lab dataset
- Example querySELECT FROM sensors WHERE
lightgt500 - Observations
- Very high variation in selectivity
- Best classifier attributes change with time
- No classifier attribute found for over half the
time
24Experimental ResultsChallenging Adaptivity
Experiment (2)
- Query SELECT FROM sensors
- WHERE light BETWEEN lowL AND highL
- AND temperature BETWEEN lowT AND highT
- AND humidity BETWEEN lowH AND highH
- AND voltage BETWEEN lowV AND highV
- lowX random number from lower 25 of domain
- highX random number from upper 25
- Results for 50 different queries.
- Average improvement of
- 8 in routing calls
- 5 in execution time
- 7 in time spent evaluating operators
- 18 in routing calls until a tuple is dropped
25Experimental ResultsVarying Operator Cost
- Run random query from previous slide
- Run query for periods with correlations
- Varied operator cost by running CPU intensive
computations
4 operators
26Conclusions
- CBR eliminates single plan assumption
- Explores correlation between tuple content and
operator selectivities - Adaptive learner of correlations with negligible
overhead - Performance improvements over non-CBR routing
- Selectivity changes much more than correlations
27Acknowledgements
- Sam Madden and Amol Deshpande for providing the
Lab dataset. - Sailesh Krishnamurthy, Amol Deshpande, Joe
Hellerstein, and the rest of the TelegraphCQ team
for providing TelegraphCQ and answering all my
questions.
28Q A?
29Extra slides
30Motivational Example (1)Intrusion Detection
Query
- Track packets with destination address matching
a prefix in table T, and containing the 100-byte
and 256-byte sequences 0xa...8 and 0x7...b
respectively as subsequence - SELECT FROM packetsWHERE matches(destination,
T)AND contains(data, 0xa...8)AND
contains(data, 0x7...b)
O1
O2
O3
31Motivational Example (2)Intrusion Detection
Query
- Assume
- costs are c3gtc1gtc2
- selectivities are ??3gt?1gt?2
- SBR routing converges to O2, O1, O3
O3
SBR
O2
O1
almost all tuples follow this route
Stream of tuples
32Motivational Example (3)Intrusion Detection
Query
- Suppose an attack (O2 and O3) on a network whose
prefix is not in T (O1) is underway - s2 and s3 will be very high, s1 will be very low
- O1, O2, O3 will be the most efficient ordering
for attack tuples
O3
O3
SBR
CBR
O2
O1
O2
O1
addr
attack tuples follow this route
almost all tuples follow this route
non-attack tuples follow this route
Stream of tuples
Stream of tuples
33Experimental ResultsVarying Skew
- One operator with selectivity A, all others with
selectivity B - Skew is A-B. A varied from 5 to 95
- Overall selectivity 5
2 joins
6 joins
34Related Work
- Adaptivity in Stream Systems
- Avnur Eddies Continuously Adaptive Query
Processing. SIGMOD'00. - Arasu STREAM The Stanford Stream Data
Manager. DEBul 26(1). - Babu Adaptive ordering of pipelined stream
filters. SIGMOD'04. - Babu StreaMon An Adaptive Engine for Stream
Query Processing. SIGMOD'04. - Carney Monitoring streams a new class of
data management applications. VLDB'02. - Chandrasekaran TelegraphCQ Continuous
dataflow processing for an uncertain world.
CIDR'03. - Chandrasekaran Psoup a system for streaming
queries over streaming data. VLDBj 12(2). - Deshpande An initial study of overheads of
eddies. SIGMODrec 33(1). - Deshpande Lifting the Burden of History from
Adaptive Query Processing. VLDB'04. - Madden Continuously adaptive continuous
queries over streams. SIGMOD'02. - Raman Using state modules for adaptive query
processing. ICDE'03. - Tian Tuple Routing Strategies for Distributed
Eddies. VLDB'03. - Viglas Maximizing the Output Rate of Multi-Way
Join Queries over Streaming Information Sources.
VLDB'03. - AQP Surveys
- Babu Adaptive Query Processing in the Looking
Glass. CIDR'05. - Hellerstein Adaptive query processing
Technology in evolution. DEBul 23(2).
- Adaptivity in DBMS
- Babu Proactive Re-optimization. SIGMOD'05.
- Graefe Dynamic query evaluation plans.
SIGMOD'89. - Kabra Efficient Mid-Query Re-Optimization of
Sub-Optimal Query Execution Plans. SIGMOD'98. - Markl Robust Query Processing through
Progressive Optimization. SIGMOD'04. - Wong Decomposition - a strategy for query
processing. TODS 1(3). - Adaptivity in Distributed Systems
- Bouganim Dynamic query scheduling in data
integration systems. ICDE'00. - Ives An adaptive query execution system for
data integration. SIGMOD'99. - Ives Adaptive query processing for internet
applications. DEBul 23(2). - Ives Adapting to Source Properties in
Processing Data Integration Queries. SIGMOD'04. - Ives Efficient Query Processing for Data
Integration. PhD thesis. - Ng Dynamic query re-optimization. ICSSDM'99.
- Urhan Dynamic pipeline scheduling for
improving interactive performance of online
queries. VLDB'01. - Urhan Cost based query scrambling for initial
delays. SIGMOD'98. - Zhu Dynamic plan migration for continuous
queries over data streams. SIGMOD'04.
35Related Work (contd)
- Acquisitional Systems
- Madden The Design of an Acquisitional Query
Processor for Sensor Networks. SIGMOD'03. - Deshpande Model-Driven Data Acquisition in
Sensor Networks. VLDB'04 - Deshpande Exploiting Correlated Attributes in
Acquisitional Query Processing. ICDE'05. - Multiple Plans in Same Query
- Antoshenkov Query processing and optimization
in oracle rdb. VLDBj 5(4). - Bizarro Content-Based Routing Different Plans
for Different Data. VLDB'05. - Polyzotis Selectivity-Based Partitioning A
Divide-and-Union Paradigm For Effective Query
Optimization. Unpublished. - URDs and Modeling Uncertainty
- Anderson Index key range estimator. U. S.
Patent 4,774,657. - Babcock Towards a Robust Query Optimizer A
Principled and Practical Approach. SIGMOD'05. - Chu Least expected cost query optimization An
exercise in utility. SoPoDS'99. - Ramamurthy Buffer-pool Aware Query
Optimization. CIDR'05. - Viglas Novel Query Optimization and Evaluation
Techniques, Ph.D. Thesis.
- Optimization, Cardinality Estimation,
Correlations - Selinger Access Path Selection in a Relational
Database Management System. SIGMOD'79. - Acharya Join Synopses for Approximate Query
Answering. SIGMOD'99. - Christodoulakis Implications of certain
assumptions in database performance evaluation.
TODS 9(2). - Graefe Query Evaluation Techniques for Large
Databases. ACM Comput. Surv. 25(2). - Getoor Selectivity Estimation using
Probabilistic Models. SIGMOD'01. - Ioannidis On the Propagation of Errors in the
Size of Join Results. SIGMOD'91. - Ilyas CORDS Automatic discovery of
correlations and soft functional dependencies.
SIGMOD'04. - Stillger LEO - DB2s LEarning Optimizer.
VLDB'01. - AI and Learning from Streams
- Mitchell Machine Learning. McGraw-Hill, '97.
- Guha Data-streams and histograms. ACM Symp. on
Theory of Computing, '01. - Domingos Mining high-speed data streams.
SIGKDD'00. - Gehrke On computing correlated aggregates over
continual data streams. SIGMOD'01.