Title: Querying Big Data
1 TDD Research Topics in Distributed Databases
- Querying Big Data
- Tractability revisited for querying big data
- BD-tractability
- Reductions, complete problems, separation results
- Querying big data
- Scale independence
- Making big data small
- Approximate query answering
- Relaxing query semantics
- Data-driven approximation
2Big data
- Volume in PB (1015B) or EB (1018B) or
- Variety heterogeneous, semi-structured or
unstructured - Velocity dynamic
- Veracity trust in its quality
The new challenges introduced by big data?
- Computer science is the topic about
the computation of function f(x)
in fact, any data that cannot be handled with
your available resources
- x is big PB (1015B) or EB (1015B)
-
2
3A new complexity theory for big data
3
4The good, the bad and the ugly
- Traditional computational complexity theory of
almost 50 years - The good polynomial time computable (PTIME)
- The bad NP-hard (intractable)
- The ugly PSPACE-hard, EXPTIME-hard, undecidable
What happens when it comes to big data?
- Assuming SSD of 6G/s. A linear scan of a data set
D would take - 1.9 days when D is of 1PB (1015B)
- 5.28 years when D is of 1EB (1018B)
- O(n) time is already beyond reach on big data in
practice!
Polynomial time queries become intractable on big
data
4
5Tractability revisited for queries on big data
- A class Q of queries is BD-tractable if there
exists a PTIME preprocessing function ? such
that - for any database D on which queries of Q are
defined, - D ?(D)
- for all queries Q in Q defined on D, Q(D) can be
computed by evaluating Q on D in parallel
polylog time (NC)
hence D is of polynomial size
possible rewriting
parallel logk(D, Q)
Q1(?(D))
?
D
?(D)
Q2(?(D))
? ?
- Does it work? If a linear scan of D could be done
in log(D) time - 15 seconds when D is of 1 PB instead of 1.99 days
- 18 seconds when D is of 1 EB rather than 5.28
years
BD-tractable queries are feasible on big data
5
6BD-tractable queries
- A class Q of queries is BD-tractable if there
exists a PTIME preprocessing function ? such
that - for any database D on which queries of Q are
defined, - D ?(D)
- for all queries Q in Q defined on D, Q(D) can be
computed by evaluating Q on D in parallel
polylog time (NC)
?TQ0 the set of all BD-tractable query classes
in parallel with more resources
- Preprocessing
- one-time process, offline, once for all queries
in Q - indices, compression, views, incremental
computation,
not necessarily reduce the size of D
Preprocessing a common practice of database
people
6
7What query classes are BD-tractable?
- Boolean selection queries
- Input A dataset D
- Query Does there exist a tuple t in D such that
tA c? - Build a B-tree on the A-column values in D. Then
all such selection queries can be answered in
O(log(D)) time.
- Graph reachability queries
- Input A directed graph G
- Query Does there exist a path from node s to t
in G?
NL-complete
What else?
Relational algebra set recursion on ordered
relational databases
Some natural query classes are BD-tractable
7
8Deal with queries that are not BD-tractable
Starts at a node s, and visits all its children,
pushing them onto a stack in the reverse order
induced by the vertex numbering. After all of s
children are visited, it continues with the node
on the top of the stack, which plays the role of s
Many query classes are not BD-tractable.
- Breadth-Depth Search (BDS)
- Input An unordered graph G (V, E) with a
numbering on its nodes, and a pair (u, v) of
nodes in V - Question Is u visited before v in the
breadth-depth search of G?
Is this problem (query class) BD-tractable?
D is empty, Q is (G, (u, v))
- No. The problem is well known to be P-complete!
- We need PTIME to process each query (G, (u, v))
! - Preprocessing does not help us answer such
queries.
Can we make it BD-tractable?
8
9Make queries BD-tractable
Factorization partition instances to identify a
data part D for preprocessing, and a query part Q
for operations
- Breadth-Depth Search (BDS)
- Input An unordered graph G (V, E) with a
numbering on its nodes, and a pair (u, v) of
nodes in V - Question Is u visited before v in the
breadth-depth search of G?
Factorization D is G (V, E), Q is (u, v)
- Preprocessing ?(G) performs BDS on G, and
returns a list M consisting of nodes in V in the
same order as they are visited - For all queries (u, v), whether u occurs before v
can be decided by a binary search on M, in
log(M) time
after proper factorization
?TQ The set of all query classes that can be
made BD-tractable
9
10Fundamental problems for BD-tractability
BD-tractable queries help practitioners determine
what query classes are tractable on big data.
Are we done yet?
- No, a number of questions in connection with a
complexity class! - Reductions how to transform a problem to
another in the class that we know how to solve,
and hence make it BD-tractable? - Complete problems Is there a natural problem (a
class of queries) that is the hardest one in the
complexity class? A problem to which all problems
in the complexity class can be reduced - How large is ?TQ? ?TQ0? Compared to P? NC?
Analogous to our familiar NP-complete problems
Why do we care?
Fundamental to any complexity classes P, NP,
10
11Reductions
transformations for making queries BD-tractable
Departing from our familiar polynomial-time
reductions, we need reductions that are in NC,
and deal with both data D and query Q!
- NC-factor reductions ?NC a pair of NC functions
that allow re-factorizations (repartition data
and query part), for ?TQ - F-reductions ?F a pair of NC functions that
do not allow re-factorizations, for ?TQ0
to determine whether a query class is BD-tractable
- Properties
- transitivity if Q1 ?NC Q2 and Q2 ?NC Q3, then
Q1 ?NC Q3 (also ?F) - compatibility
- if Q1 ?NC Q2 and Q2 is in ?TQ, then so is Q1.
- if Q1 ?F Q2 and Q2 is in ?TQ0, then so is Q1.
transform a given problem to one that we know how
to solve
11
12Complete problems
- A query class Q is complete for ?TQ if Q is in
?TQ, and moreover, for any query class Q in
?TQ, Q ?NC Q - A query class Q is complete for ?TQ0 if Q is in
?TQ0, and for any query class Q in ?TQ0, Q ?F Q
Is there a complete problems for ?TQ (?TQ0)?
- There exists a natural query class Q that is
complete for ?TQ
- Not for ?TQ0
- Unless P NC, a query class complete for ?TQ0 is
a witness for P \ NC (as hard as the
big open whether P NC)
- Whether P NC is as hard as whether P NP
It is hard to find a complete problem for ?TQ0
12
13Comparing with P and NC
How large is ?TQ? How large is ?TQ0?
- NC ? ?TQ P
- All PTIME query classes can be made
BD-tractable! - Unless P NC, NC ? ?TQ0 ? P
- Unless P NC, not all PTIME query classes are
BD-tractable
separation
need proper factorizations to answer PTIME
queries on big data
PTIME
Properly contained in P
not BD-tractable
BD-tractable
13
13
14What can we get from BD-tractability?
Guidelines for the following.
- What query classes are feasible on big data? ?TQ0
- What query classes can be made feasible to answer
on big data? ?TQ - How to determine whether it is feasible to answer
a class Q of queries on big data? - Reduce Q to a complete problem Qc for ?TQ via
?NC - If so, how to answer queries in Q?
- Identify factorizations (?NC reductions) such
that Q ?NC Qc - Compose the reduction and the algorithm for
answering queries of Qc
A revision of the classical computational
complexity theory
14
15Making big data small
15
16Scale independence
- The scale independence problem
- Input A dataset D, a query Q, and a bound M
- Query Does there exist a subset DQ of D such
that - DQ ? M, and
- Q(D) Q(DQ)?
- A more general setting
- Input A query Q defined over a schema R, and a
bound M - Query Is it for all instances D of R, there
exists a subset DQ of D such that - DQ ? M, and
- Q(D) Q(DQ)?
- The cost of query processing is independent of
D! - Scalable with big data D, when D grows!
16
Why do we care?
17Scale independent queries in practice?
Personalized social search queries (Facebook
Graph Search)
- Find me all my friends who live in Edinburgh and
like cycling - Find all restaurants rated A that are in King
of Prussia Mall - Find me all restaurants in Edinburgh where my
friends dined in 2013.
- Why bounded?
- Facebook at most 5000 friends per person
- At most K restaurants in a mall
- At most 5000 friends, there are 365 days each
year, and each person dines at most once per day
(a normal person)
To answer a query, we need to access a bounded
amount of data
17
18Query processing
- Access schemas (R, X, N)
- index on X for instances D of X
- there exist at most N tuples sharing the same X
values in D (e.g., 365 days per year), and these
tuples can be fetched efficiently - find a query plan, visiting a bounded amount of
data
decide whether a query is scale independent
- Complexity the scale independence problem is
- ?3p-complete for conjunctive queries (SPC)
- PSPACE-complete for first-order logic queries
(SQL) but - in O(1) time for Boolean conjunctive queries if
Q ? M!
there are sufficient conditions for this, based
on rules
Incremental scale independence? Using views?
18
19How to make a query tractable on big data?
- Querying big data
- Input Query Q, and big data G,
- Output Q (G), the set of answers to Q in G
- A number of techniques
- Distributed query processing
- Query preserving data compression
- Query answering using views
- Bounded incremental evaluation
- Top-k query answering with early termination
-
Too costly
The cost of query processing a function of G
and Q
O(G) time is already beyond reach in practice!
Can we effectively query big data?
- Approximate or inexact algorithms
- Exact algorithms?
Make the cost of query processing independent
of G!
MapReduce is not the only solution, and is not
even the best one!
19
20Distributed query processing
O(n2) or O(n3) is too costly
The cost of evaluation algorithm f(G, Q)
It is unlikely that we can lower its complexity,
but can we reduce the size of its parameter G?
manageable sizes
- Divide and conquer
- partition G into fragments (G1, , Gn),
distributed to various sites
evaluate Q on smaller Gi
- upon receiving a query Q,
- evaluate Q( Gi ) in parallel
- collect partial answers at a coordinator site,
and assemble them to find the answer Q( G ) in
the entire G
Performance guarantees for evaluating regular
reachability queries based on partial evaluation
Network traffic and response time Independent of
G
20
21Query preserving data compression
The cost of query processing f(G, Q)
reduce the parameter?
- Query preserving compression ltR, Pgt for a class L
of queries - For any data collection G, C R(G)
- For any Q in L, Q( G ) P(Q, Gc)
Compress big G into a smaller Gc
21
22What is new about query preserving compression?
- Query preserving compression ltR, Pgt for a class L
of queries - For any dataset G, Gc R(G)
- For any Q in L, Q( G ) P(Q, Gc)
- Relative to a class L of queries of users choice
- Better compression ratio only information about
L queries
no need to decompress Gc
- For any Q in L, Q(Gc) can be directly computed
- Any algorithms and indexing structures for G can
be used for Gc
In contrast to lossless compression, no need to
restore the original graph G
- Gc is computed once for all queries Q in L
Incrementally maintained
Reduction 95 in average for reachability queries
22
23Answering queries using views
The cost of query processing f(G, Q)
can we compute Q(G) without accessing G, i.e.,
independent of G?
- Query answering using views given a query Q in a
language L and a set V views, find another query
Q such that - Q and Q are equivalent
- Q only accesses V(G )
for any G, Q(G) Q(G)
- Answering graph pattern queries on big social
graphs - Regardless of how big G is the cost is
independent of G - V(G ) is often much smaller than G (4 -- 12
on real-life data)
Improvement 97 for graph pattern matching
The complexity is no longer a function of G
23
24Incremental query answering
5/week in Web graphs
- Real-life data is dynamic constantly changes,
?G - Re-compute Q(G??G) starting from scratch?
- Changes ?G are typically small
Compute Q(G) once, and then incrementally
maintain it
Changes to the input
Old output
- Incremental query processing
- Input Q, G, Q(G), ?G
- Output ?M such that Q(G??G) Q(G) ? ?M
Changes to the output
New output
When changes ?G to the data G are small,
typically so are the changes ?M to the output
Q(G??G)
Minimizing unnecessary recomputation
24
25Complexity of incremental problems
- Incremental query answering
- Input Q, G, Q(G), ?G
- Output ?M such that Q(G??G) Q(G) ? ?M
Incremental algorithms?
The cost of query processing a function of G
and Q
- incremental algorithms CHANGED, the size of
changes in - the input ?G, and
- the output ?M
The updating cost that is inherent to the
incremental problem itself
- Bounded the cost is expressible as f(CHANGED)?
- Optimal in O(CHANGED)?
The amount of work absolutely necessary to
perform for any incremental algorithm
Effective on graph pattern matching
Complexity analysis in terms of the size of
changes
25
26Top-k query answering
Traditional query answering compute Q(G)
- It is expensive to compute when G is large
- The result Q(G) is excessively large for the
users to inspect larger than G
- Top-k query answering
- Input Query Q, dataset G and a positive
integer k. - Output A top-ranked set of k elements in Q(G)
Improvement 65 on graph pattern matching
Early termination return top-k matches without
computing Q(G)
26
27Answering queries on big data
Yes, MapReduce is useful, but it is not the only
way!
- Partial evaluation for distributed query
processing can we get performance guarantees? - Query preserving compression convert big data to
small data - Query answering using views make big data small
- Bounded incremental query answering depending on
the size of the changes rather than the size of
the original big data - Top-k query answering and early termination find
answers without traversing the entire data set
Prerocessing methods
Make big data small
Combinations of these can do better than
MapReduce!
27
28- Further reading
- W. Fan, J. Li, X. Wang, and Y. Wu. Query
Preserving Graph Compression, SIGMOD, 2012. - W. Fan. Graph Pattern Matching Revised for Social
Network Analysis, ICDT 2012 (invited). - W. Fan, X. Wang, and Y. Wu. Performance
Guarantees for Distributed Reachability Queries,
VLDB, 2012. - W. Fan, X. Wang, and Y. Wu. Diversified Top-k
Graph Pattern Matching, VLDB, 2014. - W. Fan, J. Li, X. Wang, and Y. Wu. Incremental
Graph Pattern Matching, SIGMOD, 2011 (TODS 38(3),
2013). - W. Fan J. Li, S. Ma, and H. Wang, and Y. Wu.
Graph Homomorphism Revisited for Graph Matching,
VLDB 2010. - W. Fan J. Li, S. Ma, and N. Tang, and Y. Wu.
Graph pattern matching From intractable to
polynomial time, VLDB, 2010.
29Approximate query answering
29
30Graph Pattern Matching
- Given a pattern graph Q and a data graph G, find
all the matches of Q in G. - subgraph isomorphism
- Applications
- pattern recognition
- knowledge discovery
- intelligence analysis
- transportation network analysis
- Web site classification,
- social position detection.
- User targeted advertising
a bijective function f on nodes (u,u ) ? Q
iff (f(u), f(u)) ? G
Widely used in social network analysis
30
31Problems
Facebook 1B users, 140B links
- Real-life social graphs are typically large
- Subgraph isomorphism
- What is for the complexity for determining
whether there exists a match of a pattern Q in a
graph G? - Given a pattern Q and a graph Q, how many matches
of Q can possibly exist in G?
NP-complete
Possibly exponential
O(G) time is already beyond reach in practice!
- Nonetheless, we need to conduct graph pattern
matching on social networks, among other things
What can we do if a class of queries is NOT
BD-tractable?
subgraph isomorphism is too costly for social
network analysis
31
32Relaxing the semantics of queries
- Much cheaper
- Complexity of computing the set of matches
quadratic time - The number of matches of Q in G there exists a
unique, maximum match relation S 1
- a binary relation S on nodes
- for each node u in Q, there exists v in G such
that (u,v)? S, - for each pair (u,v)? S, each edge (u,u) in Q is
mapped to an edge (v, v ) in G, such that (u,v
)? S
- Effective
- Social position detection
- User targeted advertising,
Quadratic time is still too expensive! How to
deal with it?
A variety of extensions to capture topology, with
low complexity
So, graph simulation for social data analysis,
instead of subgraph isomorphism
32
33The approximation theory revisited
- If a query class is not BD-tractable and its
semantics cant be relaxed, is it still feasible
to answer such queries on big data?
Yes, approximation
- When exact algorithms are infeasible, we find
inexact algorithms with performance guarantees
cant be too far! - feasible on big data reducing big data to small
data - performance guarantees whenever possible
The need for revising the traditional
approximation theory, for querying big data
Data-driven approximation
33
34Data-driven approximation
- Resource-bounded query answering
- Input A dataset D, a class Q of queries, a
resource ratio ? ? 0, 1) - Question Develop an algorithm that given any
query Q ? Q computes Q(D) by accessing at most
?G amount of data
Make big data small!
- Personalized social searches and reachability
queries - Find me all my friends who live in Nanjing and
like cycling - Does Michael connect to lady Gaga through
social links?
We can do personalized social search with ?
0.0015!
- 1.5 10-6 1PB (1015B) 15 109 15GB
- We are making big data of PB size as small as
15GB!
We can make big data of PB size, fit into our
memory!
34
35Summing up
35
36Summary and Review
- What is BD-tractability? Why do we care about
it? - What is scale independence?
- How to make big data small?
- Is MapReduce the only way for querying big data?
Can we do better than it? - What is query preserving data compression? Query
answering using views? Bounded incremental query
answering? Top-k query answering? - If a class of queries is known not to be
BD-tractable, how can we process the queries in
the context of big data? - Develop an algorithm for processing a class of
queries on big data, by combining various methods
discussed