Querying Big Data - PowerPoint PPT Presentation

About This Presentation
Title:

Querying Big Data

Description:

TDD: Research Topics in Distributed Databases Schema matching, schema mapping, data integration Schema matching Schema mapping: XML Data integration – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 37
Provided by: homepage91
Category:

less

Transcript and Presenter's Notes

Title: Querying Big Data


1

TDD Research Topics in Distributed Databases
  • Querying Big Data
  • Tractability revisited for querying big data
  • BD-tractability
  • Reductions, complete problems, separation results
  • Querying big data
  • Scale independence
  • Making big data small
  • Approximate query answering
  • Relaxing query semantics
  • Data-driven approximation

2
Big data
  • Volume in PB (1015B) or EB (1018B) or
  • Variety heterogeneous, semi-structured or
    unstructured
  • Velocity dynamic
  • Veracity trust in its quality

The new challenges introduced by big data?
  • Computer science is the topic about

the computation of function f(x)
in fact, any data that cannot be handled with
your available resources
  • x is big PB (1015B) or EB (1015B)

2
3
A new complexity theory for big data
3
4
The good, the bad and the ugly
  • Traditional computational complexity theory of
    almost 50 years
  • The good polynomial time computable (PTIME)
  • The bad NP-hard (intractable)
  • The ugly PSPACE-hard, EXPTIME-hard, undecidable

What happens when it comes to big data?
  • Assuming SSD of 6G/s. A linear scan of a data set
    D would take
  • 1.9 days when D is of 1PB (1015B)
  • 5.28 years when D is of 1EB (1018B)
  • O(n) time is already beyond reach on big data in
    practice!

Polynomial time queries become intractable on big
data
4
5
Tractability revisited for queries on big data
  • A class Q of queries is BD-tractable if there
    exists a PTIME preprocessing function ? such
    that
  • for any database D on which queries of Q are
    defined,
  • D ?(D)
  • for all queries Q in Q defined on D, Q(D) can be
    computed by evaluating Q on D in parallel
    polylog time (NC)

hence D is of polynomial size
possible rewriting
parallel logk(D, Q)
Q1(?(D))
?
D
?(D)
Q2(?(D))
? ?
  • Does it work? If a linear scan of D could be done
    in log(D) time
  • 15 seconds when D is of 1 PB instead of 1.99 days
  • 18 seconds when D is of 1 EB rather than 5.28
    years

BD-tractable queries are feasible on big data
5
6
BD-tractable queries
  • A class Q of queries is BD-tractable if there
    exists a PTIME preprocessing function ? such
    that
  • for any database D on which queries of Q are
    defined,
  • D ?(D)
  • for all queries Q in Q defined on D, Q(D) can be
    computed by evaluating Q on D in parallel
    polylog time (NC)

?TQ0 the set of all BD-tractable query classes
in parallel with more resources
  • Preprocessing
  • one-time process, offline, once for all queries
    in Q
  • indices, compression, views, incremental
    computation,

not necessarily reduce the size of D
Preprocessing a common practice of database
people
6
7
What query classes are BD-tractable?
  • Boolean selection queries
  • Input A dataset D
  • Query Does there exist a tuple t in D such that
    tA c?
  • Build a B-tree on the A-column values in D. Then
    all such selection queries can be answered in
    O(log(D)) time.
  • Graph reachability queries
  • Input A directed graph G
  • Query Does there exist a path from node s to t
    in G?

NL-complete
What else?
Relational algebra set recursion on ordered
relational databases
Some natural query classes are BD-tractable
7
8
Deal with queries that are not BD-tractable
Starts at a node s, and visits all its children,
pushing them onto a stack in the reverse order
induced by the vertex numbering. After all of s
children are visited, it continues with the node
on the top of the stack, which plays the role of s
Many query classes are not BD-tractable.
  • Breadth-Depth Search (BDS)
  • Input An unordered graph G (V, E) with a
    numbering on its nodes, and a pair (u, v) of
    nodes in V
  • Question Is u visited before v in the
    breadth-depth search of G?

Is this problem (query class) BD-tractable?
D is empty, Q is (G, (u, v))
  • No. The problem is well known to be P-complete!
  • We need PTIME to process each query (G, (u, v))
    !
  • Preprocessing does not help us answer such
    queries.

Can we make it BD-tractable?
8
9
Make queries BD-tractable
Factorization partition instances to identify a
data part D for preprocessing, and a query part Q
for operations
  • Breadth-Depth Search (BDS)
  • Input An unordered graph G (V, E) with a
    numbering on its nodes, and a pair (u, v) of
    nodes in V
  • Question Is u visited before v in the
    breadth-depth search of G?

Factorization D is G (V, E), Q is (u, v)
  • Preprocessing ?(G) performs BDS on G, and
    returns a list M consisting of nodes in V in the
    same order as they are visited
  • For all queries (u, v), whether u occurs before v
    can be decided by a binary search on M, in
    log(M) time

after proper factorization
?TQ The set of all query classes that can be
made BD-tractable
9
10
Fundamental problems for BD-tractability
BD-tractable queries help practitioners determine
what query classes are tractable on big data.
Are we done yet?
  • No, a number of questions in connection with a
    complexity class!
  • Reductions how to transform a problem to
    another in the class that we know how to solve,
    and hence make it BD-tractable?
  • Complete problems Is there a natural problem (a
    class of queries) that is the hardest one in the
    complexity class? A problem to which all problems
    in the complexity class can be reduced
  • How large is ?TQ? ?TQ0? Compared to P? NC?

Analogous to our familiar NP-complete problems
Why do we care?
Fundamental to any complexity classes P, NP,
10
11
Reductions
transformations for making queries BD-tractable
Departing from our familiar polynomial-time
reductions, we need reductions that are in NC,
and deal with both data D and query Q!
  • NC-factor reductions ?NC a pair of NC functions
    that allow re-factorizations (repartition data
    and query part), for ?TQ
  • F-reductions ?F a pair of NC functions that
    do not allow re-factorizations, for ?TQ0

to determine whether a query class is BD-tractable
  • Properties
  • transitivity if Q1 ?NC Q2 and Q2 ?NC Q3, then
    Q1 ?NC Q3 (also ?F)
  • compatibility
  • if Q1 ?NC Q2 and Q2 is in ?TQ, then so is Q1.
  • if Q1 ?F Q2 and Q2 is in ?TQ0, then so is Q1.

transform a given problem to one that we know how
to solve
11
12
Complete problems
  • A query class Q is complete for ?TQ if Q is in
    ?TQ, and moreover, for any query class Q in
    ?TQ, Q ?NC Q
  • A query class Q is complete for ?TQ0 if Q is in
    ?TQ0, and for any query class Q in ?TQ0, Q ?F Q

Is there a complete problems for ?TQ (?TQ0)?
  • There exists a natural query class Q that is
    complete for ?TQ
  • Not for ?TQ0
  • Unless P NC, a query class complete for ?TQ0 is
    a witness for P \ NC (as hard as the
    big open whether P NC)
  • Whether P NC is as hard as whether P NP

It is hard to find a complete problem for ?TQ0
12
13
Comparing with P and NC
How large is ?TQ? How large is ?TQ0?
  • NC ? ?TQ P
  • All PTIME query classes can be made
    BD-tractable!
  • Unless P NC, NC ? ?TQ0 ? P
  • Unless P NC, not all PTIME query classes are
    BD-tractable

separation
need proper factorizations to answer PTIME
queries on big data
PTIME
Properly contained in P
not BD-tractable
BD-tractable
13
13
14
What can we get from BD-tractability?
Guidelines for the following.
  • What query classes are feasible on big data? ?TQ0
  • What query classes can be made feasible to answer
    on big data? ?TQ
  • How to determine whether it is feasible to answer
    a class Q of queries on big data?
  • Reduce Q to a complete problem Qc for ?TQ via
    ?NC
  • If so, how to answer queries in Q?
  • Identify factorizations (?NC reductions) such
    that Q ?NC Qc
  • Compose the reduction and the algorithm for
    answering queries of Qc

A revision of the classical computational
complexity theory
14
15
Making big data small
15
16
Scale independence
  • The scale independence problem
  • Input A dataset D, a query Q, and a bound M
  • Query Does there exist a subset DQ of D such
    that
  • DQ ? M, and
  • Q(D) Q(DQ)?
  • A more general setting
  • Input A query Q defined over a schema R, and a
    bound M
  • Query Is it for all instances D of R, there
    exists a subset DQ of D such that
  • DQ ? M, and
  • Q(D) Q(DQ)?
  • The cost of query processing is independent of
    D!
  • Scalable with big data D, when D grows!

16
Why do we care?
17
Scale independent queries in practice?
Personalized social search queries (Facebook
Graph Search)
  • Find me all my friends who live in Edinburgh and
    like cycling
  • Find all restaurants rated A that are in King
    of Prussia Mall
  • Find me all restaurants in Edinburgh where my
    friends dined in 2013.
  • Bounded number of tuples
  • Why bounded?
  • Facebook at most 5000 friends per person
  • At most K restaurants in a mall
  • At most 5000 friends, there are 365 days each
    year, and each person dines at most once per day
    (a normal person)

To answer a query, we need to access a bounded
amount of data
17
18
Query processing
  • Access schemas (R, X, N)
  • index on X for instances D of X
  • there exist at most N tuples sharing the same X
    values in D (e.g., 365 days per year), and these
    tuples can be fetched efficiently
  • find a query plan, visiting a bounded amount of
    data

decide whether a query is scale independent
  • Complexity the scale independence problem is
  • ?3p-complete for conjunctive queries (SPC)
  • PSPACE-complete for first-order logic queries
    (SQL) but
  • in O(1) time for Boolean conjunctive queries if
    Q ? M!

there are sufficient conditions for this, based
on rules
Incremental scale independence? Using views?
18
19
How to make a query tractable on big data?
  • Querying big data
  • Input Query Q, and big data G,
  • Output Q (G), the set of answers to Q in G
  • A number of techniques
  • Distributed query processing
  • Query preserving data compression
  • Query answering using views
  • Bounded incremental evaluation
  • Top-k query answering with early termination

Too costly
The cost of query processing a function of G
and Q
O(G) time is already beyond reach in practice!
Can we effectively query big data?
  • Approximate or inexact algorithms
  • Exact algorithms?

Make the cost of query processing independent
of G!
MapReduce is not the only solution, and is not
even the best one!
19
20
Distributed query processing
O(n2) or O(n3) is too costly
The cost of evaluation algorithm f(G, Q)
It is unlikely that we can lower its complexity,
but can we reduce the size of its parameter G?
manageable sizes
  • Divide and conquer
  • partition G into fragments (G1, , Gn),
    distributed to various sites

evaluate Q on smaller Gi
  • upon receiving a query Q,
  • evaluate Q( Gi ) in parallel
  • collect partial answers at a coordinator site,
    and assemble them to find the answer Q( G ) in
    the entire G

Performance guarantees for evaluating regular
reachability queries based on partial evaluation
Network traffic and response time Independent of
G
20
21
Query preserving data compression
The cost of query processing f(G, Q)
reduce the parameter?
  • Query preserving compression ltR, Pgt for a class L
    of queries
  • For any data collection G, C R(G)
  • For any Q in L, Q( G ) P(Q, Gc)

Compress big G into a smaller Gc
21
22
What is new about query preserving compression?
  • Query preserving compression ltR, Pgt for a class L
    of queries
  • For any dataset G, Gc R(G)
  • For any Q in L, Q( G ) P(Q, Gc)
  • Relative to a class L of queries of users choice
  • Better compression ratio only information about
    L queries

no need to decompress Gc
  • For any Q in L, Q(Gc) can be directly computed
  • Any algorithms and indexing structures for G can
    be used for Gc

In contrast to lossless compression, no need to
restore the original graph G
  • Gc is computed once for all queries Q in L

Incrementally maintained
Reduction 95 in average for reachability queries
22
23
Answering queries using views
The cost of query processing f(G, Q)
can we compute Q(G) without accessing G, i.e.,
independent of G?
  • Query answering using views given a query Q in a
    language L and a set V views, find another query
    Q such that
  • Q and Q are equivalent
  • Q only accesses V(G )

for any G, Q(G) Q(G)
  • Answering graph pattern queries on big social
    graphs
  • Regardless of how big G is the cost is
    independent of G
  • V(G ) is often much smaller than G (4 -- 12
    on real-life data)

Improvement 97 for graph pattern matching
The complexity is no longer a function of G
23
24
Incremental query answering
5/week in Web graphs
  • Real-life data is dynamic constantly changes,
    ?G
  • Re-compute Q(G??G) starting from scratch?
  • Changes ?G are typically small

Compute Q(G) once, and then incrementally
maintain it
Changes to the input
Old output
  • Incremental query processing
  • Input Q, G, Q(G), ?G
  • Output ?M such that Q(G??G) Q(G) ? ?M

Changes to the output
New output
When changes ?G to the data G are small,
typically so are the changes ?M to the output
Q(G??G)
Minimizing unnecessary recomputation
24
25
Complexity of incremental problems
  • Incremental query answering
  • Input Q, G, Q(G), ?G
  • Output ?M such that Q(G??G) Q(G) ? ?M

Incremental algorithms?
The cost of query processing a function of G
and Q
  • incremental algorithms CHANGED, the size of
    changes in
  • the input ?G, and
  • the output ?M

The updating cost that is inherent to the
incremental problem itself
  • Bounded the cost is expressible as f(CHANGED)?
  • Optimal in O(CHANGED)?

The amount of work absolutely necessary to
perform for any incremental algorithm
Effective on graph pattern matching
Complexity analysis in terms of the size of
changes
25
26
Top-k query answering
Traditional query answering compute Q(G)
  • It is expensive to compute when G is large
  • The result Q(G) is excessively large for the
    users to inspect larger than G
  • Top-k query answering
  • Input Query Q, dataset G and a positive
    integer k.
  • Output A top-ranked set of k elements in Q(G)

Improvement 65 on graph pattern matching
Early termination return top-k matches without
computing Q(G)
26
27
Answering queries on big data
Yes, MapReduce is useful, but it is not the only
way!
  • Partial evaluation for distributed query
    processing can we get performance guarantees?
  • Query preserving compression convert big data to
    small data
  • Query answering using views make big data small
  • Bounded incremental query answering depending on
    the size of the changes rather than the size of
    the original big data
  • Top-k query answering and early termination find
    answers without traversing the entire data set

Prerocessing methods
Make big data small
Combinations of these can do better than
MapReduce!
27
28
  • Further reading
  • W. Fan, J. Li, X. Wang, and Y. Wu. Query
    Preserving Graph Compression, SIGMOD, 2012.
  • W. Fan. Graph Pattern Matching Revised for Social
    Network Analysis, ICDT 2012 (invited).
  • W. Fan, X. Wang, and Y. Wu. Performance
    Guarantees for Distributed Reachability Queries,
    VLDB, 2012.
  • W. Fan, X. Wang, and Y. Wu. Diversified Top-k
    Graph Pattern Matching, VLDB, 2014.
  • W. Fan, J. Li, X. Wang, and Y. Wu. Incremental
    Graph Pattern Matching, SIGMOD, 2011 (TODS 38(3),
    2013).
  • W. Fan J. Li, S. Ma, and H. Wang, and Y. Wu.
    Graph Homomorphism Revisited for Graph Matching,
    VLDB 2010.
  • W. Fan J. Li, S. Ma, and N. Tang, and Y. Wu.
    Graph pattern matching From intractable to
    polynomial time, VLDB, 2010.

29
Approximate query answering
29
30
Graph Pattern Matching
  • Given a pattern graph Q and a data graph G, find
    all the matches of Q in G.
  • subgraph isomorphism
  • Applications
  • pattern recognition
  • knowledge discovery
  • intelligence analysis
  • transportation network analysis
  • Web site classification,
  • social position detection.
  • User targeted advertising

a bijective function f on nodes (u,u ) ? Q
iff (f(u), f(u)) ? G
Widely used in social network analysis
30
31
Problems
Facebook 1B users, 140B links
  • Real-life social graphs are typically large
  • Subgraph isomorphism
  • What is for the complexity for determining
    whether there exists a match of a pattern Q in a
    graph G?
  • Given a pattern Q and a graph Q, how many matches
    of Q can possibly exist in G?

NP-complete
Possibly exponential
O(G) time is already beyond reach in practice!
  • Nonetheless, we need to conduct graph pattern
    matching on social networks, among other things

What can we do if a class of queries is NOT
BD-tractable?
subgraph isomorphism is too costly for social
network analysis
31
32
Relaxing the semantics of queries
  • Graph simulation
  • Much cheaper
  • Complexity of computing the set of matches
    quadratic time
  • The number of matches of Q in G there exists a
    unique, maximum match relation S 1
  • a binary relation S on nodes
  • for each node u in Q, there exists v in G such
    that (u,v)? S,
  • for each pair (u,v)? S, each edge (u,u) in Q is
    mapped to an edge (v, v ) in G, such that (u,v
    )? S
  • Effective
  • Social position detection
  • User targeted advertising,

Quadratic time is still too expensive! How to
deal with it?
A variety of extensions to capture topology, with
low complexity
So, graph simulation for social data analysis,
instead of subgraph isomorphism
32
33
The approximation theory revisited
  • If a query class is not BD-tractable and its
    semantics cant be relaxed, is it still feasible
    to answer such queries on big data?

Yes, approximation
  • When exact algorithms are infeasible, we find
    inexact algorithms with performance guarantees
    cant be too far!
  • feasible on big data reducing big data to small
    data
  • performance guarantees whenever possible

The need for revising the traditional
approximation theory, for querying big data
Data-driven approximation
33
34
Data-driven approximation
  • Resource-bounded query answering
  • Input A dataset D, a class Q of queries, a
    resource ratio ? ? 0, 1)
  • Question Develop an algorithm that given any
    query Q ? Q computes Q(D) by accessing at most
    ?G amount of data

Make big data small!
  • Personalized social searches and reachability
    queries
  • Find me all my friends who live in Nanjing and
    like cycling
  • Does Michael connect to lady Gaga through
    social links?

We can do personalized social search with ?
0.0015!
  • 1.5 10-6 1PB (1015B) 15 109 15GB
  • We are making big data of PB size as small as
    15GB!

We can make big data of PB size, fit into our
memory!
34
35
Summing up
35
36
Summary and Review
  • What is BD-tractability? Why do we care about
    it?
  • What is scale independence?
  • How to make big data small?
  • Is MapReduce the only way for querying big data?
    Can we do better than it?
  • What is query preserving data compression? Query
    answering using views? Bounded incremental query
    answering? Top-k query answering?
  • If a class of queries is known not to be
    BD-tractable, how can we process the queries in
    the context of big data?
  • Develop an algorithm for processing a class of
    queries on big data, by combining various methods
    discussed
Write a Comment
User Comments (0)
About PowerShow.com