Title: Scalable SPARQL Querying of Large RDF Graphs
1Scalable SPARQL Querying of Large RDF Graphs
In PVLDB, 4(21), 2011
2Outline
- About Presenter
- Semantic Web
- Previous Work
- New Problem
- SYSTEM ARCHITECTURE
- EXPERIMENTS
- CONCLUSIONS AND FUTURE WORK
3About Presenter
- Daniel Abadi
- Associate Professor of Computer
- Science in Yale University
- Research
- Column-Oriented Database Systems
- Petascale Parallel Database Systems (HadoopDB)
- Semantic Web Data Management
4Semantic Web
- The vision of Semantic Web is to build a "web of
data" that enables machines to understand the
semantics of information on the Web
5Google Knowledge Graph
6Key Technology
7The Disadvantage of XML
- David Billington is a lecturer of Discrete
Mathematics. - there is no standard way of assigning meaning to
tag nesting
8The Disadvantage of Xpath
- Suppose we want to collect all academic staff
members. A path expression in Xpath might be
//academicStaffMember - XML is semantically unsatisfactory
9RDF
- Resource Description Framework
- ?Web???(?????????,Uniform Resource
Identifiers?URIs)?????,??????(property)?????????
10RDF as Triples and a Graph
11SPARQL
- RDF query language
- A basic graph pattern
- Answering SPARQL can be seen as finding subgraphs
in the RDF data that match the graph pattern
12Example for Star Pattern
- Find the names of the strikers that play for FC
Barcelona. -
13Another Example
- Find football players playing for clubs in a
- populous region where they were born.
14(No Transcript)
15Previous Work
- RDF In RDBMSs
- Property Tables
- Vertically Partitioned Approach
16RDF In RDBMSs
- Get the title of the book(s) Joe Fox wrote in 2001
17Property Tables
18Vertically Partitioned Approach
19New Problem
- Single node RDF management systems are abundant
- Sesame
- Jena
- RDF3X
- 3store
- Research in clustered RDF management is less
significantly explored The focus of the talk
20SYSTEM ARCHITECTURE
21Graph Partitioning
- Hash vs. Graph partitioning
- Hash Only efficient for star patterns
- Graph Taking advantage of graph model
22Graph Partitioning
- Edge vs. Vertex partitioning
- Edge Natural but inefficient for query execution
- Vertex Superior for common graph patterns
23Vertex Partitioning
- Preprocess
- remove triples whose predicate is rdftype
- METIS partitioner
24Triple Placement
- Minimizing data shuffling/exchange
- Allowing data overlap
- N-hop guarantee
- The extent of data overlap
- If a vertex is assigned to a machine, any vertex
that is within n-hop of this vertex is also
stored in this machine
25DIRECTED N-HOP GUARANTEE
26A potential problem
- triples (s, p, o) and (o, p, o)
- 2-hop guarantee
- triples (s, p, o) and (s, p, o)
- not guaranteed
- object-connected is not unusual
- undirected n-hop guarantee
27Triple Placement Algorithm
28Query Processing
- Queries are executed in RDF-stores and/or Hadoop
- Query execution is more efficient in RDF-stores
than in Hadoop - Pushing as much of the processing as possible
into RDF-stores - Minimizing the number of Hadoop jobs
- The larger the hop guarantee, the more work is
done in RDF-stores
29To Communicate, or not to Communicate
- Given a query and n-hop guarantee, is
communication (Hadoop job) between nodes needed? - Choose the center of the query graph
- Calculate the distance from the center to the
furthest edge - If distance gt n, communication is needed not
needed otherwise
30Determining whether a Query is PWOC
- PWOC Query
- parallelizable without communication
- DoFE
- distance of farthest edge
- the vertex in a graph with the smallest DoFE will
be the most central in a graph
31The algorithm
32the issue of duplicate results
- naive approach
- remove duplicates after the query has completed
- owner-computes model
- add triples (v, ltisOwnedgt, Yes) to a
partition - For each query issued to the RDF-stores, add an
additional pattern (core, ltisOwnedgt, Yes)
33A query is not PWOC
- decompose the query into PWOC subqueries
- use Hadoop jobs to join the results of the PWOC
subqueries - The number of Hadoop jobs required to complete
the query increases as the number of subqueries
increases
34minimal number of subqueries
- reduces to the problem of finding minimal edge
partitioning of a graph into subgraphs of bounded
diameter - brute-force
35Examlple
DoFEs for manager, footballClub, Barcelona and
club are 2, 2, 2 and 1
the DoFEs for footballer, pop, region, player and
club are 3, 3, 2, 2 and 2,
36Decompose Example
37EXPERIMENTS
- 20-machine cluster
- Leigh University Benchmark (LUBM) 270 million
triples - Competitors
- Single-node RDF-3X
- SHARD triple-store system in Hadoop
- Graph partitioning (the proposed system)
- Hash partitioning on subjects
38Data Load Time
39Performance Comparison
40Varying Number of Machines
41Summary
42Thanks !