Scalable SPARQL Querying of Large RDF Graphs - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Scalable SPARQL Querying of Large RDF Graphs

Description:

Title: 1 Author: Leon Description: TR Template of SOARingLab Last modified by: Jxulie Created Date: 10/5/2005 1:11:43 AM Document presentation format – PowerPoint PPT presentation

Number of Views:212
Avg rating:3.0/5.0
Slides: 43
Provided by: Leon110
Category:

less

Transcript and Presenter's Notes

Title: Scalable SPARQL Querying of Large RDF Graphs


1
Scalable SPARQL Querying of Large RDF Graphs
  • Xu Bo
  • 2012.06.11

In PVLDB, 4(21), 2011
2
Outline
  • About Presenter
  • Semantic Web
  • Previous Work
  • New Problem
  • SYSTEM ARCHITECTURE
  • EXPERIMENTS
  • CONCLUSIONS AND FUTURE WORK

3
About Presenter
  • Daniel Abadi
  • Associate Professor of Computer
  • Science in Yale University
  • Research
  • Column-Oriented Database Systems
  • Petascale Parallel Database Systems (HadoopDB)
  • Semantic Web Data Management

4
Semantic Web
  • The vision of Semantic Web is to build a "web of
    data" that enables machines to understand the
    semantics of information on the Web

5
Google Knowledge Graph
6
Key Technology
  • HTML
  • XML

7
The Disadvantage of XML
  • David Billington is a lecturer of Discrete
    Mathematics.
  • there is no standard way of assigning meaning to
    tag nesting

8
The Disadvantage of Xpath
  • Suppose we want to collect all academic staff
    members. A path expression in Xpath might be
    //academicStaffMember
  • XML is semantically unsatisfactory

9
RDF
  • Resource Description Framework
  • ?Web???(?????????,Uniform Resource
    Identifiers?URIs)?????,??????(property)?????????

10
RDF as Triples and a Graph
11
SPARQL
  • RDF query language
  • A basic graph pattern
  • Answering SPARQL can be seen as finding subgraphs
    in the RDF data that match the graph pattern

12
Example for Star Pattern
  • Find the names of the strikers that play for FC
    Barcelona.

13
Another Example
  • Find football players playing for clubs in a
  • populous region where they were born.

14
(No Transcript)
15
Previous Work
  • RDF In RDBMSs
  • Property Tables
  • Vertically Partitioned Approach

16
RDF In RDBMSs
  • Get the title of the book(s) Joe Fox wrote in 2001

17
Property Tables
18
Vertically Partitioned Approach
19
New Problem
  • Single node RDF management systems are abundant
  • Sesame
  • Jena
  • RDF3X
  • 3store
  • Research in clustered RDF management is less
    significantly explored The focus of the talk

20
SYSTEM ARCHITECTURE
21
Graph Partitioning
  • Hash vs. Graph partitioning
  • Hash Only efficient for star patterns
  • Graph Taking advantage of graph model

22
Graph Partitioning
  • Edge vs. Vertex partitioning
  • Edge Natural but inefficient for query execution
  • Vertex Superior for common graph patterns

23
Vertex Partitioning
  • Preprocess
  • remove triples whose predicate is rdftype
  • METIS partitioner

24
Triple Placement
  • Minimizing data shuffling/exchange
  • Allowing data overlap
  • N-hop guarantee
  • The extent of data overlap
  • If a vertex is assigned to a machine, any vertex
    that is within n-hop of this vertex is also
    stored in this machine

25
DIRECTED N-HOP GUARANTEE
26
A potential problem
  • triples (s, p, o) and (o, p, o)
  • 2-hop guarantee
  • triples (s, p, o) and (s, p, o)
  • not guaranteed
  • object-connected is not unusual
  • undirected n-hop guarantee

27
Triple Placement Algorithm
28
Query Processing
  • Queries are executed in RDF-stores and/or Hadoop
  • Query execution is more efficient in RDF-stores
    than in Hadoop
  • Pushing as much of the processing as possible
    into RDF-stores
  • Minimizing the number of Hadoop jobs
  • The larger the hop guarantee, the more work is
    done in RDF-stores

29
To Communicate, or not to Communicate
  • Given a query and n-hop guarantee, is
    communication (Hadoop job) between nodes needed?
  • Choose the center of the query graph
  • Calculate the distance from the center to the
    furthest edge
  • If distance gt n, communication is needed not
    needed otherwise

30
Determining whether a Query is PWOC
  • PWOC Query
  • parallelizable without communication
  • DoFE
  • distance of farthest edge
  • the vertex in a graph with the smallest DoFE will
    be the most central in a graph

31
The algorithm
32
the issue of duplicate results
  • naive approach
  • remove duplicates after the query has completed
  • owner-computes model
  • add triples (v, ltisOwnedgt, Yes) to a
    partition
  • For each query issued to the RDF-stores, add an
    additional pattern (core, ltisOwnedgt, Yes)

33
A query is not PWOC
  • decompose the query into PWOC subqueries
  • use Hadoop jobs to join the results of the PWOC
    subqueries
  • The number of Hadoop jobs required to complete
    the query increases as the number of subqueries
    increases

34
minimal number of subqueries
  • reduces to the problem of finding minimal edge
    partitioning of a graph into subgraphs of bounded
    diameter
  • brute-force

35
Examlple
DoFEs for manager, footballClub, Barcelona and
club are 2, 2, 2 and 1
the DoFEs for footballer, pop, region, player and
club are 3, 3, 2, 2 and 2,
36
Decompose Example
37
EXPERIMENTS
  • 20-machine cluster
  • Leigh University Benchmark (LUBM) 270 million
    triples
  • Competitors
  • Single-node RDF-3X
  • SHARD triple-store system in Hadoop
  • Graph partitioning (the proposed system)
  • Hash partitioning on subjects

38
Data Load Time
39
Performance Comparison
40
Varying Number of Machines
41
Summary
42
Thanks !
Write a Comment
User Comments (0)
About PowerShow.com