Title: The Graph Query Language: Towards a Unification of Graph Query Approaches
1The Graph Query LanguageTowards a Unification
of Graph Query Approaches
JTC1 SC32N1634
- David Silberberg
- david.silberberg_at_jhuapl.edu
- 443-778-6231
2Outline
- Goals Example Scenario
- Key Features of GQL
- Computational Complexity of Query Execution
- Future Directions
3Goals of the Graph Query Language (GQL)Project
- To unify disparate graph query approaches into a
single, seamless, and declarative language - Supports semantic search over graph data
structures represented by schemas - Supports traditional graph algorithms that
systematically follow edges to discover
interesting subgraphs (e.g., shortest path,
minimal spanning tree, etc.) - Supports metrics-oriented graphs algorithms
(e.g., social network analysis, etc.) - Supports special commands tailored to analysis of
graphs - Supports ontology-assisted query
- To quantify the scalability of this type of
language
4Assumptions
- Data model is a typed graph that adheres to a
schema - Not XML graphs tend to be more highly connected
- Not a semantic model inference cannot, in
general, be performed on the schema - Data graphs can be large
- Query languages are only an abstract
representation of questions - The object is finding the right abstraction for
the way people think about interacting with
graphs - Other query languages onto other data models will
work but do those languages help facilitate or
hinder the formulation of those requests or the
interpretation of the results? - Algorithms are external to the graph management
system - There are too many algorithms
- New algorithms may be implemented or modified
regularly - We are not the experts in writing efficient
algorithms
5Benefits of GQL
- Potential for significant reduction in time to
perform analysis - Provides visual analysis applications with a new
paradigm for interacting with graph data - Reduces the time to find information useful to
analysts - Enables interactive analysis using large data
graphs
6Graph Interaction Methods
- Graph interactions take many forms
- Browse
- One-step-at-a-time exploration of a graph
- Semantic Schema-Based Search
- Several-steps-at-a-time graph query
- Algorithms
- Find subgraphs
- Calculate graph metrics
- Analysis
- Hypothesis expressions, etc.
- GQL is a declarative graph query language for
integrating all these approaches!
7Example Scenario
- Farmer Jones' lettuce crop did well this year,
but few other farmers did well. Why? - First, find Farmer Jones. (Browsing)
8Example Scenario
- Rabbits usually eat lettuce. Let's find the
rabbits that ate Farmer Jones' lettuce. (Semantic
Schema-Based Search)
9Example Scenario
- Let's look at all the farmers, and their
locations, whose lettuce was eaten by fewer than
5 rabbits. (Semantic Schema-Based Search)
10Example Scenario
- What commonalities do the farmers have with each
other and with the rabbits? (Semantic and
Algorithmic Search)
11Example Scenario
- If Fred fox ate Prize lettuce, what else would we
learn? (Analysis-specific Methods, Semantic
Search, and Algorithmic Search)
12Outline
- Goals Example Scenario
- Key Features of GQL
- Computational Complexity of Query Execution
- Future Directions
13Related Work
- Four categories of graph query languages and
examples - Knowledge base (subject-predicate-object) query
languages - SPARQL, RQL, RAL, RDF Query Language
- Graph reasoning query languages
- OWL-QL, GraphLog, Query and Inference Service for
RDF - Query languages with graph operators
- GOQL
- GRAM
- Graphical user interface query language
- QGRAPH
14Features of GQL that Support Analysis
- Schema-based graph query
- Returns a single graph or a set of graphs (not
tables or XML files) - Aliasing
- Graph exploration through wildcard search
- Embedded queries (helps achieve first order logic
expressiveness) - Creates new graph structures in query results
- Query over defined patterns (of activity or
behavior, for example) - Special commands tailored to analysis
- Hypothesis expressions
- Composite vertices (of vertices and edges)
- External algorithms that return graphs (e.g.,
shortest path) - External algorithms that return metrics (e.g.,
social network analysis) - Ontology-assisted graph query
15Example Graph Model
16GQL Operators - Overview
- Basic Syntax
- SUBGRAPH clause
- Finds a subgraph in the source graph
- CONSTRAINT clause
- Filters the subgraph based on property
constraints - RETURN clause
- Describes the resulting graph or sets of graphs
to return - Syntax for analysis
- ASSUME clause
- Supports hypothesis statements
- PATTERN clause
- Defines search patterns
BACK
17Basic GQL Operators
- Subgraph Template Operators SUBGRAPH clause
- Conjunctions and disjunctions of path-segment
operators - Hierarchy operators (for composite vertices)
- Constraint Operators CONSTRAINT clause
- Standard first-order logic
- Conjunctions, disjunctions and negations as well
as universal and existential quantification of
predicates. - Projection Operators RETURN clause
- Constructs the result graph(s)
- Path segment operator
- Hierarchy operator (for composite vertices)
- Present results as a set of graphs
- Edge expansion operator
- Common join operator
18Simple Query that Returns a Single Graph
- SUBGRAPH Fox Chases Rabbit AND Fox Eats Rabbit
- CONSTRAINT Chases.Time lt Eats.Time
- RETURN Fox Chases Rabbit AND Fox Eats Rabbit
- Type represents variable
- Motivated by languages like SQL
- In constrast to (Fox ?f1)
19Returning a Set of Graphs
- Can be done with edge expansion or joins in the
RETURN clause - Can be seamlessly integrated with non-graph
expansion expressions - Any query can be returned as a set of graphs if
desired - SUBGRAPH Fox Chases Rabbit
- RETURN Fox Chases Rabbit
BACK
20Aliasing
- SUBGRAPH Fox ALIAS ChasingFox Chases Rabbit AND
- Fox ALIAS EatingFox Eats Rabbit
- CONSTRAINT ChasingFox.name ltgt EatingFox.name
- RETURN ChasingFox Chases Rabbit AND
- EatingFox Eats Rabbit
- If our graph had an additional edge in which
George Fox chased Jack Rabbit at 8 a.m., the
result would look like
Fox fox1
age 3
name George
Chases chases3
time 8am
Fox fox2
Rabbit rabbit3
Eats eats2
name Fred
age 2
age 1
name Jack
time 9am
BACK
21Embedded Queries
- Significant component of first order logic
expressiveness - To request the first fox that ate a rabbit, the
following existential query is formulate - SUBGRAPH Fox Eats ALIAS E1 Rabbit
- CONSTRAINT NOT EXISTS
- (SUBGRAPH Fox Eats ALIAS E2 Rabbit
- CONSTRAINT E1.time gt E2.time)
- RETURN Fox Eats Rabbit
Fox fox2
Rabbit rabbit3
Eats eats2
name Fred
age 2
age 1
name Jack
time 9am
BACK
22New Result Graph Structure Query
- SUBGRAPH Fox Eats Rabbit AND Rabbit Eats Lettuce
- RETURN Fox new(Ingests) Lettuce
Fox fox1
Lettuce lettuce1
Ingests ingests1
age 3
name George
name PrizeLettuce
Fox fox2
name Fred
age 2
Lettuce lettuce2
Ingests ingests3
name Icy
BACK
23Hypothesis Expressions
- Enables queries on hypothetical data
- SUBGRAPH Fox Chases Rabbit AND
- Fox Eats Rabbit AND
- Rabbit Eats Lettuce
- CONSTRAINT Chases.time lt 8am
- RETURN Fox new(Ingests) Lettuce
- ASSUME EDGE Chases NEW time 7am
- FROM FoxCONSTRAINT name Fred
- TO RabbitCONSTRAINT name Jack
- Motivated by OWL-QL
BACK
24Composite Vertices
- Composite vertices
- Composed of vertices and edges
- Contained vertices can be composite as well
25Composite Vertex Queries - continued
- SUBGRAPH HuntingEvent OccuredAt Place AND
- HuntingEvent DIRECTLY CONTAINS Rabbit AND
- Rabbit Eats Lettuce
- CONSTRAINT Place.name Smith Game Park
- RETURN Rabbit Eats Lettuce
time
Lettuce
name
Eats
Rabbit
name
age
- Addresses a subset of Harel's Higraphs
- Multiple hops
- CONTAINS or IS-CONTAINED-BY
- Feasible because of the hierarchy
BACK
26Wildcard Queries
- SUBGRAPH Fox ALIAS InterestingEdge Rabbit
- RETURN Fox InterestingEdge Rabbit
Fox fox1
Rabbit rabbit1
Chases chases1
time 2pm
age 3
name George
age 2
name Peter
Eats eats1
time 3pm
Chases chases2
Rabbit rabbit2
time 5pm
age 4
name Bugs
Fox fox2
Rabbit rabbit3
Eats eats2
name Fred
age 2
age 1
name Jack
time 9am
- One edge wildcard queries
- Multiple hops
- May be computationally expensive in a graph
- Can be handled by an external AllPath() algorithm
BACK
27Pattern Definition
- Assigns names to interesting graph patterns
- Can be reused in multiple queries
- PATTERN Predator (Fox new(PreysUpon) Rabbit)
- SUBGRAPH Fox Chases Rabbit AND
- Fox Eats Rabbit
- CONSTRAINT Chases.time lt Eats.time
- RETURN Fox new(PreysUpon) Rabbit
28Pattern Use
- Query
- SUBGRAPH Predator(Fox PreysUpon Rabbit) AND
- Rabbit Eats Lettuce
- RETURN Fox new(Ingests) Lettuce
- Is evaluated as if it were
- SUBGRAPH Fox Chases Rabbit AND
- Fox Eats Rabbit AND
- Rabbit Eats Lettuce
- CONSTRAINT Chases.time lt Eats.time
- RETURN Fox new(Ingests) Lettuce
BACK
29External Graph Algorithms that Return Subgraphs
- Shortest Path
- SUBGRAPH GameWarden Chases Fox AND
- ShortestPath(Fox, Rabbit) ALIAS SP_alias AND
- Rabbit Eats Lettuce
- RETURN GameWarden Chases Fox AND
- SP_alias AND
- Rabbit Eats Lettuce
- Adjacent Vertices
- SUBGRAPH AdjacentVertices(Rabbit) ALIAS AV_alias
- CONSTRAINT count_edges(Rabbit) gt 10
- RETURN AV_alias
BACK
30External Graph Algorithms that Return Metrics
- Centrality Find the Foxes that eventually Eat
the Rabbits, who play a central role in the
garden activities - SUBGRAPH Fox Eats Rabbit
- CONSTRAINT Centrality (Fox, Rabbit, Lettuce) gt
.8 - RETURN Fox Eats Rabbit
- Clustering Coefficient Find the Foxes that are
likely to work together when Chasing Rabbits - SUBGRAPH Fox ALIAS Fox1 Chases Rabbit AND
- Fox ALIAS Fox2 Chases Rabbit
- CONSTRAINT ClusteringCoefficient (Fox1, Fox2) gt
.6 - AND Fox1 ltgt Fox2
- RETURN Fox Eats Rabbit
31Some Issues with External Algorithms
- Algorithms do not filter results, they operate
direction on the graph and tie into the rest of
the results - Algorithms need to return a set of graphs (or a
graph under some circumstances) in a standard
format - Order of query execution
- No current way to refer to the result vertices
and edges of algorithms that are not specifically
identified in the query - SUBGRAPH AdjacentVertices(Rabbit) ALIAS AV_alias
- CONSTRAINT ClusteringCoefficient (ltVertex1 ?gt,
ltVertex2 ?gt) gt .6 - RETURN ltVertex1 ?gt ltEdge1 ?gt Rabbit AND
- ltVertex2 ?gt ltEdge2 ?gt Rabbit
BACK
32Ontology Assisted Query
Organism
isA
isA
Animal
Ontology
isA
isA
Chases
Carnivore
Herbivore
Vegetable
Eats
Eats
isA
isA
isA
isA
isA
isA
Wolf
Fox
Hare
Sheep
Lettuce
Carrot
Mappings
time
time
Lettuce
name
Graph Schema
Eats
Chases
Fox
Rabbit
Carrot
Eats
name
Eats
name
age
name
age
time
time
33Ontology-Assisted Query Result
- SUBGRAPH Carnivore Eats Herbivore AND
- Herbivore Eats Vegetable
- RETURN Carnivore new(Ingests) Vegetable
Fox fox1
Lettuce lettuce1
Ingests ingests1
age 3
name George
name PrizeLettuce
Fox fox2
name Fred
age 2
Lettuce lettuce2
Ingests ingests3
name Icy
34Some Issues of Ontology-Assisted Query
- Why not just have an ontology query language?
- Performance issues?
- Scaling issues?
- Capitalize on features that semantics bring to
bear on a graph query language - Semantic abstraction (e.g., subsumption,
hierarchy) - Use inference to create semantically consistent
models - Impose semantic on the graph model
BACK
35Outline
- Goals Example Scenario
- Key Features of GQL
- Computational Complexity of Query Execution
- Future Directions
36Query Optimization
- Query execution time is the key to success for
any query language GQL is no exception - We apply relational database optimization
techniques to graph queries - Optimization issues
- Addressed query optimization on a per
path-segment basis yes - Address path-segment ordering initial thoughts
- Address the management of large amounts of
intermediate results of a query not yet - Address incorporating external algorithms not
yet - Address ontology elaboration performance not yet
37Query Optimization
- Query plan representations are used to define
query execution plans - Query plan representations are constructed to
optimize the query execution time - Via graph algebra
- Via graph statistics to estimate query costs for
each operation - Query optimizer determines
- The best algorithm to execute each operation
- The best operation ordering to optimize overall
query execution time
38Query Planning and Optimization
- Query planning process determines the operators
required to solve a query - Query optimization process determines the most
efficient way to - Execute query operators
- Order the execution of query operators
- Heuristics have been identified to implement
query planning and optimization based on
statistical analysis
39Graph Statistics
- Estimating costs requires statistical knowledge
of the graph - We estimate the cost of the path segment operator
- One of the most common and costly operations
- Statistics that we initially considered useful
- Vertex Cardinality The number of vertices of
type v is count(v) or just V. - Vertex Edge Set Cardinality The total number of
edges e that emanate from all vertices of type v
is count(ev) or just EV. - Edge Cardinality The number of edges of type e
is count(e) or just E. - Edge Distribution The number of different vertex
type pairs that edges of type e connect of just
ED. - Selectivity Factor The percentage of vertices or
edges that match a property constraint is sel(?),
where ? is the property constraint. - Uniformity assumption
- Independence assumption
40Path Segment Vertex Search, No Indices
- Algorithm
- Iterate through a set of vertices of type v in
O(V) time - For each vertex, iterate through its edge list to
find edges of type e in O(EV/V) time - Follow the edge to vertex w in constant time
- Execution time is O(V(EV/V)) O(EV)
41Path Segment Indices on Vertex Edge Set
- Requires each edge set to be indexed through a
logarithmic-time search tree (e.g., B tree) - Next values are (virtually) collocated with the
matching value - Enables a constant time search for the next
value(s) - Algorithm
- Iterate through vertices of type v in time O(V)
- Find matching edge(s) in logarithmic time
O(log(EV/V) - Iterate through the matching edges in time
O(E/EDV) - Execution time is O(V (log(EV/V) E/EDV) )
O(Vlog(EV/V) E/ED) - If ED ? E (i.e., one edge of type e emanates from
each v), then the algorithm tends to operate in
time O(Vlog(EV/V)) - If ED ? E and EV ?V, the algorithm tends operate
in time O(V) - If ED ? E and EV?gtgt V, the algorithm tends to
operate in time O(Vlog(EV)) - If ED gtgt E, then the algorithm tends to operate
in time O(E/ED)
42Path Segment Edge Indices, Constraint
- Beneficial when the query includes a constraint
?v on an indexed property of vertices of type v - Vertex edge sets are indexed as well
- Algorithm
- Logarithmic-time search through the indexed
properties ?v in time O(log(V)) - Iterate through vertices (collocated in the
index) that satisfy the constraint in time
O(sel(?v)V) - Performs a logarithmic-time search on the edges
of each matching vertex in time O(log(EV/V)) - Iterate through the matching edges in time
O(E/EDV) - Execution time is O(log(V) (sel(?v)V(log(EV/V)
E/EDV)) ) O(log(V) sel(?v)Vlog(EV/V)
sel(?v)E/ED) - If sel(?v) ? 0, the dominant factor is the search
for vertices or O(log(V)) - If the selectivity factor is higher, the
execution time approaches the times of the
previous slide
43Path Segment Edge Search, No Indices
- Algorithm
- Iterate over edge types e and select those that
connect v to w in time O(E) - Find the corresponding vertices in constant time
- Execution time is O(E)
44Path Segment Edge Search, Constraint
- Beneficial when the query statement includes a
constraint ?e on an indexed property of edges of
type e - Algorithm
- Performs a logarithmic-time search through
properties to find the first matching edge in
time O(log(E)) - Performs a linear search through all subsequent
matching edges in time O(sel(?e)E) - Find both vertices attached to each edge in
constant time - Execution time is O(log(E) sel(?e)E)
- If sel(?e) ? 0, the algorithm tends to an
execution time of O(log(E)) - Otherwise, the algorithm tends to an execution
time of O(E)
45Varying Number of Vertices per Vertex Type
46Varying Number of Edges per Vertex
47Varying Edge Types with Constraints
48Path Segment Ordering
- Assume the following query
- SUBGRAPH Fox Chases Rabbit AND
- Rabbit Eats Lettuce
- CONSTRAINT Rabbit.age lt 3
- RETURN Fox new(Ingests) Lettuce
- Query processing produces the following query
execution plan
p Fox new (Ingests) Lettuce
s Rabbit.age lt 3
?
?
Lettuce
Eats
Fox
Rabbit
Chases
49Path Segment Execution Order Choice
p Fox new Ingests Lettuce
p Fox new Ingests Lettuce
s Rabbit.age lt 3
or
?
?
Lettuce
Eats
Fox
Rabbit
Chases
50Execution Order Heuristics
- In simple terms
- Identify the path segment operation that promises
to return the least number of results - Then identify the next operation that promises to
return the next least number of results - It is actually more complicated than this
- Need to search an exponential number of orderings
to find the most efficient ordering - Heuristics can make this search tractable
51Path-Segment Ordering Metric
- Order the path segment operators to return the
fewest results - Rough heuristic
- If predicates ?v, ?e, and ?w are applied to V, E
and W respectively - Start with V and use selectivity factors to
estimate execution time - Execution time is
- V sel(?v) (E/EDV) sel(?e) (WED/E)
sel(?w) - Or, sel(?v) sel(?e) sel(?w) W
- Use this formula to determine whether Fox Chases
Rabbit should precede or follow Rabbit Eats
Lettuce
52Outline
- Goals Example Scenario
- Key Features of GQL
- Computational Complexity of Query Execution
- Future Directions
53Prototype Implementation Schedule
- Currently Implemented
- Schema search returning a single graph
- Pattern matching
- Aliasing
- Ontology assisted graph query
- Next to be implemented within approximately 6
months - Externally defined functions
- Wildcard search
- Hypothesis expressions
- Future
- Return a set of graphs (instead of a single
graph) - Embedded queries
- Return new graph structures in query results
- Composite vertices (of vertices and edges)
- Predefined patterns
- Query Optimization
54Future Work
- Relate GQL to a graphical interface
- Enables analysts to express queries through
graphical means - Can leverage several technologies (QGraph,
Conceptual Graphs, etc.) - Augment GQL to include Uncertainty, Geospatial
and Temporal operators and data structures - Address query optimization techniques
- Create a generic (as much as possible) back-end
API to integrate with data sources - Relational
- Different graph approaches