Title: SPARQLeR: Extended Sparql for Semantic Association Discovery
1SPARQLeR Extended Sparql for Semantic
Association Discovery
- Krzysztof Kochut and Maciej Janik
ESWC 2007, Innsbruck, Austria June 4, 2007
Work supported by the National Science Foundation
Grant No. IIS-0325464, entitled SemDIS
Discovering Complex Relationships in the Semantic
Web.
2Paths in RDF
child
child
older
works_for
child
Directed path
child
child
Undirected path, but with specific properties
and directionality
Undirected path
3Why are paths interesting ?
- A path describes how entities are related.
- Relationships on the path define meaning of this
connection. - Entities on the path specify the content.
- Do you have migraine? Try taking magnesium!
- Path discovered by Dr. D.R.Swanson from partial
information available in PubMed publications - stress can lead to loss of magnesium in the human
body - migraine patients seem to be experiencing stress
- thats why
- migraine could lead to a loss of magnesium, so
take magnesium to fight migraine!
Swanson, R.D. Migraine and Magnesium Eleven
Neglected Connections. Perspectives in Biology
and Medicine, 31 (4). 526-557.
4Formally, what is a simple path ?
- Simple directed path between resources r0 and rn
in a description base R - sequence r0 p1 r1 p2 r2 , , pn-1 rn-1 pn rn
(ngt0) - r0 p1 r1, r1 p2 r2 , , rn-2 pn-1 rn-1, rn-1 pn
rn (ngt0) are triples in R. - all of the resources ri (0 i n) in the path
are distinct - Simple undirected path between resources r0 and
rn in R - sequence r0 p1 r1 p2 r2 , , pn-1 rn-1 pn rn
(ngt0) - for each ri-1 pi ri (0 lt i n) in the path,
either ri-1 pi ri or ri pi ri-1 is a triple in R - all of the resources ri (0 i n) in the path
are distinct
5Paths and SPARQL
- SPARQL query can express only static graph
patterns. - Some flexibility is introduced by an OPTIONAL
part, but it does not solve path problems. - No support for flexible length path expressions.
- Glycan biosynthesis pathway in biology has a
specific pattern (properties), but its length may
be unknown. - Path discovery may be of unknown length and
pattern, like in Dr. Swansons example.
6What we need to discover paths?
- Knowledge discovery needs more flexible patterns.
- Patterns may be partially known or even unknown
(unrestricted path). - Properties on the path, their order and
directionality create a specific meaning. - Entities on the path provide content.
- Relationships to entities outside of the path
give an additional context.
7Proposed extensions
- A path may have a flexible length
- For computational reasons, length is limited.
- Constraints on properties
- Specific properties must appear in the path.
- Their order and directionality is meaningful.
- They can form a repeating pattern.
- Constraints on resources
- Specific resources must be on the path.
- They can be anywhere on the path or at specific
positions.
8SPARQLeR
- Extension of SPARQL for semantic association
discovery. - Seamlessly integrated into the SPARQL syntax.
- Graph patterns incorporating simple paths with
constraints. - Constraints are based on regular expressions over
properties.
9What is a path in SPARQLeR ?
- Path is a meta-property that connects two
resources. - Defined as a sequence of interleaving properties
and resources. - Starts and ends with properties (endpoint
resources are not included). - A path of length 1 is a sequence with just one
property. - ltrdfClass rdfabout"http//meta.org/rdf-meta-sch
emaPath"gt - ltrdfsisDefinedBy rdfresource"http//meta.org/r
df-meta-schema"/gt - ltrdfssubClassOf rdfresource"http//www.w3.org
/1999/02/22-rdf-syntax-nsProperty"/gt - ltrdfssubClassOf rdfresource"http//www.w3.org
/1999/02/22-rdf-syntax-nsSeq"/gt - ltrdfslabelgtPathlt/rdfslabelgt
- ltrdfscommentgtThe class of RDFMS
paths.lt/rdfscommentgt - lt/rdfClassgt
10Path patterns in SPARQLeR
- Meta-property similar concept to a property
- Resource property? Resource
- Resource path? Resource
- Path as a Sequence
- Test if a resource is in the path
- rdfsmember
- Test if a resource is at a specific position in
the path - rdf_2, rdf_4, ...
- SPARQLeR-specific path properties
- Test all resources or all properties in the path
- rdfmsentityResource and rdfmspropertyResource
- Example all resources on a path must be of type
fooPerson -
11Path pattern anatomy
p1
p1
p2
p3
12Path types in SPARQLeR
- Directionality of relationships in the path
defines its specific semantics. - SPARQLeR allows definition of the following path
types - As defined in graph theory
- Directed
- Undirected
- SPARQLeR specific extension
- Defined directionality path (includes directed
path)
13Directionality of properties in path
- Defined directionality paths
- Neither directed nor undirected
- Each property in a path has a specified
directionality - Example simple graph with p relationship
- (a) X p Y, directed path
- (b) X p Y, undirected path
- (c) X ( p p-1 ) Y, directional path
(a)
(b)
(c)
p
p
p
p
X
Y
p
p
p
p
14Inverse property operator
- In standard SPARQL there is no need for inverse
property operator - Pattern syntax is based on individual statements,
so it is easy to reverse direction. - Defining path constraints requires the inverse
operator - A pPath expression defines constraints on
properties, not on individual statements. - Without the inverse property operator some paths
constraints would be impossible to express (as
shown in the previous example).
15RegExp in path constraints
- Path constraints on properties are based on
regular expressions - Uses syntax similar to lex
- Easy for grep users
- Examples
- a c d a (bc) a
- abc c? d ( b a-1 ) c
16Path constraints in SPARQLeR
- Defined as regular path expressions
- Can specify patterns of properties in the path
- Directionality requirement needs the inverse
operator ? (- minus) p - Supported regular expressions
- p (single property)
- -p (the inverse of p)
- p1 p2 ... pn (class of properties)
- -p1 p2 ... pn (class of inverse properties)
- p1 p2 .. pn (complement of properties)
- -p1 p2 .. pn (inverse of complement of
properties)
. (wildcard) x y (alternative) xy
(concatenation) x (Kleene star) x (one or more
repetition) (x) (match a path matched by x)
17Path constraints (contd)
- Class of properties and inverse operator
- Complement operator can be applied only to
defined properties, not their inverses - Inverse operator
- Not allowed inside class of properties
- Inverses set created from defined properties
- Example
- properties q r s t
- rt ? q s
- qr ? t-1 s-1 (inverses)
- (st t) ? q r q-1 r-1 s-1
18Integrating paths into SPARQL
- Path variable binds a path
- Name begins with instead of ?
- Simple patterns path between two resources
- SELECT ?prop WHERE ltrgt ?prop ltsgt
- SELECT path WHERE ltrgt path ltsgt
- Single source path
- SELECT path, ?res WHERE ltrgt path ?res
19Integrating paths into SPARQL
- Resources on the path
- SELECT path WHEREltrgt path ltsgt . path
rdfsmember ltegt - SELECT path WHEREltrgt path ltsgt . path rdf_1
ltpgt - Listing path elements list operator
- SELECT list(path) WHERE ltrgt path ltsgt
20Expressing path constraints
- Bounded path length
- only constants allowed
- FILTER(length(path)lt5)
- FILTER(length(path)gt3 length(path)lt7)
21Expressing path constraints
- Constraints added as a regular expression filter
(existing syntax in SPARQL) - regex( pathvariable, pathexpr, pathflags )
-
- FILTER(regex(path,.fooprop.,uis))
- Flags i (instances) s (schema) l (literals)
h (match using hierarchy) d (set
directionality) u (undirected) - Default flags d i
22Some examples
- SELECT list(path), ?res WHERE
- ltrgt path ?res .
- path rdfsmember ?x .
- ?x foolocatedIn wikiEurope
- FILTER(regex(path,fooprop)
- SELECT list(path) WHERE
- ltrgt path ltsgt .
- path rdfmsentityResource ?x .
- ?x rdftype fooPerson
- FILTER(regex(path,(foopropfoorel),u)
- SELECT list(path) WHERE
- ltrgt path ltsgt
- FILTER(length(path)lt6 length(path)gt4
- regex(path,(fooprop -foorel))
23SPARQLeR Prototype Implementation
- Prototype implementation is based on BRAHMS
RDF/S main memory storage. - Path search based on a bi-directional BFS for
simple paths. - Checking of path constraints in regex is
implemented as a simulation of DFAs.
Janik, M. and Kochut, K., BRAHMS A WorkBench RDF
Store And High Performance Memory System for
Semantic Association Discovery. ISWC 2005
24Implementation details
- Each path expression (FILTER regex) is translated
into a DFA. - For path between two resources, partial
constraints are checked while building the search
trie from both endpoints forward and reverse
DFAs - When a path is connected,the forward DFA used
to check the full (path) constraint.
25Experiments biology pathway
- Biosynthesis paths in biology (glycomics)
- How specific glyco peptide is created from a
basic structure? - Find pathway between dolichol phosphate and glyco
peptide G00009 - Path has 15 reactions (30 hops, as each reaction
is represented by its substrates and products) - Only undirected path connects the endpoint
resources, but a specific directionality pattern
is present - RDF representation sample reactions in
the path
26Experiments biology pathway
- Functionality test - proof of concept
- N-glycan biosynthesis pathway
SELECT list(path) WHERE glycodolichol_phosph
ate path glycoglyco_peptide_G00009 . path
rdfsmember enzyoR05969 FILTER ( length(path)
lt 30 regex(path,
"((-glycohas_acceptor_substrate
-glycohas_reactant) glycohas_product)" ) )
Ontology GlycO Length 30 hops Consists
of 15 reactions Search time milliseconds (less
than 1 tick)...
courtesy of Dr. Alison Vandersall-Nairn,
University of Georgia
27Experiments
- Scalability
- Modified DBLP datasets in RDF (added random
citations) - Test on increasing dataset (adding older years of
publications) - Search for cited publications (transitive)
- PREFIX opus lthttp//lsdis.cs.uga.edu/projects/sem
dis/opusgt - SELECT ?end_publication WHERE
- lthttp//dblp.uni-trier.de/rec/bibtex/journals/ai/H
uber06gtpath ?end_publication - FILTER ( length(path)lt26 regex(path,
"(opuscites_publication)" ) )
B. Aleman-Meza et. al. Semantic Analytics on
Social Networks Experiences in Addressing the
Problem of Conflict of Interest Detection.
(WWW2006)
28Experiments dataset characteristics
29Experiments results single source paths
Search paths up to length 26
30Experiments results two endpoint paths
31More complex uses of path expressions
- Discover connecting paths with a shared node
- Path between A and B, length up to 4
- Path between C and D, length up to 4
- Both paths have a shared resource
A path_1 B length(path_1) lt 4
?x
C path_2 D length(path_2) lt 4
path_1 rdfsmember ?x path_2 rdfsmember ?x
Potential subgraph discovery
32SPARQLeR summary
- Path expressions
- use of regular expressions over properties
- Flexible path specification
- Undirected
- Defined directionality paths
- Directed
- Length restricted
- Complex path patterns
- Test of resources and properties on the path
- Intersecting paths
33Conclusion and future work
- SPARQLeR extension fits seamlessly into the
current SPARQL syntax. - Performance of path queries is acceptable (if
defined expression is highly selective). - Optimization of path queries, complex expressions
and multiple paths in query. - Inclusion of context.
34SPARQLeR Krys Kochut, Maciej Janik
35Predicate Vs. Statement expressions
- Predicate alphabet
- p
- -p
- _ (wildcard)
- simplicity
- Statement alphabet
- s p o
- _ p o
- s _ o
- s p _
- _ _ o
- _ p _
- s _ _
- _ _ _
Additional rules Which statement pattern can
be connected withwhich one