Title: Ontology Based Extraction of RDF Data from the World Wide Web
1Ontology Based Extraction of RDF Data from the
World Wide Web
- Tim Chartrand
- Masters Thesis
- Research Supported By NSF
2Introduction
- World Wide Web
- Has a huge amount of existing information
- Designed primarily for human consumption
- Semantic Web
- Is an extension of WWW
- Gives information a well-defined meaning
- Allows automation of tasks
- DEG contribution Extract data from the WWW
- Solution
- Extract Semantic Web data from the WWW
- Superimpose extracted data
3Research Overview
4RDF What is it?
- Resource Description Framework
- Language of the Semantic Web
- Set of ltsubjectgtltpredicategtltobjectgt triples
- ltmailtotim_at_cs.byu.edugtltgenealogyagegt25
- ltmailtotim_at_cs.byu.edugtltgenealogyfatherOfgtltmail
totyler_at_thechartrands.comgt
5DAML
- Core Concepts
- damlclass defines a class
- damlproperty defines a binary relation, has a
value - rdfsdomain specifies class to which a property
applies - rdfsrange specifies possible values of a
property - damluniqueProperty, damlunambiguousProperty
specify cardinality constraints for a property
6Example Ontology
- . . .
- ltdamlClass rdfID"Program"gt
- ltrdfslabelgtProgramlt/rdfslabelgt
- lt/damlClassgt
- ltdamlClass rdfID"OperatingSystem"gt
- ltrdfslabelgtOperatingSystemlt/rdfslabelgt
- lt/damlClassgt
- . . .
- ltdamlDatatypeProperty rdfID"Name"gt
- ltrdftype rdfresource"damlUniqueProperty"/
gt - ltrdftype rdfresource"damlUnambiguousPrope
rty"/gt - ltrdfsdomain rdfresource"Program"/gt
- ltrdfsrange rdfresource"rdfsLiteral"/gt
- lt/damlDatatypePropertygt
- ltdamlProperty rdfID"supportsOperatingSystem"gt
- ltrdfsdomain rdfresource"Program"/gt
- ltrdfsrange rdfresource"OperatingSystem"/gt
- lt/damlPropertygt
- . . .
7DAML ? OSM
- Class ? Non-lexical object set
- Property ? Binary relationship set between object
sets - Literal property ? Lexical object set and binary
relationship set between non-lexical and lexical
object sets - Cardinality restriction ? Participation constraint
8DAML ? OSM
- ltdamlClass rdfID"Program"gt
- ltrdfslabelgtProgramlt/rdfslabelgt
- lt/damlClassgt
- ltdamlClass rdfID"OperatingSystem"gt
- ltrdfslabelgtOperatingSystemlt/rdfslabelgt
- lt/damlClassgt
- . . .
- ltdamlDatatypeProperty rdfID"Name"gt
- ltrdftype rdfresource"damlUniqueProperty"/gt
- ltrdftyperdfresource"damlUnambiguousPropert
y"/gt - ltrdfsdomain rdfresource"Program"/gt
- ltrdfsrange rdfresource"rdfsLiteral"/gt
- lt/damlDatatypePropertygt
- ltdamlProperty rdfID"supportsOperatingSystem"gt
- ltrdfsdomain rdfresource"Program"/gt
- ltrdfsrange rdfresource"OperatingSystem"/gt
- lt/damlPropertygt
9Data Frames
- Lexical object sets need data frames.
- Use data-frame library
- Match lexical object sets with data frames
- Compare stemmed names and aliases
- Levenshtein edit distance
- Soundex
- Longest common subsequence
- Weighted average
- Specialization heuristic
- Choose most similar data frame (above a threshold)
10User Modification
- Provide graphical ontology editor
- Automate graph layout
- Allow the user to edit participation constraints
- Allow user to edit data-frame mapping
- Provide data frame editor
11Extracting the Data
12Pointing to the Data
lthtmlgt . . . ltbodygt lttablegt
lttrgt lttdgt
lta href"..."gtltbgtStick
Death 1.0lt/bgtlt/agtltbr /gt
Advance in levels, grab weapons, and unlock new
levels and
characters.ltbr /gt
ltbgtOSlt/bgt Windows 3.x/95/98/Me/NT/2000/XPltbr /gt
ltbgtFile
Sizelt/bgt2.66MBltbr /gt
ltbgtLicenselt/bgtFreeltbr /gt
lt/tdgt lttdgt05/14/2002ltbr /gt
ltigtltbgtnewlt/bgtlt/igt
lt/tdgt
lttdgtlt/tdgt lttdgt2,235lt/tdgt
lttdgtlta href"..."gtDownload
nowlt/agtltbr /gtltbr /gtlt/tdgt lt/trgt
. . .
xpointer(string-range(/html1/body1/table1/tr
1, , 10, 3))
13Convert to RDF
14Superimposed Data
15Results
- RDF Data Extraction and Viewing
- Built 4 data-extraction ontologies
- 3 from DAML ontologies for data extraction
- 1 from an existing DAML ontology
- Most existing DAML ontologies not good for data
extraction - Data Frame Matcher
- 8 training ontologies, 16 test ontologies
- 128 lexical object sets, 40 correct matches, 12
incorrect matches - Precision 77
- Recall 89
- Experiment (apartment rentals) 6 students 3 data
frames - Phone 2.8 min
- RentalRate 16.5 min
- Bedrooms 17.5 min
16Contributions
- Advancement of Semantic Web
- Application of Information Extraction to building
Semantic Web content - Semantic Web data as superimposed information
- Algorithm for ontology conversion
17Future Work
- Data extraction
- Enhance name matcher with data values
- Support n-ary relationship sets
- RDF data generation
- Generate only one URI for an object
- Associate concepts from DAML ontologies to
well-known DAML ontologies