Toward Tomorrow - PowerPoint PPT Presentation

About This Presentation
Title:

Toward Tomorrow

Description:

Key Point: it only has to be good enough. And that's our challenge and ... Database-Instance Generator. insert into Car values(1001, '97', 'CHEVY', 'Cavalier' ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 139
Provided by: davidw8
Learn more at: https://www.deg.byu.edu
Category:
Tags: cd | generator | key | tomorrow | toward

less

Transcript and Presenter's Notes

Title: Toward Tomorrow


1
Toward Tomorrows Semantic Web
  • An Approach Based on
  • Information Extraction Ontologies

David W. Embley Brigham Young University
Funded in part by the National Science Foundation
2
Presentation Outline
  • Grand Challenge
  • Meaning, Knowledge, Information, Data
  • Fun and Games with Data
  • Information Extraction Ontologies
  • Applications
  • Limitations and Pragmatics
  • Summary and Challenges

3
Grand Challenge
Semantic Understanding
4
Grand Challenge
Semantic Understanding
If ever there were a technology that could
generate trillions of dollars in savings
worldwide , it would be the technology that
makes business information systems
interoperable. (Jeffrey T. Pollock, VP of
Technology Strategy, Modulant Solutions)
5
Grand Challenge
Semantic Understanding
The Semantic Web content that is meaningful
to computers and that will unleash a revolution
of new possibilities Properly designed, the
Semantic Web can assist the evolution of human
knowledge (Tim Berners-Lee, , Weaving the
Web)
6
Grand Challenge
Semantic Understanding
20th Century Data Processing 21st Century
Data Exchange The issue now is mutual
understanding. (Stefano Spaccapietra, Editor in
Chief, Journal on Data Semantics)
7
Grand Challenge
Semantic Understanding
The Grand Challenge of semantic understanding
has become mission critical. Current solutions
wont scale. Businesses need economic growth
dependent on the web working and scaling (cost
1 trillion/year). (Michael Brodie, Chief
Scientist, Verizon Communications)
8
What is Semantic Understanding?
Semantics The meaning or the interpretation of
a word, sentence, or other language form.
Understanding To grasp or comprehend
whats intended or expressed.
- Dictionary.com
9
Can We Achieve Semantic Understanding?
A computer doesnt truly understand anything.

But computers can manipulate terms in ways that
are useful and meaningful to the human user.
- Tim Berners-Lee
Key Point it only has to be good enough. And
thats our challenge and our opportunity!
10
Presentation Outline
  • Grand Challenge
  • Meaning, Knowledge, Information, Data
  • Fun and Games with Data
  • Information Extraction Ontologies
  • Applications
  • Limitations and Pragmatics
  • Summary and Challenges

11
Information Value Chain
Translating data into meaning
12
Foundational Definitions
  • Meaning knowledge that is relevant or activates
  • Knowledge information with a degree of certainty
    or community agreement
  • Information data in a conceptual framework
  • Data attribute-value pairs

- Adapted from Meadow92
13
Foundational Definitions
  • Meaning knowledge that is relevant or activates
  • Knowledge information with a degree of certainty
    or community agreement (ontology)
  • Information data in a conceptual framework
  • Data attribute-value pairs

- Adapted from Meadow92
14
Foundational Definitions
  • Meaning knowledge that is relevant or activates
  • Knowledge information with a degree of certainty
    or community agreement (ontology)
  • Information data in a conceptual framework
  • Data attribute-value pairs

- Adapted from Meadow92
15
Foundational Definitions
  • Meaning knowledge that is relevant or activates
  • Knowledge information with a degree of certainty
    or community agreement (ontology)
  • Information data in a conceptual framework
  • Data attribute-value pairs

- Adapted from Meadow92
16
Data
  • Attribute-Value Pairs
  • Fundamental for information
  • Thus, fundamental for knowledge meaning

17
Data
  • Attribute-Value Pairs
  • Fundamental for information
  • Thus, fundamental for knowledge meaning
  • Data Frame
  • Extensive knowledge about a data item
  • Everyday data currency, dates, time, weights
    measures
  • Textual appearance, units, context, operators,
    I/O conversion
  • Abstract data type with an extended framework

18
Presentation Outline
  • Grand Challenge
  • Meaning, Knowledge, Information, Data
  • Fun and Games with Data
  • Information Extraction Ontologies
  • Applications
  • Limitations and Pragmatics
  • Summary and Challenges

19
?
Olympus C-750 Ultra Zoom Sensor Resolution 4.2
megapixels Optical Zoom 10 x Digital Zoom 4
x Installed Memory 16 MB Lens Aperture F/8-2.8/3
.7 Focal Length min 6.3 mm Focal Length
max 63.0 mm
20
?
Olympus C-750 Ultra Zoom Sensor Resolution 4.2
megapixels Optical Zoom 10 x Digital Zoom 4
x Installed Memory 16 MB Lens Aperture F/8-2.8/3
.7 Focal Length min 6.3 mm Focal Length
max 63.0 mm
21
?
Olympus C-750 Ultra Zoom Sensor Resolution 4.2
megapixels Optical Zoom 10 x Digital Zoom 4
x Installed Memory 16 MB Lens Aperture F/8-2.8/3
.7 Focal Length min 6.3 mm Focal Length
max 63.0 mm
22
?
Olympus C-750 Ultra Zoom Sensor Resolution 4.2
megapixels Optical Zoom 10 x Digital Zoom 4
x Installed Memory 16 MB Lens Aperture F/8-2.8/3.7
Focal Length min 6.3 mm Focal Length max 63.0 mm
23
Digital Camera
Olympus C-750 Ultra Zoom Sensor Resolution 4.2
megapixels Optical Zoom 10 x Digital Zoom 4
x Installed Memory 16 MB Lens Aperture F/8-2.8/3
.7 Focal Length min 6.3 mm Focal Length
max 63.0 mm
24
?
Year 2002 Make Ford Model Thunderbird Mileage
5,500 miles Features Red ABS 6 CD
changer keyless entry Price 33,000 Phone (916
) 972-9117
25
?
Year 2002 Make Ford Model Thunderbird Mileage
5,500 miles Features Red ABS 6 CD
changer keyless entry Price 33,000 Phone (916
) 972-9117
26
?
Year 2002 Make Ford Model Thunderbird Mileage
5,500 miles Features Red ABS 6 CD
changer keyless entry Price 33,000 Phone (916
) 972-9117
27
?
Year 2002 Make Ford Model Thunderbird Mileage
5,500 miles Features Red ABS 6 CD
changer keyless entry Price 33,000 Phone (916
) 972-9117
28
Car Advertisement
Year 2002 Make Ford Model Thunderbird Mileage
5,500 miles Features Red ABS 6 CD
changer keyless entry Price 33,000 Phone (916
) 972-9117
29
?
Flight Class From Time/Date To
Time/Date Stops Delta 16 Coach JFK
605 pm CDG 735 am 0
02 01 04
03 01 04 Delta 119 Coach CDG
1020 am JFK 100 pm 0
09 01 04
09 01 04
30
?
Flight Class From Time/Date To
Time/Date Stops Delta 16 Coach JFK
605 pm CDG 735 am 0
02 01 04
03 01 04 Delta 119 Coach CDG
1020 am JFK 100 pm 0
09 01 04
09 01 04
31
Airline Itinerary
Flight Class From Time/Date To
Time/Date Stops Delta 16 Coach JFK
605 pm CDG 735 am 0
02 01 04
03 01 04 Delta 119 Coach CDG
1020 am JFK 100 pm 0
09 01 04
09 01 04
32
?
Monday, October 13, 2003 Group
A W L T GF GA Pts. USA 3 0 0 11 1
9 Sweden 2 1 0 5 3 6 North Korea 1 2 0 3
4 3 Nigeria 0 3 0 0 11 0 Group
B W L T GF GA Pts. Brazil 2 0 1 8 2 7
33
?
Monday, October 13, 2003 Group
A W L T GF GA Pts. USA 3 0 0 11 1
9 Sweden 2 1 0 5 3 6 North Korea 1 2 0 3
4 3 Nigeria 0 3 0 0 11 0 Group
B W L T GF GA Pts. Brazil 2 0 1 8 2 7
34
World Cup Soccer
Monday, October 13, 2003 Group
A W L T GF GA Pts. USA 3 0 0 11 1
9 Sweden 2 1 0 5 3 6 North Korea 1 2 0 3
4 3 Nigeria 0 3 0 0 11 0 Group
B W L T GF GA Pts. Brazil 2 0 1 8 2 7
35
?
Calories 250 cal Distance 2.50 miles Time 23.35
minutes Incline 1.5 degrees Speed 5.2 mph Heart
Rate 125 bpm
36
?
Calories 250 cal Distance 2.50 miles Time 23.35
minutes Incline 1.5 degrees Speed 5.2 mph Heart
Rate 125 bpm
37
?
Calories 250 cal Distance 2.50 miles Time 23.35
minutes Incline 1.5 degrees Speed 5.2 mph Heart
Rate 125 bpm
38
Treadmill Workout
Calories 250 cal Distance 2.50 miles Time 23.35
minutes Incline 1.5 degrees Speed 5.2 mph Heart
Rate 125 bpm
39
?
Place Bonnie Lake County Duchesne State Utah Typ
e Lake Elevation 10,000 feet USGS Quad Mirror
Lake Latitude 40.711ºN Longitude 110.876ºW
40
?
Place Bonnie Lake County Duchesne State Utah Typ
e Lake Elevation 10,000 feet USGS Quad Mirror
Lake Latitude 40.711ºN Longitude 110.876ºW
41
?
Place Bonnie Lake County Duchesne State Utah Typ
e Lake Elevation 10,000 feet USGS Quad Mirror
Lake Latitude 40.711ºN Longitude 110.876ºW
42
Maps
Place Bonnie Lake County Duchesne State Utah Typ
e Lake Elevation 10,100 feet USGS Quad Mirror
Lake Latitude 40.711ºN Longitude 110.876ºW
43
Presentation Outline
  • Grand Challenge
  • Meaning, Knowledge, Information, Data
  • Fun and Games with Data
  • Information Extraction Ontologies
  • Applications
  • Limitations and Pragmatics
  • Summary and Challenges

44
Information Extraction Ontologies
Source
Target
Information Extraction
Information Exchange
45
What is an Extraction Ontology?
  • Augmented Conceptual-Model Instance
  • Object relationship sets
  • Constraints
  • Data frame value recognizers
  • Robust Wrapper (Ontology-Based Wrapper)
  • Extracts information
  • Works even when site changes or when new sites
    come on-line

46
CarAds Extraction Ontology
ltObjectSet x"329" y"51" lexical"true"
name"Mileage" id"osmx50"gt ltDataFramegt
ltInternalRepresentationgt
ltDataType typeName"String"/gt
lt/InternalRepresentationgt
ltValuePhraseListgt ltValuePhrase
hint"Mileage Pattern 1"gt
ltValueExpression color"ffffff"gt
ltExpressionTextgt1-9\d0,2kKlt/Expressio
nTextgt lt/ValueExpressiongt
ltLeftContextExpression
color"ffffff"gt
ltObjectSet x"329" y"51" lexical"true"
name"Mileage" id"osmx50"gt ltDataFramegt
ltInternalRepresentationgt
ltDataType typeName"String"/gt
lt/InternalRepresentationgt
ltValuePhraseListgt ltValuePhrase
hint"Mileage Pattern 1"gt
ltValueExpression color"ffffff"gt
ltExpressionTextgt1-9\d0,2kKlt/Expressio
nTextgt lt/ValueExpressiongt
ltLeftContextExpression
color"ffffff"gt
47
Extraction OntologiesAn Example ofSemantic
Understanding
  • Intelligent Symbol Manipulation
  • Gives the Illusion of Understanding
  • Obtains Meaningful and Useful Results

48
Presentation Outline
  • Grand Challenge
  • Meaning, Knowledge, Information, Data
  • Fun and Games with Data
  • Information Extraction Ontologies
  • Applications
  • Limitations and Pragmatics
  • Summary and Challenges

49
A Variety of Applications
  • Information Extraction
  • Semantic Web Page Annotation
  • Free-Form Semantic Web Queries
  • Task Ontologies for Free-Form Service Requests
  • High-Precision Classification
  • Schema Mapping for Ontology Alignment
  • Record Linkage
  • Accessing the Hidden Web
  • Ontology Discovery and Generation
  • Challenging Applications (e.g. BioInformatics)

50
Application 1Information Extraction
51
Constant/Keyword Recognition
'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles.
Previous owner heart broken! Asking only
11,995. 1415. JERRY SEINER MIDVALE, 566-3800
or 566-3888
Descriptor/String/Position(start/end)
Year9723 MakeCHEV58 MakeCHEVY59 ModelCav
alier1118 FeatureRed2123 Feature5
spd2630 Mileage7,0003842 KEYWORD(Mileage)mil
es4448 Price11,995100105 Mileage11,9951001
05 PhoneNr566-3800136143 PhoneNr566-38881481
55
52
Heuristics
  • Keyword proximity
  • Subsumed and overlapping constants
  • Functional relationships
  • Nonfunctional relationships
  • First occurrence without constraint violation

53
Keyword Proximity
Year9723 MakeCHEV58 MakeCHEVY59 ModelCav
alier1118 FeatureRed2123 Feature5
spd2630 Mileage7,0003842 KEYWORD(Mileage)mil
es4448 Price11,995100105 Mileage11,9951001
05 PhoneNr566-3800136143 PhoneNr566-38881481
55
D 2
D 52
'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles
on her. Previous owner heart broken! Asking
only 11,995. 1415. JERRY SEINER MIDVALE,
566-3800 or 566-3888
54
Subsumed/Overlapping Constants
Year9723 MakeCHEV58 MakeCHEVY59 ModelCav
alier1118 FeatureRed2123 Feature5
spd2630 Mileage7,0003842 KEYWORD(Mileage)mil
es4448 Price11,995100105 Mileage11,9951001
05 PhoneNr566-3800136143 PhoneNr566-38881481
55
'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles.
Previous owner heart broken! Asking only
11,995. 1415. JERRY SEINER MIDVALE, 566-3800
or 566-3888
55
Functional Relationships
Year9723 MakeCHEV58 MakeCHEVY59 ModelCav
alier1118 FeatureRed2123 Feature5
spd2630 Mileage7,0003842 KEYWORD(Mileage)mil
es4448 Price11,995100105 Mileage11,9951001
05 PhoneNr566-3800136143 PhoneNr566-38881481
55
'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles
on her. Previous owner heart broken! Asking
only 11,995. 1415. JERRY SEINER MIDVALE,
566-3800 or 566-3888
56
Nonfunctional Relationships
Year9723 MakeCHEV58 MakeCHEVY59 ModelCav
alier1118 FeatureRed2123 Feature5
spd2630 Mileage7,0003842 KEYWORD(Mileage)mil
es4448 Price11,995100105 Mileage11,9951001
05 PhoneNr566-3800136143 PhoneNr566-38881481
55
'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles
on her. Previous owner heart broken! Asking
only 11,995. 1415. JERRY SEINER MIDVALE,
566-3800 or 566-3888
57
First Occurrence without Constraint Violation
Year9723 MakeCHEV58 MakeCHEVY59 ModelCav
alier1118 FeatureRed2123 Feature5
spd2630 Mileage7,0003842 KEYWORD(Mileage)mil
es4448 Price11,995100105 Mileage11,9951001
05 PhoneNr566-3800136143 PhoneNr566-38881481
55
'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles
on her. Previous owner heart broken! Asking
only 11,995. 1415. JERRY SEINER MIDVALE,
566-3800 or 566-3888
58
Database-Instance Generator
Year9723 MakeCHEV58 MakeCHEVY59 ModelCav
alier1118 FeatureRed2123 Feature5
spd2630 Mileage7,0003842 KEYWORD(Mileage)mil
es4448 Price11,995100105 Mileage11,9951001
05 PhoneNr566-3800136143 PhoneNr566-38881481
55
insert into Car values(1001, 97, CHEVY,
Cavalier, 7,000, 11,995,
556-3800) insert into CarFeature values(1001,
Red) insert into CarFeature values(1001, 5
spd)
59
Application 2Semantic Web Page Annotation
60
Annotated Web Page
(Demo)
61
OWL
  • ltowlClass rdfID"CarAds"gt
  • ltrdfslabel xmllang"en"gtCarAdslt/rdfslabelgt
  • ......
  • ltrdfssubClassOfgt
  • ltowlRestrictiongt
  • ltowlonProperty rdfresource"hasMileage"
    /gt
  • ltowlminCardinality rdfdatatype"xsdnonNeg
    ativeInteger"gt0lt/owlminCardinalitygt
  • lt/owlRestrictiongt
  • lt/rdfssubClassOfgt
  • ltrdfssubClassOfgt
  • ltowlRestrictiongt
  • ltowlonProperty rdfresource"hasMileage"
    /gt
  • ltowlmaxCardinality
    rdfdatatype"xsdnonNegativeInteger"gt1lt/owlmaxC
    ardinalitygt
  • lt/owlRestrictiongt
  • lt/rdfssubClassOfgt
  • ltrdfssubClassOfgt
  • ltowlRestrictiongt
  • ltowlonProperty rdfresource"hasMile
    age" /gt
  • ltowlallValuesFrom rdfresource"Mile
    age" /gt
  • ltCarAds rdfID"CarAdsIns2"gt
  • ltCarAdsValue rdfdatatype"xsdstring"gt2lt/CarAds
    Valuegt
  • lt/CarAdsgt
  • ltMileage rdfID"MileageIns2"gt
  • ltStartingCharPosition rdfdatatype"xsdnonNegat
    iveInteger"gt237lt/StartingCharPositiongt
  • ltEndingCharPosition rdfdatatype"xsdn
    onNegativeInteger"gt241lt/EndingCharPositiongt
  • lt/Mileagegt
  • .
  • ltowlThing rdfabout"CarAdsIns2"gt
  • lthasMake rdfresource"MakeIns2" /gt
  • lthasModel rdfresource"ModelIns2" /gt
  • lthasYear rdfresource"YearIns2" /gt
  • lthasMileage rdfresource"MileageIns2" /gt
  • lthasPhoneNr rdfresource"PhoneNrIns2" /gt
  • lthasPrice rdfresource"PriceIns2" /gt
  • lt/owlThinggt

62
Application 3Free-Form Semantic Web Queries
63
Find Ontology
  • Tell me about cruises on San Francisco Bay. Id
    like to know scheduled times, cost, and the
    duration of cruises on Friday of next week.

64
Formulate Query
Friday, Oct. 29th
cost
duration
?
?
Result
(
)
65
StartTime Price Duration Source
1045 am, 1200 pm, 115, 230, 400 20.00, 16.00, 12.00 1
1000 am, 1045 am, 1115 am, 1200 pm, 1230 pm, 115 pm, 145 pm, 230 pm, 300 pm, 345 pm, 415 pm, 500 pm 17.00, 16.00, 12.00 1 Hour 2
66
Application 4Task Ontologies for Free-Form
Service Requests
67
Basic Idea
  • Service Request
  • Match with Task Ontology
  • Domain Ontology
  • Process Ontology
  • Complete, Negotiate, Finalize

I want to see a dermatologist next week any day
would be ok for me, at 400 p.m. The
dermatologist must be within 20 miles from my
home and must accept my insurance.
68
Domain Ontology
69
Appointment context keywords/phrase
appointment want to see a Dermatologist
context keywords/phrases (Ddermatologist)

I want to see a dermatologist next week any day
would be ok for me, at 400 p.m. The
dermatologist must be within 20 miles from my
home and must accept my insurance.
70
Appointment context keywords/phrase
appointment want to see a Dermatologist
context keywords/phrases (Ddermatologist)

?
I want to see a dermatologist next week any day
would be ok for me, at 400 p.m. The
dermatologist must be within 20 miles from my
home and must accept my insurance.
71
Appointment context keywords/phrase
appointment want to see a Dermatologist
context keywords/phrases (Ddermatologist)

?
?
I want to see a dermatologist next week any day
would be ok for me, at 400 p.m. The
dermatologist must be within 20 miles from my
home and must accept my insurance.
72
Appointment context keywords/phrase
appointment want to see a Dermatologist
context keywords/phrases (Ddermatologist)

?
?
?
?
?
?
I want to see a dermatologist next week any day
would be ok for me, at 400 p.m. The
dermatologist must be within 20 miles from my
home and must accept my insurance.
73
Date NextWeek(d1 Date, d2 Date) returns
(BooleanT,F) context keywords/phrases next
week week from now Distance internal
representation real input (s String) context
keywords/phrases miles mile mi kilometers
kilometer meters meter centimeter
Within(d1 Distance, 20) returns (Boolean T
or F) context keywords/phrases within not
more than ? return (d1?d2) end
Appointment context keywords/phrase
appointment want to see a Dermatologist
context keywords/phrases (Ddermatologist)

?
?
?
?
?
?
I want to see a dermatologist next week any day
would be ok for me, at 400 p.m. The
dermatologist must be within 20 miles from my
home and must accept my insurance.
74
?
?
?
?
?
?
75
(No Transcript)
76
Process Ontology
77
Specification Satisfaction
Date(28 Dec 04) and NextWeek(28 Dec 04, 5
Jan 05) Dermatologist(Dermatologist0) is at
Address(Orem 600 State St.) and
Within(DistanceBetween(Provo 300 State St.,
Orem 600 State St.), 22) ?i2
(Dermatologist(Dermatologist0) accepts
Insurance(i2) and Equal(IHC, i2))
78
Application 5High-Precision Classification
79
An Extraction Ontology Solution
80
Density Heuristic
81
Expected Values Heuristic
82
Vector Space of Expected Values
D1
  • OV ______ D1 D2
  • Year 0.98 16 6
  • Make 0.93 10 0
  • Model 0.91 12 0
  • Mileage 0.45 6 2
  • Price 0.80 11 8
  • Feature 2.10 29 0
  • PhoneNr 1.15 15 11
  • D1 0.996
  • D2 0.567

ov
D2
83
Grouping Heuristic
84
Grouping
Car Ads ---------------- Year Year Make Model ----
---------- 3 Price Year Model Year ---------------
3 Make Model Mileage Year ---------------4 Model M
ileage Price Year ---------------4 Grouping
0.875
Sale Items ---------------- Year Year Year Mileage
-------------- 2 Mileage Year Price Price -------
--------3 Year Price Price Year ---------------2 P
rice Price Price Price ---------------1 Grouping
0.500
Expected Number in Group floor(? Ave
) 4 (for our example)
1-Max
Sum of Distinct 1-Max Object Sets in each
Group Number of Groups Expected Number in a
Group
85
Application 6Schema Mapping forOntology
Alignment
86
Problem Different Schemas
  • Target Database Schema
  • Car, Year, Make, Model, Mileage, Price,
    PhoneNr, PhoneNr, Extension, Car, Feature
  • Different Source Table Schemas
  • Run , Yr, Make, Model, Tran, Color, Dr
  • Make, Model, Year, Colour, Price, Auto, Air
    Cond., AM/FM, CD
  • Vehicle, Distance, Price, Mileage
  • Year, Make, Model, Trim, Invoice/Retail, Engine,
    Fuel Economy

87
Solution Remove Internal Factoring
Discover Nesting Make, (Model, (Year, Colour,
Price, Auto, Air Cond, AM/FM, CD))
88
Solution Replace Boolean Values
ACURA
ACURA
Legend
89
Solution Form Attribute-Value Pairs
ACURA
ACURA
Legend
ltMake, Hondagt, ltModel, Civic EXgt, ltYear, 1995gt,
ltColour, Whitegt, ltPrice, 6300gt, ltAuto,
Autogt, ltAir Cond., Air Cond.gt, ltAM/FM, AM/FMgt,
ltCD, gt
90
Solution Adjust Attribute-Value Pairs
ACURA
ACURA
Legend
ltMake, Hondagt, ltModel, Civic EXgt, ltYear, 1995gt,
ltColour, Whitegt, ltPrice, 6300gt, ltAutogt,
ltAir Condgt, ltAM/FMgt
91
Solution Do Extraction
ACURA
ACURA
Legend
92
Solution Infer Mappings
ACURA
ACURA
Legend
Car, Year, Make, Model, Mileage, Price,
PhoneNr, PhoneNr, Extension, Car, Feature
93
Solution Infer Mappings
ACURA
ACURA
Legend
Car, Year, Make, Model, Mileage, Price,
PhoneNr, PhoneNr, Extension, Car, Feature
94
Solution Do Extraction
ACURA
ACURA
Legend
pPriceTable
Car, Year, Make, Model, Mileage, Price,
PhoneNr, PhoneNr, Extension, Car, Feature
95
Solution Do Extraction
ACURA
ACURA
Legend
? Colour?Feature p ColourTable U ? Auto?Feature p
Auto ß AutoTable U ? Air Cond.?Feature p Air
Cond. ß Air Cond.Table U ? AM/FM?Feature p AM/FM
ß AM/FMTable U ? CD?Featurep CDß CDTable
Yes,
Yes,
Yes,
Yes,
Car, Year, Make, Model, Mileage, Price,
PhoneNr, PhoneNr, Extension, Car, Feature
96
Application 7Record Linkage
97
Kelly Flanagan Query
98
A Multi-faceted Approach
  • Gather evidence from each of several different
    facets
  • Attributes
  • Links
  • Page Similarity
  • Combine the evidence

99
Attributes
  • Phone number, email address, state, city, zip
    code
  • Data-frame recognizers

100
Links
101
Page Similarity
  • adjacent cap-word pairs
  • Cap-Word (Connector Preposition
    (Article)? (Capital-LetterDot))? Cap-Word.

102
Confidence Matrix for Each Facet
C1 C2 .. Ci .. Cj Cn
C1 1 C12 C1i C1j C1n
C2 1 C2i C2j C2n

Ci 1 Cij Cin

Cj 1 Cjn

Cn 1
0 if no evidence for a facet f
Cij
P(Ci and Cj refer to a same person evidence for
a facet f )
Training set to compute the conditional
probabilities
103
Final Matrix
Confidence Matrix for Attributes
Confidence Matrix for Links
Confidence Matrix for Page Similarity
0.96 0 0.78 - 0.96 0 - 0.96 0.78 - 0.78
0 0.96 0 0.78 0.9912
104
Grouping Algorithm
  • Input final confidence matrix
  • Output citations grouped by same person
  • The idea
  • Ci , Cj and Cj , Ck then Ci , Cj , Ck
  • The threshold we use for highly
    confident is 0.8.

105
Experimental Results
106
Application 8Accessing the Hidden Web
107
Obtaining Data Behind Forms
  • Web information is stored in databases
  • Databases are accessed through forms
  • Forms are designed in various ways

108
Hidden Web Extraction System
Find green cars costing no more than 9000.
Site Form
User Query
Input Analyzer
Application Extraction Ontology
Extracted Information
Retrieved Page(s)
Output Analyzer
109
Application 9Ontology Discovery Generation
110
TANGO Table Analysis for Generating Ontologies
  • Recognize and normalize table information
  • Construct mini-ontologies from tables
  • Discover inter-ontology mappings
  • Merge mini-ontologies into a growing ontology

111
Recognize Table Information

Religion
Population Albanian
Roman Shia
Sunni Country (July 2001 est.) Orthodox
Muslim Catholic Muslim Muslim
other Afganistan 26,813,057
15
84 1 Albania
3,510,484 20 70 30
112
Construct Mini-Ontology
113
Discover Mappings
114
Merge
115
Application 10Challenging Applications(e.g.
BioInformatics)
116
Large Extraction Ontologies
117
Complex Semi-Structured Pages
118
Additional Analysis Opportunities
  • Sibling Page Comparison
  • Semi-automatic Lexicon Update
  • Seed Ontology Recognition

119
Sibling Page Comparison
120
Sibling Page Comparison
Attributes
121
Sibling Page Comparison
122
Sibling Page Comparison
123
Semi-automatic Lexicon Update
Additional Source Species or Organisms
Additional Protein Names
124
Seed Ontology Recognition
Homo sapiens human
nucleus zinc ion binding nucleic acid binding
9606
Eukaryota Metazoa Chorata Craniata Vertebrata
Euteleostomi Mammalia Eutheria Primates Catar
rhini Hominidae Homo
zinc ion binding nucleic acid binding
NP_079345
nucleus
linear
NP_079345
FLJ14299
GTTTTTGTGTT.ATAAGTGCATTAACGGCCCACATG
msdspagsnprtpessgsgsggtagpyyspyalygqrlasasalgyq

8 eight
8?p\s?12 8?p11.2 8?p11.23
hypothetical protein FLJ14299
37,?612,?680
37,?610,?585
125
Seed Ontology Recognition
126
Presentation Outline
  • Grand Challenge
  • Meaning, Knowledge, Information, Data
  • Fun and Games with Data
  • Information Extraction Ontologies
  • Applications
  • Limitations and Pragmatics
  • Summary and Challenges

127
Limitations and Pragmatics
  • Data-Rich, Narrow Domain
  • Ambiguities Context Assumptions
  • Incompleteness Implicit Information
  • Common Sense Requirements
  • Knowledge Prerequisites

128
Busiest Airport in 2003?
Chicago - 928,735 Landings (Nat. Air Traffic
Controllers Assoc.) - 931,000 Landings
(Federal Aviation Admin.) Atlanta -
58,875,694 Passengers (Sep., latest numbers
available) Memphis - 2,494,190 Metric Tons
(Airports Council Intl.)
129
Busiest Airport in 2003?
Chicago - 928,735 Landings (Nat. Air Traffic
Controllers Assoc.) - 931,000 Landings
(Federal Aviation Admin.) Atlanta -
58,875,694 Passengers (Sep., latest numbers
available) Memphis - 2,494,190 Metric Tons
(Airports Council Intl.)
130
Busiest Airport in 2003?
Chicago - 928,735 Landings (Nat. Air Traffic
Controllers Assoc.) - 931,000 Landings
(Federal Aviation Admin.) Atlanta -
58,875,694 Passengers (Sep., latest numbers
available) Memphis - 2,494,190 Metric Tons
(Airports Council Intl.)
131
Busiest Airport in 2003?
Chicago - 928,735 Landings (Nat. Air Traffic
Controllers Assoc.) - 931,000 Landings
(Federal Aviation Admin.) Atlanta -
58,875,694 Passengers (Sep., latest numbers
available) Memphis - 2,494,190 Metric Tons
(Airports Council Intl.)
Ambiguous Whom do we
trust?
(How do they count?)
132
Busiest Airport in 2003?
Chicago - 928,735 Landings (Nat. Air Traffic
Controllers Assoc.) - 931,000 Landings
(Federal Aviation Admin.) Atlanta -
58,875,694 Passengers (Sep., latest numbers
available) Memphis - 2,494,190 Metric Tons
(Airports Council Intl.)
Important qualification
133
Dow Jones Industrial Average
High Low
Last Chg 30 Indus 10527.03
10321.35 10409.85 85.18 20 Transp
3038.15 2998.60 3008.16 9.83 15
Utils 268.78 264.72 266.45
1.72 66 Stocks 3022.31 2972.94
2993.12 19.65
Graphics, Icons,
134
Dow Jones Industrial Average
High Low
Last Chg 30 Indus 10527.03
10321.35 10409.85 85.18 20 Transp
3038.15 2998.60 3008.16 9.83 15
Utils 268.78 264.72 266.45
1.72 66 Stocks 3022.31 2972.94
2993.12 19.65
135
Presentation Outline
  • Grand Challenge
  • Meaning, Knowledge, Information, Data
  • Fun and Games with Data
  • Information Extraction Ontologies
  • Applications
  • Limitations and Pragmatics
  • Summary and Challenges

136
Some Key Ideas
  • Data, Information, and Knowledge
  • Data Frames
  • Knowledge about everyday data items
  • Recognizers for data in context
  • Ontologies
  • Resilient Extraction Ontologies
  • Shared Conceptualizations
  • Limitations and Pragmatics

137
Some Research Issues
  • Building a library of open source data
    recognizers
  • Precisely finding and gathering relevant
    information
  • Subparts of larger data
  • Scattered data (linked, factored, implied)
  • Data behind forms in the hidden web
  • Improving concept matching
  • Indirect matching
  • Calculations, unit conversions, alternative
    representations,

138
Some Research Challenges
(Machine Learning)
  • Web Page Understanding
  • Suppose extraction is 85 accurate
  • Generate a page grammar
  • Increased recall (more extracted)
  • Increased precision (fewer false positives)
  • Fast extraction from same-site sibling pages
  • Universal Rules for Schema Matching
  • Must rules be domain-specific?
  • Can some rules be universal?
  • Boundaries of Usefulness
    When should machine
    learning not be used?
  • Application to Significant Problems
  • Like those above
  • Many more

www.deg.byu.edu
Write a Comment
User Comments (0)
About PowerShow.com