Title: Semiautomatic Generation of Resilient DataExtraction Ontologies
1Semiautomatic Generation of Resilient
Data-Extraction Ontologies
- Yihong Ding
- Data Extraction Group
- Brigham Young University
- Sponsored by NSF
2Wrapper-Driven Data Extraction
- Web data extraction
- Obtain user-specified information from Web
documents - Wrapper
- Convert implicit HTML data into explicit
formatted data - Data-source-specified, high performance
- Examples
- SoftMealy, STALKER, WIEN, Omini, ROADRUNNER,
3Common Problem of Wrappers
SoftMealy
- ltLIgt ltA HREF""gt Mani Chandy lt/Agt,
- ltIgtProfessor of Computer Sciencelt/Igt
- and ltIgtExecutive Officer for Computer
- Sciencelt/Igt
Resiliency fixed domain changeable layout
Scalability unchanged existing wrapper extendable
domain and functions
4Data-Extraction Ontology
- Structure
- Object sets
- Relationship sets
- Participation constraints
- Data frames
- Pros resilient and scalable
- Cons hard to create
- Knowledge requirements
- Tedious and error-prone work
- Car -gt object
- Car 01 has Make 1
- Make matches 10
- constant extract "\baudi\b"
- end
- Car 01 has Model 1
- Model matches 25
- constant extract "80"
- context "\baudi\S\s80\b"
- end
- Car 01 has Mileage 1
- Mileage matches 8
- constant extract "\b1-9\d0,2k"
- substitute "kK" -gt "000"
- end
5Motif of Ontology Generation
6Thesis Statement
- Given knowledge base
- Input sample Web pages of interest
- Output a data-extraction ontology for the domain
of interest - Between input and output this is the work of
this thesis
7Ontology-Generation Procedure
8Primary Knowledge Source
- Requirements
- Available
- General in coverage
- Rich in meaningful relationship
- Encoded in or easily converted to XML
- Mikrokosmos (?K) Ontology
- Developed by NMSU jointly with U.S. DoD
- Contains over 5000 concepts
- Connects to an average 14 links per concept
- Represented in XML format
9Integrated Knowledge Base
KNOWLEDGE BASE
?K Ontology
Data-Frame Library
Lexicons
Synonym Dictionary (WordNet)
10Ontology-Generation Procedure
11Domain Specification
- Training documents
- Data-rich
- Narrow in topic breadth
- Preprocessing
12Example Car Advertisement
Record 1 00 GrandAM SE, Sunfire Red, CD, AC, PW,
PLGreat Condition, 10,800, Call 798-3446
Record 2 02 Buick Century Custom, Pwr Seat,
Nada Retail 13,695 221-1250 Record 3 02 Buick
Century, lo mi, mint cond, 11,999. 373-4445 dlr
2755 Record 4 00 Buick Century Stk HU7159 Green
9,319, 714-2200To Apply By Phone,
1-877-228-9486, OREM Utah
13Ontology-Generation Procedure
14Concept Selection
- Selection strategies
- Compare a string with the name of a concept
- Compare a string with the values belonging to a
concept - Apply data-frame recognizers to recognize a
string
KB
ltPHONE-NRgt
00 Buick Century Stk HU7159 Green 9,319,
714-2200To Apply By Phone, 1-877-228-9486, OREM
Utah
15Concept Selection
- Reasons of conflict
- Synonymy
- Polysemy
- Conflict resolution
- Same-string only one meaning
- Favor longer over shorter
- Context decides meaning
KB
02 Buick Century Custom, Pwr Seat, Nada Retail
13,695 221-1250.
16Ontology-Generation Procedure
17Relationship Retrieval
KB
ltAUTOMOBILEgt
ltMILEAGEgt
ltYEARgt
ltPRICEgt
ltPHONE-NRgt
ltAUDIO-MEDIA-ARTIFACTgt
ltCENTURYgt
18Ontology-Generation Procedure
19Constraint Discovery
02 Buick Century, lo mi, mint cond, green, pwr
seat, 11,999. 373-4445 dlr 2755
AUTOMOBILE 01 IsA.ARTIFACT.CostofProduction
PRICE 11
00 Buick Century Stk HU7159 Green 9,319,
714-2200To Apply By Phone, 1-877-228-9486, OREM
Utah
20Ontology-Generation Procedure
21Ontology Generation
- concept nodes ? object sets
- paths ? relationship sets
- discovered constraints ? participation
constraints - concept recognizers ? data frames
22Automatically Generated Ontology -- Car
Advertisement
(01) Automobile -gt object (02) Automobile
01 has Mileage 11 (03) Automobile 01
IsA.ARTIFACT.CostOfProduction Price 11 (12)
Price 11 IsA.SCALARATTRIBUTE.MeasuredIn.MEASUR
INGUNIT.Subclasses Year 0 (20) Automobile
01 relatesTo PhoneNr 1 relatesTo
ArtifactPart 1 relatesTo Mileage 1
relatesTo Truck 1 relatesTo
AudioMediaArtifact 1 relatesTo
CommunicationDevice 1 relatesTo ControlEvent
1 relatesTo TravelEvent 1
23Ontology-Generation Procedure
24Updating Strategies
- Remove all bad relationship sets
- Modify remaining incorrect relationship sets
- Substitute incorrect object sets
- Reduce long n-ary relationship sets
- Fix participation constraints
- Adjust names or re-arrange sequences
- Add new relationship sets
25Final Ontology
- Car -gt object
- Car 01 has Year 1
- Car 01 has Mileage 1
- Car 01 has Price 1
- PhoneNr 1 is for Car 01
- PhoneNr 01 has Extension 1
- Car 0 has Feature 1
- Car 01 has Make 1
- Car 01 has Model 1
26Evaluation Criteria
- Basic measures
- POG (Precision of Ontology Generation)
- ROG (Recall of Ontology Generation)
- Human constraints
- PROG (Pseudo-ROG)
- Comparing with an expert-created ontology
- Knowledge base constraints
- EPROG (Effective-PROG)
- Correctness dependency
- DEPROG (Dependent-EPROG)
- For example relationship sets depends on object
sets
27Evaluation Results
28Discussion of Results
- Bottleneck cannot generate what not in the
knowledge base - Object sets
- Concept-selection procedure works well
- Desired concept not shown in training records
- Rarely occurring concept ? not severe even if we
dont fix the error - Example extension
- Aggregation and union
- USAddressCity, USAddressState, USAddressZipCode ?
Location - CropPlant, AnimalProduct, FruitFoodStuff ?
AgriculturalProduct - Close-meaning concepts FurniturePart ? Furnished
29Discussion of Results
- Relationship sets
- Binary relationship sets over 95
- Most errors due to incorrectly generated object
sets - Semantically incorrect relationship sets
- Price IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT
.Subclasses Year - n-ary relationship sets (usually huge)
- Participation constraints
- Error due to lack of training examples
- How much is enough?
30Knowledge Base Extensibility
- Add SALT -- a new knowledge source
- Successfully integrated into existing KB
- Sample new relationship set (DOE abstract domain)
- CrudeOil IsA.PHYSICALOBJECT.Location.PLACE.Subclas
ses Nation
31Conclusion
- Experimented with knowledge-base construction and
extension - Standardized application domain specification
- Generated data-extraction ontologies from a
specified domain and an integrated knowledge base - Showed DEPROG results of more than 70 on average
and over 90 for well-defined domains
32Future Work
- Build a general-purpose knowledge source for
data-extraction usage - Study more about data frames
- Can a system correctly identify concepts with
data frames? - Can a system update a data frame to fit a special
situation? - Can a system generate a data frame from a
collection of information of interest?