Title: SemiAutomatically Generating DataExtraction Ontology
1Semi-Automatically Generating Data-Extraction
Ontology
- Yihong Ding
- March 6, 2001
2Extract information from Web document
--------------------------------------------------
----------------------- -- Cars Application
Ontology -- -- Revision 1.2 -- -- Log
cars.osm,v -- Revision 1.2 1998/02/20
001555 liddl -- Cleaned up header -- --
Revision 1.1 1998/02/20 001414 liddl --
Initial revision -- Car -gt object Car 01
has Year 1 Year matches 4 constant
extract "\d2" context
"(\\d)4-9\d,\dkK"
substitute "" -gt "19" ,
extract "\d2" context
"(\\d)4-9\d,\d"
substitute "" -gt "19" ,
extract "\d2" context "\b'4-9\d\b"
substitute "" -gt "19" ,
extract "\d2" context
"(\\d)0\d,\dkK"
substitute "" -gt "20" ,
3Ontology
Car -gt object Car 01 has Make 1 Make
matches 10 constant extract "\baudi\b"
end Car 01 has Model 1 Model matches
25 constant extract "80"
context "\baudi\S\s80\b" end Car
01 has Mileage 1 Mileage matches 8
constant extract "\b1-9\d0,2k"
substitute "kK" -gt "000" end Car
01 has Price 1 Price matches 8
constant extract "1-9\d3,6"
context "\1-9\d3,6" end
- a computational entity, a resource containing
knowledge about what concepts exist in the
world and how they relate to one another - Components
- Concepts
- Domain dependent
- Context free
- Context sensitive
- Domain independent
- Context free
- Context sensitive
- Relationship (relational schema between the
concepts) - Constraints
4My work
- Pre-assumptions
- Given information knowledge base that already
containing domain dependent and domain
independent concepts - Pre-defined ontologies
- Mikrokosmos, Gene, our ontologies, etc.
- Component recognizers
- date, time, price, phone number, etc.
- Given sample training Web documents
- Semi-automatically generate the ontology
5Architecture
6Example CIA Factbook
- Country China
- Location Eastern Asia, bordering the East China
Sea, Korea Bay, Yellow Sea, and South China Sea,
between North Korea and Vietnam - Geographic coordinates 35 00 N, 105 00 E
- Map references Asia
- Area
- total 9,596,960 sq km
- land 9,326,410 sq km
- water 270,550 sq km
7Partial completed ontology
- CountryName matches 30
- constant extract \bChina\b ,
- extract \bUnited States\b
- end
- Location matches 50
- constant extract "\bAsia\b" ,
- extract "\bEurope\b" ,
- extract \bYellow Sea\b ,
- end
- Latitude matches 10
- constant extract "\b1-9\d0,2\b1-9\d0
,1(EW)" , - end
- Longitude matches 10
- constant extract "\b1-9\d0,2\b1-9\d0
,1(NS)" , - end
- Country China
- Location Eastern Asia, bordering the East China
Sea, Korea Bay, Yellow Sea, and South China Sea,
between North Korea and Vietnam - Geographic coordinates 35 00 N, 105 00 E
- Map references Asia
- Area
- total 9,596,960 sq km
- land 9,326,410 sq km
- water 270,550 sq km
8Raw completed ontology
- Country China
- Location Eastern Asia, bordering the East China
Sea, Korea Bay, Yellow Sea, and South China Sea,
between North Korea and Vietnam - Geographic coordinates 35 00 N, 105 00 E
- Map references Asia
- Area
- total 9,596,960 sq km
- land 9,326,410 sq km
- water 270,550 sq km
- Country -gt object
- Country 01 has CountryName 11
- Country 01 has Location1 1
- ...
- Country 01 has Location8 1
- Country 01 has Latitude 1
- Country 01 has Longitude 1
- Country 01 has Number1 1
- Country 01 has Number2 1
- Country 01 has Number3 1
- -- Generalization/Specializations
- Location1 Location
- ...
- Location8 Location
- Number1 Number
- Number2 Number
- Number3 Number
9User control interface
- Output to user
- raw completed ontology
- tagged training web pages
- the query results
- User may
- modify attribute name
- combine attributes
- delete useless attributes
- change relationships
- add new attributes, new relations, and
constraints -
- When satisfied, output the final ontology
- Country China CountryName
- Location Eastern Asia Location1, bordering the
East China Sea Location2, Korea Bay
Location3, Yellow Sea Location4, and South
China Sea Location5, between North Korea
Location6, and Vietnam Location7 - Geographic coordinates 35 00 N Latitude, 105
00 E Longitude - Map references Asia Location8
- Area
- total 9,596,960 Number1 sq km
- land 9,326,410 Number2 sq km
- water 270,550 Number3 sq km
- Country China CountryName
- Location Eastern Asia Location1, bordering the
East China Sea Location2, Korea Bay
Location3, Yellow Sea Location4, and South
China Sea Location5, between North Korea
Location6, and Vietnam Location7 - Geographic coordinates 35 00 N Latitude, 105
00 E Longitude - Map references Asia MapReference
- Area
- total 9,596,960 TotalArea sq km
- land 9,326,410 LandArea sq km
- water 270,550 WaterArea sq km
- Country China CountryName
- Location Eastern Asia, bordering the East China
Sea, Korea Bay, Yellow Sea, and South China Sea,
between North Korea, and Vietnam Location - Geographic coordinates 35 00 N Latitude, 105
00 E Longitude - Map references Asia MapReference
- Area
- total 9,596,960 TotalArea sq km
- land 9,326,410 LandArea sq km
- water 270,550 WaterArea sq km
10Problems
- Obtain knowledge base
- Classify related concepts for the sample
documents - Refine
- Tag the document based on the raw completed
ontology - User interface design and control
- Update strategy to raw completed ontology based
on user modification
11Contribution
- Exploit existing knowledge
- Semi-automatically generate an extraction ontology