Title: Automatically Generated DAML Markup for Semistructured Documents
1Automatically Generated DAML Markup for
Semistructured Documents
- William Krueger, Jonathan Nilsson, Tim Oates, Tim
Finin
Supported by DARPA contract F30602-00-2-0591
2DAML and the Semantic Web
- The most efficient way for machines to understand
the semantics of the vast amount of information
on the web is to add semantic markup to the
information - DAML (DARPA Agent Markup Language) is one
existing semantic markup language
3The Problem
- Semantically marking up large amounts of data by
hand is far too time consuming - We use machine learning techniques to automate
the task
4An Excerpt From a Talk Announcement
The International Computer Science Institute ltbrgt
is pleased to present a talk ltpgtltBRgt "Automatic
Classification of Acoustic Signals Based on
Psychoacoustic and Neurophysiological
Knowledge"ltBRgt ltPgt ltcentergtMichael
Kleinschmidtlt/centergt Medical Physics Group,
University of Oldenburg, GermanyltBRgt
Who is the speaker?
5An Excerpt From a Talk Announcement (Solution)
The International Computer Science Institute ltbrgt
is pleased to present a talk ltpgtltBRgt "Automatic
Classification of Acoustic Signals Based on
Psychoacoustic and Neurophysiological
Knowledge"ltBRgt ltPgt ltcentergtltSpeakergtMichael
Kleinschmidtlt/Speakergtlt/centergt Medical Physics
Group, University of Oldenburg, GermanyltBRgt
6Outline
- Talk Ontology
- Hierarchical Wrapper Induction
- Contributions
- Experimental Results
- Future Considerations
7Our Talk Ontology
- An ontology is the hierarchically organized
vocabulary used to semantically mark up
information sources - The root of our talk ontology is Talk
- The ontological children of Talk include elements
such as TalkTitle and TalkBeginTime - The element TalkBeginTime has its own
ontological children, TalkBeginTimeHour and
TalkBeginTimeMinute
8Advantages of a Hierarchy
- Using a hierarchical data model, we can break up
documents into embedded segments - When learning rules for the speakers first name,
for example, we only have to consider the speaker
segment of each document
9Wrappers
- A wrapper is the set of rules used to extract
data along with the code required to perform the
extraction
10The STALKER Algorithm
- Stalker is a hierarchical wrapper induction
algorithm developed at ISI - We use a modified Stalker algorithm to do
information extraction on a source - The extracted information along with a DAML
ontology can then be used to create markup for
the source
11Defining Rule, Landmark, Token, etc
- A token is an elementary piece of text
- Lowercase words, HTML tags, Numbers, Alphanumeric
words, Symbols, etc. - A landmark is a sequence of one or more
consecutive tokens - A rule clause contains one landmark and is one of
two types SkipTo or SkipUntil - A rule is an ordered list of rule clauses
- can be applied either forward or backward
- used to locate both the beginning and end of an
information field
12Rule Disjunction
- Because our system is based on a sequential
covering algorithm, a rule disjunction is learned
for each tag - A rule disjunction is an ordered set of rules
that are applied in order when placing a tag - The first of that set to match in the document is
used to place the tag - Keep in mind that it is a rule disjunction of one
or more rules that is learned for each tag
13Example of a rule matching
14Refining a Rule
- A rule initially contains a single token
- The token is taken from the tokens immediately
adjacent to the target data item - Examples SkipTo(SYMBOL) or SkipUntil(John)
- Then, either a landmark is added to the rule or a
token is added to one of the existing landmarks
15Refining a Rule
- Example
- SkipTo(SYMBOL) can become
- SkipTo(be SYMBOL)
- SkipTo(speaker) SkipTo(SYMBOL)
- etc.
16Refining a Rule
- After refining a rule, the best candidate rule is
chosen and is determined to be either perfect or
imperfect - The best candidate rule has the greatest number
of matches on the remaining training documents - Early and failed matches are preferred over late
matches - If the best candidate is perfect, it is returned
otherwise it is refined again
17Keeping a Rule
- We want to keep rules that have perfect accuracy
on the training documents - No negative matches where the rule being
evaluated misplaces a tag in a - No false positive matches where the rule places a
tag for a data item in some training document
where that data item does not exist - When a rule continues being refined without
becoming perfect it reaches a limit and is
returned as is - The rule in this case is probably not very useful
- This case is infrequent
18General overview of our improvements
- Minimum Refinements
- Rule Score
- Refinement Window
- Wildcards
- In the upcoming examples, we often explain how
each of these improvements is useful in finding a
begin tag for an ontology element the usefulness
for end tags is similar
19Minimum Refinements
- Forces rules to be refined some minimum number of
times - We typically use a minimum number of 5
20Minimum Refinements Example
- Consider the rule SkipTo(George)
- Suppose this rule is perfect
- In general, this rule would be very ineffective
at finding the speakers first name - We would force this rule to be refined further so
that it might ultimately have a greater coverage
over all documents and reflect the structure of
the domain
21Rule Score
- Utilizes an evaluation set of documents
- Decides whether forward rules or backward rules
are better for a particular tag based on their
performance on the evaluation set
22Rule Score Example
- What should we do when forward and backward rules
disagree on the location of a tag? - We test the forward and backward rules on a set
of evaluation documents that were not used during
the training - If the forward rules have a better score on the
evaluation set, they are stored as the rules for
placing that tag - Requires additional marked-up documents
23Refinement Window
- Only consider the closest n tokens to a tag when
refining a rule - We typically use n 10
24Refinement Window Example
- Consider the tag TalkTitle
- Its ontological parent is Talk, the entire talk
announcement - Without a refinement window, many irrelevant
tokens would be considered when learning rules
for the title - At worst, some irrelevant tokens would actually
be used in a rule - Such a rule would not generalize well
25Wildcards
- Both domain-dependent and domain-independent can
be used in place of tokens - Allow us to better generalize a documents
structure - Examples are MONTH, NUMBER, HTML_TAG, etc.
26Wildcards Example
- Consider the tags TalkDateMonth and
TalkDateDayOfWeek - We might start with the rule SkipUntil(INITIAL_CAP
_WORD) for finding the month, but this rule would
match the day of the week, as well - By virtue of the wildcard MONTH, we can use the
rule skipUntil(MONTH) to accurately locate the
month
27Marked Up by a Human
lt?xml version"1.0" ?gt ltrdfRDF
xmlnsrdf"http//www.w3.org/1999/02/22-rdf-syntax
-ns" xmlnsdaml"http//www.daml.org/2000/10/daml
-ont" xmlns"http//daml.umbc.edu/ontologies/talk
-ont" xmlnstime"http//daml.umbc.edu/ontologies
/calendar-ont"gt ltTalkgt ltTitlegtNew
Developments in Still Image and Video
Compressionlt/Titlegt ltX-URIgtlt/X-URIgt
ltDAML-URIgt./daml/trainfile1.damllt/DAML-URIgt
ltAbstractgtWhile the demand on quality of digital
images and videos increases, .... We will also
show a real-time IP video conference system
based on earlier versions of these wavelet
codecs.lt/Abstractgt ltBeginTime
rdfparseType"Resource"gt
lttimeYeargt2002lt/timeYeargt
lttimeMonthgtMarchlt/timeMonthgt
lttimeDaygt27lt/timeDaygt lttimeDayOfWeekgtThur
sdaylt/timeDayOfWeekgt lttimeHourgt3lt/timeHou
rgt lttimeMinutegt30lt/timeMinutegt
lttimeSecondgt00lt/timeSecondgt lt/BeginTimegt
ltEndTime rdfparseType"Resource"gt
lttimeYeargt2002lt/timeYeargt
lttimeMonthgtMarchlt/timeMonthgt
lttimeDaygt27lt/timeDaygt lttimeDayOfWeekgtThur
sdaylt/timeDayOfWeekgt lttimeHourgt4lt/timeHou
rgt lttimeMinutegt30lt/timeMinutegt
lttimeSecondgt00lt/timeSecondgt lt/EndTimegt
ltSpeaker rdfparseType"Resource"gt
ltNamegtHans L. Cyconlt/Namegt
ltOrganizationgtFHTW Berlin, University of Applied
Scienceslt/Organizationgt ltEmailgthcycon_at_fhtw-b
erlin.delt/Emailgt lt/Speakergt
lt/Talkgt lt/rdfRDFgt
28Marked Up by our Basic System
lt?xml version"1.0" ?gt ltrdfRDF
xmlnsrdf"http//www.w3.org/1999/02/22-rdf-syntax
-ns" xmlnsdaml"http//www.daml.org/2000/10/daml
-ont" xmlns"http//daml.umbc.edu/ontologies/talk
-ont" xmlnstime"http//daml.umbc.edu/ontologies
/calendar-ont"gt ltTalkgt ltTitlegtNew
Developments in Still Image and Video
Compression Prof. Hans L. Cycon FHTW Berlin,
University of Applied Scienceslt/Titlegt
ltX-URIgtlt/X-URIgt ltDAML-URIgt./daml/trainfile1.da
mllt/DAML-URIgt ltAbstractgtWhile the demand on
quality of digital images and videos increases,
.... We will also show a real-time IP video
conference system based on earlier versions of
these wavelet codecs. ltAbstractgt ltBeginTime
rdfparseType"Resource"gt
lttimeYeargtlt/timeYeargt lttimeMonthgtlt/timeM
onthgt lttimeDaygtlt/timeDaygt
lttimeDayOfWeekgtlt/timeDayOfWeekgt
lttimeHourgtlt/timeHourgt lttimeMinutegt00lt/tim
eMinutegt lttimeSecondgt00lt/timeSecondgt
lt/BeginTimegt ltEndTime rdfparseType"Resource"
gt lttimeYeargtlt/timeYeargt
lttimeMonthgtlt/timeMonthgt
lttimeDaygtlt/timeDaygt lttimeDayOfWeekgtlt/time
DayOfWeekgt lttimeHourgt330-4lt/timeHourgt
lttimeMinutegt30-430lt/timeMinutegt
lttimeSecondgt00lt/timeSecondgt lt/EndTimegt
ltSpeaker rdfparseType"Resource"gt
ltNamegtHans L lt/Namegt ltOrganizationgtlt/Organiz
ationgt ltEmailgthcycon_at_fhtw-berlin.delt/Emailgt
lt/Speakergt lt/Talkgt lt/rdfRDFgt
29Marked Up by our Full System
lt?xml version"1.0" ?gt ltrdfRDF
xmlnsrdf"http//www.w3.org/1999/02/22-rdf-syntax
-ns" xmlnsdaml"http//www.daml.org/2000/10/daml
-ont" xmlns"http//daml.umbc.edu/ontologies/talk
-ont" xmlnstime"http//daml.umbc.edu/ontologies
/calendar-ont"gt ltTalkgt ltTitlegtNew
Developments in Still Image and Video
Compression Prof. Hans L. Cycon FHTW Berlin,
University of Applied Scienceslt/Titlegt
ltX-URIgtlt/X-URIgt ltDAML-URIgt./daml/trainfile1.da
mllt/DAML-URIgt ltAbstractgtWhile the demand on
quality of digital images and videos increases,
.... We will also show a real-time IP video
conference system based on earlier versions of
these wavelet codecs.lt/Abstractgt ltBeginTime
rdfparseType"Resource"gt
lttimeYeargt2002lt/timeYeargt
lttimeMonthgtMarchlt/timeMonthgt
lttimeDaygt27lt/timeDaygt lttimeDayOfWeekgtThur
sdaylt/timeDayOfWeekgt lttimeHourgt3lt/timeHou
rgt lttimeMinutegt30lt/timeMinutegt
lttimeSecondgt00lt/timeSecondgt lt/BeginTimegt
ltEndTime rdfparseType"Resource"gt
lttimeYeargt2002lt/timeYeargt
lttimeMonthgtMarchlt/timeMonthgt
lttimeDaygt27lt/timeDaygt lttimeDayOfWeekgtThur
sdaylt/timeDayOfWeekgt lttimeHourgt4lt/timeHou
rgt lttimeMinutegt30lt/timeMinutegt
lttimeSecondgt00lt/timeSecondgt lt/EndTimegt
ltSpeaker rdfparseType"Resource"gt
ltNamegtHans L. Cyconlt/Namegt
ltOrganizationgtProf. Hans L. Cycon FHTW Berlin,
University of Applied Scienceslt/Organizationgt
ltEmailgthcycon_at_fhtw-berlin.delt/Emailgt
lt/Speakergt lt/Talkgt lt/rdfRDFgt
30Experimental Setup
- 3 Domains
- UC Berkeley, UCSB, and ITTALKS
- 6 Systems
- Basic, Min Refine, Score, Refinement Window,
Wildcards, and Full - 10 Partitions
- 20/20/20 split
- Training/Evaluation/Testing Sets
- recall number of correctly extracted data items
divided by the total number of data items in the
documents
31Average Recall Over All Tags
UC Berkeley Domain
UCSB Domain
32Performance Improvements on Individual Tags
UC Berkeley Domain
33Performance Improvements on Individual Tags
UCSB Domain
34Conclusion
- Our system extends the state-of-the-art algorithm
STALKER - Our system performs DAML markup on talk
announcements - It can trivially be extended to different markup
languages and different domains - A working implementation of everything described
here exists!
35Future Considerations
- Active Learning select training documents that
yield rules with the greatest possible coverage - Cardinality Issues ontology elements that appear
in lists - Linguistic Information use a system like
Aerotext to preprocess the documents - Google API check to see if our tag placement
makes sense
36Acknowledgements
- This work was supported in part by the Defense
Advanced Research Projects Agency under contract
F30602-00-2-0 591 as part of the DAML program
(http//daml.org/) - It was also supported by a Northrop Grumman
Fellowship
37References
- Ciravegna, F. (2001). (LP)2 , an Adaptive
Algorithm for - Information Extraction from Web-related Texts.
In - Proceedings of the IJCAI-2001 Workshop on
Adaptive - Text Extraction and Mining held in conjunction
with - 17th International Joint Conference on
Artificial - Intelligence (IJCAI).
- Cost, R. S., T. Finin, A. Joshi, Y. Peng, C.
Nicholas, I. - Soboroff, H. Chen, L. Kagal, F. Perich, Y. Zou,
and S. - Tolia. (2002). ITtalks A Case Study in the
Semantic - Web and DAMLOIL. IEEE Intelligent Systems,
- 17(1)40-47.
- Hendler, J. (2001). Agents and the Semantic Web.
IEEE - Intelligent Systems,16(2)30-37.
- Hendler, J., and D. L. McGuinness. (2000). The
Darpa - Agent Markup Language. IEEE Intelligent Systems,
- 15(6)67-73.