Automatically Generated DAML Markup for Semistructured Documents - PowerPoint PPT Presentation

About This Presentation

Title:

Automatically Generated DAML Markup for Semistructured Documents

Description:

The most efficient way for machines to understand the semantics of the vast ... conference system based on earlier versions of these wavelet codecs. /Abstract ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 38

Provided by: UMB61

Category:

more less

Transcript and Presenter's Notes

Title: Automatically Generated DAML Markup for Semistructured Documents

1
Automatically Generated DAML Markup for
Semistructured Documents

William Krueger, Jonathan Nilsson, Tim Oates, Tim
Finin

Supported by DARPA contract F30602-00-2-0591
2
DAML and the Semantic Web

The most efficient way for machines to understand
the semantics of the vast amount of information
on the web is to add semantic markup to the
information
DAML (DARPA Agent Markup Language) is one
existing semantic markup language

3
The Problem

Semantically marking up large amounts of data by
hand is far too time consuming
We use machine learning techniques to automate
the task

4
An Excerpt From a Talk Announcement
The International Computer Science Institute ltbrgt
is pleased to present a talk ltpgtltBRgt "Automatic
Classification of Acoustic Signals Based on
Psychoacoustic and Neurophysiological
Knowledge"ltBRgt ltPgt ltcentergtMichael
Kleinschmidtlt/centergt Medical Physics Group,
University of Oldenburg, GermanyltBRgt
Who is the speaker?
5
An Excerpt From a Talk Announcement (Solution)
The International Computer Science Institute ltbrgt
is pleased to present a talk ltpgtltBRgt "Automatic
Classification of Acoustic Signals Based on
Psychoacoustic and Neurophysiological
Knowledge"ltBRgt ltPgt ltcentergtltSpeakergtMichael
Kleinschmidtlt/Speakergtlt/centergt Medical Physics
Group, University of Oldenburg, GermanyltBRgt
6
Outline

Talk Ontology
Hierarchical Wrapper Induction
Contributions
Experimental Results
Future Considerations

7
Our Talk Ontology

An ontology is the hierarchically organized
vocabulary used to semantically mark up
information sources
The root of our talk ontology is Talk
The ontological children of Talk include elements
such as TalkTitle and TalkBeginTime
The element TalkBeginTime has its own
ontological children, TalkBeginTimeHour and
TalkBeginTimeMinute

8
Advantages of a Hierarchy

Using a hierarchical data model, we can break up
documents into embedded segments
When learning rules for the speakers first name,
for example, we only have to consider the speaker
segment of each document

9
Wrappers

A wrapper is the set of rules used to extract
data along with the code required to perform the
extraction

10
The STALKER Algorithm

Stalker is a hierarchical wrapper induction
algorithm developed at ISI
We use a modified Stalker algorithm to do
information extraction on a source
The extracted information along with a DAML
ontology can then be used to create markup for
the source

11
Defining Rule, Landmark, Token, etc

A token is an elementary piece of text
Lowercase words, HTML tags, Numbers, Alphanumeric
words, Symbols, etc.
A landmark is a sequence of one or more
consecutive tokens
A rule clause contains one landmark and is one of
two types SkipTo or SkipUntil
A rule is an ordered list of rule clauses
can be applied either forward or backward
used to locate both the beginning and end of an
information field

12
Rule Disjunction

Because our system is based on a sequential
covering algorithm, a rule disjunction is learned
for each tag
A rule disjunction is an ordered set of rules
that are applied in order when placing a tag
The first of that set to match in the document is
used to place the tag
Keep in mind that it is a rule disjunction of one
or more rules that is learned for each tag

13
Example of a rule matching
14
Refining a Rule

A rule initially contains a single token
The token is taken from the tokens immediately
adjacent to the target data item
Examples SkipTo(SYMBOL) or SkipUntil(John)
Then, either a landmark is added to the rule or a
token is added to one of the existing landmarks

15
Refining a Rule

Example
SkipTo(SYMBOL) can become
SkipTo(be SYMBOL)
SkipTo(speaker) SkipTo(SYMBOL)
etc.

16
Refining a Rule

After refining a rule, the best candidate rule is
chosen and is determined to be either perfect or
imperfect
The best candidate rule has the greatest number
of matches on the remaining training documents
Early and failed matches are preferred over late
matches
If the best candidate is perfect, it is returned
otherwise it is refined again

17
Keeping a Rule

We want to keep rules that have perfect accuracy
on the training documents
No negative matches where the rule being
evaluated misplaces a tag in a
No false positive matches where the rule places a
tag for a data item in some training document
where that data item does not exist
When a rule continues being refined without
becoming perfect it reaches a limit and is
returned as is
The rule in this case is probably not very useful
This case is infrequent

18
General overview of our improvements

Minimum Refinements
Rule Score
Refinement Window
Wildcards
In the upcoming examples, we often explain how
each of these improvements is useful in finding a
begin tag for an ontology element the usefulness
for end tags is similar

19
Minimum Refinements

Forces rules to be refined some minimum number of
times
We typically use a minimum number of 5

20
Minimum Refinements Example

Consider the rule SkipTo(George)
Suppose this rule is perfect
In general, this rule would be very ineffective
at finding the speakers first name
We would force this rule to be refined further so
that it might ultimately have a greater coverage
over all documents and reflect the structure of
the domain

21
Rule Score

Utilizes an evaluation set of documents
Decides whether forward rules or backward rules
are better for a particular tag based on their
performance on the evaluation set

22
Rule Score Example

What should we do when forward and backward rules
disagree on the location of a tag?
We test the forward and backward rules on a set
of evaluation documents that were not used during
the training
If the forward rules have a better score on the
evaluation set, they are stored as the rules for
placing that tag
Requires additional marked-up documents

23
Refinement Window

Only consider the closest n tokens to a tag when
refining a rule
We typically use n 10

24
Refinement Window Example

Consider the tag TalkTitle
Its ontological parent is Talk, the entire talk
announcement
Without a refinement window, many irrelevant
tokens would be considered when learning rules
for the title
At worst, some irrelevant tokens would actually
be used in a rule
Such a rule would not generalize well

25
Wildcards

Both domain-dependent and domain-independent can
be used in place of tokens
Allow us to better generalize a documents
structure
Examples are MONTH, NUMBER, HTML_TAG, etc.

26
Wildcards Example

Consider the tags TalkDateMonth and
TalkDateDayOfWeek
We might start with the rule SkipUntil(INITIAL_CAP
_WORD) for finding the month, but this rule would
match the day of the week, as well
By virtue of the wildcard MONTH, we can use the
rule skipUntil(MONTH) to accurately locate the
month

27
Marked Up by a Human
lt?xml version"1.0" ?gt ltrdfRDF
xmlnsrdf"http//www.w3.org/1999/02/22-rdf-syntax
-ns" xmlnsdaml"http//www.daml.org/2000/10/daml
-ont" xmlns"http//daml.umbc.edu/ontologies/talk
-ont" xmlnstime"http//daml.umbc.edu/ontologies
/calendar-ont"gt ltTalkgt ltTitlegtNew
Developments in Still Image and Video
Compressionlt/Titlegt ltX-URIgtlt/X-URIgt
ltDAML-URIgt./daml/trainfile1.damllt/DAML-URIgt
ltAbstractgtWhile the demand on quality of digital
images and videos increases, .... We will also
show a real-time IP video conference system
based on earlier versions of these wavelet
codecs.lt/Abstractgt ltBeginTime
rdfparseType"Resource"gt
lttimeYeargt2002lt/timeYeargt
lttimeMonthgtMarchlt/timeMonthgt
lttimeDaygt27lt/timeDaygt lttimeDayOfWeekgtThur
sdaylt/timeDayOfWeekgt lttimeHourgt3lt/timeHou
rgt lttimeMinutegt30lt/timeMinutegt
lttimeSecondgt00lt/timeSecondgt lt/BeginTimegt
ltEndTime rdfparseType"Resource"gt
lttimeYeargt2002lt/timeYeargt
lttimeMonthgtMarchlt/timeMonthgt
lttimeDaygt27lt/timeDaygt lttimeDayOfWeekgtThur
sdaylt/timeDayOfWeekgt lttimeHourgt4lt/timeHou
rgt lttimeMinutegt30lt/timeMinutegt
lttimeSecondgt00lt/timeSecondgt lt/EndTimegt
ltSpeaker rdfparseType"Resource"gt
ltNamegtHans L. Cyconlt/Namegt
ltOrganizationgtFHTW Berlin, University of Applied
Scienceslt/Organizationgt ltEmailgthcycon_at_fhtw-b
erlin.delt/Emailgt lt/Speakergt
lt/Talkgt lt/rdfRDFgt
28
Marked Up by our Basic System
lt?xml version"1.0" ?gt ltrdfRDF
xmlnsrdf"http//www.w3.org/1999/02/22-rdf-syntax
-ns" xmlnsdaml"http//www.daml.org/2000/10/daml
-ont" xmlns"http//daml.umbc.edu/ontologies/talk
-ont" xmlnstime"http//daml.umbc.edu/ontologies
/calendar-ont"gt ltTalkgt ltTitlegtNew
Developments in Still Image and Video
Compression Prof. Hans L. Cycon FHTW Berlin,
University of Applied Scienceslt/Titlegt
ltX-URIgtlt/X-URIgt ltDAML-URIgt./daml/trainfile1.da
mllt/DAML-URIgt ltAbstractgtWhile the demand on
quality of digital images and videos increases,
.... We will also show a real-time IP video
conference system based on earlier versions of
these wavelet codecs. ltAbstractgt ltBeginTime
rdfparseType"Resource"gt
lttimeYeargtlt/timeYeargt lttimeMonthgtlt/timeM
onthgt lttimeDaygtlt/timeDaygt
lttimeDayOfWeekgtlt/timeDayOfWeekgt
lttimeHourgtlt/timeHourgt lttimeMinutegt00lt/tim
eMinutegt lttimeSecondgt00lt/timeSecondgt
lt/BeginTimegt ltEndTime rdfparseType"Resource"
gt lttimeYeargtlt/timeYeargt
lttimeMonthgtlt/timeMonthgt
lttimeDaygtlt/timeDaygt lttimeDayOfWeekgtlt/time
DayOfWeekgt lttimeHourgt330-4lt/timeHourgt
lttimeMinutegt30-430lt/timeMinutegt
lttimeSecondgt00lt/timeSecondgt lt/EndTimegt
ltSpeaker rdfparseType"Resource"gt
ltNamegtHans L lt/Namegt ltOrganizationgtlt/Organiz
ationgt ltEmailgthcycon_at_fhtw-berlin.delt/Emailgt
lt/Speakergt lt/Talkgt lt/rdfRDFgt
29
Marked Up by our Full System
lt?xml version"1.0" ?gt ltrdfRDF
xmlnsrdf"http//www.w3.org/1999/02/22-rdf-syntax
-ns" xmlnsdaml"http//www.daml.org/2000/10/daml
-ont" xmlns"http//daml.umbc.edu/ontologies/talk
-ont" xmlnstime"http//daml.umbc.edu/ontologies
/calendar-ont"gt ltTalkgt ltTitlegtNew
Developments in Still Image and Video
Compression Prof. Hans L. Cycon FHTW Berlin,
University of Applied Scienceslt/Titlegt
ltX-URIgtlt/X-URIgt ltDAML-URIgt./daml/trainfile1.da
mllt/DAML-URIgt ltAbstractgtWhile the demand on
quality of digital images and videos increases,
.... We will also show a real-time IP video
conference system based on earlier versions of
these wavelet codecs.lt/Abstractgt ltBeginTime
rdfparseType"Resource"gt
lttimeYeargt2002lt/timeYeargt
lttimeMonthgtMarchlt/timeMonthgt
lttimeDaygt27lt/timeDaygt lttimeDayOfWeekgtThur
sdaylt/timeDayOfWeekgt lttimeHourgt3lt/timeHou
rgt lttimeMinutegt30lt/timeMinutegt
lttimeSecondgt00lt/timeSecondgt lt/BeginTimegt
ltEndTime rdfparseType"Resource"gt
lttimeYeargt2002lt/timeYeargt
lttimeMonthgtMarchlt/timeMonthgt
lttimeDaygt27lt/timeDaygt lttimeDayOfWeekgtThur
sdaylt/timeDayOfWeekgt lttimeHourgt4lt/timeHou
rgt lttimeMinutegt30lt/timeMinutegt
lttimeSecondgt00lt/timeSecondgt lt/EndTimegt
ltSpeaker rdfparseType"Resource"gt
ltNamegtHans L. Cyconlt/Namegt
ltOrganizationgtProf. Hans L. Cycon FHTW Berlin,
University of Applied Scienceslt/Organizationgt
ltEmailgthcycon_at_fhtw-berlin.delt/Emailgt
lt/Speakergt lt/Talkgt lt/rdfRDFgt
30
Experimental Setup

3 Domains
UC Berkeley, UCSB, and ITTALKS
6 Systems
Basic, Min Refine, Score, Refinement Window,
Wildcards, and Full
10 Partitions
20/20/20 split
Training/Evaluation/Testing Sets
recall number of correctly extracted data items
divided by the total number of data items in the
documents

31
Average Recall Over All Tags
UC Berkeley Domain
UCSB Domain
32
Performance Improvements on Individual Tags
UC Berkeley Domain
33
Performance Improvements on Individual Tags
UCSB Domain
34
Conclusion

Our system extends the state-of-the-art algorithm
STALKER
Our system performs DAML markup on talk
announcements
It can trivially be extended to different markup
languages and different domains
A working implementation of everything described
here exists!

35
Future Considerations

Active Learning select training documents that
yield rules with the greatest possible coverage
Cardinality Issues ontology elements that appear
in lists
Linguistic Information use a system like
Aerotext to preprocess the documents
Google API check to see if our tag placement
makes sense

36
Acknowledgements

This work was supported in part by the Defense
Advanced Research Projects Agency under contract
F30602-00-2-0 591 as part of the DAML program
(http//daml.org/)
It was also supported by a Northrop Grumman
Fellowship

37
References