Automatically Generated DAML Markup for Semistructured Documents - PowerPoint PPT Presentation

About This Presentation
Title:

Automatically Generated DAML Markup for Semistructured Documents

Description:

The most efficient way for machines to understand the semantics of the vast ... conference system based on earlier versions of these wavelet codecs. /Abstract ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 38
Provided by: UMB61
Category:

less

Transcript and Presenter's Notes

Title: Automatically Generated DAML Markup for Semistructured Documents


1
Automatically Generated DAML Markup for
Semistructured Documents
  • William Krueger, Jonathan Nilsson, Tim Oates, Tim
    Finin

Supported by DARPA contract F30602-00-2-0591
2
DAML and the Semantic Web
  • The most efficient way for machines to understand
    the semantics of the vast amount of information
    on the web is to add semantic markup to the
    information
  • DAML (DARPA Agent Markup Language) is one
    existing semantic markup language

3
The Problem
  • Semantically marking up large amounts of data by
    hand is far too time consuming
  • We use machine learning techniques to automate
    the task

4
An Excerpt From a Talk Announcement
The International Computer Science Institute ltbrgt
is pleased to present a talk ltpgtltBRgt "Automatic
Classification of Acoustic Signals Based on
Psychoacoustic and Neurophysiological
Knowledge"ltBRgt ltPgt ltcentergtMichael
Kleinschmidtlt/centergt Medical Physics Group,
University of Oldenburg, GermanyltBRgt
Who is the speaker?
5
An Excerpt From a Talk Announcement (Solution)
The International Computer Science Institute ltbrgt
is pleased to present a talk ltpgtltBRgt "Automatic
Classification of Acoustic Signals Based on
Psychoacoustic and Neurophysiological
Knowledge"ltBRgt ltPgt ltcentergtltSpeakergtMichael
Kleinschmidtlt/Speakergtlt/centergt Medical Physics
Group, University of Oldenburg, GermanyltBRgt
6
Outline
  • Talk Ontology
  • Hierarchical Wrapper Induction
  • Contributions
  • Experimental Results
  • Future Considerations

7
Our Talk Ontology
  • An ontology is the hierarchically organized
    vocabulary used to semantically mark up
    information sources
  • The root of our talk ontology is Talk
  • The ontological children of Talk include elements
    such as TalkTitle and TalkBeginTime
  • The element TalkBeginTime has its own
    ontological children, TalkBeginTimeHour and
    TalkBeginTimeMinute

8
Advantages of a Hierarchy
  • Using a hierarchical data model, we can break up
    documents into embedded segments
  • When learning rules for the speakers first name,
    for example, we only have to consider the speaker
    segment of each document

9
Wrappers
  • A wrapper is the set of rules used to extract
    data along with the code required to perform the
    extraction

10
The STALKER Algorithm
  • Stalker is a hierarchical wrapper induction
    algorithm developed at ISI
  • We use a modified Stalker algorithm to do
    information extraction on a source
  • The extracted information along with a DAML
    ontology can then be used to create markup for
    the source

11
Defining Rule, Landmark, Token, etc
  • A token is an elementary piece of text
  • Lowercase words, HTML tags, Numbers, Alphanumeric
    words, Symbols, etc.
  • A landmark is a sequence of one or more
    consecutive tokens
  • A rule clause contains one landmark and is one of
    two types SkipTo or SkipUntil
  • A rule is an ordered list of rule clauses
  • can be applied either forward or backward
  • used to locate both the beginning and end of an
    information field

12
Rule Disjunction
  • Because our system is based on a sequential
    covering algorithm, a rule disjunction is learned
    for each tag
  • A rule disjunction is an ordered set of rules
    that are applied in order when placing a tag
  • The first of that set to match in the document is
    used to place the tag
  • Keep in mind that it is a rule disjunction of one
    or more rules that is learned for each tag

13
Example of a rule matching
14
Refining a Rule
  • A rule initially contains a single token
  • The token is taken from the tokens immediately
    adjacent to the target data item
  • Examples SkipTo(SYMBOL) or SkipUntil(John)
  • Then, either a landmark is added to the rule or a
    token is added to one of the existing landmarks

15
Refining a Rule
  • Example
  • SkipTo(SYMBOL) can become
  • SkipTo(be SYMBOL)
  • SkipTo(speaker) SkipTo(SYMBOL)
  • etc.

16
Refining a Rule
  • After refining a rule, the best candidate rule is
    chosen and is determined to be either perfect or
    imperfect
  • The best candidate rule has the greatest number
    of matches on the remaining training documents
  • Early and failed matches are preferred over late
    matches
  • If the best candidate is perfect, it is returned
    otherwise it is refined again

17
Keeping a Rule
  • We want to keep rules that have perfect accuracy
    on the training documents
  • No negative matches where the rule being
    evaluated misplaces a tag in a
  • No false positive matches where the rule places a
    tag for a data item in some training document
    where that data item does not exist
  • When a rule continues being refined without
    becoming perfect it reaches a limit and is
    returned as is
  • The rule in this case is probably not very useful
  • This case is infrequent

18
General overview of our improvements
  • Minimum Refinements
  • Rule Score
  • Refinement Window
  • Wildcards
  • In the upcoming examples, we often explain how
    each of these improvements is useful in finding a
    begin tag for an ontology element the usefulness
    for end tags is similar

19
Minimum Refinements
  • Forces rules to be refined some minimum number of
    times
  • We typically use a minimum number of 5

20
Minimum Refinements Example
  • Consider the rule SkipTo(George)
  • Suppose this rule is perfect
  • In general, this rule would be very ineffective
    at finding the speakers first name
  • We would force this rule to be refined further so
    that it might ultimately have a greater coverage
    over all documents and reflect the structure of
    the domain

21
Rule Score
  • Utilizes an evaluation set of documents
  • Decides whether forward rules or backward rules
    are better for a particular tag based on their
    performance on the evaluation set

22
Rule Score Example
  • What should we do when forward and backward rules
    disagree on the location of a tag?
  • We test the forward and backward rules on a set
    of evaluation documents that were not used during
    the training
  • If the forward rules have a better score on the
    evaluation set, they are stored as the rules for
    placing that tag
  • Requires additional marked-up documents

23
Refinement Window
  • Only consider the closest n tokens to a tag when
    refining a rule
  • We typically use n 10

24
Refinement Window Example
  • Consider the tag TalkTitle
  • Its ontological parent is Talk, the entire talk
    announcement
  • Without a refinement window, many irrelevant
    tokens would be considered when learning rules
    for the title
  • At worst, some irrelevant tokens would actually
    be used in a rule
  • Such a rule would not generalize well

25
Wildcards
  • Both domain-dependent and domain-independent can
    be used in place of tokens
  • Allow us to better generalize a documents
    structure
  • Examples are MONTH, NUMBER, HTML_TAG, etc.

26
Wildcards Example
  • Consider the tags TalkDateMonth and
    TalkDateDayOfWeek
  • We might start with the rule SkipUntil(INITIAL_CAP
    _WORD) for finding the month, but this rule would
    match the day of the week, as well
  • By virtue of the wildcard MONTH, we can use the
    rule skipUntil(MONTH) to accurately locate the
    month

27
Marked Up by a Human
lt?xml version"1.0" ?gt ltrdfRDF
xmlnsrdf"http//www.w3.org/1999/02/22-rdf-syntax
-ns" xmlnsdaml"http//www.daml.org/2000/10/daml
-ont" xmlns"http//daml.umbc.edu/ontologies/talk
-ont" xmlnstime"http//daml.umbc.edu/ontologies
/calendar-ont"gt ltTalkgt ltTitlegtNew
Developments in Still Image and Video
Compressionlt/Titlegt ltX-URIgtlt/X-URIgt
ltDAML-URIgt./daml/trainfile1.damllt/DAML-URIgt
ltAbstractgtWhile the demand on quality of digital
images and videos increases, .... We will also
show a real-time IP video conference system
based on earlier versions of these wavelet
codecs.lt/Abstractgt ltBeginTime
rdfparseType"Resource"gt
lttimeYeargt2002lt/timeYeargt
lttimeMonthgtMarchlt/timeMonthgt
lttimeDaygt27lt/timeDaygt lttimeDayOfWeekgtThur
sdaylt/timeDayOfWeekgt lttimeHourgt3lt/timeHou
rgt lttimeMinutegt30lt/timeMinutegt
lttimeSecondgt00lt/timeSecondgt lt/BeginTimegt
ltEndTime rdfparseType"Resource"gt
lttimeYeargt2002lt/timeYeargt
lttimeMonthgtMarchlt/timeMonthgt
lttimeDaygt27lt/timeDaygt lttimeDayOfWeekgtThur
sdaylt/timeDayOfWeekgt lttimeHourgt4lt/timeHou
rgt lttimeMinutegt30lt/timeMinutegt
lttimeSecondgt00lt/timeSecondgt lt/EndTimegt
ltSpeaker rdfparseType"Resource"gt
ltNamegtHans L. Cyconlt/Namegt
ltOrganizationgtFHTW Berlin, University of Applied
Scienceslt/Organizationgt ltEmailgthcycon_at_fhtw-b
erlin.delt/Emailgt lt/Speakergt
lt/Talkgt lt/rdfRDFgt
28
Marked Up by our Basic System
lt?xml version"1.0" ?gt ltrdfRDF
xmlnsrdf"http//www.w3.org/1999/02/22-rdf-syntax
-ns" xmlnsdaml"http//www.daml.org/2000/10/daml
-ont" xmlns"http//daml.umbc.edu/ontologies/talk
-ont" xmlnstime"http//daml.umbc.edu/ontologies
/calendar-ont"gt ltTalkgt ltTitlegtNew
Developments in Still Image and Video
Compression Prof. Hans L. Cycon FHTW Berlin,
University of Applied Scienceslt/Titlegt
ltX-URIgtlt/X-URIgt ltDAML-URIgt./daml/trainfile1.da
mllt/DAML-URIgt ltAbstractgtWhile the demand on
quality of digital images and videos increases,
.... We will also show a real-time IP video
conference system based on earlier versions of
these wavelet codecs. ltAbstractgt ltBeginTime
rdfparseType"Resource"gt
lttimeYeargtlt/timeYeargt lttimeMonthgtlt/timeM
onthgt lttimeDaygtlt/timeDaygt
lttimeDayOfWeekgtlt/timeDayOfWeekgt
lttimeHourgtlt/timeHourgt lttimeMinutegt00lt/tim
eMinutegt lttimeSecondgt00lt/timeSecondgt
lt/BeginTimegt ltEndTime rdfparseType"Resource"
gt lttimeYeargtlt/timeYeargt
lttimeMonthgtlt/timeMonthgt
lttimeDaygtlt/timeDaygt lttimeDayOfWeekgtlt/time
DayOfWeekgt lttimeHourgt330-4lt/timeHourgt
lttimeMinutegt30-430lt/timeMinutegt
lttimeSecondgt00lt/timeSecondgt lt/EndTimegt
ltSpeaker rdfparseType"Resource"gt
ltNamegtHans L lt/Namegt ltOrganizationgtlt/Organiz
ationgt ltEmailgthcycon_at_fhtw-berlin.delt/Emailgt
lt/Speakergt lt/Talkgt lt/rdfRDFgt
29
Marked Up by our Full System
lt?xml version"1.0" ?gt ltrdfRDF
xmlnsrdf"http//www.w3.org/1999/02/22-rdf-syntax
-ns" xmlnsdaml"http//www.daml.org/2000/10/daml
-ont" xmlns"http//daml.umbc.edu/ontologies/talk
-ont" xmlnstime"http//daml.umbc.edu/ontologies
/calendar-ont"gt ltTalkgt ltTitlegtNew
Developments in Still Image and Video
Compression Prof. Hans L. Cycon FHTW Berlin,
University of Applied Scienceslt/Titlegt
ltX-URIgtlt/X-URIgt ltDAML-URIgt./daml/trainfile1.da
mllt/DAML-URIgt ltAbstractgtWhile the demand on
quality of digital images and videos increases,
.... We will also show a real-time IP video
conference system based on earlier versions of
these wavelet codecs.lt/Abstractgt ltBeginTime
rdfparseType"Resource"gt
lttimeYeargt2002lt/timeYeargt
lttimeMonthgtMarchlt/timeMonthgt
lttimeDaygt27lt/timeDaygt lttimeDayOfWeekgtThur
sdaylt/timeDayOfWeekgt lttimeHourgt3lt/timeHou
rgt lttimeMinutegt30lt/timeMinutegt
lttimeSecondgt00lt/timeSecondgt lt/BeginTimegt
ltEndTime rdfparseType"Resource"gt
lttimeYeargt2002lt/timeYeargt
lttimeMonthgtMarchlt/timeMonthgt
lttimeDaygt27lt/timeDaygt lttimeDayOfWeekgtThur
sdaylt/timeDayOfWeekgt lttimeHourgt4lt/timeHou
rgt lttimeMinutegt30lt/timeMinutegt
lttimeSecondgt00lt/timeSecondgt lt/EndTimegt
ltSpeaker rdfparseType"Resource"gt
ltNamegtHans L. Cyconlt/Namegt
ltOrganizationgtProf. Hans L. Cycon FHTW Berlin,
University of Applied Scienceslt/Organizationgt
ltEmailgthcycon_at_fhtw-berlin.delt/Emailgt
lt/Speakergt lt/Talkgt lt/rdfRDFgt
30
Experimental Setup
  • 3 Domains
  • UC Berkeley, UCSB, and ITTALKS
  • 6 Systems
  • Basic, Min Refine, Score, Refinement Window,
    Wildcards, and Full
  • 10 Partitions
  • 20/20/20 split
  • Training/Evaluation/Testing Sets
  • recall number of correctly extracted data items
    divided by the total number of data items in the
    documents

31
Average Recall Over All Tags
UC Berkeley Domain
UCSB Domain
32
Performance Improvements on Individual Tags
UC Berkeley Domain
33
Performance Improvements on Individual Tags
UCSB Domain
34
Conclusion
  • Our system extends the state-of-the-art algorithm
    STALKER
  • Our system performs DAML markup on talk
    announcements
  • It can trivially be extended to different markup
    languages and different domains
  • A working implementation of everything described
    here exists!

35
Future Considerations
  • Active Learning select training documents that
    yield rules with the greatest possible coverage
  • Cardinality Issues ontology elements that appear
    in lists
  • Linguistic Information use a system like
    Aerotext to preprocess the documents
  • Google API check to see if our tag placement
    makes sense

36
Acknowledgements
  • This work was supported in part by the Defense
    Advanced Research Projects Agency under contract
    F30602-00-2-0 591 as part of the DAML program
    (http//daml.org/)
  • It was also supported by a Northrop Grumman
    Fellowship

37
References
  • Ciravegna, F. (2001). (LP)2 , an Adaptive
    Algorithm for
  • Information Extraction from Web-related Texts.
    In
  • Proceedings of the IJCAI-2001 Workshop on
    Adaptive
  • Text Extraction and Mining held in conjunction
    with
  • 17th International Joint Conference on
    Artificial
  • Intelligence (IJCAI).
  • Cost, R. S., T. Finin, A. Joshi, Y. Peng, C.
    Nicholas, I.
  • Soboroff, H. Chen, L. Kagal, F. Perich, Y. Zou,
    and S.
  • Tolia. (2002). ITtalks A Case Study in the
    Semantic
  • Web and DAMLOIL. IEEE Intelligent Systems,
  • 17(1)40-47.
  • Hendler, J. (2001). Agents and the Semantic Web.
    IEEE
  • Intelligent Systems,16(2)30-37.
  • Hendler, J., and D. L. McGuinness. (2000). The
    Darpa
  • Agent Markup Language. IEEE Intelligent Systems,
  • 15(6)67-73.
Write a Comment
User Comments (0)
About PowerShow.com