Discovering Semantic Sibling Groups from Web Documents with XTREEMSG - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Discovering Semantic Sibling Groups from Web Documents with XTREEMSG

Description:

XTREEM - Xhtml TREE Mining. Hints on Semantic Relatedness - Examples. Headings ... XHTML-Tree Paths Text Elements html html head html head ... html /head ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 31
Provided by: ekaw
Category:

less

Transcript and Presenter's Notes

Title: Discovering Semantic Sibling Groups from Web Documents with XTREEMSG


1
Discovering Semantic Sibling Groupsfrom Web
Documents with XTREEM-SG
  • Marko Brunzel, Myra Spiliopoulou
  • marko.brunzel_at_dfki.uni-kl.de
  • Otto-von-Guericke-University Magdeburg and
  • German Research Centre for Artificial
    Intelligence (DFKI GmbH)
  • EKAW 2006, 05.10.2006

2
Content
  • Introduction
  • Group-By-Path
  • XTREEM-SG Procedure
  • Experimental Results
  • Conclusion and Outlook

3
Context
  • Semantic Web
  • Ontologies
  • Knowledge Acquisition Bottleneck
  • Ontology Learning

4
Semi-Structured Data Source
  • Structure
  • Dictionaries, glossaries, database schemas
  • Rarely available
  • Unstructured Text
  • 50 methods
  • Shared conceptualization in written text?
  • Semi-Structured Text
  • Web Documents (X)HTML

5
WWW as Data Source
  • Domain Document Collection
  • As Input ? manually crafted ? laborious
  • Subjective - limited confidential
  • WWW Documents
  • As part of the mining process
  • Shared - all topics - freely available
  • Added value of mark-up

6
Semi-Structured Web Content
  • HTML Text Mark-up
  • XHTML
  • Enforces Tree Structure
  • HTML to XHTML Conversion
  • Mark-Up is used for
  • Tables / Lists / Headings / (Named) Links /
    Highlightings

? XTREEM - Xhtml TREE Mining
7
Hints on Semantic Relatedness - Examples
Headings- distributed over paragraphs
lth2gtWordnetlt/h2gtltpgtWas developed
lt/pgtlth2gtGermanetlt/h2gtltpgtAnalogous lt/pgt
Highlighted Keywords - with regular text between
ltpgt there are different important standards
for building the ltstronggtSemantic Weblt/stronggt.
is ltstronggtRDFlt/stronggt. ltstronggtRDFS lt/stronggt
adds whereas ltstronggtOWL lt/stronggt is lt/pgt
8
Content
  • Introduction
  • Group-By-Path
  • XTREEM-SG Procedure
  • Experimental Results
  • Conclusion and Outlook

9
XHTML-Tree
lthtmlgt ltheadgt ltheadgt lt/headgt
ltbodygt lth1gtLexical
Resourceslt/h1gt ltpgtlt/pgt
lth2gtWordnetlt/h2gt ltpgtWas developed
lt/pgt lth2gtGermanetlt/h2gt
ltpgtAnalogous to Wordnet for the English lt/pgt
ltbodygt lt/bodygt lt/htmlgt
10
XHTML-Tree Paths
lthtmlgt ltheadgt ltheadgt lt/headgt
ltbodygt lth1gtLexical Resources
lt/h1gt ltpgtlt/pgt
lth2gtWordnetlt/h2gt ltpgtWas developed
lt/pgt lth2gtGermanetlt/h2gt
ltpgtAnalogous to Wordnet for the English lt/pgt
ltbodygt lt/bodygt lt/htmlgt
lthtmlgt lthtmlgtltheadgt lthtmlgtltheadgt lthtmlgtlt/headgt lth
tmlgtltbodygt lthtmlgtltbodygtlth1gtLexical
Resourceslt/h1gt lthtmlgtltbodygtltpgtlt/pgt lthtmlgtltbodygtlth
2gtWordnetlt/h2gt lthtmlgtltbodygtltpgtWas developed
lt/pgt lthtmlgtltbodygtlth2gtGermanetlt/h2gt lthtmlgtltbodygtltp
gtAnalogous to Wordnet for the English
lt/pgt lthtmlgtltbodygt lthtmlgtlt/bodygt lt/htmlgt
11
XHTML-Tree Paths Text Elements
lthtmlgt lthtmlgtltheadgt lthtmlgtltheadgt lthtmlgtlt/headgt lth
tmlgtltbodygt lthtmlgtltbodygtlth1gtLexical
Resourceslt/h1gt lthtmlgtltbodygtltpgtlt/pgt lthtmlgtltbodygtlth
2gtWordnetlt/h2gt lthtmlgtltbodygtltpgtWas developed
lt/pgt lthtmlgtltbodygtlth2gtGermanetlt/h2gt lthtmlgtltbodygtltp
gtAnalogous to Wordnet for the English
lt/pgt lthtmlgtltbodygt lthtmlgtlt/bodygt lt/htmlgt
12
XHTML-Tree Paths Text Elements - Filtered
lthtmlgt lthtmlgtltheadgt lthtmlgtltheadgt lthtmlgtlt/headgt lth
tmlgtltbodygt lthtmlgtltbodygtlth1gtLexical
Resourceslt/h1gt lthtmlgtltbodygtltpgtlt/pgt lthtmlgtltbodygtlth
2gtWordnetlt/h2gt lthtmlgtltbodygtltpgtWas developed
lt/pgt lthtmlgtltbodygtlth2gtGermanetlt/h2gt lthtmlgtltbodygtltp
gtAnalogous to Wordnet for the English
lt/pgt lthtmlgtltbodygt lthtmlgtlt/bodygt lt/htmlgt
13
Group-By-Path
lthtmlgt lthtmlgtltheadgt lthtmlgtltheadgt lthtmlgtlt/headgt lth
tmlgtltbodygt lthtmlgtltbodygtlth1gtLexical
Resourceslt/h1gt lthtmlgtltbodygtltpgtlt/pgt lthtmlgtltbodygtlth
2gtWordnetlt/h2gt lthtmlgtltbodygtltpgtWas developed
lt/pgt lthtmlgtltbodygtlth2gtGermanetlt/h2gt lthtmlgtltbodygtltp
gtAnalogous to Wordnet for the English
lt/pgt lthtmlgtltbodygt lthtmlgtlt/bodygt lt/htmlgt
14
Content
  • Introduction
  • Group-By-Path
  • XTREEM-SG Procedure
  • Experimental Results
  • Conclusion and Outlook

15
Sibling Sets
  • Sibling Text-Spans - Item Sets
  • Wordnet, Germanet
  • Errors
  • Large Amounts of Web Documents ? Redundancy ?
    Processing ? reduce sporadic occurring Siblings

16
XTREEM-SG Data Flow Diagram
17
Resulting Clusters
Publication Parts
TopicMap
topic association scope role associations subject
topics resource occurrence topic_map
abstract keywords title references acknowledgement
s speaker introduction contact conclusion authors
thesaurus taxonomy ontology metadata controlled_vo
cabulary faceted_classification semantic_web topic
_maps classification rdf
Metadata
ian_horrocks dieter_fensel stefan_decker steffen_s
taab frank_van_harmelen peter_f__patel_schneider d
eborah_l__mcguinness raphael_volz brian_mcbride se
an_bechhofer
xml rdf html owl daml_oil semantic_web rdf_schema
xml_schema http w3c
class property domain range subclassof mincardinal
ity cardinality disjointwith string list
wordnet cyc opencyc sensus sumo semanticweb_org on
tosaurus umls daml_oil shoe
Ontologies
Authors
Webservice
RDF/OWL
18
Evaluation
  • Automatic vs. Manual
  • Gold standard ontologies
  • F-Measure on average sibling overlap
  • Set of sets / set of sets

19
Variations on the Preprocessing method
  • Group-By-Path (GBP)
  • traditional Bag-Of-Words (BOW) vector space model
  • solely usage of Mark-Up (MU)

20
Evaluation Results I
21
  • FMASO
  • More intuitive (since often seen before)
  • precision and recall

22
Evaluation Results PR
23
Variations on the Web Document Collection
24
Evaluation Results II
25
Variations on the required support
  • threshold on the required support
  • of terms in the Web Document Collection
  • weakly supported terms are more and more ignored

26
Evaluation Results III
27
Findings
  • Improvement of FMASO from 14.18 to 21.47
  • Cluster Characteristics
  • Conventional relations between terms in a cluster
    are various
  • XTREEM-SG Sibling Relationship
  • Explanation
  • Authors group texts at the same level into item
    lists, headlines etc, - usually motivated by the
    intention to present sibling concepts in an
    intuitive way

28
Content
  • Introduction
  • Group-By-Path
  • XTREEM-SG Procedure
  • Experimental Results
  • Conclusion and Outlook

29
Conclusion and Outlook
  • Discovery of
  • Siblings (co-hyponyms, co-meronyms, )
  • Singleword multiword term expressions
  • Language and domain independent
  • WebScale
  • Large scale application / integration
  • number of clusters to be generated/inspected
  • Corresponding super-concept

30
  • Thank you for your attention!
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com