Title: Discovering Semantic Sibling Groups from Web Documents with XTREEMSG
1Discovering Semantic Sibling Groupsfrom Web
Documents with XTREEM-SG
- Marko Brunzel, Myra Spiliopoulou
- marko.brunzel_at_dfki.uni-kl.de
- Otto-von-Guericke-University Magdeburg and
- German Research Centre for Artificial
Intelligence (DFKI GmbH) - EKAW 2006, 05.10.2006
2Content
- Introduction
- Group-By-Path
- XTREEM-SG Procedure
- Experimental Results
- Conclusion and Outlook
3Context
- Semantic Web
- Ontologies
- Knowledge Acquisition Bottleneck
- Ontology Learning
4Semi-Structured Data Source
- Structure
- Dictionaries, glossaries, database schemas
- Rarely available
- Unstructured Text
- 50 methods
- Shared conceptualization in written text?
- Semi-Structured Text
- Web Documents (X)HTML
5WWW as Data Source
- Domain Document Collection
- As Input ? manually crafted ? laborious
- Subjective - limited confidential
- WWW Documents
- As part of the mining process
- Shared - all topics - freely available
- Added value of mark-up
6Semi-Structured Web Content
- HTML Text Mark-up
- XHTML
- Enforces Tree Structure
- HTML to XHTML Conversion
- Mark-Up is used for
- Tables / Lists / Headings / (Named) Links /
Highlightings
? XTREEM - Xhtml TREE Mining
7Hints on Semantic Relatedness - Examples
Headings- distributed over paragraphs
lth2gtWordnetlt/h2gtltpgtWas developed
lt/pgtlth2gtGermanetlt/h2gtltpgtAnalogous lt/pgt
Highlighted Keywords - with regular text between
ltpgt there are different important standards
for building the ltstronggtSemantic Weblt/stronggt.
is ltstronggtRDFlt/stronggt. ltstronggtRDFS lt/stronggt
adds whereas ltstronggtOWL lt/stronggt is lt/pgt
8Content
- Introduction
- Group-By-Path
- XTREEM-SG Procedure
- Experimental Results
- Conclusion and Outlook
9XHTML-Tree
lthtmlgt ltheadgt ltheadgt lt/headgt
ltbodygt lth1gtLexical
Resourceslt/h1gt ltpgtlt/pgt
lth2gtWordnetlt/h2gt ltpgtWas developed
lt/pgt lth2gtGermanetlt/h2gt
ltpgtAnalogous to Wordnet for the English lt/pgt
ltbodygt lt/bodygt lt/htmlgt
10XHTML-Tree Paths
lthtmlgt ltheadgt ltheadgt lt/headgt
ltbodygt lth1gtLexical Resources
lt/h1gt ltpgtlt/pgt
lth2gtWordnetlt/h2gt ltpgtWas developed
lt/pgt lth2gtGermanetlt/h2gt
ltpgtAnalogous to Wordnet for the English lt/pgt
ltbodygt lt/bodygt lt/htmlgt
lthtmlgt lthtmlgtltheadgt lthtmlgtltheadgt lthtmlgtlt/headgt lth
tmlgtltbodygt lthtmlgtltbodygtlth1gtLexical
Resourceslt/h1gt lthtmlgtltbodygtltpgtlt/pgt lthtmlgtltbodygtlth
2gtWordnetlt/h2gt lthtmlgtltbodygtltpgtWas developed
lt/pgt lthtmlgtltbodygtlth2gtGermanetlt/h2gt lthtmlgtltbodygtltp
gtAnalogous to Wordnet for the English
lt/pgt lthtmlgtltbodygt lthtmlgtlt/bodygt lt/htmlgt
11XHTML-Tree Paths Text Elements
lthtmlgt lthtmlgtltheadgt lthtmlgtltheadgt lthtmlgtlt/headgt lth
tmlgtltbodygt lthtmlgtltbodygtlth1gtLexical
Resourceslt/h1gt lthtmlgtltbodygtltpgtlt/pgt lthtmlgtltbodygtlth
2gtWordnetlt/h2gt lthtmlgtltbodygtltpgtWas developed
lt/pgt lthtmlgtltbodygtlth2gtGermanetlt/h2gt lthtmlgtltbodygtltp
gtAnalogous to Wordnet for the English
lt/pgt lthtmlgtltbodygt lthtmlgtlt/bodygt lt/htmlgt
12XHTML-Tree Paths Text Elements - Filtered
lthtmlgt lthtmlgtltheadgt lthtmlgtltheadgt lthtmlgtlt/headgt lth
tmlgtltbodygt lthtmlgtltbodygtlth1gtLexical
Resourceslt/h1gt lthtmlgtltbodygtltpgtlt/pgt lthtmlgtltbodygtlth
2gtWordnetlt/h2gt lthtmlgtltbodygtltpgtWas developed
lt/pgt lthtmlgtltbodygtlth2gtGermanetlt/h2gt lthtmlgtltbodygtltp
gtAnalogous to Wordnet for the English
lt/pgt lthtmlgtltbodygt lthtmlgtlt/bodygt lt/htmlgt
13 Group-By-Path
lthtmlgt lthtmlgtltheadgt lthtmlgtltheadgt lthtmlgtlt/headgt lth
tmlgtltbodygt lthtmlgtltbodygtlth1gtLexical
Resourceslt/h1gt lthtmlgtltbodygtltpgtlt/pgt lthtmlgtltbodygtlth
2gtWordnetlt/h2gt lthtmlgtltbodygtltpgtWas developed
lt/pgt lthtmlgtltbodygtlth2gtGermanetlt/h2gt lthtmlgtltbodygtltp
gtAnalogous to Wordnet for the English
lt/pgt lthtmlgtltbodygt lthtmlgtlt/bodygt lt/htmlgt
14Content
- Introduction
- Group-By-Path
- XTREEM-SG Procedure
- Experimental Results
- Conclusion and Outlook
15Sibling Sets
- Sibling Text-Spans - Item Sets
- Wordnet, Germanet
- Errors
- Large Amounts of Web Documents ? Redundancy ?
Processing ? reduce sporadic occurring Siblings
16XTREEM-SG Data Flow Diagram
17Resulting Clusters
Publication Parts
TopicMap
topic association scope role associations subject
topics resource occurrence topic_map
abstract keywords title references acknowledgement
s speaker introduction contact conclusion authors
thesaurus taxonomy ontology metadata controlled_vo
cabulary faceted_classification semantic_web topic
_maps classification rdf
Metadata
ian_horrocks dieter_fensel stefan_decker steffen_s
taab frank_van_harmelen peter_f__patel_schneider d
eborah_l__mcguinness raphael_volz brian_mcbride se
an_bechhofer
xml rdf html owl daml_oil semantic_web rdf_schema
xml_schema http w3c
class property domain range subclassof mincardinal
ity cardinality disjointwith string list
wordnet cyc opencyc sensus sumo semanticweb_org on
tosaurus umls daml_oil shoe
Ontologies
Authors
Webservice
RDF/OWL
18Evaluation
- Automatic vs. Manual
- Gold standard ontologies
- F-Measure on average sibling overlap
- Set of sets / set of sets
19Variations on the Preprocessing method
- Group-By-Path (GBP)
- traditional Bag-Of-Words (BOW) vector space model
- solely usage of Mark-Up (MU)
20Evaluation Results I
21- FMASO
- More intuitive (since often seen before)
- precision and recall
22Evaluation Results PR
23Variations on the Web Document Collection
24Evaluation Results II
25Variations on the required support
- threshold on the required support
- of terms in the Web Document Collection
- weakly supported terms are more and more ignored
26Evaluation Results III
27Findings
- Improvement of FMASO from 14.18 to 21.47
- Cluster Characteristics
- Conventional relations between terms in a cluster
are various - XTREEM-SG Sibling Relationship
- Explanation
- Authors group texts at the same level into item
lists, headlines etc, - usually motivated by the
intention to present sibling concepts in an
intuitive way
28Content
- Introduction
- Group-By-Path
- XTREEM-SG Procedure
- Experimental Results
- Conclusion and Outlook
29Conclusion and Outlook
- Discovery of
- Siblings (co-hyponyms, co-meronyms, )
- Singleword multiword term expressions
- Language and domain independent
- WebScale
- Large scale application / integration
- number of clusters to be generated/inspected
- Corresponding super-concept
30- Thank you for your attention!
- Questions?