Title: Bettina Berendt
1Semantic Web Mining Heute Semantik für und aus
Blogs
- Bettina Berendt
- Humboldt-Universität zu Berlin www.berendt.de
- mit vielen Ko-AutorInnen
- mit Roberto Navigli, Università La Sapienza,
Roma, Italy
2Agenda
- Motivation und Überblick
- Warum Web? Warum Blogs?
- Semantic Web Mining
- Finding your way through blogspace
- Using semantics for cross-domain blog analysis
3Agenda
- Motivation und Überblick
- Warum Web? Warum Blogs?
- Semantic Web Mining
- Finding your way through blogspace
- Using semantics for cross-domain blog analysis
4Das Ziel
5Das Wissen der Menschheit möglichst vielen
Menschen effektiv zugänglich machen.
6Makrokosmos World Wide Web
7Mikrokosmos Blogosphere
8Konkrete Ziele(Bsp. für Teil 2 dieses Vortrags)
Klassifikation Dieser Blog behandelt Inhalte
aus Ernährung und Gastronomie. ? Vorschläge von
Meta-Tags für den Blog ? Unterstützung
von Blog-Suchmaschinen
Empfehlungen mit Erklärung Wenn Sie diesen
Blog interessant fanden, dann wird Sie
vielleicht auch Blog ... interessieren, und zwar
weil ...
9Das Potenzial
10Sehr viel Wissen, für Menschen zugänglich.
11Die Probleme
12Sehr viel Wissen, für Menschen zugänglich.
13Web Mining
14Formen
- Knowledge discovery (aka Data
mining) - the non-trivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data. 1 - Web Mining
- die Anwendung von Data-Mining-Techniken auf
Inhalt, (Hyperlink-) Struktur und Nutzung von
Webressourcen.
Webmining-Gebiete Web content mining
1 Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.,
Uthurusamy, R. (Eds.) (1996). Advances in
Knowledge Discovery and Data Mining. Boston, MA
AAAI/MIT Press
15Web Mining Beispiele
Webmining-Gebiete Web content mining
Web structure mining
Web usage mining
16Das Hauptproblem des Web Mining
17Syntax in, Syntax out.
18(No Transcript)
19Semi-automatisches Tagging Tag-Empfehlung auf
Basis von Syntax existierenden Labels
20Tagyu funktioniert auch (mit Einschränkungen) für
Ressourcen in anderen Sprachen
21Funktioniert das wirklich? (1)
22Funktioniert das wirklich? (2)
23Das Wikipedia 300 Component Model, generiert mit
diskreter PCA cosco.hiit.fi/search/H300.html/topi
c_list - common phrases of selected components
- process water air pressure gas body of
water natural gas high pressure hot water
fresh water - Mark Gospel Matthew Luke Rose Virgin Virgin
Mary Gospel of John Gospel of Mark Gospel of
Luke - part text Britannica entry Encyclopedia
Britannica Encyclopdia Britannica
Encyclopaedia Britannica domain Encyclopdia
Britannica public domain Encyclopdia
Britannica public domain text - property theorem elements proof subset
axioms proposition natural numbers fundamental
theorem mathematical logic - Dove AMD Dove Streptopelia imperial crown
Imperial army imperial court imperial family
Collared Dove Streptopelia Imperial Russia - side feet long time long period right side
left side long distances different types short
distance opposite side - David bill Bob Jim Allen Dave Current
stars former members Bill Clinton former
President - magazine newspaper political parties public
domain text public opinion political career
public schools own right political life public
service - way things boy cat long time same way same
thing only way different ways good thing - problems zero sum digits natural numbers
positive integer mathematical analysis decimal
digits natural logarithm - population density couples races total area
makeup Demographics median age income
density housing units - Torres Iraqi KASUMI KHAZAD Khufu Granada Spa
Fra General information General Public License
General Bernardo New Granada Torres Strait - love Me Rolling Stones love songs Rolling
Stone magazine Love Me Fall in Love Meet Me
love story professional wrestler
Zusammenfassend Schwächen rein statistischer
Ansätze Interpretation der Resultate? Existenz
von Resultaten? Korrektheit? Inferenzen?
24Semantic Web
25 Das Semantic Web
- The Semantic Web is an extension of the current
web in which information is given well-defined
meaning, better enabling computers and people to
work in co-operation. 1 - The Semantic Web provides a common framework
that allows data to be shared and reused across
application, enterprise, and community
boundaries. It is a collaborative effort led by
W3C with participation from a large number of
researchers and industrial partners. It is based
on the Resource Description Framework (RDF),
which integrates a variety of applications using
XML for syntax and URIs for naming. 2
1 Berners-Lee, T., Hendler, J., Lassila, O.
(2001). The Semantic Web. Sci. American, May. 2
http//www.w3.org/2001/sw/ 3 Berners-Lee, T.
(2000). Semantic Web XML2000. www.w3.org/2000/Talk
s/1206-xml2k-tbl/
26Category structure ltRDF xmlnsr"http//www.w3.or
g/TR/RDF/" xmlnsd"http//purl.org/dc/e
lements/1.0/" xmlns"http//directory.
mozilla.org/rdf"gt ltTopic rid"Top"gt lttag
catid"1"/gt ltdTitlegtToplt/dTitlegt ltnarrow
rresource"Top/Arts"/gt .... lt/Topicgt ltTopic
rid"Top/Arts"gt lttag catid"2"/gt
ltdTitlegtArtslt/dTitlegt ltnarrow
rresource"Top/Arts/Books"/gt ... ltnarrow
rresource"Top/Arts/Artists"/gt ltsymbolic
rresource"TypographyTop/Computers/Fonts"/gt lt/To
picgt .... lt/RDFgt
Resources ltRDF xmlnsr"http//www.w3.org/TR/RDF/
" xmlnsd"http//purl.org/dc/elements/1.0/"
xmlns"http//directory.mozilla.org/rdf"gt
... ltTopic rid"Top/Arts"gt lttag catid"2"/gt
ltdTitlegtArtslt/dTitlegt ltlink
rresource"http//www3...ca/./file.html"/gt lt/Top
icgt ltExternalPage about"http//wwwca/file
.html"gt ltdTitlegtJohn phillips Blown
glasslt/dTitlegt ltdDescriptiongtA small display
of glass by John Phillipslt/dDescriptiongt lt
/ExternalPagegt ltTopic rid"Top/Computers"gt
lttag catid"4"/gt ltdTitlegtComputerslt/dTitlegt
ltlink rresource"http//www.cs.tcd.ie/FME/"/gt
ltlink rresourcehttp//foo.asdfsa.."/gt lt/Topicgt
lt/RDFgt
Semantic Web Beispiel
27Warum Semantic Web?Bsp. strukturierte Suche
Metadaten gemäß Dublin Core (DC)
28Semantische Suche Bsp. 2 Metadaten gemäß DC
Domänenontologie
29Das Hauptproblem des Semantic Web
30Wer soll das alles machen?
31Der Ansatz
32Web Mining Maschinelles Lernen extrahiert aus
Daten Wissen
Das Semantic Web macht Wissen maschinen-verständli
ch
33(No Transcript)
34(No Transcript)
35(No Transcript)
36(No Transcript)
37Agenda
- Motivation und Überblick
- Warum Web? Warum Blogs?
- Semantic Web Mining
- Finding your way through blogspace
- Using semantics for cross-domain blog analysis
38Context
- Semi-automatic tagging
- Blog recommendation
- Semantics-enhanced text mining, word sense
disambiguation - Exploratory analyses of blog contents
- Computational Approaches to Analyzing Weblogs
AAAI 2006 Spring Symposium - Read more in the paper
- http//www2.wiwi.hu-berlin.de/berendt/Papers/SS06
03BerendtB.pdf
39Blog recommendation collaborative
content-based filtering (www.iro.umontreal.ca/ai
meur/publications/Workshop20.pdf)
40An example of exploratory blogs analysis (in
which a syntax-based approach is sufficient) the
run-up to the 2004 US presidential election
(Adamic Glance, 2005)
41Our procedure
- Take a set of blog corpora ( collection of blogs
manually labelled as belonging to one topic) - In all of the following analyses
- what is blog corpus about?
- to which other blog corpora is it related, and
why? - syntactic analysis keyphrases
- semantic analysis I domain labels
- semantic analysis II structural semantic
interconnections
42Data
43Sample data 4 blog corpora
- Food and drink
- Health and medicine
- Law
- Weblogs about blogging
- Randomly sampled from the Yahoo! blog directory,
140-330 K words each - Available at
- http//www.wiwi.hu-berlin.de/berendt/Blogs/Sample
20050917/
44Syntactic analysis
45What is a blog about? Term Extraction
- Domain relevance and domain consensus
- Keyphrases DR 0.35, DC 0.23 (values from
previous experiments) - t term, D corpus (here blog corpus), b a
blog (here as an element of a corpus Dk)
46What is shared by two blogs? Syntactic
similarity Jaccard coefficient
T(C) keyphrases / terminology of corpus C
47Semantic analysis IWordNet and WordNet domains
48WordNet
49Hierarchical knowledge Domain labels
50Domain label statistics show that the blog
corpora have clear thematic foci
frequency of domain D in corpus C no. of
keyphrases in C with a sense that maps to D
51Blog foci Top 5 Domains
Food Health Law Meta-blogs
1 Gastronomy Medicine Law Telecommunications
2 Alimentation Time period Quality Time period
3 Quality Quality Politics Person
4 Botany Biology Administration Publishing
5 Person Physics Economy Economy
52Top-10 intersections
- Law meta-blogs
- Law, politics, economy ( 3 factotum)
- Law health
- Law, psychology ( 2 factotum)
- Health meta-blogs
- Law ( 2 factotum)
- Food law
- Sociology ( 2 factotum)
- No overlap food health, health law
53Semantic analysis IIHierarchical and
non-hierarchical knowledge WordNet and SSI
(Structural semantic interconnections)
54The need for word sense disambiguation
She sat by the bank and looked sentimentally at
the last coins.
She sat by the bank and looked sentimentally at
the last coins.
She sat by the bank and looked sentimentally at
the last fish.
55WordNet semantic relations
56Structural semantic interconnections bank fish
Details of SSIs enhanced lexcial
database (extending WordNet) and of SSIs word
sense disambiguation are described in R.
Navigli P. Velardi. Structural Semantic
Interconnections a knowledge-based approach to
word sense disambiguation. IEEE Transactions on
Pattern Analysis and Machine Intelligence
(27-7), July, 2005.
57Structural semantic interconnections bank coin
58Knowledge-based similarity between blogs
- Example
- connection between two terms from the domain
computer science - path weights 0.33 0.25 0.25 1 / path
length in no. of edges) - Procedure For each blog pair
- find all SSI paths between all pairs of a term
(keyphrase) from blog 1 and a term from blog 2 - (in all conditions but the baseline choose only
terms that map to senses in the top domain(s),
and choose only those senses) - Measure of blog pair similarity sum over the
weights of all these paths
59Experi-mental settings
60Results (Quantitative view)
61Results Qualitative view
- Baseline Spurious connections between law
metablogs via computer science terms ? filtered
out in domain-label conditions - Correct connections throughout Food health
greasy food (cream cheese, chocolate sauce, ...)
other fats, or health food - 1/3-relatedness reveals important connections
- Expected law metablogs enterprise (related to
law) computer science (related to
telecommunications), publishing, politics law
firms, news organizations, news story, political
party - Unexpected law food local government town
planning (including parking lots, the main drag) - Single-term expressions particularly visible in
food health (eggs, onions, ... health food
disease beef) ? lexicalization effect, depends
on domains (also related domains in law
metablogs) - 3-relatedness topic drift, many highly generic
single-word terms (activity, life, computer,
area, food) establish many generic paths to a 2nd
corpus (these terms are related to nearly
everything else) ? topic drift
62Restricting path grammar to find valid
interconnections
- Starting from 3-relatedness
- 1 related-to link ? filters out 88.8 of the
paths - 2 types of links ? filters out 53.4 of the
path - Results
- Mostly, meaningful paths were retained.
- But further research is needed.
63Questions / future work
- Evaluation
- Standard datasets (senseval for blogs) try the
following ?! - http//www.blogpulse.com/www2006-workshop/
- 10 M posts from 1 M weblogs from three weeks in
July 2005. - This data set has been selected as it spans a
period of time during which an event of global
significance occurred, namely the London
bombings. - Compare syntax- and semantics-based approaches
- Assuming that the semi-automatic approaches of
Semantic Web Mining give qualitatively better
results - How can the quality gains be weigthed against
the additional costs of manual post-processing? - Improve path grammars
- Ontology learning
64 Danke