Bettina Berendt - PowerPoint PPT Presentation

1 / 64

About This Presentation

Title:

Bettina Berendt

Description:

Blog recommendation. Semantics-enhanced text mining, word sense disambiguation ... Law meta-blogs. Law, politics, economy ( 3 factotum) Law health ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 65

Provided by: warholWiw

Category:

more less

Transcript and Presenter's Notes

Title: Bettina Berendt

1
Semantic Web Mining Heute Semantik für und aus
Blogs

Bettina Berendt
Humboldt-Universität zu Berlin www.berendt.de
mit vielen Ko-AutorInnen
mit Roberto Navigli, Università La Sapienza,
Roma, Italy

2
Agenda

Motivation und Überblick
Warum Web? Warum Blogs?
Semantic Web Mining
Finding your way through blogspace
Using semantics for cross-domain blog analysis

3
Agenda

Motivation und Überblick
Warum Web? Warum Blogs?
Semantic Web Mining
Finding your way through blogspace
Using semantics for cross-domain blog analysis

4
Das Ziel
5
Das Wissen der Menschheit möglichst vielen
Menschen effektiv zugänglich machen.
6
Makrokosmos World Wide Web
7
Mikrokosmos Blogosphere
8
Konkrete Ziele(Bsp. für Teil 2 dieses Vortrags)
Klassifikation Dieser Blog behandelt Inhalte
aus Ernährung und Gastronomie. ? Vorschläge von
Meta-Tags für den Blog ? Unterstützung
von Blog-Suchmaschinen
Empfehlungen mit Erklärung Wenn Sie diesen
Blog interessant fanden, dann wird Sie
vielleicht auch Blog ... interessieren, und zwar
weil ...
9
Das Potenzial
10
Sehr viel Wissen, für Menschen zugänglich.
11
Die Probleme
12
Sehr viel Wissen, für Menschen zugänglich.
13
Web Mining
14
Formen

Knowledge discovery (aka Data
mining)
the non-trivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data. 1
Web Mining
die Anwendung von Data-Mining-Techniken auf
Inhalt, (Hyperlink-) Struktur und Nutzung von
Webressourcen.

Webmining-Gebiete Web content mining
1 Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.,
Uthurusamy, R. (Eds.) (1996). Advances in
Knowledge Discovery and Data Mining. Boston, MA
AAAI/MIT Press
15
Web Mining Beispiele
Webmining-Gebiete Web content mining
Web structure mining
Web usage mining
16
Das Hauptproblem des Web Mining
17
Syntax in, Syntax out.
18
(No Transcript)
19
Semi-automatisches Tagging Tag-Empfehlung auf
Basis von Syntax existierenden Labels
20
Tagyu funktioniert auch (mit Einschränkungen) für
Ressourcen in anderen Sprachen
21
Funktioniert das wirklich? (1)
22
Funktioniert das wirklich? (2)
23
Das Wikipedia 300 Component Model, generiert mit
diskreter PCA cosco.hiit.fi/search/H300.html/topi
c_list - common phrases of selected components

process water air pressure gas body of
water natural gas high pressure hot water
fresh water
Mark Gospel Matthew Luke Rose Virgin Virgin
Mary Gospel of John Gospel of Mark Gospel of
Luke
part text Britannica entry Encyclopedia
Britannica Encyclopdia Britannica
Encyclopaedia Britannica domain Encyclopdia
Britannica public domain Encyclopdia
Britannica public domain text
property theorem elements proof subset
axioms proposition natural numbers fundamental
theorem mathematical logic
Dove AMD Dove Streptopelia imperial crown
Imperial army imperial court imperial family
Collared Dove Streptopelia Imperial Russia
side feet long time long period right side
left side long distances different types short
distance opposite side
David bill Bob Jim Allen Dave Current
stars former members Bill Clinton former
President
magazine newspaper political parties public
domain text public opinion political career
public schools own right political life public
service
way things boy cat long time same way same
thing only way different ways good thing
problems zero sum digits natural numbers
positive integer mathematical analysis decimal
digits natural logarithm
population density couples races total area
makeup Demographics median age income
density housing units
Torres Iraqi KASUMI KHAZAD Khufu Granada Spa
Fra General information General Public License
General Bernardo New Granada Torres Strait
love Me Rolling Stones love songs Rolling
Stone magazine Love Me Fall in Love Meet Me
love story professional wrestler

Zusammenfassend Schwächen rein statistischer
Ansätze Interpretation der Resultate? Existenz
von Resultaten? Korrektheit? Inferenzen?
24
Semantic Web
25
Das Semantic Web

The Semantic Web is an extension of the current
web in which information is given well-defined
meaning, better enabling computers and people to
work in co-operation. 1
The Semantic Web provides a common framework
that allows data to be shared and reused across
application, enterprise, and community
boundaries. It is a collaborative effort led by
W3C with participation from a large number of
researchers and industrial partners. It is based
on the Resource Description Framework (RDF),
which integrates a variety of applications using
XML for syntax and URIs for naming. 2

1 Berners-Lee, T., Hendler, J., Lassila, O.
(2001). The Semantic Web. Sci. American, May. 2
http//www.w3.org/2001/sw/ 3 Berners-Lee, T.
(2000). Semantic Web XML2000. www.w3.org/2000/Talk
s/1206-xml2k-tbl/
26
Category structure ltRDF xmlnsr"http//www.w3.or
g/TR/RDF/" xmlnsd"http//purl.org/dc/e
lements/1.0/" xmlns"http//directory.
mozilla.org/rdf"gt ltTopic rid"Top"gt lttag
catid"1"/gt ltdTitlegtToplt/dTitlegt ltnarrow
rresource"Top/Arts"/gt .... lt/Topicgt ltTopic
rid"Top/Arts"gt lttag catid"2"/gt
ltdTitlegtArtslt/dTitlegt ltnarrow
rresource"Top/Arts/Books"/gt ... ltnarrow
rresource"Top/Arts/Artists"/gt ltsymbolic
rresource"TypographyTop/Computers/Fonts"/gt lt/To
picgt .... lt/RDFgt
Resources ltRDF xmlnsr"http//www.w3.org/TR/RDF/
" xmlnsd"http//purl.org/dc/elements/1.0/"
xmlns"http//directory.mozilla.org/rdf"gt
... ltTopic rid"Top/Arts"gt lttag catid"2"/gt
ltdTitlegtArtslt/dTitlegt ltlink
rresource"http//www3...ca/./file.html"/gt lt/Top
icgt ltExternalPage about"http//wwwca/file
.html"gt ltdTitlegtJohn phillips Blown
glasslt/dTitlegt ltdDescriptiongtA small display
of glass by John Phillipslt/dDescriptiongt lt
/ExternalPagegt ltTopic rid"Top/Computers"gt
lttag catid"4"/gt ltdTitlegtComputerslt/dTitlegt
ltlink rresource"http//www.cs.tcd.ie/FME/"/gt
ltlink rresourcehttp//foo.asdfsa.."/gt lt/Topicgt
lt/RDFgt
Semantic Web Beispiel
27
Warum Semantic Web?Bsp. strukturierte Suche
Metadaten gemäß Dublin Core (DC)
28
Semantische Suche Bsp. 2 Metadaten gemäß DC
Domänenontologie
29
Das Hauptproblem des Semantic Web
30
Wer soll das alles machen?
31
Der Ansatz
32
Web Mining Maschinelles Lernen extrahiert aus
Daten Wissen
Das Semantic Web macht Wissen maschinen-verständli
ch
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
Agenda

Motivation und Überblick
Warum Web? Warum Blogs?
Semantic Web Mining
Finding your way through blogspace
Using semantics for cross-domain blog analysis

38
Context

Semi-automatic tagging
Blog recommendation
Semantics-enhanced text mining, word sense
disambiguation
Exploratory analyses of blog contents
Computational Approaches to Analyzing Weblogs
AAAI 2006 Spring Symposium
Read more in the paper
http//www2.wiwi.hu-berlin.de/berendt/Papers/SS06
03BerendtB.pdf

39
Blog recommendation collaborative
content-based filtering (www.iro.umontreal.ca/ai
meur/publications/Workshop20.pdf)
40
An example of exploratory blogs analysis (in
which a syntax-based approach is sufficient) the
run-up to the 2004 US presidential election
(Adamic Glance, 2005)
41
Our procedure

Take a set of blog corpora ( collection of blogs
manually labelled as belonging to one topic)
In all of the following analyses
what is blog corpus about?
to which other blog corpora is it related, and
why?
syntactic analysis keyphrases
semantic analysis I domain labels
semantic analysis II structural semantic
interconnections

42
Data
43
Sample data 4 blog corpora

Food and drink
Health and medicine
Law
Weblogs about blogging
Randomly sampled from the Yahoo! blog directory,
140-330 K words each
Available at
http//www.wiwi.hu-berlin.de/berendt/Blogs/Sample
20050917/

44
Syntactic analysis
45
What is a blog about? Term Extraction

Domain relevance and domain consensus
Keyphrases DR 0.35, DC 0.23 (values from
previous experiments)
t term, D corpus (here blog corpus), b a
blog (here as an element of a corpus Dk)

46
What is shared by two blogs? Syntactic
similarity Jaccard coefficient
T(C) keyphrases / terminology of corpus C
47
Semantic analysis IWordNet and WordNet domains
48
WordNet
49
Hierarchical knowledge Domain labels
50
Domain label statistics show that the blog
corpora have clear thematic foci
frequency of domain D in corpus C no. of
keyphrases in C with a sense that maps to D
51
Blog foci Top 5 Domains
Food Health Law Meta-blogs
1 Gastronomy Medicine Law Telecommunications
2 Alimentation Time period Quality Time period
3 Quality Quality Politics Person
4 Botany Biology Administration Publishing
5 Person Physics Economy Economy
52
Top-10 intersections

Law meta-blogs
Law, politics, economy ( 3 factotum)
Law health
Law, psychology ( 2 factotum)
Health meta-blogs
Law ( 2 factotum)
Food law
Sociology ( 2 factotum)
No overlap food health, health law

53
Semantic analysis IIHierarchical and
non-hierarchical knowledge WordNet and SSI
(Structural semantic interconnections)
54
The need for word sense disambiguation
She sat by the bank and looked sentimentally at
the last coins.
She sat by the bank and looked sentimentally at
the last coins.
She sat by the bank and looked sentimentally at
the last fish.
55
WordNet semantic relations
56
Structural semantic interconnections bank fish
Details of SSIs enhanced lexcial
database (extending WordNet) and of SSIs word
sense disambiguation are described in R.
Navigli P. Velardi. Structural Semantic
Interconnections a knowledge-based approach to
word sense disambiguation. IEEE Transactions on
Pattern Analysis and Machine Intelligence
(27-7), July, 2005.
57
Structural semantic interconnections bank coin
58
Knowledge-based similarity between blogs

Example
connection between two terms from the domain
computer science
path weights 0.33 0.25 0.25 1 / path
length in no. of edges)
Procedure For each blog pair
find all SSI paths between all pairs of a term
(keyphrase) from blog 1 and a term from blog 2
(in all conditions but the baseline choose only
terms that map to senses in the top domain(s),
and choose only those senses)
Measure of blog pair similarity sum over the
weights of all these paths

59
Experi-mental settings
60
Results (Quantitative view)
61
Results Qualitative view

Baseline Spurious connections between law
metablogs via computer science terms ? filtered
out in domain-label conditions
Correct connections throughout Food health
greasy food (cream cheese, chocolate sauce, ...)
other fats, or health food
1/3-relatedness reveals important connections
Expected law metablogs enterprise (related to
law) computer science (related to
telecommunications), publishing, politics law
firms, news organizations, news story, political
party
Unexpected law food local government town
planning (including parking lots, the main drag)
Single-term expressions particularly visible in
food health (eggs, onions, ... health food
disease beef) ? lexicalization effect, depends
on domains (also related domains in law
metablogs)
3-relatedness topic drift, many highly generic
single-word terms (activity, life, computer,
area, food) establish many generic paths to a 2nd
corpus (these terms are related to nearly
everything else) ? topic drift

62
Restricting path grammar to find valid
interconnections

Starting from 3-relatedness
1 related-to link ? filters out 88.8 of the
paths
2 types of links ? filters out 53.4 of the
path
Results
Mostly, meaningful paths were retained.
But further research is needed.

63
Questions / future work

Evaluation
Standard datasets (senseval for blogs) try the
following ?!
http//www.blogpulse.com/www2006-workshop/
10 M posts from 1 M weblogs from three weeks in
July 2005.
This data set has been selected as it spans a
period of time during which an event of global
significance occurred, namely the London
bombings.
Compare syntax- and semantics-based approaches
Assuming that the semi-automatic approaches of
Semantic Web Mining give qualitatively better
results
How can the quality gains be weigthed against
the additional costs of manual post-processing?
Improve path grammars
Ontology learning