Bettina Berendt - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Bettina Berendt

Description:

Blog recommendation. Semantics-enhanced text mining, word sense disambiguation ... Law meta-blogs. Law, politics, economy ( 3 factotum) Law health ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 65
Provided by: warholWiw
Category:
Tags: berendt | bettina

less

Transcript and Presenter's Notes

Title: Bettina Berendt


1
Semantic Web Mining Heute Semantik für und aus
Blogs
  • Bettina Berendt
  • Humboldt-Universität zu Berlin www.berendt.de
  • mit vielen Ko-AutorInnen
  • mit Roberto Navigli, Università La Sapienza,
    Roma, Italy

2
Agenda
  • Motivation und Überblick
  • Warum Web? Warum Blogs?
  • Semantic Web Mining
  • Finding your way through blogspace
  • Using semantics for cross-domain blog analysis

3
Agenda
  • Motivation und Überblick
  • Warum Web? Warum Blogs?
  • Semantic Web Mining
  • Finding your way through blogspace
  • Using semantics for cross-domain blog analysis

4
Das Ziel
5
Das Wissen der Menschheit möglichst vielen
Menschen effektiv zugänglich machen.
6
Makrokosmos World Wide Web
7
Mikrokosmos Blogosphere
8
Konkrete Ziele(Bsp. für Teil 2 dieses Vortrags)
Klassifikation Dieser Blog behandelt Inhalte
aus Ernährung und Gastronomie. ? Vorschläge von
Meta-Tags für den Blog ? Unterstützung
von Blog-Suchmaschinen
Empfehlungen mit Erklärung Wenn Sie diesen
Blog interessant fanden, dann wird Sie
vielleicht auch Blog ... interessieren, und zwar
weil ...
9
Das Potenzial
10
Sehr viel Wissen, für Menschen zugänglich.
11
Die Probleme
12
Sehr viel Wissen, für Menschen zugänglich.
13
Web Mining
14
Formen
  • Knowledge discovery (aka Data
    mining)
  • the non-trivial process of identifying valid,
    novel, potentially useful, and ultimately
    understandable patterns in data. 1
  • Web Mining
  • die Anwendung von Data-Mining-Techniken auf
    Inhalt, (Hyperlink-) Struktur und Nutzung von
    Webressourcen.

Webmining-Gebiete Web content mining
1 Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.,
Uthurusamy, R. (Eds.) (1996). Advances in
Knowledge Discovery and Data Mining. Boston, MA
AAAI/MIT Press
15
Web Mining Beispiele
Webmining-Gebiete Web content mining
Web structure mining
Web usage mining
16
Das Hauptproblem des Web Mining
17
Syntax in, Syntax out.
18
(No Transcript)
19
Semi-automatisches Tagging Tag-Empfehlung auf
Basis von Syntax existierenden Labels
20
Tagyu funktioniert auch (mit Einschränkungen) für
Ressourcen in anderen Sprachen
21
Funktioniert das wirklich? (1)
22
Funktioniert das wirklich? (2)
23
Das Wikipedia 300 Component Model, generiert mit
diskreter PCA cosco.hiit.fi/search/H300.html/topi
c_list - common phrases of selected components
  • process water air pressure gas body of
    water natural gas high pressure hot water
    fresh water
  • Mark Gospel Matthew Luke Rose Virgin Virgin
    Mary Gospel of John Gospel of Mark Gospel of
    Luke
  • part text Britannica entry Encyclopedia
    Britannica Encyclopdia Britannica
    Encyclopaedia Britannica domain Encyclopdia
    Britannica public domain Encyclopdia
    Britannica public domain text
  • property theorem elements proof subset
    axioms proposition natural numbers fundamental
    theorem mathematical logic
  • Dove AMD Dove Streptopelia imperial crown
    Imperial army imperial court imperial family
    Collared Dove Streptopelia Imperial Russia
  • side feet long time long period right side
    left side long distances different types short
    distance opposite side
  • David bill Bob Jim Allen Dave Current
    stars former members Bill Clinton former
    President
  • magazine newspaper political parties public
    domain text public opinion political career
    public schools own right political life public
    service
  • way things boy cat long time same way same
    thing only way different ways good thing
  • problems zero sum digits natural numbers
    positive integer mathematical analysis decimal
    digits natural logarithm
  • population density couples races total area
    makeup Demographics median age income
    density housing units
  • Torres Iraqi KASUMI KHAZAD Khufu Granada Spa
    Fra General information General Public License
    General Bernardo New Granada Torres Strait
  • love Me Rolling Stones love songs Rolling
    Stone magazine Love Me Fall in Love Meet Me
    love story professional wrestler

Zusammenfassend Schwächen rein statistischer
Ansätze Interpretation der Resultate? Existenz
von Resultaten? Korrektheit? Inferenzen?
24
Semantic Web
25
Das Semantic Web
  • The Semantic Web is an extension of the current
    web in which information is given well-defined
    meaning, better enabling computers and people to
    work in co-operation. 1
  • The Semantic Web provides a common framework
    that allows data to be shared and reused across
    application, enterprise, and community
    boundaries. It is a collaborative effort led by
    W3C with participation from a large number of
    researchers and industrial partners. It is based
    on the Resource Description Framework (RDF),
    which integrates a variety of applications using
    XML for syntax and URIs for naming. 2

1 Berners-Lee, T., Hendler, J., Lassila, O.
(2001). The Semantic Web. Sci. American, May. 2
http//www.w3.org/2001/sw/ 3 Berners-Lee, T.
(2000). Semantic Web XML2000. www.w3.org/2000/Talk
s/1206-xml2k-tbl/
26
Category structure ltRDF xmlnsr"http//www.w3.or
g/TR/RDF/" xmlnsd"http//purl.org/dc/e
lements/1.0/" xmlns"http//directory.
mozilla.org/rdf"gt ltTopic rid"Top"gt lttag
catid"1"/gt ltdTitlegtToplt/dTitlegt ltnarrow
rresource"Top/Arts"/gt .... lt/Topicgt ltTopic
rid"Top/Arts"gt lttag catid"2"/gt
ltdTitlegtArtslt/dTitlegt ltnarrow
rresource"Top/Arts/Books"/gt ... ltnarrow
rresource"Top/Arts/Artists"/gt ltsymbolic
rresource"TypographyTop/Computers/Fonts"/gt lt/To
picgt .... lt/RDFgt
Resources ltRDF xmlnsr"http//www.w3.org/TR/RDF/
" xmlnsd"http//purl.org/dc/elements/1.0/"
xmlns"http//directory.mozilla.org/rdf"gt
... ltTopic rid"Top/Arts"gt lttag catid"2"/gt
ltdTitlegtArtslt/dTitlegt ltlink
rresource"http//www3...ca/./file.html"/gt lt/Top
icgt ltExternalPage about"http//wwwca/file
.html"gt ltdTitlegtJohn phillips Blown
glasslt/dTitlegt ltdDescriptiongtA small display
of glass by John Phillipslt/dDescriptiongt lt
/ExternalPagegt ltTopic rid"Top/Computers"gt
lttag catid"4"/gt ltdTitlegtComputerslt/dTitlegt
ltlink rresource"http//www.cs.tcd.ie/FME/"/gt
ltlink rresourcehttp//foo.asdfsa.."/gt lt/Topicgt
lt/RDFgt
Semantic Web Beispiel
27
Warum Semantic Web?Bsp. strukturierte Suche
Metadaten gemäß Dublin Core (DC)
28
Semantische Suche Bsp. 2 Metadaten gemäß DC
Domänenontologie
29
Das Hauptproblem des Semantic Web
30
Wer soll das alles machen?
31
Der Ansatz
32
Web Mining Maschinelles Lernen extrahiert aus
Daten Wissen
Das Semantic Web macht Wissen maschinen-verständli
ch
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
Agenda
  • Motivation und Überblick
  • Warum Web? Warum Blogs?
  • Semantic Web Mining
  • Finding your way through blogspace
  • Using semantics for cross-domain blog analysis

38
Context
  • Semi-automatic tagging
  • Blog recommendation
  • Semantics-enhanced text mining, word sense
    disambiguation
  • Exploratory analyses of blog contents
  • Computational Approaches to Analyzing Weblogs
    AAAI 2006 Spring Symposium
  • Read more in the paper
  • http//www2.wiwi.hu-berlin.de/berendt/Papers/SS06
    03BerendtB.pdf

39
Blog recommendation collaborative
content-based filtering (www.iro.umontreal.ca/ai
meur/publications/Workshop20.pdf)
40
An example of exploratory blogs analysis (in
which a syntax-based approach is sufficient) the
run-up to the 2004 US presidential election
(Adamic Glance, 2005)
41
Our procedure
  • Take a set of blog corpora ( collection of blogs
    manually labelled as belonging to one topic)
  • In all of the following analyses
  • what is blog corpus about?
  • to which other blog corpora is it related, and
    why?
  • syntactic analysis keyphrases
  • semantic analysis I domain labels
  • semantic analysis II structural semantic
    interconnections

42
Data
43
Sample data 4 blog corpora
  • Food and drink
  • Health and medicine
  • Law
  • Weblogs about blogging
  • Randomly sampled from the Yahoo! blog directory,
    140-330 K words each
  • Available at
  • http//www.wiwi.hu-berlin.de/berendt/Blogs/Sample
    20050917/

44
Syntactic analysis
45
What is a blog about? Term Extraction
  • Domain relevance and domain consensus
  • Keyphrases DR 0.35, DC 0.23 (values from
    previous experiments)
  • t term, D corpus (here blog corpus), b a
    blog (here as an element of a corpus Dk)

46
What is shared by two blogs? Syntactic
similarity Jaccard coefficient
T(C) keyphrases / terminology of corpus C
47
Semantic analysis IWordNet and WordNet domains
48
WordNet
49
Hierarchical knowledge Domain labels
50
Domain label statistics show that the blog
corpora have clear thematic foci
frequency of domain D in corpus C no. of
keyphrases in C with a sense that maps to D
51
Blog foci Top 5 Domains
Food Health Law Meta-blogs
1 Gastronomy Medicine Law Telecommunications
2 Alimentation Time period Quality Time period
3 Quality Quality Politics Person
4 Botany Biology Administration Publishing
5 Person Physics Economy Economy
52
Top-10 intersections
  • Law meta-blogs
  • Law, politics, economy ( 3 factotum)
  • Law health
  • Law, psychology ( 2 factotum)
  • Health meta-blogs
  • Law ( 2 factotum)
  • Food law
  • Sociology ( 2 factotum)
  • No overlap food health, health law

53
Semantic analysis IIHierarchical and
non-hierarchical knowledge WordNet and SSI
(Structural semantic interconnections)
54
The need for word sense disambiguation
She sat by the bank and looked sentimentally at
the last coins.
She sat by the bank and looked sentimentally at
the last coins.
She sat by the bank and looked sentimentally at
the last fish.
55
WordNet semantic relations
56
Structural semantic interconnections bank fish
Details of SSIs enhanced lexcial
database (extending WordNet) and of SSIs word
sense disambiguation are described in R.
Navigli P. Velardi. Structural Semantic
Interconnections a knowledge-based approach to
word sense disambiguation. IEEE Transactions on
Pattern Analysis and Machine Intelligence
(27-7), July, 2005.
57
Structural semantic interconnections bank coin
58
Knowledge-based similarity between blogs
  • Example
  • connection between two terms from the domain
    computer science
  • path weights 0.33 0.25 0.25 1 / path
    length in no. of edges)
  • Procedure For each blog pair
  • find all SSI paths between all pairs of a term
    (keyphrase) from blog 1 and a term from blog 2
  • (in all conditions but the baseline choose only
    terms that map to senses in the top domain(s),
    and choose only those senses)
  • Measure of blog pair similarity sum over the
    weights of all these paths

59
Experi-mental settings
60
Results (Quantitative view)
61
Results Qualitative view
  • Baseline Spurious connections between law
    metablogs via computer science terms ? filtered
    out in domain-label conditions
  • Correct connections throughout Food health
    greasy food (cream cheese, chocolate sauce, ...)
    other fats, or health food
  • 1/3-relatedness reveals important connections
  • Expected law metablogs enterprise (related to
    law) computer science (related to
    telecommunications), publishing, politics law
    firms, news organizations, news story, political
    party
  • Unexpected law food local government town
    planning (including parking lots, the main drag)
  • Single-term expressions particularly visible in
    food health (eggs, onions, ... health food
    disease beef) ? lexicalization effect, depends
    on domains (also related domains in law
    metablogs)
  • 3-relatedness topic drift, many highly generic
    single-word terms (activity, life, computer,
    area, food) establish many generic paths to a 2nd
    corpus (these terms are related to nearly
    everything else) ? topic drift

62
Restricting path grammar to find valid
interconnections
  • Starting from 3-relatedness
  • 1 related-to link ? filters out 88.8 of the
    paths
  • 2 types of links ? filters out 53.4 of the
    path
  • Results
  • Mostly, meaningful paths were retained.
  • But further research is needed.

63
Questions / future work
  • Evaluation
  • Standard datasets (senseval for blogs) try the
    following ?!
  • http//www.blogpulse.com/www2006-workshop/
  • 10 M posts from 1 M weblogs from three weeks in
    July 2005.
  • This data set has been selected as it spans a
    period of time during which an event of global
    significance occurred, namely the London
    bombings.
  • Compare syntax- and semantics-based approaches
  • Assuming that the semi-automatic approaches of
    Semantic Web Mining give qualitatively better
    results
  • How can the quality gains be weigthed against
    the additional costs of manual post-processing?
  • Improve path grammars
  • Ontology learning

64
Danke
  • für Ihre Aufmerksamkeit!
Write a Comment
User Comments (0)
About PowerShow.com