Folie 1 - PowerPoint PPT Presentation

About This Presentation
Title:

Folie 1

Description:

Why social bookmarking? Provides a vast amount of user-generated annotations for web content. ... Social bookmarking and spam. Conclusions and future work ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 33
Provided by: DAI104
Category:

less

Transcript and Presenter's Notes

Title: Folie 1


1
Analyzing Social Bookmarking Systems A
del.icio.us Cookbook Robert Wetzker, Carsten
Zimmermann, Christian Bauckhage Workshop on
Mining Social Data, ECAI 2008
23 November, 2009
Dipl.-Ing. Robert Wetzker I
robert.wetzker_at_dai-labor.de
2
Why this paper?
  • Why social bookmarking?
  • Provides a vast amount of user-generated
    annotations for web content.
  • Reflects the interests of millions of users.
  • Wisdom-of-crowds.

3
Why this paper?
  • Why social bookmarking?
  • Provides a vast amount of user-generated
    annotations for web content.
  • Reflects the interests of millions of users.
  • Wisdom-of-crowds.
  • Research areas
  • (Web-) Search
  • (Web-) Content classification
  • Ontology building
  • Trend detection
  • Recommendation

4
Outline
  • The del.icio.us bookmarking service
  • Bookmarking patterns
  • Tagging patterns
  • Social bookmarking and spam
  • Conclusions and future work

5
The del.icio.us bookmarking service
6
The del.icio.us bookmarking service
7
The growth of del.icio.us
8
The dataset
  • We recursively crawled del.icio.us tag wise
    starting with the tag web2.0 (Oct.-Dez. 2007).
  • From the retrieved corpus of 45 million
    bookmarks we extracted the 1 million most
    frequent users and downloaded the bookmarks of
    these users. (Dez. 2007 Apr. 2008)
  • For the analysis presented here, we only
    considered the 142 million bookmarks obtained
    from the user wise crawling.

9
The dataset
  • We recursively crawled del.icio.us tag wise
    starting with the tag web2.0 (Oct.-Dez. 2007).
  • From the retrieved corpus of 45 million
    bookmarks we extracted the 1 million most
    frequent users and downloaded the bookmarks of
    these users. (Dez. 2007 Apr. 2008)
  • For the analysis presented here, we only
    considered the 142 million bookmarks obtained
    from the user wise crawling.

Corpus details
10
The dataset
  • We recursively crawled del.icio.us tag wise
    starting with the tag web2.0 (Oct.-Dez. 2007).
  • From the retrieved corpus of 45 million
    bookmarks we extracted the 1 million most
    frequent users and downloaded the bookmarks of
    these users. (Dez. 2007 Apr. 2008)
  • For the analysis presented here, we only
    considered the 142 million bookmarks obtained
    from the user wise crawling.

Corpus details
gt 80 of del.icio.us
11
Bookmarking patterns
12
Bookmarking patterns
  • The del.icio.us community is biased toward web
    community and web technology related content.

Top 10 most frequent URLs in the corpus
13
Bookmarking patterns
  • The del.icio.us community is biased toward web
    community and web technology related content.

Top 10 most frequent domains in the corpus
14
Bookmarking patterns
  • The Top 1 of users proliferates 22 of all
    bookmarks.
  • 39 of all bookmarks link to 1 of all URLs.

15
Bookmarking patterns
  • The del.icio.us community pays attention to new
    content only for a very short period of time.

16
Tagging patterns
17
Tagging patterns
  • Each bookmark is labeled with 3.16 tags on
    average.
  • About 7 of all bookmarks are not tagged at all.

Top 20 most frequent tags in the corpus
18
Tagging patterns
  • 700 of 7.000.000 tags account for 50 of all
    labels.
  • 55 of all tags appear only once.

19
Tagging patterns
  • Tendencies in the del.icio.us tag distribution
    strongly correlate with upcoming and periodic
    external events.

Occurrence of 5 sample tags in 2007.
20
Social bookmarking and spam
21
Social bookmarking and spam
  • Del.icio.us is highly vulnerable to spam.
  • 19 of the Top 20 users are of apparently non
    human origin accounting for 1.3 million
    bookmarks, around 1 of the corpus.

22
Social bookmarking and spam
  • Del.icio.us is highly vulnerable to spam.
  • 19 of the Top 20 users are of apparently non
    human origin accounting for 1.3 million
    bookmarks, around 1 of the corpus.
  • We find spammers to exhibit one or more of the
    following characteristics
  • very high activity
  • bookmarking only few domains
  • high tagging rate
  • very low tagging rate
  • bulk posts
  • a combination of the above

23
Social bookmarking and spam
The number of bookmarks and the number of users
linking to a domain.
24
Social bookmarking and spam
The number of user bookmarks and the average
number of tags per bookmark.
25
The diffusion of attention
26
The diffusion of attention
  • In some cases spam detection may prove
    computational expensive or ambiguous.
  • The diffusion of attention concept reduces the
    effect of spam on the tag distribution without
    the actual need of spam detection.

27
The diffusion of attention
  • In some cases spam detection may prove
    computational expensive or ambiguous.
  • The diffusion of attention concept reduces the
    effect of spam on the tag distribution without
    the actual need of spam detection.
  • We define the attention given to a tag as the
    number of users using the tag.
  • The diffusion of attention for a tag is then
    given by the number of users that assign a tag
    for the first time in a given period.

28
The diffusion of attention
Tagging trends by tag occurrence.
29
The diffusion of attention
Tagging trends by tag occurrence.
Tagging trends by diffusion of attention.
30
Future work
  • Provide automatic and scalable spam detection
    methods.
  • Topic aware detection of trends.

Follow up paper Detecting Trends in Social
Bookmarking Systems using a Probabilistic
Generative Model and Smoothing, R. Wetzker, T.
Plumbaum, A.Korth, C. Bauckhage, T. Alpcan, F.
Metze, International Conference on Pattern
Recognition (ICPR), 2008, Tampa, USA (to appear)
31
Thank you.
Questions?
32
Social bookmarking and spam
The number of bookmarks and the number of users
linking to a domain.
http//d.hatena.ne.jp
Write a Comment
User Comments (0)
About PowerShow.com