Title: Folie 1
1Analyzing Social Bookmarking Systems A
del.icio.us Cookbook Robert Wetzker, Carsten
Zimmermann, Christian Bauckhage Workshop on
Mining Social Data, ECAI 2008
23 November, 2009
Dipl.-Ing. Robert Wetzker I
robert.wetzker_at_dai-labor.de
2Why this paper?
- Why social bookmarking?
- Provides a vast amount of user-generated
annotations for web content. - Reflects the interests of millions of users.
- Wisdom-of-crowds.
3Why this paper?
- Why social bookmarking?
- Provides a vast amount of user-generated
annotations for web content. - Reflects the interests of millions of users.
- Wisdom-of-crowds.
- Research areas
- (Web-) Search
- (Web-) Content classification
- Ontology building
- Trend detection
- Recommendation
4Outline
- The del.icio.us bookmarking service
- Bookmarking patterns
- Tagging patterns
- Social bookmarking and spam
- Conclusions and future work
5The del.icio.us bookmarking service
6The del.icio.us bookmarking service
7The growth of del.icio.us
8The dataset
- We recursively crawled del.icio.us tag wise
starting with the tag web2.0 (Oct.-Dez. 2007). - From the retrieved corpus of 45 million
bookmarks we extracted the 1 million most
frequent users and downloaded the bookmarks of
these users. (Dez. 2007 Apr. 2008) - For the analysis presented here, we only
considered the 142 million bookmarks obtained
from the user wise crawling.
9The dataset
- We recursively crawled del.icio.us tag wise
starting with the tag web2.0 (Oct.-Dez. 2007). - From the retrieved corpus of 45 million
bookmarks we extracted the 1 million most
frequent users and downloaded the bookmarks of
these users. (Dez. 2007 Apr. 2008) - For the analysis presented here, we only
considered the 142 million bookmarks obtained
from the user wise crawling.
Corpus details
10The dataset
- We recursively crawled del.icio.us tag wise
starting with the tag web2.0 (Oct.-Dez. 2007). - From the retrieved corpus of 45 million
bookmarks we extracted the 1 million most
frequent users and downloaded the bookmarks of
these users. (Dez. 2007 Apr. 2008) - For the analysis presented here, we only
considered the 142 million bookmarks obtained
from the user wise crawling.
Corpus details
gt 80 of del.icio.us
11Bookmarking patterns
12Bookmarking patterns
- The del.icio.us community is biased toward web
community and web technology related content.
Top 10 most frequent URLs in the corpus
13Bookmarking patterns
- The del.icio.us community is biased toward web
community and web technology related content.
Top 10 most frequent domains in the corpus
14Bookmarking patterns
- The Top 1 of users proliferates 22 of all
bookmarks. - 39 of all bookmarks link to 1 of all URLs.
15Bookmarking patterns
- The del.icio.us community pays attention to new
content only for a very short period of time.
16Tagging patterns
17Tagging patterns
- Each bookmark is labeled with 3.16 tags on
average. - About 7 of all bookmarks are not tagged at all.
Top 20 most frequent tags in the corpus
18Tagging patterns
- 700 of 7.000.000 tags account for 50 of all
labels. - 55 of all tags appear only once.
19Tagging patterns
- Tendencies in the del.icio.us tag distribution
strongly correlate with upcoming and periodic
external events.
Occurrence of 5 sample tags in 2007.
20Social bookmarking and spam
21Social bookmarking and spam
- Del.icio.us is highly vulnerable to spam.
- 19 of the Top 20 users are of apparently non
human origin accounting for 1.3 million
bookmarks, around 1 of the corpus.
22Social bookmarking and spam
- Del.icio.us is highly vulnerable to spam.
- 19 of the Top 20 users are of apparently non
human origin accounting for 1.3 million
bookmarks, around 1 of the corpus. - We find spammers to exhibit one or more of the
following characteristics - very high activity
- bookmarking only few domains
- high tagging rate
- very low tagging rate
- bulk posts
- a combination of the above
23Social bookmarking and spam
The number of bookmarks and the number of users
linking to a domain.
24Social bookmarking and spam
The number of user bookmarks and the average
number of tags per bookmark.
25The diffusion of attention
26The diffusion of attention
- In some cases spam detection may prove
computational expensive or ambiguous. - The diffusion of attention concept reduces the
effect of spam on the tag distribution without
the actual need of spam detection.
27The diffusion of attention
- In some cases spam detection may prove
computational expensive or ambiguous. - The diffusion of attention concept reduces the
effect of spam on the tag distribution without
the actual need of spam detection. - We define the attention given to a tag as the
number of users using the tag. - The diffusion of attention for a tag is then
given by the number of users that assign a tag
for the first time in a given period.
28The diffusion of attention
Tagging trends by tag occurrence.
29The diffusion of attention
Tagging trends by tag occurrence.
Tagging trends by diffusion of attention.
30Future work
- Provide automatic and scalable spam detection
methods. - Topic aware detection of trends.
Follow up paper Detecting Trends in Social
Bookmarking Systems using a Probabilistic
Generative Model and Smoothing, R. Wetzker, T.
Plumbaum, A.Korth, C. Bauckhage, T. Alpcan, F.
Metze, International Conference on Pattern
Recognition (ICPR), 2008, Tampa, USA (to appear)
31Thank you.
Questions?
32Social bookmarking and spam
The number of bookmarks and the number of users
linking to a domain.
http//d.hatena.ne.jp