Title: Visualizing Weblog Term Spaces
1Visualizing Weblog Term Spaces
Elijah Wright and Kazuhiro Seki
ellwrigh,kseki_at_indiana.edu Ph.D. Students,
School of Library and Information Science Indiana
University, Bloomington
Abstract It is commonly assumed that weblogs
news or diary pages with entries presented in
reverse-chronological order have a
neighborhood or a community of other blogs
which discuss related matters. How can we verify
or disprove this assumption? In this project, we
have collected a large sample of weblog data and
compared the performance of link-based and
content-based clustering methods on the same
data. With these methods, we hope to infer the
presence of both topical and link-based
communities.
- Inlink similarity (co-citation)
- In similar fashion, we also examined the use of
inlinks which point to at least one of the 2,740
blogs. These inlinks were obtained by using
Google to carry out a backlink search. Of 2,740
blogs, it was found that 1,209 did not have any
backlinks. Based on these results, we created a
co-citation matrix C, where C(x,y)1 means that
web site (or blog) x has a link to blog y. The
size of the resulting matrix was 4,518?1,209. - As before, the clusters were plotted such that
dot colors and patterns correspond to the
content-based clusters as shown in the figure
below, and scatter plots for content- and
inlink-based similarities for every pair of blogs
were drawn. Both show a similar trend as the case
of blogrolls (outlinks) and did not reveal a
clear relation between content and inlinks.
Data collection 5000 randomly-sampled weblogs,
with no redundancy. We used http//blo.gs, which
allows us to randomly select a blog from their
existing database of over one million weblogs.
We were able to extract the textual content of
weblog titles and entries by parsing them out of
the RDF/XML feeds. Blogrolls (commonly found as
a simple list of anchor tags in the sidebar of a
weblog) were found in 1869 of our 5000 weblogs.
Only 810 blogs had both RDF/XML feeds and
blogrolls found. To compare the outcomes from
content-based clusters and link-based
communities, we found it desirable for both feed
and blogroll information to be identified for
each collected weblog. Weblogs which did not
meet this criterion were discarded.
Dot colors and patterns correspond to the
resulting clusters above. Top clusters for
LSA-style clustering were as follows 1business,
3spirituality/religion, 4gay rights,
19tech/programming, 7government/policy, 18US
presidential election, 11diary type / social
words, 6sports. Minor / noise clusters contained
foreign language documents, unexpected HTML tag
text, and words most commonly found in interfaces
of weblog software like LiveJournal (cluster 20).
- Methodology II
- Link-based analysis
- URL normalization
- Adjacency matrix creation After normalizing
URLs, we found that the resulting matrix was very
sparse. The mean number of links a blog has
within our data set is only .33 on average, which
would not be sufficient for link structure-based
community discovery (for example, Flake et al.,
2002). As a result, we also explored the use of
outlink similarity (co-reference) and inlink
similarity (co-citation). - Outlink similarity (co-reference)
- Using the blogroll data, a co-reference matrix R
was created, where R(x,y)n denotes blog x has n
links to blog y. The size of R is 2,740?32,221
(number of blogs by number of blogrolls). We
discarded 29,817 blogroll entries which were
pointed to by only one blog, as well as 1,274
blogs which do not have any blogroll links
associated with them, resulting in a 1,466?2,404
matrix. We then applied TFIDF, SVD, and K-means
to the matrix. We expected that TFIDF would give
an appropriate weight to each blogroll link
within each blog and that SVD would reveal
implicit associations among blogs, even if they
did not have same blogroll links within them.
- Methodology I
- Content-based clustering
- Content extraction from the feed files RDF/XML
ltcontentgt, ltdescriptiongt, or ltsummarygt tags. - Applied stopword list
- Porter stemmer
- Discarded tiny content files normalizing
document length. - Low document frequency (DF) words were eliminated
in order to reduce the feature space it has
been shown that low DF words are not useful for
text classification tasks (Yang and Pederson,
1997). Words with a DF of less than 5 were
removed, resulting in 19,398 features (87.0
reduction). - TFIDF (term frequency by inverse document
frequency) techniques applied. - SVD then applied via the modules present within
the GNU R Statistical Computing Environment.
Experimentally, 100 dimensions were used. - Clustering
- For clustering, we applied the K-means
clustering algorithm (where clusters was
empirically set to 20). To describe the
resulting clusters, we computed chi-square
statistics for each ?word, cluster? pair. We
produced a table showing the 10 most
discriminative words (stemmed) for each cluster,
where a class label was manually identified for
each class. We were able to apply
multidimensional scaling (MDS) in order to
produce a visualization of the data space (See
next fig.).
Conclusions and Future Work With these early
results, we found that LSA (on content) was more
effective with our unique data set than inlink or
outlink-based clustering. The results might have
been significantly different with data gathered
in a different manner in particular, the matrix
used to calculate cocitation and coreference
would probably have been much less dense with a
link-based rather than a random sample. Much
care is required in order to get very good
results from messy Web data. In the next six
months we plan to automate this process so that
similar work can be done with much less manual
intervention. We are interested in tracking the
changes in cluster size and boundaries over time
hopefully, this will allow us to visually
examine the evolution and growth of weblog
discussion topics as events of national
importance occur. Further research and
application of these techniques to other samples
of data may provide validation and support for
decisions we made in the course of this research.
References Derweester, Dumais, Furnas, Landauer
and Harshman (1990). Indexing by Latent Semantic
Analysis. JASIST, vol. 41, no. 6. pp.
391407. Gary Flake, Steve Lawrence, C. Lee
Giles, and Frans Coetzee (2002).
Self-Organization of the Web and Identification
of Communities. IEEE Computer, vol. 35, no. 3,
pp. 6671. Filippo Menczer (To appear). Lexical
and Semantic Clustering by Web Links.
JASIST. Martin Porter (1980). An algorithm for
suffix stripping. Program, vol. 14, no. 3, pp.
130137. Yiming Yang and Jan O. Pedersen (1997).
A Comparative Study on Feature Selection in Text
Categorization. In Proceedings of the Fourteenth
International Conference on Machine Learning
(ICML'97), pp. 412420. R Development Core Team
(2004). R A language and environment for
statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. ISBN
3-900051-00-3, URL http//www.R-project.org/.