Visualizing Weblog Term Spaces - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

Visualizing Weblog Term Spaces

Description:

Based on these results, we created a co-citation matrix C, where C(x,y)=1 means ... Fourteenth International Conference on Machine Learning (ICML'97), pp. 412 ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 2
Provided by: sarahmarg
Category:

less

Transcript and Presenter's Notes

Title: Visualizing Weblog Term Spaces


1
Visualizing Weblog Term Spaces
Elijah Wright and Kazuhiro Seki
ellwrigh,kseki_at_indiana.edu Ph.D. Students,
School of Library and Information Science Indiana
University, Bloomington
Abstract It is commonly assumed that weblogs
news or diary pages with entries presented in
reverse-chronological order have a
neighborhood or a community of other blogs
which discuss related matters. How can we verify
or disprove this assumption? In this project, we
have collected a large sample of weblog data and
compared the performance of link-based and
content-based clustering methods on the same
data. With these methods, we hope to infer the
presence of both topical and link-based
communities.
  • Inlink similarity (co-citation)
  • In similar fashion, we also examined the use of
    inlinks which point to at least one of the 2,740
    blogs. These inlinks were obtained by using
    Google to carry out a backlink search. Of 2,740
    blogs, it was found that 1,209 did not have any
    backlinks. Based on these results, we created a
    co-citation matrix C, where C(x,y)1 means that
    web site (or blog) x has a link to blog y. The
    size of the resulting matrix was 4,518?1,209.
  • As before, the clusters were plotted such that
    dot colors and patterns correspond to the
    content-based clusters as shown in the figure
    below, and scatter plots for content- and
    inlink-based similarities for every pair of blogs
    were drawn. Both show a similar trend as the case
    of blogrolls (outlinks) and did not reveal a
    clear relation between content and inlinks.

Data collection 5000 randomly-sampled weblogs,
with no redundancy. We used http//blo.gs, which
allows us to randomly select a blog from their
existing database of over one million weblogs.
We were able to extract the textual content of
weblog titles and entries by parsing them out of
the RDF/XML feeds. Blogrolls (commonly found as
a simple list of anchor tags in the sidebar of a
weblog) were found in 1869 of our 5000 weblogs.
Only 810 blogs had both RDF/XML feeds and
blogrolls found. To compare the outcomes from
content-based clusters and link-based
communities, we found it desirable for both feed
and blogroll information to be identified for
each collected weblog. Weblogs which did not
meet this criterion were discarded.
Dot colors and patterns correspond to the
resulting clusters above. Top clusters for
LSA-style clustering were as follows 1business,
3spirituality/religion, 4gay rights,
19tech/programming, 7government/policy, 18US
presidential election, 11diary type / social
words, 6sports. Minor / noise clusters contained
foreign language documents, unexpected HTML tag
text, and words most commonly found in interfaces
of weblog software like LiveJournal (cluster 20).
  • Methodology II
  • Link-based analysis
  • URL normalization
  • Adjacency matrix creation After normalizing
    URLs, we found that the resulting matrix was very
    sparse. The mean number of links a blog has
    within our data set is only .33 on average, which
    would not be sufficient for link structure-based
    community discovery (for example, Flake et al.,
    2002). As a result, we also explored the use of
    outlink similarity (co-reference) and inlink
    similarity (co-citation).
  • Outlink similarity (co-reference)
  • Using the blogroll data, a co-reference matrix R
    was created, where R(x,y)n denotes blog x has n
    links to blog y. The size of R is 2,740?32,221
    (number of blogs by number of blogrolls). We
    discarded 29,817 blogroll entries which were
    pointed to by only one blog, as well as 1,274
    blogs which do not have any blogroll links
    associated with them, resulting in a 1,466?2,404
    matrix. We then applied TFIDF, SVD, and K-means
    to the matrix. We expected that TFIDF would give
    an appropriate weight to each blogroll link
    within each blog and that SVD would reveal
    implicit associations among blogs, even if they
    did not have same blogroll links within them.
  • Methodology I
  • Content-based clustering
  • Content extraction from the feed files RDF/XML
    ltcontentgt, ltdescriptiongt, or ltsummarygt tags.
  • Applied stopword list
  • Porter stemmer
  • Discarded tiny content files normalizing
    document length.
  • Low document frequency (DF) words were eliminated
    in order to reduce the feature space it has
    been shown that low DF words are not useful for
    text classification tasks (Yang and Pederson,
    1997). Words with a DF of less than 5 were
    removed, resulting in 19,398 features (87.0
    reduction).
  • TFIDF (term frequency by inverse document
    frequency) techniques applied.
  • SVD then applied via the modules present within
    the GNU R Statistical Computing Environment.
    Experimentally, 100 dimensions were used.
  • Clustering
  • For clustering, we applied the K-means
    clustering algorithm (where clusters was
    empirically set to 20). To describe the
    resulting clusters, we computed chi-square
    statistics for each ?word, cluster? pair. We
    produced a table showing the 10 most
    discriminative words (stemmed) for each cluster,
    where a class label was manually identified for
    each class. We were able to apply
    multidimensional scaling (MDS) in order to
    produce a visualization of the data space (See
    next fig.).

Conclusions and Future Work With these early
results, we found that LSA (on content) was more
effective with our unique data set than inlink or
outlink-based clustering. The results might have
been significantly different with data gathered
in a different manner in particular, the matrix
used to calculate cocitation and coreference
would probably have been much less dense with a
link-based rather than a random sample. Much
care is required in order to get very good
results from messy Web data. In the next six
months we plan to automate this process so that
similar work can be done with much less manual
intervention. We are interested in tracking the
changes in cluster size and boundaries over time
hopefully, this will allow us to visually
examine the evolution and growth of weblog
discussion topics as events of national
importance occur. Further research and
application of these techniques to other samples
of data may provide validation and support for
decisions we made in the course of this research.
References Derweester, Dumais, Furnas, Landauer
and Harshman (1990). Indexing by Latent Semantic
Analysis. JASIST, vol. 41, no. 6. pp.
391407. Gary Flake, Steve Lawrence, C. Lee
Giles, and Frans Coetzee (2002).
Self-Organization of the Web and Identification
of Communities. IEEE Computer, vol. 35, no. 3,
pp. 6671. Filippo Menczer (To appear). Lexical
and Semantic Clustering by Web Links.
JASIST. Martin Porter (1980). An algorithm for
suffix stripping. Program, vol. 14, no. 3, pp.
130137. Yiming Yang and Jan O. Pedersen (1997).
A Comparative Study on Feature Selection in Text
Categorization. In Proceedings of the Fourteenth
International Conference on Machine Learning
(ICML'97), pp. 412420. R Development Core Team
(2004). R A language and environment for
statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. ISBN
3-900051-00-3, URL http//www.R-project.org/.
Write a Comment
User Comments (0)
About PowerShow.com