Curtis Spencer - PowerPoint PPT Presentation

About This Presentation
Title:

Curtis Spencer

Description:

An Internet Forum Index Curtis Spencer Ezra Burgoyne The Problem Forums provide a wealth of information often overlooked by search engines. Semi-structured data ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 19
Provided by: ezra242
Learn more at: http://web.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Curtis Spencer


1
An Internet Forum Index
  • Curtis Spencer
  • Ezra Burgoyne

2
The Problem
  • Forums provide a wealth of information often
    overlooked by search engines.
  • Semi-structured data provided by forums is not
    taken advantage of by popular search software
    (e.g. Google, Yahoo).
  • Despite being crawled, many useful information
    rich posts never appear in results due to low
    page rank.
  • Discovering what the best forums are for a given
    topic is difficult even when the help of a search
    engine is enlisted.
  • Forum users are often unaware of related
    information found on rival forums.
  • A forums own search software is often slow and
    returns poor results.

3
Quick summary of solution
  • Forum detection crawlers continually find new
    forums with the help of a web search engine (e.g.
    Dogpile)
  • These discovered forums are eventually wrapped
    in their entirety through a distributed crawler.
  • Forum content collected in the database is
    indexed using latest MySQL fulltext natural
    language index.
  • Search ranking algorithm uses data ignored by
    traditional search engines such as number of
    replies, number of views, popularity of poster,
    etc.

4
(No Transcript)
5
Forums supported by Forum Looter
  • phpBB

vBulletin
6
(No Transcript)
7
Discovering forums
  • Using WordNet, a program serves dictionary words
    and their synonyms to a set of distributed
    crawlers.
  • Every link returned by Dogpile subjected to a
    detection algorithm that consist of URL
    formations as well as common patterns in the
    markup.
  • Detects the three most popular forum types used
    on the internet
  • vBulletin, phpBB, Invision
  • In trying to be good netizens, Dogpile website
    only accessed every two minutes. In addition,
    robots.txt was respected.

8
Dogpile example query
9
(No Transcript)
10
Distributed forum wrapping (architecture)
  • Synchronized Java RMI server stores a queue of
    jobs.
  • Distributed crawlers retrieve jobs from the
    central RMI server.
  • Distributed crawlers wrap whatever page their
    fetched job contains and saves results into
    database.
  • Distributed crawlers can schedule new jobs on
    the RMI server, too.

11
More on distributed forum wrapping (access)
  • In trying to be good netizens, each forum
    website is only accessed once in any 20 second
    time period.
  • At the mentioned rate, it would take two months
    to completely wrap some of the largest forums out
    there.
  • Last request times of forum websites were set by
    individual client crawlers and kept track of in
    the RMI server.
  • Exponential back off algorithm used for slow
    sites.

12
More on distributed forum wrapping (performance)
  • Java RMI server performed very well, only using
    10 of an AMD Athlon CPU.
  • Individual crawlers were rather memory
    intensive.
  • Individual crawlers used JTidy for parsing pages
    and performed DOM manipulation in addition to
    regular expressions to extract data.
  • Memory use attributed to Hibernate and JTidy.
  • Database access caused main bottleneck in the
    distributed system.

13
(No Transcript)
14
Indexing of data
  • Shadow database periodically updates forum data
    from crawl and creates an incremental MySQL
    fulltext index.
  • Has built-in support for stop words and document
    frequency scaling.

15
Ranking algorithm for search results
  • Forum software has many missed opportunities for
    metadata analysis.
  • We do a hierarchical weighted value calculation
    for each post once it is matched a NLP query from
    MySQL.
  • An approximation of this calculation is
  • Value w(apc)apc w(numViews) w(numReplies)
    w(isThread())

16
Future changes
  • Natural language parsing of forum post corpus so
    ranking may be affected by things such as thank
    you or great post replies.
  • Collaborative filtering of results per query by
    tracking user clicks.
  • Improve resource usage of crawlers.

17
(No Transcript)
18
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com