Finding Information on the web - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Finding Information on the web

Description:

FINDING INFORMATION ON THE WEB Srinivasan Seshadri CTO Kosmix EARLY INTERNET (1992 1994) Mozilla Browser People linked to others home pages and other interesting ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 29
Provided by: ValuedA185
Category:

less

Transcript and Presenter's Notes

Title: Finding Information on the web


1
Finding Information on the web
  • Srinivasan Seshadri
  • CTO Kosmix

2
Early Internet (1992 1994)
  • Mozilla Browser
  • People linked to others home pages and other
    interesting pages
  • People really browsed

3
INTERNET (1995 2002)
  • Search - Altavista, Lycos
  • Google
  • Used Hyperlink Graph Structure to Rank Results

4
Internet Now
  • Kosmix bringing back joys of browsing and
    exploring
  • 360 degree view of any topic
  • Topic Home page (why not a topic ?)
  • Top Informational Sites for a topic and a preview
    (snippets) are the results!

5
INFORMATION TYPES
  • Factual Information (Wiki etc.)
  • Videos
  • Images
  • Forum Discussions
  • Question and Answers
  • News
  • Blogs
  • Structured Information

6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
FUTURE OF SEARCH
  • First step towards providing multiple pivot
    points for a topic or search
  • Need to make this conversational, stateful like
    talking to an expert on the topic..

11
transient Intent and Persistent Intent
  • TRANSIENT INTENT
  • Searching for a needle in the haystack
  • Exploring the haystack for a topic
  • PERSISTENT INTENT
  • Interested in the topic for a long time
  • Carnatic Music, Indian Cricket, Internet
    Industry, Venture Capital

12
INFORMATION
Deliver information to the consumer
what they want when they want
how they want where they want
13
PERSONALIZED NEWSPAPER
  • My World is Changing
  • Can not keep track of it
  • Can my world come to me?

14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
MEDIA INDUSTRY AND INTERNET
  • Huge pressure on newspapers
  • Ad spending moving online
  • More and more content online
  • Reputed journalists have their own blogs
  • Content Production Aggregation and Distribution
    is becoming disaggregated
  • Vanilla online newspaper does not exploit what
    the internet enables
  • Ability to personalize to nano interests
  • Publish a personalized newspaper for everyone any
    time

19
Key technology Ingredients
  • Cloud Computing
  • Categorization
  • Relevance

20
Cloud computing at kosmix
  • Storage
  • Biggest Productivity boost in kosmix in the first
    year
  • Getting machines to be remotely rebooted!
  • KFS (Kosmix File System) further lowered the
    time to make data accessible after machine
    failures
  • Computation
  • Long Running Computations need to be broken into
    small restartable/replayable components

21
Cloud computing at kosmix
  • Computation Templates
  • Most of the computation could be expressed as
    some variant of a single table scan and some
    aggregate operation (group by) -- called
    MapReduce by google
  • MapReduce not friendly enough to non programmers
  • SQL not powerful enough in many situations
  • Need a nice scripting language ..

22
Opportunity?
  • Many many companies trying to provide
    interesting web services
  • A gold mine of information in the web that can
    be used by companies
  • Impractical for each of the companies to build a
    huge web scale support system (crawling,
    indexing, KFS, MapReduce etc. etc.)
  • Further most companies want slivers of the web
    (typically category based slivers health
    forums travel news sites etc. etc.)
  • Web and all the derived information is the
    biggest database perhaps -- can some one make
    this accessible and easy to use (using some pay
    you go model) or perhaps some non profit
    (academia?) angle here?

23
Categorization
  • Concept Space space in which all connections
    are made within kosmix
  • Documents, Queries, External Modules,
    Advertisements, People are all mapped to points
    in this space and matched..
  • Internet Industry, Venture Capital documents
    need to be mapped to these categories even if
    they dont contain the original words

24
Kategorization at kosmix
  • Leverage human curated sources
  • Wiki corpus is a majorr source of knowledge
  • Huge Automatically Curated Taxonomy
  • 6 million concepts
  • Building a Concept Graph with relationship labels
    where possible
  • Use a web index to match short pieces of texts
    with concepts and use taxonomy to refine the
    matches

25
Relevance
  • Need to combine multiple signals into one number
    to enable ranking
  • Say Query Relevance Score and Page Relevance
    Score (text score and page rank)
  • Signals need to be made comparable
  • Normalization alone (making ranges the same) is
    not enough
  • Need to reconcile different distributions
  • Deviations from the mean

26
Relevance
  • More data always beats smarter algorithms
  • Adding positions information in the index
    greatly increases quality
  • Adding stemming saw a CTR rise of 10
  • Adding anchors (and page rank) distinguished
    google
  • Adding origin of anchors (hosts) is a much
    better measure of independent votes
  • Using demand side popularity (alexa, quantcast)
    complement web popularity

27
RELEVANCE
  • What is a news story?
  • Cluster news articles..
  • Use size of cluster as a measure of popularity
  • How does one do this efficiently?
  • Needs to be online since interests/queries are ad
    hoc
  • Need to combine some offline preclustering and
    online methods

28
summary
  • Consumer
  • Internet has come a long way in terms of getting
    information to people
  • Utopian goal of a smart, chatty expert still far
    away kosmix.com is a great first step
  • Need good tools to keep on top of the information
    explosion personalized newspaper (meehive.com)
    is our first stab at this..
  • Technology
  • Need to deal with large volume of data
  • Efficient Data Analysis and Annotation (e.g.,
    Categorization)
  • Humming Next Gen Database System that grows
    incrementally, immune to failures, expressive for
    non programmers
Write a Comment
User Comments (0)
About PowerShow.com