Web Communities: The World Online - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Web Communities: The World Online

Description:

Web Communities: The World Online – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 63
Provided by: glenn171
Category:

less

Transcript and Presenter's Notes

Title: Web Communities: The World Online


1
Web Communities The World Online
  • Raghu Ramakrishnan
  • Chief Scientist for Audience and Cloud Computing
  • Research Fellow
  • Yahoo!
  • (On leave, Univ. of Wisconsin-Madison)

2
Outline
  • Evolution of Online Communities
  • Social Search
  • PeopleWeb
  • Trends in Search and Information Discovery
  • Move towards task-centricity
  • Need to interpret content
  • Community Information Management
  • Web Data Infrastructure
  • Massively distributed computing and hosted
    cloud services

3
Evolution of Online Communities
4
(No Transcript)
5
Rate of content creation
  • Estimated growth of content
  • Published content from traditional sources 3-4
    Gb/day
  • Professional web content 2 Gb/day
  • User-generated content 8-10 Gb/day
  • Private text content 3 Tb/day (200x more)
  • Upper bound on typed content 700 Tb/day

(Towards a PeopleWeb, Ramakrishnan Tomkins,
IEEE Computer, August 2007)
6
Metadata
  • Estimated growth of metadata
  • Anchortext 100Mb/day
  • Tags 40Mb/day
  • Pageviews 100-200Gb/day
  • Reviews Around 10Mb/day
  • Ratings ltsmallgt

Drove most advances in search from 1996-present
Increasingly rich and available, but not yet
useful in search
This is in spite of the fact that interactions on
the web are currently limited by the fact that
each site is essentially a silo
7
PeopleWeb Site-Centric People-Centric
Global Object
Model
Portable Social
Environment
Community
Search
  • Common web-wide id for objects (incl. users)
  • Even common attributes? (e.g., pixels for camera
    objects)
  • As users move across sites, their personas and
    social networks will be carried along
  • Increased semantics on the web through community
    activity (another path to the goals of the
    Semantic Web)

(Towards a PeopleWeb, Ramakrishnan Tomkins,
IEEE Computer, August 2007)
8
Content Access and Ownership
(Slide courtesy Andrew Tomkins)
9
Facebook Apps, Open Social
  • Web site provides canvas
  • Third party apps can paint on this canvas
  • Paint comes from data on and off-network
  • Via APIs that each site chooses to expose What is
    the core asset of a web portal?
  • What are the computational implications?
  • App hosting and caching
  • Dynamic, personalized content
  • Searching over spaghetti information threads

10
Trends in Search
11
Search and Content Supply
  • Premise
  • People dont want to search
  • People want to get tasks done

Broder 2002, A Taxonomy of web search
12
Structure Intent
seafood san francisco
Category restaurant Location San Francisco
13
Y! Shortcuts
14
Google Base
15
Search as Killer App for Web Data Semantics
  • Publishers and search engine collaborate
  • Example Abstracts surfacing structured content
  • Users see richer search experience
  • Accomplish their tasks faster and more effectively

16
Social Search
17
Social Search
  • Explicitly open up search
  • Enable communities, sites and consumers to
    explicitly re-define search results (e.g.,
    SearchMonkey, Boss)
  • What is the right unit for a search result? Can
    we intelligently stitch together more
    informative abstracts, possibly from multiple
    sources?
  • Facilitate creation of specialized ranking
    engines based on different kinds of tasks, or
    aimed at different communities of users
  • Implicitly leverage socially engaged users and
    their interactions
  • Learning from shared community interactions, and
    leveraging community interactions to create and
    refine content
  • Expanding search results to include sources of
    information
  • E.g., Experts, sub-communities of shared
    interest, particular search engines (in a world
    with many, this is valuable!)

Reputation, Quality, Trust, Privacy
18
Opening Up Yahoo! Search
  • Phase 1

Phase 2
Giving site owners and developers control over
the appearance of Yahoo! Search results.
BOSS takes Yahoo!s open strategy to the next
level by providing Yahoo! Search infrastructure
and technology to developers and companies to
help them build their own search experiences.
(Slide courtesy Prabhakar Raghavan)
19
What Is It?
An open platform for using structured data to
build more useful and relevant search results
Before
After
(Slide courtesy Amit Kumar)
20
Whats New?
task links buy this user reviews best trips
media product images business photos profile
pictures
user choice remove report spam
favicon
send result share this rich result with others
structured data review ratings product
prices hours of operation
(Slide courtesy Amit Kumar)
21
How Does It Work?
(Slide courtesy Amit Kumar)
22
Publishing Structured Data Support for Emerging
Semantic Web Standards
  • Microformats
  • hCard, hEvent, hReview, hAtom, XFN
  • More as they get adopted
  • RDFa and eRDF markup
  • OpenSearch
  • extensions to return structured data
  • Atom/RSS Feeds
  • extensions to embed structured data

markup (crawl)
apis (pull)
push
(Slide courtesy Andrew Tomkins)
23
Infobars Integrating 3rd Party Data
Pull in data from any web service
(Slide courtesy Amit Kumar)
24
Search Results of the Future
yelp.com
Gawker
babycenter
New York Times
epicurious
LinkedIn
answers.com
webmd
(Slide courtesy Andrew Tomkins)
25
BOSS Offerings
BOSS offers two options for companies and
developers and has partnered with top technology
universities to drive search experimentation,
innovation and research into next generation
search.
ACADEMIC Working with the following
universities to allow for wide-scale research in
the search field
API A self-service, web services model for
developers and start-ups to quickly build and
deploy new search experiences.
CUSTOM Working with 3rd parties to build a more
relevant, brand/site specific web search
experience. This option is jointly built by
Yahoo! and select partners.
  • University of Illinois Urbana Champaign
  • Carnegie Mellon University
  • Stanford University
  • Purdue University
  • MIT
  • Indian Institute of
  • Technology Bombay
  • University of
  • Massachusetts

(Slide courtesy Prabhakar Raghavan)
26
BOSS Could Enable Custom Search Experiences
Social Search
Vertical Search
Visual Search
(Slide courtesy Prabhakar Raghavan)
27
Partner Examples
28
Web Search Results for Lisa
Latest news results for Lisa. Mostly about
people because Lisa is a popular name
41 results from My Web!
Web search results are very diversified, covering
pages about organizations, projects, people,
events, etc.
29
Save / Tag Pages You Like
Enter your note for personal recall and sharing
purpose
You can save / tag pages you like into My Web
from toolbar / bookmarklet / save buttons
You can pick tags from the suggested tags based
on collaborative tagging technology
Type-ahead based on the tags you have used
You can specify a sharing mode
You can save a cache copy of the page content
(Courtesy Raymie Stata)
30
My Web 2.0 Search Results for Lisa
Excellent set of search results from my community
because a couple of people in my community are
interested in Usenix Lisa-related topics
31
Google Co-Op
Query-based direct-display, programmed by
Contributor
This query matches a pattern provided by
Contributor
so SERP displays (query-specific) links
programmed by Contributor.
Subscribed Link
edit remove
Users opts-in by subscribing to them
32
(No Transcript)
33
Tech Support at COMPAQ
In newsgroups, conversations disappear and you
have to ask the same question over and over
again. The thing that makes the real difference
is the ability for customers to collaborate and
have information be persistent. Thats how we
found QUIQ. Its exactly the philosophy were
looking for.
Tech support people cant
keep up with generating content and are not
experts on how to effectively utilize the product
Mass Collaboration is the next step in Customer
Service. Steve Young, VP of Customer
Care, Compaq
34
How It Works
QUESTION
QUESTION
KNOWLEDGE
Customer
KNOWLEDGE
BASE
BASE
SELF SERVICE
SELF SERVICE
Answer added to power self service
Answer added to
power self service
ANSWER
Support Agent
35
Timely Answers
77 of answers provided within 24h
6,845
  • No effort to answer each question
  • No added experts
  • No monetary incentives for enthusiasts

86 (4,328)
74 answered
77 (3,862)
65 (3,247)
40 (2,057)
Answers provided in 12h
Answers provided in 24h
Answers provided in 3h
Answers provided in 48h
Questions
36
Power of Knowledge Creation
SUPPORT
SHIELD 1
SHIELD 2
Knowledge Creation
Self-Service )
80
Customer Mass Collaboration )
5-10
Support Incidents
Agent Cases
) Averages from QUIQ implementations
37
Mass Contribution
Users who on average provide only 2 answers
provide 50 of all answers
Answers
100 (6,718)
Contributed by mass of users
50 (3,329)
Top users
Contributing Users
7 (120)
93 (1,503)
38
Interesting Problems
  • Question categorization
  • Detecting undesirable questions answers
  • Identifying trolls
  • Ranking results in Answers search
  • Finding related questions
  • Estimating question answer quality

(Byron Dom SIGIR talk)
39
Supplying Structured Search Content
  • Semantic Web?
  • Unleash community computingPeopleWeb!
  • Three ways to create semantically rich summaries
    that address the users information needs
  • Editorial, Extraction, UGC

Challenge Design social interactions that lead
to creation and maintenance of high-quality
structured content
40
Better Search via Information Extraction
  • Extract, then exploit, structured data from raw
    text

For years, Microsoft Corporation CEO Bill Gates
was against open source. But today he appears to
have changed his mind. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Select Name From PEOPLE Where Organization
Microsoft
PEOPLE
Name Title Organization Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman Founder Free
Soft..
Bill Gates Bill Veghte
(from Cohens IE tutorial, 2003)
41
Community Information Management (CIM)
  • Many real-life communities have a Web presence
  • Database researchers, movie fans, stock traders
  • Each community many data sources
    people
  • Members want to query and track at a semantic
    level
  • Any interesting connection between researchers X
    and Y?
  • List all courses that cite this paper
  • Find all citations of this paper in the past one
    week on the Web
  • What is new in the past 24 hours in the database
    community?
  • Which faculty candidates are interviewing this
    year, where?

42
DBLife
  • Integrated information about a (focused)
    real-world community
  • Collaboratively built and maintained by the
    community
  • Semantic web via extraction community

43
DBLife
  • Faculty AnHai Doan Raghu Ramakrishnan
  • Students P. DeRose, W. Shen, F. Chen, R. McCann,
    Y. Lee, M. Sayyadian
  • Prototype system up and running since early 2005
  • Plan to release a public version of the system in
    Spring 2007
  • 1164 sources, crawled daily, 11000 pages / day
  • 160 MB, 121400 people mentions, 5600 persons
  • See DE overview article, CIDR 2007 demo

44
DBLife Papers
  • Efficient Information Extraction over Evolving
    Text Data, F. Chen, A. Doan, J. Yang, R.
    Ramakrishnan. ICDE-08.
  • Building Structured Web Community Portals A
    Top-Down, Compositional, and Incremental
    Approach, P. DeRose, W. Shen, F. Chen, A. Doan,
    R. Ramakrishnan. VLDB-07.
  • Declarative Information Extraction Using Datalog
    with Embedded Extraction Predicates, W. Shen, A.
    Doan, J. Naughton, R. Ramakrishnan. VLDB-07.
  • Source-aware Entity Matching A Compositional
    Approach, W. Shen, A. Doan, J.F. Naughton, R.
    Ramakrishnan ICDE 2007.
  • OLAP over Imprecise Data with Domain Constraints,
    D. Burdick, A. Doan, R. Ramakrishnan, S.
    Vaithyanathan. VLDB-07.
  • Community Information Management, A. Doan, R.
    Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R.
    McCann, M. Sayyadian, and W. Shen. IEEE Data
    Engineering Bulletin, Special Issue on
    Probabilistic Databases, 29(1), 2006.
  • Managing Information Extraction, A. Doan, R.
    Ramakrishnan, S. Vaithyanathan. SIGMOD-06
    Tutorial.

45
DBLife
  • Integrate data of the DB research community
  • 1164 data sources

Crawled daily, 11000 pages 160 MB / day
46
Entity Extraction and Resolution
co-authors A. Doan, Divesh Srivastava, ...
Raghu Ramakrishnan
47
Resulting ER Graph
48
Challenges
  • Extraction
  • Domain-level vs. site-level extraction
    templates
  • Compositional, customizable approach to
    extraction planning
  • Blending extraction with other sources (feeds,
    wiki-style user edits)
  • Maintenance of extracted information
  • Managing information Extraction
  • Incremental maintenance of extracted views at
    large scales
  • Mass Collaborationcommunity-based maintenance
  • Exploitation
  • Search/query over extracted structures in a
    community
  • Search across communitiesSemantic Web through
    the back door!
  • Detect interesting events and changes

49
Mass Collaboration
  • We want to leverage user feedback to improve the
    quality of extraction over time.
  • Maintaining an extracted view on a collection
    of documents over time is very costly getting
    feedback from users can help
  • In fact, distributing the maintenance task across
    a large group of users may be the best approach

50
Mass Collaboration A Simplified Example
Not David!
Picture is removed if enough users vote no.
51
Mass Collaboration Meets Spam
Jeffrey F. Naughton swears that this is David J.
DeWitt
52
Incorporating Feedback
A. Gupta, D. Smith, Text mining, SIGMOD-06
User says this is wrong
System extracted Gupta, D as a person name
System extracted Gupta, D using rules (R1)
David Gupta is a person name (R2) If first-name
last-name is a person name, then last-name, f
is also a person name.
  • Knowing this, system can potentially improve
    extraction accuracy.
  • Discover corrective rules
  • Find and fix other incorrect applications of R1
    and R2

A general framework for incorporating feedback?
53
Collaborative Editing
  • Users should be able to
  • Correct/add to the imported data
  • E.g., User imports a paper, system provides bib
    item
  • Challenges
  • Incentives, reputation
  • Handling malicious/spam users
  • Ownership model
  • My home page vs. a citation that appears on it
  • Reconciliation
  • Extracted vs. manual input
  • Conflicting input from different users

54
The Purple SOX Project
(SOcial eXtraction)
Application Layer
Shopping, Travel, Autos
Academic Portals (DBLife/MeYahoo)
Enthusiast Platform
and many others
Operator Library
55
Web Data ManagementMassively Distributed Hosted
Systems
56
Two Key Subsystems
  • Serving system
  • Takes queries and returns results
  • Supports low-latency updates
  • Content system
  • Gathers input of various kinds (including
    crawling)
  • Generates the data sets used by serving system
  • Both highly parallel

Goal scaleup. Hardware increments support
larger loads.
Serving System
Data sets
Users
Logs
Data updates
Content System
Web sites
Goal speedup. Hardware increments speed
computations.
(Courtesy Raymie Stata)
57
An Example Web App
Heavy use of simple database operations
Updates
Queries
58
The Problem
What does it take to build the next big app?
59
Why Hosted?
simple API
  • No maintenance worries for application
  • Single ops team
  • Resource sharing leads to savings

60
Data Analysis Platforms
  • Understanding online communities, and
    provisioning their data needs
  • Exploratory analysis over massive data sets
  • Challenges Analyze shared, evolving social
    networks of users, content, and interactions to
    learn models of individual preferences and
    characteristics community structure and
    dynamics and to develop robust frameworks for
    evolution of authority and trust extracting and
    exploiting structure from web content
  • Examples
  • Bigtable, Map-Reduce, Hadoop, PIG

61
The Bigger Picture
  • Software-as-a-service
  • E.g., Salesforce.com
  • Hosted data systems
  • E.g., Amazons S3/Dynamo and EC2
  • Web application development
  • Ning, Ruby-on-rails
  • Change tracking
  • Stream management

62
Implications
  • Data management as a service
  • Scientists and others whove resisted
    (installing, maintaining, and) using DBMSs will
    find it much easier to reap the benefits
  • Data centers and Computing Centers will come
    into vogue again
  • Hosted back-ends and RAD tools will make Web
    application development accessible to all
  • The Web is becoming open
  • E.g., OpenSocial, OpenID
  • Ideas will be the most valuable currency, not the
    wherewithal to build complex systems
  • Paradigm shifts possible for how we do research
    in many fields
  • Build applications that embed your algorithms and
    test them directly in the fieldComputer
    Scientists can interact directly with users
    (ironically, this would still be a breakthrough
    of sorts after four decades!)
  • Many other disciplines (e.g., Sociology,
    microeconomics) can design and conduct online
    experiments involving unprecedented numbers of
    participants

63
Summary
  • Online communities represent a tremendous
    resource for organizing information online
  • Open APIs and cloud services mass engagement
  • Extraction mass collaboration semantics
  • Web is becoming
  • More people-centric, less site-centric
  • Highly intertwined, distributed, dynamic,
    personalized
  • Models of ownership, trust, incentives?
  • Next generation of search algorithms and
    infrastructure?

64
Further Reading
  • Content, Metadata, and Behavioral Information
    Directions for Yahoo! Research, The Yahoo!
    Research Team, IEEE Data Engineering Bulletin,
    Dec 2006 (Special Issue on Web-Scale Data,
    Systems, and Semantics)
  • Systems, Communities, Community Systems on the
    Web, Community Systems Group at Yahoo! Research,
    SIGMOD Record, Sept 2007
  • Towards a PeopleWeb, R. Ramakrishnan and A.
    Tomkins, IEEE Computer, August 2007 (Special
    Issue on Web Search)
Write a Comment
User Comments (0)
About PowerShow.com