Web Data Management - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Web Data Management

Description:

Get background and understanding of the new challenges in ... Facebook Apps, Open Social. Web site provides canvas. Third party apps can paint on this canvas ' ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 43
Provided by: amliem
Category:

less

Transcript and Presenter's Notes

Title: Web Data Management


1
Web Data Management
  • 198673
  • Fall 2008
  • Amélie Marian

2
Goals
  • Get background and understanding of the new
    challenges in Web Data Management research
  • Read a mix of influential and recent papers on
    the topic
  • Think of new research directions
  • Hopefully start some research through projects

3
Administrative Details
  • Thursday 140-440 SEC 207
  • Office hours Mondays 3-4pm
  • Class web site will have schedule
  • Most of the communication (and slides) will be on
    Sakai

4
Class organization
  • A mix of papers and lectures
  • About 2 papers per week
  • Student presentations
  • In-class discussion
  • 1-2 presentation per students
  • A class project
  • In teams
  • Class presentation at the end of the semester

5
Grading
  • 60 project
  • 20 presentation
  • 20 participation

6
What does that mean
  • The presentation will require some substantial
    work
  • In-depth understanding of the paper
  • Possibly look into related work
  • Ideas for in-class discussion topics
  • Should meet with me on Monday before presentation
    to go over your slides
  • Class participation is very important
  • Everyone should read the papers before class
  • Students who are not presenting should submit a
    1-page note on each paper by Wednesday evening
  • Summary of the key ideas (do not copy abstract)
  • What you liked/disliked
  • 2-3 questions/discussion points

7
Project
  • No predefined project
  • Ideally should be a potential research project
  • You are responsible for defining your project
  • In the wide context of Web data management
  • I will offer some comments and topic suggestions
    if you are stuck

8
So what about Web Data Management?
  • Didnt Google solve everything?

Following slides from Raghu Ramakrishnan (Yahoo!)
keynote talk Web Data Management
9
Outline
  • Trends in Search and Information Discovery
  • Move towards task-centricity
  • Need to interpret content
  • Evolution of Online Communities
  • Social Search
  • PeopleWeb
  • Community Information Management

10
Trends in Search
11
Structure Intent
seafood san francisco
Category restaurant Location San Francisco
12
Y! Shortcuts
13
Google Base
14
Supplying Structured Search Content
  • Semantic Web?
  • Unleash community computingPeopleWeb!
  • Three ways to create semantically rich summaries
    that address the users information needs
  • Editorial, Extraction, UGC

Challenge Design social interactions that lead
to creation and maintenance of high-quality
structured content
15
Search and Content Supply
  • Premise
  • People dont want to search
  • People want to get tasks done

Broder 2002, A Taxonomy of web search
16
Social Search
  • Improve web search by
  • Learning from shared community interactions, and
    leveraging community interactions to create and
    refine content
  • Enhance and amplify user interactions
  • Expanding search results to include sources of
    information (e.g., experts, sub-communities of
    shared interest)

Reputation, Quality, Trust, Privacy
17
Evolution of Online Communities
18
Rate of content creation
  • Estimated growth of content
  • Published content from traditional sources 3-4
    Gb/day
  • Professional web content 2 Gb/day
  • User-generated content 8-10 Gb/day
  • Private text content 3 Tb/day (200x more)
  • Upper bound on typed content 700 Tb/day

19
Metadata
  • Estimated growth of metadata
  • Anchortext 100Mb/day
  • Tags 40Mb/day
  • Pageviews 100-200Gb/day
  • Reviews Around 10Mb/day
  • Ratings ltsmallgt

Drove most advances in search from 1996-present
Increasingly rich and available, but not yet
useful in search
This is in spite of the fact that interactions on
the web are currently limited by the fact that
each site is essentially a silo
20
PeopleWeb Site-Centric People-Centric
Global Object
Model
Portable Social
Environment
Community
Search
  • Common web-wide id for objects (incl. users)
  • Even common attributes? (e.g., pixels for camera
    objects)
  • As users move across sites, their personas and
    social networks will be carried along
  • Increased semantics on the web through community
    activity (another path to the goals of the
    Semantic Web)

(Towards a PeopleWeb, Ramakrishnan Tomkins,
IEEE Computer, August 2007)
21
Facebook Apps, Open Social
  • Web site provides canvas
  • Third party apps can paint on this canvas
  • Paint comes from data on and off-network
  • Via APIs that each site chooses to expose What is
    the core asset of a web portal?
  • What are the computational implications?
  • App hosting and caching
  • Dynamic, personalized content
  • Searching over spaghetti information threads

22
(No Transcript)
23
Web Search Results for Lisa
Latest news results for Lisa. Mostly about
people because Lisa is a popular name
41 results from My Web!
Web search results are very diversified, covering
pages about organizations, projects, people,
events, etc.
24
Save / Tag Pages You Like
Enter your note for personal recall and sharing
purpose
You can save / tag pages you like into My Web
from toolbar / bookmarklet / save buttons
You can pick tags from the suggested tags based
on collaborative tagging technology
Type-ahead based on the tags you have used
You can specify a sharing mode
You can save a cache copy of the page content
(Courtesy Raymie Stata)
25
My Web 2.0 Search Results for Lisa
Excellent set of search results from my community
because a couple of people in my community are
interested in Usenix Lisa-related topics
26
(No Transcript)
27
Tech Support at COMPAQ
In newsgroups, conversations disappear and you
have to ask the same question over and over
again. The thing that makes the real difference
is the ability for customers to collaborate and
have information be persistent. Thats how we
found QUIQ. Its exactly the philosophy were
looking for.
Tech support people cant
keep up with generating content and are not
experts on how to effectively utilize the product
Mass Collaboration is the next step in Customer
Service. Steve Young, VP of Customer
Care, Compaq
28
Timely Answers
77 of answers provided within 24h
6,845
  • No effort to answer each question
  • No added experts
  • No monetary incentives for enthusiasts

86 (4,328)
74 answered
77 (3,862)
65 (3,247)
40 (2,057)
Answers provided in 12h
Answers provided in 24h
Answers provided in 3h
Answers provided in 48h
Questions
29
Mass Contribution
Users who on average provide only 2 answers
provide 50 of all answers
Answers
100 (6,718)
Contributed by mass of users
50 (3,329)
Top users
Contributing Users
7 (120)
93 (1,503)
30
Interesting Problems
  • Question categorization
  • Detecting undesirable questions answers
  • Identifying trolls
  • Ranking results in Answers search
  • Finding related questions
  • Estimating question answer quality

(Byron Dom SIGIR talk)
31
Better Search via Information Extraction
  • Extract, then exploit, structured data from raw
    text

For years, Microsoft Corporation CEO Bill Gates
was against open source. But today he appears to
have changed his mind. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Select Name From PEOPLE Where Organization
Microsoft
PEOPLE
Name Title Organization Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman Founder Free
Soft..
Bill Gates Bill Veghte
(from Cohens IE tutorial, 2003)
32
Community Information Management (CIM)
  • Many real-life communities have a Web presence
  • Database researchers, movie fans, stock traders
  • Each community many data sources
    people
  • Members want to query and track at a semantic
    level
  • Any interesting connection between researchers X
    and Y?
  • List all courses that cite this paper
  • Find all citations of this paper in the past one
    week on the Web
  • What is new in the past 24 hours in the database
    community?
  • Which faculty candidates are interviewing this
    year, where?

33
DBLife
  • Integrated information about a (focused)
    real-world community
  • Collaboratively built and maintained by the
    community
  • Semantic web via extraction community

34
DBLife
  • Faculty AnHai Doan Raghu Ramakrishnan
  • Students P. DeRose, W. Shen, F. Chen, R. McCann,
    Y. Lee, M. Sayyadian
  • Prototype system up and running since early 2005
  • 1164 sources, crawled daily, 11000 pages / day
  • 160 MB, 121400 people mentions, 5600 persons
  • See DE overview article, CIDR 2007 demo

35
Entity Extraction and Resolution
co-authors A. Doan, Divesh Srivastava, ...
Raghu Ramakrishnan
36
Resulting ER Graph
37
Challenges
  • Extraction
  • Domain-level vs. site-level extraction
    templates
  • Compositional, customizable approach to
    extraction planning
  • Blending extraction with other sources (feeds,
    wiki-style user edits)
  • Maintenance of extracted information
  • Managing information Extraction
  • Incremental maintenance of extracted views at
    large scales
  • Mass Collaborationcommunity-based maintenance
  • Exploitation
  • Search/query over extracted structures in a
    community
  • Search across communitiesSemantic Web through
    the back door!
  • Detect interesting events and changes

38
Mass Collaboration A Simplified Example
Not David!
Picture is removed if enough users vote no.
39
Mass Collaboration Meets Spam
Jeffrey F. Naughton swears that this is David J.
DeWitt
40
Incorporating Feedback
A. Gupta, D. Smith, Text mining, SIGMOD-06
User says this is wrong
System extracted Gupta, D as a person name
System extracted Gupta, D using rules (R1)
David Gupta is a person name (R2) If first-name
last-name is a person name, then last-name, f
is also a person name.
  • Knowing this, system can potentially improve
    extraction accuracy.
  • Discover corrective rules
  • Find and fix other incorrect applications of R1
    and R2

A general framework for incorporating feedback?
41
Collaborative Editing
  • Users should be able to
  • Correct/add to the imported data
  • E.g., User imports a paper, system provides bib
    item
  • Challenges
  • Incentives, reputation
  • Handling malicious/spam users
  • Ownership model
  • My home page vs. a citation that appears on it
  • Reconciliation
  • Extracted vs. manual input
  • Conflicting input from different users

42
Provenance and Collaboration
  • Provenance/lineage/explanation becomes a key
    issue if we want to leverage user feedback to
    improve the quality of extraction over time.
  • Explanations must be succint, from end-user
    perspectivenot from derivation perspective
  • Maintaining an extracted view on a collection
    of documents over time is very costly getting
    feedback from users can help
  • In fact, distributing the maintenance task across
    a large group of users may be the best approach

43
Summary
  • Online communities represent a tremendous
    resource for organizing information online
  • Extraction mass collaboration semantics
  • Web is becoming
  • More people-centric, less site-centric
  • Highly intertwined, distributed, dynamic,
    personalized
  • Models of ownership, trust, incentives?
  • Next generation of search algorithms and
    infrastructure?
Write a Comment
User Comments (0)
About PowerShow.com