Title: MOBS Mass Collaboration to Build Systems
1MOBSMass Collaboration to Build Systems
- Prof. AnHai Doan , Alexander Kramnik, Rob McCann,
Warren Shen, Olu Sobulo , Vanitha Varadarajan - DAIS Research _at_ UIUC
2Data Retrieval Problem
Unstructured Query Broad Coverage
Structured Query Low Coverage
3Solution Data Integration Systems
Structured Query
Broad Coverage
4Building A Data Integration System
Construct Mediated Interface
5Building A Data Integration System
Locate Sources
6Building A Data Integration System
Learn To Translate A Query
7Building A Data Integration System
Combine Results For User
8Building A Data Integration SystemIs Expensive
- Automated tools help but are inaccurate.
- Today DI Systems are still built manually!
- There are very few DI systems on the web.
9Use Mass Collaboration
MOBS Mass Collaboration to Build Systems
- Treat a DI system as having a finite set of
parameters. - System admins construct and deploy an initial
system shell. - Users help system converge to correct
parameter values.
10Parameterize System
Parameter Form 1 is an online Bookstore Value
Yes / No
11Parameterize System
Parameter Form 2 WRITER Matches Author Value
Yes / No
12Ask The Users
13Comparison to Database Tuning
- Database Tuning
- set values of physical-design knobs (e.g., buffer
size) - using feedback from query execution
- time, resources consumed, etc.
- to further improve query execution performance
- Mass Collaboration for DI Systems
- set values of logical-design knobs (e.g., ab?")
- using feedback from users
- to improve system correctness and further expand
14Mass CollaborationUsed In Many Places
- Review sites (amazon, imdb, epinions)
- Open-Source software (Linux, GNU)
- Knowledge Bases (mindpixel, openmind, bibserv)
- Peer-to-Peer Systems (napster)
- The World Wide Web
- Wired Magazine 11/03 Highlighted growing
popularity of mass collaboration. - Why Not Data Integration?
15Potential High Impact
- If succeeds
- Dramatically reduce development cost time
- Launch numerous DI systems on the Web in
enterprises - everyday domains books, movies, cars, travel,
etc. - "niche" domains e.g., fire fighting
- scientific domains e.g., bioinformatics
- within/across enterprises
- Applicable to other data management tasks
- building P2P systems, info extraction from text,
Semantic Web, ... - Our current work
- Start by exploring a few initial settings
- Online Bookstores, Hub Finder, Publication
Extractor - Use these settings to understand key challenges.
- Develop, deploy, evaluate general solutions.
- Proposed a new solution to building DI systems
- Mass Collaboration
- Proposed several methods for gathering
- feedback
- Monopoly, Better Service, Cooperative
- Evaluated our approach across a variety of
- population and deployed 3 such systems on
the - web
- Bookstores, Hubs, Peanut
17Integrating Online Bookstores
- As a first step, we restrict our attention to two
crucial tasks - Interface Recognition
- Query Translation
18Interface Recognition
19Interface Recognition
20Query Translation
First Label Each Field With A Relevant Semantic
Ranked Concepts
21Query Translation
First Label Each Field With A Relevant Semantic
Ranked Concepts
- Title - Author - Genre - ISBN
22Query Translation
First Label Each Field With A Relevant Semantic
Ranked Concepts
- Title - Author - Genre - ISBN
23Query Translation
First Label Each Field With A Relevant Semantic
Ranked Concepts
- Title - Author - Genre - ISBN
24Query Translation
25Query Translation
How To Translate A Query To A Source?
Author-Specific Rules
If 1 text field, FN LN LN, FN If 2
text fields, LN FN FN LN If a
drop-down Search for LN
26Query Translation
How To Translate A Query To A Source?
27Query Translation
How To Translate A Query To A Source?
- Formulate a Mass Collaboration
- Framework suitable for such DI tasks.
- Motivate Users To Participate.
- Handle Malicious / Error-Prone Users.
- Resolve Differences In User Opinion.
- Boost Accuracy Of Mass Collaboration
- Results.
- Reduce Workload Placed Upon Users.
29Mass Collaboration Framework
- Cast a DI task into a sequence of simple binary
decisions. - Is this source an online bookstore?
- Is this field used to query on Author?
- Making these decisions is equivalent to solving
the particular DI task. - Simple decisions can be solved by users.
30User Participation
- Sell a Monopolized Service CS 311 Site
- Sell Improved Service Query Interface
- Cooperative Environment A Community
- There are several ways to collect feedback.
31Malicious Users
Inject known questions to detect and remove bad
32Forming a Consensus
- Must decide when and how each decision is made
from given feedback. - Our current scheme
- If one answer (Yes/No) has a convincing lead,
immediately make that decision. - Otherwise, at some max number of answers, take
the majority opinion.
33Overcoming Exponential Error
- Dependent (sequential) decisions snowball error
exponentially fast.
Use A Semi-Parallel Pool Scheme
34Pool Scheme
- We need parallelism to be accurate, but
- Round-Robin is too much work
- Lessen workload by Zooming
35Empirical Evaluation
- Interface Recognition
- 55 Query Forms (from Books, Movies, and Music
Fully Automated
Automation Mass Collab
Selected 24 forms 0.70 Precision 0.89 Recall
Automation Selected 24 forms 0.70 Precision
0.89 Recall MC Selected 17/24 forms 1.00
Precision 0.89 Recall
75 Users Averaged 8 Answers
36Empirical Evaluation
- Query Translation
- Just deployed on CS 311 Site.
- Labeled 2 sources (5 fields) with 100 accuracy,
- with 53 users averaging 5 answers apiece.
- Simulated 500 Users over 18 Sources (114 Fields)
37Empirical Evaluation
- Query Translation
- Simulated 500 Users over 18 Sources (114 Fields)
8 Answers
Workload is spread thinly.
38MOBS Is A General Technique
- We were also able to use MOBS to build two
applications on the Surface Web. - Hubs Locate CS faculty hubs.
- Peanut Locate CS publication lists.
39Hubs System
- Overall Goal Build an IR system over content
found in CS faculty homepages. - First Step Use Mass Collaboration to find
faculty homepages. Do this using department
hubs. - Look only for 1 page per site.
- More stable than individual homepages.
40Hubs System
- Use automated techniques to crawl a dept website
and create a ranked list of hub candidates. - Use mass collaboration to select correct hub from
candidates. - Is this page a CS faculty hub?
- Currently deployed on Prof. Zhais 397 site
- Finished 2 CS depts.
- Machine learning was wrong in both cases.
- Mass Collaboration boosted accuracy from 0 to
100, - with the help of 20 users averaging 9.5 answers.
41Peanut System
- Overall Goal Build a mini-citeseer over
relevant CS publications. - Standard MOBS Approach Automatically parse
candidate lists from web pages and use mass
collaboration to verify/reject candidates. - Is this a publication list?
- Currently deployed atop the Google search engine
- Extracted 10 publication lists from 50 lists
over 10 pages - with 100 precision and 100 recall
- with the help of 10 users averaging 8.4 answers.
42Related Work
- Mass Collaboration
- Review sites, knowledge bases, open-source.
- Data Integration
- Lots of research, but few works on
addressing entire process (Rosenthal Seligman,
01). - Autonomic Systems
- A system built by mass collaboration
exhibits self-healing and self-improving
43Future Work
- Theoretical guarantees of accuracy/speed.
- Automatic self-tuning in response to observations
on the population. - More sophisticated consensus techniques (i.e.
linear regression). - Maintain over changing sources.
- Most importantly, find The Application.
44System Demos
- Bookstore
- http//hanoi.cs.uiuc.edu/cgi-bin/deploy/mobs.pl?sy
stem4 - http//hanoi.cs.uiuc.edu/cs311
- http//hanoi.cs.uiuc.edu/cgi-bin/deploy/mb_system_
deep_web_portal.pl - Hubs
- http//hanoi.cs.uiuc.edu/cs397
- http//hanoi.cs.uiuc.edu/cgi-bin/deploy/hubs_publi
c_stats.pl - Peanut
- http//hanoi.cs.uiuc.edu/peanut