Title: MOBS Mass Collaboration to Build Systems
1MOBSMass Collaboration to Build Systems
- Prof. AnHai Doan , Alexander Kramnik, Rob McCann,
Warren Shen, Olu Sobulo , Vanitha Varadarajan - DAIS Research _at_ UIUC
2Data Retrieval Problem
Unstructured Query Broad Coverage
Structured Query Low Coverage
3Solution Data Integration Systems
Structured Query
Broad Coverage
4Building A Data Integration System
Construct Mediated Interface
5Building A Data Integration System
Locate Sources
6Building A Data Integration System
Learn To Translate A Query
7Building A Data Integration System
Combine Results For User
8Building A Data Integration SystemIs Expensive
- Automated tools help but are inaccurate.
- Today DI Systems are still built manually!
- There are very few DI systems on the web.
9Use Mass Collaboration
MOBS Mass Collaboration to Build Systems
- Treat a DI system as having a finite set of
parameters. - System admins construct and deploy an initial
system shell. - Users help system converge to correct
parameter values.
10Parameterize System
?
Parameter Form 1 is an online Bookstore Value
Yes / No
11Parameterize System
?
Parameter Form 2 WRITER Matches Author Value
Yes / No
12Ask The Users
13Comparison to Database Tuning
- Database Tuning
- set values of physical-design knobs (e.g., buffer
size) - using feedback from query execution
- time, resources consumed, etc.
- to further improve query execution performance
- Mass Collaboration for DI Systems
- set values of logical-design knobs (e.g., ab?")
- using feedback from users
- to improve system correctness and further expand
system
14Mass CollaborationUsed In Many Places
- Review sites (amazon, imdb, epinions)
- Open-Source software (Linux, GNU)
- Knowledge Bases (mindpixel, openmind, bibserv)
- Peer-to-Peer Systems (napster)
- The World Wide Web
- Wired Magazine 11/03 Highlighted growing
popularity of mass collaboration. - Why Not Data Integration?
15Potential High Impact
- If succeeds
- Dramatically reduce development cost time
- Launch numerous DI systems on the Web in
enterprises - everyday domains books, movies, cars, travel,
etc. - "niche" domains e.g., fire fighting
- scientific domains e.g., bioinformatics
- within/across enterprises
- Applicable to other data management tasks
- building P2P systems, info extraction from text,
Semantic Web, ... - Our current work
- Start by exploring a few initial settings
- Online Bookstores, Hub Finder, Publication
Extractor - Use these settings to understand key challenges.
- Develop, deploy, evaluate general solutions.
16Contributions
- Proposed a new solution to building DI systems
- Mass Collaboration
- Proposed several methods for gathering
- feedback
- Monopoly, Better Service, Cooperative
- Evaluated our approach across a variety of
- population and deployed 3 such systems on
the - web
- Bookstores, Hubs, Peanut
17Integrating Online Bookstores
- As a first step, we restrict our attention to two
crucial tasks - Interface Recognition
- Query Translation
18Interface Recognition
Candidate
Candidate
19Interface Recognition
20Query Translation
First Label Each Field With A Relevant Semantic
Concept
Ranked Concepts
21Query Translation
First Label Each Field With A Relevant Semantic
Concept
Ranked Concepts
- Title - Author - Genre - ISBN
?
22Query Translation
First Label Each Field With A Relevant Semantic
Concept
Ranked Concepts
- Title - Author - Genre - ISBN
?
23Query Translation
First Label Each Field With A Relevant Semantic
Concept
Ranked Concepts
- Title - Author - Genre - ISBN
24Query Translation
25Query Translation
How To Translate A Query To A Source?
Author-Specific Rules
If 1 text field, FN LN LN, FN If 2
text fields, LN FN FN LN If a
drop-down Search for LN
26Query Translation
How To Translate A Query To A Source?
27Query Translation
How To Translate A Query To A Source?
Frost
Robert
28Challenges
- Formulate a Mass Collaboration
- Framework suitable for such DI tasks.
- Motivate Users To Participate.
- Handle Malicious / Error-Prone Users.
- Resolve Differences In User Opinion.
- Boost Accuracy Of Mass Collaboration
- Results.
- Reduce Workload Placed Upon Users.
29Mass Collaboration Framework
- Cast a DI task into a sequence of simple binary
decisions. - Is this source an online bookstore?
- Is this field used to query on Author?
- Making these decisions is equivalent to solving
the particular DI task. - Simple decisions can be solved by users.
30User Participation
- Sell a Monopolized Service CS 311 Site
- Sell Improved Service Query Interface
- Cooperative Environment A Community
- There are several ways to collect feedback.
31Malicious Users
Inject known questions to detect and remove bad
users.
32Forming a Consensus
- Must decide when and how each decision is made
from given feedback. - Our current scheme
- If one answer (Yes/No) has a convincing lead,
immediately make that decision. - Otherwise, at some max number of answers, take
the majority opinion.
33Overcoming Exponential Error
- Dependent (sequential) decisions snowball error
exponentially fast.
Use A Semi-Parallel Pool Scheme
34Pool Scheme
- We need parallelism to be accurate, but
- Round-Robin is too much work
- Lessen workload by Zooming
35Empirical Evaluation
- Interface Recognition
- 55 Query Forms (from Books, Movies, and Music
pages)
Fully Automated
Automation Mass Collab
Selected 24 forms 0.70 Precision 0.89 Recall
Automation Selected 24 forms 0.70 Precision
0.89 Recall MC Selected 17/24 forms 1.00
Precision 0.89 Recall
75 Users Averaged 8 Answers
36Empirical Evaluation
- Query Translation
- Just deployed on CS 311 Site.
- Labeled 2 sources (5 fields) with 100 accuracy,
- with 53 users averaging 5 answers apiece.
- Simulated 500 Users over 18 Sources (114 Fields)
37Empirical Evaluation
- Query Translation
- Simulated 500 Users over 18 Sources (114 Fields)
8 Answers
Workload is spread thinly.
38MOBS Is A General Technique
- We were also able to use MOBS to build two
applications on the Surface Web. - Hubs Locate CS faculty hubs.
- Peanut Locate CS publication lists.
39Hubs System
- Overall Goal Build an IR system over content
found in CS faculty homepages. - First Step Use Mass Collaboration to find
faculty homepages. Do this using department
hubs. - Look only for 1 page per site.
- More stable than individual homepages.
40Hubs System
- Use automated techniques to crawl a dept website
and create a ranked list of hub candidates. - Use mass collaboration to select correct hub from
candidates. - Is this page a CS faculty hub?
- Currently deployed on Prof. Zhais 397 site
- Finished 2 CS depts.
- Machine learning was wrong in both cases.
- Mass Collaboration boosted accuracy from 0 to
100, - with the help of 20 users averaging 9.5 answers.
41Peanut System
- Overall Goal Build a mini-citeseer over
relevant CS publications. - Standard MOBS Approach Automatically parse
candidate lists from web pages and use mass
collaboration to verify/reject candidates. - Is this a publication list?
- Currently deployed atop the Google search engine
- Extracted 10 publication lists from 50 lists
over 10 pages - with 100 precision and 100 recall
- with the help of 10 users averaging 8.4 answers.
42Related Work
- Mass Collaboration
- Review sites, knowledge bases, open-source.
- Data Integration
- Lots of research, but few works on
addressing entire process (Rosenthal Seligman,
01). - Autonomic Systems
- A system built by mass collaboration
exhibits self-healing and self-improving
qualities.
43Future Work
- Theoretical guarantees of accuracy/speed.
- Automatic self-tuning in response to observations
on the population. - More sophisticated consensus techniques (i.e.
linear regression). - Maintain over changing sources.
- Most importantly, find The Application.
44System Demos
- Bookstore
- http//hanoi.cs.uiuc.edu/cgi-bin/deploy/mobs.pl?sy
stem4 - http//hanoi.cs.uiuc.edu/cs311
- http//hanoi.cs.uiuc.edu/cgi-bin/deploy/mb_system_
deep_web_portal.pl - Hubs
- http//hanoi.cs.uiuc.edu/cs397
- http//hanoi.cs.uiuc.edu/cgi-bin/deploy/hubs_publi
c_stats.pl - Peanut
- http//hanoi.cs.uiuc.edu/peanut