Title: Semex: A Platform for Personal Information Management and Integration
1Semex A Platform for Personal Information
Management and Integration
- Xin (Luna) Dong
- University of Washington
- June 24, 2005
2Is Your Personal Informationa Mine or a Mess?
Intranet Internet
3Is Your Personal Informationa Mine or a Mess?
Intranet Internet
4Questions Hard to Answer
- Where are my SEMEX papers and presentation slides
(maybe in an attachment)?
5Index Data from Different SourcesE.g. Google,
MSN desktop search
Intranet Internet
6Questions Hard to Answer
- Where are my SEMEX papers and presentation slides
(maybe in an attachment)? - Who are working on SEMEX?
- What are the emails sent by my PKU alumni?
- What are the phone numbers and emails of my
coauthors?
7Organize Data in a Semantically Meaningful Way
Intranet Internet
8Questions Hard to Answer
- Where are my SEMEX papers and presentation slides
(maybe in an attachment)? - Who are working on SEMEX?
- What are the emails sent by my PKU alumni?
- What are the phone numbers and emails of my
coauthors? - Whom of SIGMOD05 authors do I know?
9Integrate Organizational and Public Data with
Personal Data
Intranet Internet
10(No Transcript)
11SEMEX (SEMantic EXplorer) I. Provide a
Logical View of Data
12SEMEX (SEMantic EXplorer) II. On-the-fly Data
Integration
13How to Find Alons Papers on My Desktop?
14How to Find Alons Papers on My Desktop? Google
Search Results
Search Alon Halevy
Send me the semex demo slides again?
15How to Find Alons Papers on My Desktop? Google
Search Results
Search Alon Halevy
Ignore previous request, I found them
16How to Find Alons Papers on My Desktop? Google
Search Results
17Semex Goal
- Build a Personal Information Management (PIM)
system prototype that provides a logical view of
personal information - Build the logical view automatically
- Extract object instances and associations
- Remove instance duplications
- Leverage the logical view for on-the-fly data
integration - Exploit the logical view for information search
and browsing to improve peoples productivity - Be resilient to the evolution of the logical view
18An Ideal PIM is a Magic Wand
19An Ideal PIM is a Magic Wand
20Outline
- Problem definition and project goals
- Technical issues
- System architecture and instance extraction
CIDR05 - Reference reconciliation Sigmod05
- On-the-fly data integration
- Association search and browsing
- Domain model personalization and evolution
WebDB05 - Interleaved with Semex demo Best demo in
Sigmod05 - Overarching PIM Themes
21System Architecture
Data Analysis Module
Domain Management Module
Data Collection Module
Domain Manager
22Outline
- Problem definition and project goals
- Technical issues
- System architecture and instance extraction
CIDR05 - Reference reconciliation Sigmod05
- On-the-fly data integration
- Association search and browsing
- Domain model personalization and evolution
WebDB05 - Interleaved with Semex demo Best demo in
Sigmod05 - Overarching PIM Themes
23Reference Reconciliation in Semex
Xin (Luna) Dong
Lab-dong xin
dong xin luna
Names
luna
x. dong
dongxin
Emails
xin dong
24Semex Without Reference Reconciliation
Search results for luna
23 persons
luna dong SenderOfEmails(3043) RecipientOfEmails(2
445) MentionedIn(94)
25Semex Without Reference Reconciliation
Search results for luna
23 persons
Xin (Luna) Dong AuthorOfArticles(49) MentionedIn(2
0)
26Semex Without Reference Reconciliation
A Platform for Personal Information Management
and Integration
27Semex Without Reference Reconciliation
9 Persons dong xin xin dong
28Semex NEEDS Reference Reconciliation
29Reference Reconciliation
- A very active area of research in Databases, Data
Mining and AI. (Surveyed in Cohen, et al.
2003) - Traditional approaches assume matching tuples
from a single table - Based on pair-wise comparisons
- Harder in our context
30Challenges
- Article a1(Bounds on the Sample Complexity of
Bayesian Learning, 703-746, p1,p2,p3,
c1) a2(Bounds on the sample complexity of
bayesian learning, 703-746, p4,p5,p6, c2) - Venue c1(Computational learning theory,
1992, Austin, Texas) c2(COLT, 1992,
null) - Person p1(David Haussler, null) p2(Michael
Kearns, null) p3(Robert Schapire, null)
p4(Haussler, D., null) p5(Kearns, M.
J., null) p6(Schapire, R., null)
31Challenges
- Article a1(Bounds on the Sample Complexity of
Bayesian Learning, 703-746, p1,p2,p3,
c1) a2(Bounds on the sample complexity of
bayesian learning, 703-746, p4,p5,p6, c2) - Venue c1(Computational learning theory,
1992, Austin, Texas) c2(COLT, 1992,
null) - Person p1(David Haussler, null) p2(Michael
Kearns, null) p3(Robert Schapire, null)
p4(Haussler, D., null) p5(Kearns, M.
J., null) p6(Schapire, R., null)
p7(Robert Schapire, schapire_at_research.att.c
om) p8(null, mkearns_at_cis.uppen.edu) p9(m
ike, mkearns_at_cis.uppen.edu)
2. LimitedInformation
1. Multiple Classes
3. Multi-value Attributes
32Intuition
- Complex information spaces can be considered as
networks of instances and associations between
the instances - Key exploit the network, specifically, the clues
hidden in the associations
33I. Exploiting Richer Evidences
- Cross-attribute similarity Nameemail
- p5(Stonebraker, M., null)
- p8(null, stonebraker_at_csail.mit.edu)
- Context Information I Contact list
- p5(Stonebraker, M., null, p4, p6)
- p8(null, stonebraker_at_csail.mit.edu, p7)
- p6p7
- Context Information II Authored articles
- p2(Michael Stonebraker, null)
- p5(Stonebraker, M., null)
- p2 and p5 authored the same article
34Considering Only Attribute-wise Similarities
Cannot Merge Persons Well
3159
Person references 24076 Real-world persons
(gold-standard)1750
35Considering Richer Evidence Improves the Recall
Person references 24076 Real-world persons1750
36II. Propagate Information between Reconciliation
Decisions
- Article a1(Distributed Query
Processing,169-180, p1,p2,p3,
c1) a2(Distributed query processing,169-180
, p4,p5,p6, c2) - Venue c1(ACM Conference on Management of
Data, 1978, Austin, Texas) c2(ACM
SIGMOD, 1978, null) - Person p1(Robert S. Epstein,
null) p2(Michael Stonebraker,
null) p3(Eugene Wong, null) p4(Epstein,
R.S., null) p5(Stonebraker, M.,
null) p6(Wong, E., null)
37Propagating Information between Reconciliation
Decisions Further Improves Recall
Person references 24076 Real-world persons1750
38III. Reference Enrichment
- p2(Michael Stonebraker, null,
p1,p3)p8(null, stonebraker_at_csail.mit.edu,
p7)p9(mike, stonebraker_at_csail.mit.edu,
null) - p8-9 (mike, stonebraker_at_csail.mit.edu, p7)
39References Enrichment Improves Recall More than
Information Propagation
Person references 24076 Real-world persons1750
40Applying Both Information Propagation and
Reference Enrichment Gets the Highest Recall
Person references 24076 Real-world persons1750
41Outline
- Problem definition and project goals
- Technical issues
- System architecture and instance extraction
CIDR05 - Reference reconciliation Sigmod05
- On-the-fly data integration
- Association search and browsing
- Domain model personalization and evolution
WebDB05 - Interleaved with Semex demo Best demo in
Sigmod05 - Overarching PIM Themes
42Importing External Data Sources
43IntuitionExplore associations in schema mapping
- Traditional approaches proceed in two steps
- Step 1. Schema matching (Surveyed in
RahmBernstein, 2001) - Generate term matching candidates
- E.g., paperTitle in table Author matches
title in table Article - Step 2. Query discovery Miller et al., 2000
- Take term matching as input, generate mapping
expressions (typically queries) - E.g., SELECT Article.title as paperTitle,
Person.name as author FROM Article,
Person WHERE Article.author Person.id
44IntuitionExplore associations in schema mapping
- Traditional approaches proceed in two steps
- Step 1. Schema matching (Surveyed in
RahmBernstein, 2001) - Generate term matching candidates
- E.g., paperTitle in table Author matches
title in table Article - Step 2. Query discovery Miller et al., 2000
- Take term matching as input, generate mapping
expressions (typically queries) - E.g., SELECT Article.title as paperTitle,
Person.name as author FROM Article,
Person WHERE Article.author Person.id - Users input is needed to fill in the gap between
Step 1 output and Step 2 input - Our approach check association violations to
filter inappropriate matching candidates -
45Integration Example
authoredBy
publishedIn
authoredBy
Person(name, email) Book(title, year)
Article(title, page) Conference(name, year)
Webpage-item (title, author, conf, year)
46Integration Example
authoredBy
Person(name, email) Book(title, year)
Article(title, page) Conference(name, year)
?
Webpage-item (title, author, conf, year)
?
47Outline
- Problem definition and project goals
- Technical issues
- System architecture and instance extraction
CIDR05 - Reference reconciliation Sigmod05
- On-the-fly data integration
- Association search and browsing
- Domain model personalization and evolution
WebDB05 - Interleaved with Semex demo Best demo in
Sigmod05 - Overarching PIM Themes
48Explore the association network 1. Find the
relationship between two instances
- Example How did I know this person?
- Solution Lineage
- Find an association chain between two object
instances - Shortest chain?
- Earliest chain OR Latest chain
49Explore the association network 2. Find all
instances related to a given keyword
- Example Who are working on Schema Matching?
- Solution
- Naive approach index object instances on
attribute values - ?A list of papers on schema matching
- ?A list of emails on schema matching
- ?A list of persons working on schema matching
- ?A list of conferences for schema-matching papers
- ?A list of institutes that conduct
schema-matching research - Our approach index objects on the attributes of
associated objects
50Explore the association network 3. Rank
returned instances in a keyword search
- Example What are important papers on schema
matching? - Solution
- Naive approach rank by TF/IDF metric
- Our approach ranking by
- Significance score PageRank measure
- Relevance score TF/IDF metric
- Usage score last visit time and modification time
51Explore the association network 4. Fuzzy
Queries
- Queries we pose todaysomething we can describe
- Find me something with (related to) keyword X
- Find me the co-authors of Person Y
- Fuzzy queries
- Q What do I want to know?
- A In this webpage, 5 papers are written by your
friends - Q What significant things have happened today?
- A The President wrote an email to you!!
52Outline
- Problem definition and project goals
- Technical issues
- System architecture and instance extraction
CIDR05 - Reference reconciliation Sigmod05
- On-the-fly data integration
- Association search and browsing
- Domain model personalization and evolution
WebDB05 - Interleaved with Semex demo Best demo in
Sigmod05 - Overarching PIM Themes
53The Domain Model
- The logical view is described with a domain
model - Semex provides very basic classes and
associations as a default domain model - Users can personalize the domain model
cite
54Problems in Domain Model Personalization
- Problem hard to precisely model a domain
- At certain point we are not able to give a
precise domain model - Not enough knowledge of the domain
- Inherently evolution of a domain
- Non-existence of a precise model
- Overly detailed models may be a burden to users
- Modeling every details of the information on
ones desktop is often overwhelming - We may want to leave part of the domain
unstructured - Extract descriptions at different levels of
granularity Address v.s. street, city, state, zip
55Malleable Schemas
- Key idea capture the important aspects of the
domain model without committing to a strict schema
Unstructured data sources
Clean Schema
Structured data sources
56Malleable Schema
- Introduce text into schemas
- Phrases as element names E.g.,
InitialPlanningPhaseParticipant - Regular expressions as element namesE.g.,
Phone, StateProvince - Chains as element namesE.g., name/firstName
- Introduce imprecision into queries
- SELECT S.name, S.phone
- FROM Student as S, Project as P
- WHERE (S initialParticipant P) AND (P.name
Semex)
57Outline
- Problem definition and project goals
- Technical issues
- System architecture and instance extraction
CIDR05 - Reference reconciliation Sigmod05
- On-the-fly data integration
- Association search and browsing
- Domain model personalization and evolution
WebDB05 - Interleaved with Semex demo Best demo in
Sigmod05 - Overarching PIM Themes
58Overarching PIM Themes
PERSONAL
- It is PERSONAL data!
- How to build a system supporting users in their
own habitat? - How to create an AHA! browsing experience and
increase users productivity? - There can be any kind of INFORMATION
- How to combine structured and un-structured data?
- We are pursuing life-long data MANAGEMENT
- What is the right granularity for modeling
personal data? - How to manage data and schema that evolve over
time?
INFORMATION
MANAGEMENT
59Related Work
- Personal Information Management Systems
- Indexing
- Stuff Ive Seen (MSN Desktop Search)Dumais et
al., 2003 - Google Desktop Search 2004
- Richer relationships
- MyLifeBits Gemmell et al., 2002
- Placeless Documents Dourish et al., 2000
- LifeStreams Freeman and Gelernter, 1996
- Objects and associations
- Haystack Karger et al., 2005
60Summary
- 60 years passed since the personal Memex was
envisioned - Its time to get serious
- Great challenges for data management
- Deliverables of the project
- An approach to automatically build a database of
objects and associations from personal data - An algorithm for on-the-fly integration
- Algorithms for data analysis for association
search and browsing - The concept of malleable schema as a modeling
tool - A PIM system incorporating the above
61Association Network for Semex
Project Semex