Semex: A Platform for Personal Information Management and Integration

1 / 61
About This Presentation
Title:

Semex: A Platform for Personal Information Management and Integration

Description:

Semex: A Platform for Personal Information Management and Integration ... papers and presentation s (maybe in an attachment) ... What are the emails sent ... –

Number of Views:97
Avg rating:3.0/5.0
Slides: 62
Provided by: Sweet7
Category:

less

Transcript and Presenter's Notes

Title: Semex: A Platform for Personal Information Management and Integration


1
Semex A Platform for Personal Information
Management and Integration
  • Xin (Luna) Dong
  • University of Washington
  • June 24, 2005

2
Is Your Personal Informationa Mine or a Mess?
Intranet Internet
3
Is Your Personal Informationa Mine or a Mess?
Intranet Internet
4
Questions Hard to Answer
  • Where are my SEMEX papers and presentation slides
    (maybe in an attachment)?

5
Index Data from Different SourcesE.g. Google,
MSN desktop search
Intranet Internet
6
Questions Hard to Answer
  • Where are my SEMEX papers and presentation slides
    (maybe in an attachment)?
  • Who are working on SEMEX?
  • What are the emails sent by my PKU alumni?
  • What are the phone numbers and emails of my
    coauthors?

7
Organize Data in a Semantically Meaningful Way
Intranet Internet
8
Questions Hard to Answer
  • Where are my SEMEX papers and presentation slides
    (maybe in an attachment)?
  • Who are working on SEMEX?
  • What are the emails sent by my PKU alumni?
  • What are the phone numbers and emails of my
    coauthors?
  • Whom of SIGMOD05 authors do I know?

9
Integrate Organizational and Public Data with
Personal Data
Intranet Internet
10
(No Transcript)
11
SEMEX (SEMantic EXplorer) I. Provide a
Logical View of Data
12
SEMEX (SEMantic EXplorer) II. On-the-fly Data
Integration
13
How to Find Alons Papers on My Desktop?
14
How to Find Alons Papers on My Desktop? Google
Search Results
Search Alon Halevy
Send me the semex demo slides again?
15
How to Find Alons Papers on My Desktop? Google
Search Results
Search Alon Halevy
Ignore previous request, I found them
16
How to Find Alons Papers on My Desktop? Google
Search Results
17
Semex Goal
  • Build a Personal Information Management (PIM)
    system prototype that provides a logical view of
    personal information
  • Build the logical view automatically
  • Extract object instances and associations
  • Remove instance duplications
  • Leverage the logical view for on-the-fly data
    integration
  • Exploit the logical view for information search
    and browsing to improve peoples productivity
  • Be resilient to the evolution of the logical view

18
An Ideal PIM is a Magic Wand
19
An Ideal PIM is a Magic Wand
20
Outline
  • Problem definition and project goals
  • Technical issues
  • System architecture and instance extraction
    CIDR05
  • Reference reconciliation Sigmod05
  • On-the-fly data integration
  • Association search and browsing
  • Domain model personalization and evolution
    WebDB05
  • Interleaved with Semex demo Best demo in
    Sigmod05
  • Overarching PIM Themes

21
System Architecture
Data Analysis Module


Domain Management Module
Data Collection Module


Domain Manager
22
Outline
  • Problem definition and project goals
  • Technical issues
  • System architecture and instance extraction
    CIDR05
  • Reference reconciliation Sigmod05
  • On-the-fly data integration
  • Association search and browsing
  • Domain model personalization and evolution
    WebDB05
  • Interleaved with Semex demo Best demo in
    Sigmod05
  • Overarching PIM Themes

23
Reference Reconciliation in Semex
Xin (Luna) Dong
Lab-dong xin
dong xin luna
  • ðà xinluna dong

Names
luna
x. dong
dongxin
Emails
xin dong
24
Semex Without Reference Reconciliation
Search results for luna
23 persons
luna dong SenderOfEmails(3043) RecipientOfEmails(2
445) MentionedIn(94)
25
Semex Without Reference Reconciliation
Search results for luna
23 persons
Xin (Luna) Dong AuthorOfArticles(49) MentionedIn(2
0)
26
Semex Without Reference Reconciliation
A Platform for Personal Information Management
and Integration
27
Semex Without Reference Reconciliation
9 Persons dong xin xin dong
28
Semex NEEDS Reference Reconciliation
29
Reference Reconciliation
  • A very active area of research in Databases, Data
    Mining and AI. (Surveyed in Cohen, et al.
    2003)
  • Traditional approaches assume matching tuples
    from a single table
  • Based on pair-wise comparisons
  • Harder in our context

30
Challenges
  • Article a1(Bounds on the Sample Complexity of
    Bayesian Learning, 703-746, p1,p2,p3,
    c1) a2(Bounds on the sample complexity of
    bayesian learning, 703-746, p4,p5,p6, c2)
  • Venue c1(Computational learning theory,
    1992, Austin, Texas) c2(COLT, 1992,
    null)
  • Person p1(David Haussler, null) p2(Michael
    Kearns, null) p3(Robert Schapire, null)
    p4(Haussler, D., null) p5(Kearns, M.
    J., null) p6(Schapire, R., null)

31
Challenges
  • Article a1(Bounds on the Sample Complexity of
    Bayesian Learning, 703-746, p1,p2,p3,
    c1) a2(Bounds on the sample complexity of
    bayesian learning, 703-746, p4,p5,p6, c2)
  • Venue c1(Computational learning theory,
    1992, Austin, Texas) c2(COLT, 1992,
    null)
  • Person p1(David Haussler, null) p2(Michael
    Kearns, null) p3(Robert Schapire, null)
    p4(Haussler, D., null) p5(Kearns, M.
    J., null) p6(Schapire, R., null)
    p7(Robert Schapire, schapire_at_research.att.c
    om) p8(null, mkearns_at_cis.uppen.edu) p9(m
    ike, mkearns_at_cis.uppen.edu)

2. LimitedInformation
1. Multiple Classes
3. Multi-value Attributes
32
Intuition
  • Complex information spaces can be considered as
    networks of instances and associations between
    the instances
  • Key exploit the network, specifically, the clues
    hidden in the associations

33
I. Exploiting Richer Evidences
  • Cross-attribute similarity Nameemail
  • p5(Stonebraker, M., null)
  • p8(null, stonebraker_at_csail.mit.edu)
  • Context Information I Contact list
  • p5(Stonebraker, M., null, p4, p6)
  • p8(null, stonebraker_at_csail.mit.edu, p7)
  • p6p7
  • Context Information II Authored articles
  • p2(Michael Stonebraker, null)
  • p5(Stonebraker, M., null)
  • p2 and p5 authored the same article

34
Considering Only Attribute-wise Similarities
Cannot Merge Persons Well
3159
Person references 24076 Real-world persons
(gold-standard)1750
35
Considering Richer Evidence Improves the Recall
Person references 24076 Real-world persons1750
36
II. Propagate Information between Reconciliation
Decisions
  • Article a1(Distributed Query
    Processing,169-180, p1,p2,p3,
    c1) a2(Distributed query processing,169-180
    , p4,p5,p6, c2)
  • Venue c1(ACM Conference on Management of
    Data, 1978, Austin, Texas) c2(ACM
    SIGMOD, 1978, null)
  • Person p1(Robert S. Epstein,
    null) p2(Michael Stonebraker,
    null) p3(Eugene Wong, null) p4(Epstein,
    R.S., null) p5(Stonebraker, M.,
    null) p6(Wong, E., null)

37
Propagating Information between Reconciliation
Decisions Further Improves Recall
Person references 24076 Real-world persons1750
38
III. Reference Enrichment
  • p2(Michael Stonebraker, null,
    p1,p3)p8(null, stonebraker_at_csail.mit.edu,
    p7)p9(mike, stonebraker_at_csail.mit.edu,
    null)
  • p8-9 (mike, stonebraker_at_csail.mit.edu, p7)

39
References Enrichment Improves Recall More than
Information Propagation
Person references 24076 Real-world persons1750
40
Applying Both Information Propagation and
Reference Enrichment Gets the Highest Recall
Person references 24076 Real-world persons1750
41
Outline
  • Problem definition and project goals
  • Technical issues
  • System architecture and instance extraction
    CIDR05
  • Reference reconciliation Sigmod05
  • On-the-fly data integration
  • Association search and browsing
  • Domain model personalization and evolution
    WebDB05
  • Interleaved with Semex demo Best demo in
    Sigmod05
  • Overarching PIM Themes

42
Importing External Data Sources
43
IntuitionExplore associations in schema mapping
  • Traditional approaches proceed in two steps
  • Step 1. Schema matching (Surveyed in
    RahmBernstein, 2001)
  • Generate term matching candidates
  • E.g., paperTitle in table Author matches
    title in table Article
  • Step 2. Query discovery Miller et al., 2000
  • Take term matching as input, generate mapping
    expressions (typically queries)
  • E.g., SELECT Article.title as paperTitle,
    Person.name as author FROM Article,
    Person WHERE Article.author Person.id

44
IntuitionExplore associations in schema mapping
  • Traditional approaches proceed in two steps
  • Step 1. Schema matching (Surveyed in
    RahmBernstein, 2001)
  • Generate term matching candidates
  • E.g., paperTitle in table Author matches
    title in table Article
  • Step 2. Query discovery Miller et al., 2000
  • Take term matching as input, generate mapping
    expressions (typically queries)
  • E.g., SELECT Article.title as paperTitle,
    Person.name as author FROM Article,
    Person WHERE Article.author Person.id
  • Users input is needed to fill in the gap between
    Step 1 output and Step 2 input
  • Our approach check association violations to
    filter inappropriate matching candidates

45
Integration Example
authoredBy
publishedIn
authoredBy
Person(name, email) Book(title, year)
Article(title, page) Conference(name, year)
Webpage-item (title, author, conf, year)
46
Integration Example
authoredBy
Person(name, email) Book(title, year)
Article(title, page) Conference(name, year)
?
Webpage-item (title, author, conf, year)
?
47
Outline
  • Problem definition and project goals
  • Technical issues
  • System architecture and instance extraction
    CIDR05
  • Reference reconciliation Sigmod05
  • On-the-fly data integration
  • Association search and browsing
  • Domain model personalization and evolution
    WebDB05
  • Interleaved with Semex demo Best demo in
    Sigmod05
  • Overarching PIM Themes

48
Explore the association network 1. Find the
relationship between two instances
  • Example How did I know this person?
  • Solution Lineage
  • Find an association chain between two object
    instances
  • Shortest chain?
  • Earliest chain OR Latest chain

49
Explore the association network 2. Find all
instances related to a given keyword
  • Example Who are working on Schema Matching?
  • Solution
  • Naive approach index object instances on
    attribute values
  • ?A list of papers on schema matching
  • ?A list of emails on schema matching
  • ?A list of persons working on schema matching
  • ?A list of conferences for schema-matching papers
  • ?A list of institutes that conduct
    schema-matching research
  • Our approach index objects on the attributes of
    associated objects

50
Explore the association network 3. Rank
returned instances in a keyword search
  • Example What are important papers on schema
    matching?
  • Solution
  • Naive approach rank by TF/IDF metric
  • Our approach ranking by
  • Significance score PageRank measure
  • Relevance score TF/IDF metric
  • Usage score last visit time and modification time

51
Explore the association network 4. Fuzzy
Queries
  • Queries we pose todaysomething we can describe
  • Find me something with (related to) keyword X
  • Find me the co-authors of Person Y
  • Fuzzy queries
  • Q What do I want to know?
  • A In this webpage, 5 papers are written by your
    friends
  • Q What significant things have happened today?
  • A The President wrote an email to you!!

52
Outline
  • Problem definition and project goals
  • Technical issues
  • System architecture and instance extraction
    CIDR05
  • Reference reconciliation Sigmod05
  • On-the-fly data integration
  • Association search and browsing
  • Domain model personalization and evolution
    WebDB05
  • Interleaved with Semex demo Best demo in
    Sigmod05
  • Overarching PIM Themes

53
The Domain Model
  • The logical view is described with a domain
    model
  • Semex provides very basic classes and
    associations as a default domain model
  • Users can personalize the domain model

cite
54
Problems in Domain Model Personalization
  • Problem hard to precisely model a domain
  • At certain point we are not able to give a
    precise domain model
  • Not enough knowledge of the domain
  • Inherently evolution of a domain
  • Non-existence of a precise model
  • Overly detailed models may be a burden to users
  • Modeling every details of the information on
    ones desktop is often overwhelming
  • We may want to leave part of the domain
    unstructured
  • Extract descriptions at different levels of
    granularity Address v.s. street, city, state, zip

55
Malleable Schemas
  • Key idea capture the important aspects of the
    domain model without committing to a strict schema

Unstructured data sources
Clean Schema
Structured data sources
56
Malleable Schema
  • Introduce text into schemas
  • Phrases as element names E.g.,
    InitialPlanningPhaseParticipant
  • Regular expressions as element namesE.g.,
    Phone, StateProvince
  • Chains as element namesE.g., name/firstName
  • Introduce imprecision into queries
  • SELECT S.name, S.phone
  • FROM Student as S, Project as P
  • WHERE (S initialParticipant P) AND (P.name
    Semex)

57
Outline
  • Problem definition and project goals
  • Technical issues
  • System architecture and instance extraction
    CIDR05
  • Reference reconciliation Sigmod05
  • On-the-fly data integration
  • Association search and browsing
  • Domain model personalization and evolution
    WebDB05
  • Interleaved with Semex demo Best demo in
    Sigmod05
  • Overarching PIM Themes

58
Overarching PIM Themes
PERSONAL
  • It is PERSONAL data!
  • How to build a system supporting users in their
    own habitat?
  • How to create an AHA! browsing experience and
    increase users productivity?
  • There can be any kind of INFORMATION
  • How to combine structured and un-structured data?
  • We are pursuing life-long data MANAGEMENT
  • What is the right granularity for modeling
    personal data?
  • How to manage data and schema that evolve over
    time?

INFORMATION
MANAGEMENT
59
Related Work
  • Personal Information Management Systems
  • Indexing
  • Stuff Ive Seen (MSN Desktop Search)Dumais et
    al., 2003
  • Google Desktop Search 2004
  • Richer relationships
  • MyLifeBits Gemmell et al., 2002
  • Placeless Documents Dourish et al., 2000
  • LifeStreams Freeman and Gelernter, 1996
  • Objects and associations
  • Haystack Karger et al., 2005

60
Summary
  • 60 years passed since the personal Memex was
    envisioned
  • Its time to get serious
  • Great challenges for data management
  • Deliverables of the project
  • An approach to automatically build a database of
    objects and associations from personal data
  • An algorithm for on-the-fly integration
  • Algorithms for data analysis for association
    search and browsing
  • The concept of malleable schema as a modeling
    tool
  • A PIM system incorporating the above

61
Association Network for Semex
Project Semex
Write a Comment
User Comments (0)
About PowerShow.com