Searching and Integrating Information on the Web - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Searching and Integrating Information on the Web

Description:

Searching and Integrating Information on the Web Seminar 4: Ranking Queries and Data Privacy Professor Chen Li UC Irvine – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 58
Provided by: Chen187
Category:

less

Transcript and Presenter's Notes

Title: Searching and Integrating Information on the Web


1
Searching and Integrating Information on the Web
  • Seminar 4 Ranking Queries and Data Privacy
  • Professor Chen Li
  • UC Irvine

2
Outline and readings
  • Ranking Queries
  • Fagin, R., Combining Fuzzy Information from
    Multiple Systems, PODS 1996
  • Fagin et al., Optimal Aggregation Algorithms for
    Middleware, PODS 2001.
  • Data privacy
  • Database-as-service
  • Executing SQL over Encrypted Data in the
    Database-Service-Provider Model. Hakan Hacigumus,
    Bala Iyer, Chen Li, and Sharad Mehrotra. SIGMOD
    2002.
  • XML Data publishing
  • Secure XML Publishing without Information Leakage
    in the Presence of Data Inference. Xiaochun Yang
    and Chen Li. To appear in VLDB'04

3
Outline
  • Ranking Queries
  • Data privacy
  • XML Data publishing
  • Database-as-service

4
Top-k queries
  1. Finding multi-attribute tuples with top-k highest
    scores
  2. Scoring function aggregating scores on
    attributes, e.g., w1A1 wn An, where wi
    is the weight for attribute Ai.
  3. Monotone aggregation functions if tuple A has a
    higher grade than tuple B on each attribute, then
    As overall grade is higher than Bs.

5
Applications
  • Multimedia databases
  • Web search queries
  • Restaurants
  • Houses
  • Cars

6
Modes of Data Access (Fagin)
  • Underlying Middleware (e.g., Search engines,
    Garlic, QBIC) supports 2 modes
  • 1. Sorted access
  • - Attribute Ai (column) forms a list Li sorted
    based on the score of Ai.
  • - The list is output one by one.
  • 2. Random access
  • - Ask the system for the grade of any given
    object
  • Goal minimize the total cost to get the top-k
    results

year
mileage
price
b e f . . .
a c e . . .
a d e . . .
Sorted lists
7
FA Fagins algorithm PODS96
  1. Do sorted access in parallel to each of the m
    sorted lists Li. Wait until there is a set H of
    at least k objects such that each of these
    objects has been seen in each of the m lists.
  2. For each object R that has been seen, do random
    access as needed to each of the lists Li to find
    the i-th field xi or R.
  3. Compute the aggregate results.

8
Example
year
mileage
price
b e f . . .
a c e . . .
a d e . . .
Cut-off line
  1. Suppose k 1. Given the three partial lists
    retrieved so far, e appears in all of them. We
    can say that the top 1 tuple must be in
    a,b,c,e,d,f.
  2. Reason since the function is monotonic, tuple
    e blocks all tuples below, since they can
    only have a smaller overall grade than e.
  3. The algorithm does random access for these 5
    tuples to get their grades, and pick the top 1.
  4. Notice that we cannot say e must be the top 1,
    since other tuples (e.g., a) may still have a
    higher overall score
  5. Minor point one possible improvement f can
    never be better than e.

9
General case
year
mileage
price
k
k
k
Cut-off line
  1. Once k tuples have appeared in all the partial
    lists, halt.
  2. Reason these k tuples block all the tuples
    below, which cannot be better than these k tuples
  3. Do random access for the retrieved tuples to get
    their overall grades, and find the top-k.

10
FAs Properties
  1. Can correctly find top-k results for monotone
    aggregation functions
  2. Cost of a database with N objects
    O(N(m-1)/mK1/m) with arbitrarily high
    probability.

11
FAs Drawbacks
  • The number of sorted accesses is still large.
  • Since all seen tuples should be buffered, the
    required buffer size is unbounded.
  • Does not exploit the bound given by the
    aggregation function to determine when to stop
    sorted access.

12
TA Threshold Algorithm PODS2001
  1. Do sorted access in parallel to each of the m
    sorted lists. As an object R is seen under sorted
    access in some list, do random access to the
    other lists to find the grade xi of object R in
    other lists. Then compute the aggregate grade for
    this object R. If this is one of the highest,
    insert it, else discard it.
  2. For each list Li, let xi be the grade of the last
    object seen under sorted access. Define the
    threshold value T to be t( x1, , xm). As soon as
    at least k objects have been seen whose grade is
    at least equal to T, then halt.
  3. Return the K objects that have been seen with the
    highest grades.

13
Example
mileage
year
price
buffer for top-k
b e f . . .
a d e . . .
a c e . . .
Threshold window
  1. A buffer keeps the top-k tuples that have been
    found so far
  2. For any tuple in a sorted list, do a random
    access to get its overall grade. Compare it with
    the tuples in the buffer queue, and decide to
    insert it or discard it.
  3. Threshold window (including the previous m
    records) represents the best top-k results we
    can see, assuming we can combine best values from
    different tuples.
  4. Notice that this window may not be horizontal
    if we use different speeds to access different
    lists
  5. This window helps us decide when to stop once we
    find k tuple whose grade is at least equal to the
    window tuple, we halt.

14
TAs Properties
  • TA is optimal for all monotone functions and over
    every database.
  • Compared to FA, TA requires a small,
    constant-size buffer.
  • TA allows early stopping
  • Can show TA never stops later than FA. (Why?)
  • There are times when the user is satisfied with
    approximate top k list. TA is modified to give
    such approximation.
  • TA can be modified to the case where random
    access is impossible

15
Instance Optimality
  1. Algorithm b is instance optimal over an algorithm
    set A and a database instance set D, if b is in
    A, and for any algorithm a in A and every
    instance d in D, we have cost (b,D)
    O(cost(a,D)).
  2. Similar to competitive ratio
  3. Essentially b is the best algorithm in A.
  4. Stronger than optimality in a worst-case case
  5. TA is instance optimal in all correct
    algorithms (nondeterministic algorithms).

b
A
a
16
Variations of TA
  • NRA When no random access is possible
  • Example Web search engines, which typically do
    not allow you to enter a URL and get its ranking
  • TAZ When no sorted access is possible for some
    predicates
  • Example Find good restaurants near location x
    (sorted and random access for restaurant ratings,
    random access only for distances from a mapping
    site)
  • CA When the relative costs of random and sorted
    accesses matter.
  • TA? Only when approximate answers are needed
  • Example Web search, with lots of good quality
    answers

17
Outline
  • Ranking Queries
  • Data privacy
  • XML Data publishing
  • Database-as-service

18
Motivation
  • Privacy in publishing XML data
  • Applications
  • Web publishing
  • Data sharing and exchange, e.g., in P2P systems

19
Example Hospital XML data
hospital
(1)
(2)
(2)
(3)
(4)
physician
patient
(1)
patient
physician
patient
patient
...
...
(1)
phname
pname
(4)
Smith
Walker
Tom
W403
cancer
  • Goal hide Alices disease
  • Common Knowledge patients in the same ward have
    the same disease

20
Problem
  • Given
  • An XML document to be published
  • Sensitive data in the document
  • Common knowledge using which public users can do
    data inference
  • Find
  • A partial document to be released so that users
    cannot infer the sensitive data

21
Research challenges
  • How to model data inference using common
    knowledge?
  • How to compute all possible inferred data?
  • How to compute a partial document to be published
    without leaking sensitive information?

22
Roadmap
  • ? Information Leakage
  • Defining sensitive data
  • Describing common knowledge
  • Computing inferred documents
  • Prevent information leakage

23
Defining sensitive data
  • Using an XQuery, called regulating query
  • A special node marked to indicate the
    sensitive data

24
Example 1
hospital
(2)
(3)
patient
(1)
patient
patient
  • Map the query to the XML tree
  • For each mapping, the target of the node is
    sensitive.

25
Example 2
hospital
(2)
(3)
patient
(1)
patient
patient
26
Common Knowledge
  • Represented as XML constraints
  • Could be obtained in various ways, e.g.,
  • possible schema
  • analysis from the published data

27
Common Constraints
  • Child constraints //p ? //p/c
  • //patient ? //patient/pname
  • Descendant constraints //p ? //p//d
  • //patient ? //patient//disease
  • Functional dependencies //p/a?//p/b
  • //patient/ward ? //patient/disease

Patient
Patient
pname
Patient
Patient
disease
Patient
Patient
If w1 w2, then d1 d2
ward
disease
ward
disease
(value equal)
w1
w2
d1
d2
28
Modify partial document using constraints
Partial document P
C1 //patient ? //patient/pname C2
//patient ? //patient//disease C3
//patient/ward ? //patient/disease
29
Apply C1 on document P
C1(P)
C1 //patient ? //patient/pname
30
Apply C2 on document P
C2(P)
C2 //patient ? //patient//disease
  • Floating branch exact location unknown

31
Apply C3 on document P
C3(P)
C3 //patient/ward?//patient/disease
32
Apply a sequence of constraints ltC2,C3gt
C2 //patient ? //patient//disease C3
//patient/ward ? //patient/disease
33
Another user applies a different sequence of
constraints ltC3,C2gt
C2 //patient ? //patient//disease C3
//patient/ward ? //patient/disease
  • After applying C3, we cannot use C2 to expand the
    tree
  • No more floating branch!

34
They look different!
  • P1 is m-contained in P2
  • There is a mapping from P1 to P2.
  • A floating branch can be mapped to a path.
  • The m-containing document P2 has more information
  • P2 is also m-contained in P1.
  • Thus they are m-equivalent!

35
What documents can users infer?
  • Different users can use different sequences of
    constraints to do inference
  • Thus they can infer different documents
  • Questions
  • Can an inference process terminate?
  • What inferred document should we consider to
    prevent leakage of sensitive data?

36
Theorem
  • Given a partial document P of an XML document D
    and a set of constraints CC1,, Ck, there is a
    document M that can be inferred from P using a
    sequence of constraints, such that
  • for any sequence of constraints, its resulting
    document is m-contained in M.
  • Can be computed using a greedy approach.
  • Such a document is unique under m-equivalence.

37
Information leakage
  • For a partial document P, if there exists a
    regulating query A, such that the maximal
    inferred document M can produce a non-empty
    answer to the query A, then we say P causes
    information leakage.

Partial Document P
Regulating query A
38
Roadmap
  • Information Leakage
  • ? Prevent information leakage

39
Formal Problem
  • Given an XML document D, a regulating query A,
    common knowledge represented as constraints
    C1,,Ck
  • How to find a partial document P without
    information leakage?
  • Called a valid partial document
  • The empty document is a trivial one
  • We want the published document to have as much
    data as possible

40
An algorithm
  • We develop an algorithm for solving this problem
  • We use the running example to illustrate the
    algorithm

41
Example
hospital
(2)
(3)
patient
(1)
patient
patient
Regulating query A
patient
disease
Alice

S
Functional dependency //patient/ward ?
//patient/disease
42
Remove sensitive data A(D)
hospital
(2)
(3)
patient
(1)
patient
patient
patient
disease
Alice

S
Remaining document D - A(D)
43
Compute the maximal inferred document M of D-A(D)
hospital
(2)
(3)
patient
(1)
patient
patient
patient
disease
Alice

S
Maximal inferred document M
44
Testing Information Leakage
hospital
(2)
(3)
patient
(1)
patient
patient
Regulating query A
patient
disease
Alice

S
There is a mapping from A to P. So information
leaked.
45
Computing a valid partial document
D - A(D)
A(D)
How to break the mappings? How to chase back the
inference steps?
46
AND/OR Graphs
  • A structure representing how a goal can be
    reached by solving subproblems.
  • We use such graphs to formulate the process of
    finding a valid partial document

47
hospital
(2)
(3)
patient
(1)
patient
patient
Regulating query A
patient
disease
Alice

S
START
  • Consider mapping images of the leaf nodes in A
  • An OR connector shows that solving any of the
    subproblems can solve the parent problem.

OR
(1)
(1)
Alice
leukemia
48
hospital
(2)
(3)
START
patient
(1)
patient
patient
OR
(1)
(1)
Alice
leukemia
AND
Regulating query A
OR
OR
patient
(1)
(2)
(3)
(3)
(2)
W305
leukemia
leukemia
W305
W305
disease
Alice
  • Multiple ways to infer the sensitive data.
  • An AND connector shows that solving ALL the
    subproblems can solve the parent problem.


S
49
hospital
(2)
(3)
patient
(1)
patient
patient
Regulating query A
patient
disease
Alice
  • Continue expanding the AND/OR graph


S
50
AND/OR Graphs (cont)
  • A special START node representing the goal of
    computing a valid partial document.
  • The graph has nodes corresponding to nodes in the
    maximal inferred document M.
  • Such a node represents the subproblem of hiding
    its corresponding node n in M
  • This node n should be removed from M
  • It cannot be inferred using the constraints and
    other nodes in M.

51
Solution graphs
  • A connected subgraph (of M) including the START
    node
  • For each node in the subgraph, its successor
    connectors are also in the subgraph.
  • If it contains an OR connector, it must also
    contain one of the connector's successors.
  • If it contains an AND connector, it must also
    contain all the successors of the connector.

52
Example solution graphs
START
START
OR
OR
Alice
(1)
(1)
leukemia
AND
OR
OR
(1)
W305
53
Computing a valid partial document using a
solution graph
  • For a solution graph G, for each node in G, we
    remove the corresponding node in M to get a valid
    partial document

START
START
OR
OR
hospital
Alice
(1)
(1)
leukemia
(2)
(3)
patient
(1)
patient
patient
AND
OR
OR
(1)
W305
54
Constructing an AND/OR Graph
  • Give an algorithm for computing an AND/OR graph
  • Consider inference steps of different constraints
  • Many algorithms proposed on finding a solution
    graph. They are applicable
  • No need to construct the entire AND/OR graph.
    Search for a solution graph on the fly.

55
Related work
Different scenarios of database security based on
trust domains
Data Execution Query
A. Single-user DBMS
Data Execution
Query
B. C/S access control
C. Database as a service
Data Query
Execution
D. Data publishing (our work)
Query Execution
Data
56
Summary of 2nd paper
  • Formulated problem of publishing XML document
    without information leakage due to data inference
  • Showed the effect of constraints on inference
  • Algorithm for finding a valid partial document of
    a given document

57
Outline
  • Ranking Queries
  • Data privacy
  • XML Data publishing
  • Database-as-service (DAS) model
Write a Comment
User Comments (0)
About PowerShow.com