Title: Searching and Integrating Information on the Web
1Searching and Integrating Information on the Web
- Seminar 4 Ranking Queries and Data Privacy
- Professor Chen Li
- UC Irvine
2Outline and readings
- Ranking Queries
- Fagin, R., Combining Fuzzy Information from
Multiple Systems, PODS 1996 - Fagin et al., Optimal Aggregation Algorithms for
Middleware, PODS 2001. - Data privacy
- Database-as-service
- Executing SQL over Encrypted Data in the
Database-Service-Provider Model. Hakan Hacigumus,
Bala Iyer, Chen Li, and Sharad Mehrotra. SIGMOD
2002. - XML Data publishing
- Secure XML Publishing without Information Leakage
in the Presence of Data Inference. Xiaochun Yang
and Chen Li. To appear in VLDB'04
3Outline
- Ranking Queries
- Data privacy
- XML Data publishing
- Database-as-service
4Top-k queries
- Finding multi-attribute tuples with top-k highest
scores - Scoring function aggregating scores on
attributes, e.g., w1A1 wn An, where wi
is the weight for attribute Ai. - Monotone aggregation functions if tuple A has a
higher grade than tuple B on each attribute, then
As overall grade is higher than Bs.
5Applications
- Multimedia databases
- Web search queries
- Restaurants
- Houses
- Cars
6Modes of Data Access (Fagin)
- Underlying Middleware (e.g., Search engines,
Garlic, QBIC) supports 2 modes - 1. Sorted access
- - Attribute Ai (column) forms a list Li sorted
based on the score of Ai. - - The list is output one by one.
- 2. Random access
- - Ask the system for the grade of any given
object - Goal minimize the total cost to get the top-k
results
year
mileage
price
b e f . . .
a c e . . .
a d e . . .
Sorted lists
7FA Fagins algorithm PODS96
- Do sorted access in parallel to each of the m
sorted lists Li. Wait until there is a set H of
at least k objects such that each of these
objects has been seen in each of the m lists. - For each object R that has been seen, do random
access as needed to each of the lists Li to find
the i-th field xi or R. - Compute the aggregate results.
8Example
year
mileage
price
b e f . . .
a c e . . .
a d e . . .
Cut-off line
- Suppose k 1. Given the three partial lists
retrieved so far, e appears in all of them. We
can say that the top 1 tuple must be in
a,b,c,e,d,f. - Reason since the function is monotonic, tuple
e blocks all tuples below, since they can
only have a smaller overall grade than e. - The algorithm does random access for these 5
tuples to get their grades, and pick the top 1. - Notice that we cannot say e must be the top 1,
since other tuples (e.g., a) may still have a
higher overall score - Minor point one possible improvement f can
never be better than e.
9General case
year
mileage
price
k
k
k
Cut-off line
- Once k tuples have appeared in all the partial
lists, halt. - Reason these k tuples block all the tuples
below, which cannot be better than these k tuples - Do random access for the retrieved tuples to get
their overall grades, and find the top-k.
10FAs Properties
- Can correctly find top-k results for monotone
aggregation functions - Cost of a database with N objects
O(N(m-1)/mK1/m) with arbitrarily high
probability.
11FAs Drawbacks
- The number of sorted accesses is still large.
- Since all seen tuples should be buffered, the
required buffer size is unbounded. - Does not exploit the bound given by the
aggregation function to determine when to stop
sorted access.
12TA Threshold Algorithm PODS2001
- Do sorted access in parallel to each of the m
sorted lists. As an object R is seen under sorted
access in some list, do random access to the
other lists to find the grade xi of object R in
other lists. Then compute the aggregate grade for
this object R. If this is one of the highest,
insert it, else discard it. - For each list Li, let xi be the grade of the last
object seen under sorted access. Define the
threshold value T to be t( x1, , xm). As soon as
at least k objects have been seen whose grade is
at least equal to T, then halt. - Return the K objects that have been seen with the
highest grades.
13Example
mileage
year
price
buffer for top-k
b e f . . .
a d e . . .
a c e . . .
Threshold window
- A buffer keeps the top-k tuples that have been
found so far - For any tuple in a sorted list, do a random
access to get its overall grade. Compare it with
the tuples in the buffer queue, and decide to
insert it or discard it. - Threshold window (including the previous m
records) represents the best top-k results we
can see, assuming we can combine best values from
different tuples. - Notice that this window may not be horizontal
if we use different speeds to access different
lists - This window helps us decide when to stop once we
find k tuple whose grade is at least equal to the
window tuple, we halt.
14TAs Properties
- TA is optimal for all monotone functions and over
every database. - Compared to FA, TA requires a small,
constant-size buffer. - TA allows early stopping
- Can show TA never stops later than FA. (Why?)
- There are times when the user is satisfied with
approximate top k list. TA is modified to give
such approximation. - TA can be modified to the case where random
access is impossible
15Instance Optimality
- Algorithm b is instance optimal over an algorithm
set A and a database instance set D, if b is in
A, and for any algorithm a in A and every
instance d in D, we have cost (b,D)
O(cost(a,D)). - Similar to competitive ratio
- Essentially b is the best algorithm in A.
- Stronger than optimality in a worst-case case
- TA is instance optimal in all correct
algorithms (nondeterministic algorithms).
b
A
a
16Variations of TA
- NRA When no random access is possible
- Example Web search engines, which typically do
not allow you to enter a URL and get its ranking - TAZ When no sorted access is possible for some
predicates - Example Find good restaurants near location x
(sorted and random access for restaurant ratings,
random access only for distances from a mapping
site) - CA When the relative costs of random and sorted
accesses matter. - TA? Only when approximate answers are needed
- Example Web search, with lots of good quality
answers
17Outline
- Ranking Queries
- Data privacy
- XML Data publishing
- Database-as-service
18Motivation
- Privacy in publishing XML data
- Applications
- Web publishing
- Data sharing and exchange, e.g., in P2P systems
19Example Hospital XML data
hospital
(1)
(2)
(2)
(3)
(4)
physician
patient
(1)
patient
physician
patient
patient
...
...
(1)
phname
pname
(4)
Smith
Walker
Tom
W403
cancer
- Goal hide Alices disease
- Common Knowledge patients in the same ward have
the same disease
20Problem
- Given
- An XML document to be published
- Sensitive data in the document
- Common knowledge using which public users can do
data inference - Find
- A partial document to be released so that users
cannot infer the sensitive data
21Research challenges
- How to model data inference using common
knowledge? - How to compute all possible inferred data?
- How to compute a partial document to be published
without leaking sensitive information?
22Roadmap
- ? Information Leakage
- Defining sensitive data
- Describing common knowledge
- Computing inferred documents
- Prevent information leakage
23Defining sensitive data
- Using an XQuery, called regulating query
- A special node marked to indicate the
sensitive data
24Example 1
hospital
(2)
(3)
patient
(1)
patient
patient
- Map the query to the XML tree
- For each mapping, the target of the node is
sensitive.
25Example 2
hospital
(2)
(3)
patient
(1)
patient
patient
26Common Knowledge
- Represented as XML constraints
- Could be obtained in various ways, e.g.,
- possible schema
- analysis from the published data
27Common Constraints
- Child constraints //p ? //p/c
- //patient ? //patient/pname
- Descendant constraints //p ? //p//d
- //patient ? //patient//disease
- Functional dependencies //p/a?//p/b
- //patient/ward ? //patient/disease
Patient
Patient
pname
Patient
Patient
disease
Patient
Patient
If w1 w2, then d1 d2
ward
disease
ward
disease
(value equal)
w1
w2
d1
d2
28Modify partial document using constraints
Partial document P
C1 //patient ? //patient/pname C2
//patient ? //patient//disease C3
//patient/ward ? //patient/disease
29Apply C1 on document P
C1(P)
C1 //patient ? //patient/pname
30Apply C2 on document P
C2(P)
C2 //patient ? //patient//disease
- Floating branch exact location unknown
31Apply C3 on document P
C3(P)
C3 //patient/ward?//patient/disease
32Apply a sequence of constraints ltC2,C3gt
C2 //patient ? //patient//disease C3
//patient/ward ? //patient/disease
33Another user applies a different sequence of
constraints ltC3,C2gt
C2 //patient ? //patient//disease C3
//patient/ward ? //patient/disease
- After applying C3, we cannot use C2 to expand the
tree - No more floating branch!
34They look different!
- P1 is m-contained in P2
- There is a mapping from P1 to P2.
- A floating branch can be mapped to a path.
- The m-containing document P2 has more information
- P2 is also m-contained in P1.
- Thus they are m-equivalent!
35What documents can users infer?
- Different users can use different sequences of
constraints to do inference - Thus they can infer different documents
- Questions
- Can an inference process terminate?
- What inferred document should we consider to
prevent leakage of sensitive data?
36Theorem
- Given a partial document P of an XML document D
and a set of constraints CC1,, Ck, there is a
document M that can be inferred from P using a
sequence of constraints, such that - for any sequence of constraints, its resulting
document is m-contained in M. - Can be computed using a greedy approach.
- Such a document is unique under m-equivalence.
37Information leakage
- For a partial document P, if there exists a
regulating query A, such that the maximal
inferred document M can produce a non-empty
answer to the query A, then we say P causes
information leakage.
Partial Document P
Regulating query A
38Roadmap
- Information Leakage
- ? Prevent information leakage
39Formal Problem
- Given an XML document D, a regulating query A,
common knowledge represented as constraints
C1,,Ck - How to find a partial document P without
information leakage? - Called a valid partial document
- The empty document is a trivial one
- We want the published document to have as much
data as possible
40An algorithm
- We develop an algorithm for solving this problem
- We use the running example to illustrate the
algorithm
41Example
hospital
(2)
(3)
patient
(1)
patient
patient
Regulating query A
patient
disease
Alice
S
Functional dependency //patient/ward ?
//patient/disease
42Remove sensitive data A(D)
hospital
(2)
(3)
patient
(1)
patient
patient
patient
disease
Alice
S
Remaining document D - A(D)
43Compute the maximal inferred document M of D-A(D)
hospital
(2)
(3)
patient
(1)
patient
patient
patient
disease
Alice
S
Maximal inferred document M
44Testing Information Leakage
hospital
(2)
(3)
patient
(1)
patient
patient
Regulating query A
patient
disease
Alice
S
There is a mapping from A to P. So information
leaked.
45Computing a valid partial document
D - A(D)
A(D)
How to break the mappings? How to chase back the
inference steps?
46AND/OR Graphs
- A structure representing how a goal can be
reached by solving subproblems. - We use such graphs to formulate the process of
finding a valid partial document
47hospital
(2)
(3)
patient
(1)
patient
patient
Regulating query A
patient
disease
Alice
S
START
- Consider mapping images of the leaf nodes in A
- An OR connector shows that solving any of the
subproblems can solve the parent problem.
OR
(1)
(1)
Alice
leukemia
48hospital
(2)
(3)
START
patient
(1)
patient
patient
OR
(1)
(1)
Alice
leukemia
AND
Regulating query A
OR
OR
patient
(1)
(2)
(3)
(3)
(2)
W305
leukemia
leukemia
W305
W305
disease
Alice
- Multiple ways to infer the sensitive data.
- An AND connector shows that solving ALL the
subproblems can solve the parent problem.
S
49hospital
(2)
(3)
patient
(1)
patient
patient
Regulating query A
patient
disease
Alice
- Continue expanding the AND/OR graph
S
50AND/OR Graphs (cont)
- A special START node representing the goal of
computing a valid partial document. - The graph has nodes corresponding to nodes in the
maximal inferred document M. - Such a node represents the subproblem of hiding
its corresponding node n in M - This node n should be removed from M
- It cannot be inferred using the constraints and
other nodes in M.
51Solution graphs
- A connected subgraph (of M) including the START
node - For each node in the subgraph, its successor
connectors are also in the subgraph. - If it contains an OR connector, it must also
contain one of the connector's successors. - If it contains an AND connector, it must also
contain all the successors of the connector.
52Example solution graphs
START
START
OR
OR
Alice
(1)
(1)
leukemia
AND
OR
OR
(1)
W305
53Computing a valid partial document using a
solution graph
- For a solution graph G, for each node in G, we
remove the corresponding node in M to get a valid
partial document
START
START
OR
OR
hospital
Alice
(1)
(1)
leukemia
(2)
(3)
patient
(1)
patient
patient
AND
OR
OR
(1)
W305
54Constructing an AND/OR Graph
- Give an algorithm for computing an AND/OR graph
- Consider inference steps of different constraints
- Many algorithms proposed on finding a solution
graph. They are applicable - No need to construct the entire AND/OR graph.
Search for a solution graph on the fly.
55Related work
Different scenarios of database security based on
trust domains
Data Execution Query
A. Single-user DBMS
Data Execution
Query
B. C/S access control
C. Database as a service
Data Query
Execution
D. Data publishing (our work)
Query Execution
Data
56Summary of 2nd paper
- Formulated problem of publishing XML document
without information leakage due to data inference - Showed the effect of constraints on inference
- Algorithm for finding a valid partial document of
a given document
57Outline
- Ranking Queries
- Data privacy
- XML Data publishing
- Database-as-service (DAS) model