Publishing Set-Valued Data via Differential Privacy - PowerPoint PPT Presentation

About This Presentation
Title:

Publishing Set-Valued Data via Differential Privacy

Description:

Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia University – PowerPoint PPT presentation

Number of Views:197
Avg rating:3.0/5.0
Slides: 39
Provided by: Pai99
Category:

less

Transcript and Presenter's Notes

Title: Publishing Set-Valued Data via Differential Privacy


1
Publishing Set-Valued Data via Differential
Privacy
  • Rui Chen, Concordia University
  • Noman Mohammed, Concordia University
  • Benjamin C. M. Fung, Concordia University
  • Bipin C. Desai, Concordia University
  • Li Xiong, Emory University

VLDB 2011
2
Outline
  • Introduction
  • Preliminaries
  • Sanitization algorithm
  • Experimental results
  • Conclusions

3
Introduction
  • The problem non-interactive set-valued data
    publication under differential privacy
  • Typical set-valued data transaction data, web
    search queries

4
Introduction
  • Set-valued data refers to the data in which each
    record owner is associated with a set of items
    drawn from an item universe.

TID Items
t1 I1, I2, I3, I4
t2 I2, I4
t3 I2
t4 I1, I2
t5 I2
t6 I1
t7 I1, I2, I3, I4
t8 I2, I3, I4
5
Introduction
  • Existing works 1, 2, 3, 4, 5, 6, 7 on
    publishing set-valued data are based on
    partitioned-based privacy models 8.
  • They provide insufficient privacy protection.
  • Composition attack 8
  • deFinetti attack 9
  • Foreground knowledge attack 10
  • They are vulnerable to background knowledge.

6
Introduction
  • Differential privacy is independent of an
    adversary background knowledge and computational
    power (with exceptions 11).
  • The outcome of any analysis should not overly
    depend on a single data record.
  • Existing differentially private data publishing
    approaches are not adequate in terms of both
    utility and scalability for our problem.

7
Introduction
  • Problems of data-independent publishing
    approaches

I1
I2
I3
I1, I2
I1, I3
I2, I3
I1, I2, I3
Universe I I1, I2, I3
  • Scalability O(2n)
  • Utility noise accumulates exponentially

8
Outline
  • Introduction
  • Preliminaries
  • Sanitization algorithm
  • Experimental results
  • Conclusions

9
Preliminaries
  • Context-free taxonomy tree
  • Each internal node is a set of their leaves, not
    necessarily the semantic generalization

10
Preliminaries
  • Differential privacy 12

D
D
  • D and D are neighbors if they differ on at most
    one record

A non-interactive privacy mechanism A gives
e-differential privacy if for all neighbours D,
D, and for any possible sanitized database D ?
Range(A), PrAA(D) D
exp(e) PrAA(D) D
11
Preliminaries
  • Laplace mechanism 12

Global Sensitivity
For example, for a single counting query Q over a
dataset D, returning Q(D)Laplace(1/e) gives
e-differential privacy.
12
Preliminaries
  • Exponential mechanism 13

Given a utility function q (D R) ? R for a
database instance D, the mechanism A, A(D, q)
return r with probability ?
exp(eq(D, r)/2?q) gives e-differential
privacy.
13
Preliminaries
  • Composition properties 14

Sequential composition ?iei differential
privacy
Parallel composition max(ei)differential privacy
14
Preliminaries
  • Utility metrics

For a given itemset I I , a counting query Q
over a dataset D is defined to be
A privacy mechanism A is (a, d)-useful if with
probability 1- d, for every counting query and
every dataset D, for DA(D), Q(D)-Q(D)lt a.
15
15
Outline
  • Introduction
  • Preliminaries
  • Sanitization algorithm
  • Experimental results
  • Conclusions

16
Sanitization Algorithm
  • Top-down partitioning

TID Items
t1 I1, I2, I3, I4
t2 I2, I4
t3 I2
t4 I1, I2
t5 I2
t6 I1
t7 I1, I2, I3, I4
t8 I2, I3, I4
  • Generalize all records to a single partition
  • Keep partitioning non-empty partitions until leaf
    partitions are reached

17
Sanitization Algorithm
  • Privacy budget allocation
  • We reserve B/2 to obtain noisy sizes of leaf
    partitions and the rest B/2 to guide the
    partitioning.
  • Assign less budget to more general partitions and
    more budget to more specific partitions.

18
Sanitization Algorithm
  • Privacy budget allocation

A hierarchy cut needs at most
partition operations to reach leaf partitions.
Example I1,2, I3, 4 needs at most two
partition operations to reach leaf partitions
19
Sanitization Algorithm
  • Privacy budget allocation
  • We reserve B/2 to obtain noisy sizes of leaf
    partitions and the rest B/2 to guide the
    partitioning.
  • Assign less budget to more general partitions and
    more budget to more specific partitions.

B/2/3 B/6
(B/2-B/6)/2 B/6
B/6B/2 2B/3
20
Sanitization Algorithm
  • Sub-partition generation
  • For a non-leaf partition, we need to consider all
    possible sub-partitions to satisfy differential
    privacy.
  • Efficient implementation separately handling
    empty and non-empty partitions (inspired by 16).

21
Outline
  • Introduction
  • Preliminaries
  • Sanitization algorithm
  • Experimental results
  • Conclusions

22
Experiments
  • Two real-life set-valued datasets are used.
  • MSNBC is publicly available at UCI machine
    learning repository(http//archive.ics.uci.edu/ml/
    index.html).
  • STM is provided by Societe de transport de
    Montreal (STM) (http//www.stm.info).

23
Experiments
  • Average relative error vs. privacy budget

B0.5
B0.75
B1.0
24
Experiments
  • Utility for frequent itemset mining

B0.75
B0.5
B1.0
25
Experiments
  • Scalability O(DI)

Runtime vs. D
Runtime vs. I
26
Outline
  • Introduction
  • Preliminaries
  • Sanitization algorithm
  • Experimental results
  • Conclusions

27
Conclusions
  • Differential privacy can be successfully applied
    to non-interactive set-valued data publishing
    with guaranteed utility.
  • Differential privacy can be achieved by
    data-dependent solutions with improved efficiency
    and accuracy.
  • The general idea of data-dependent solutions
    applies to other types of data, for example,
    relational data 17 and trajectory data 18.

28
References
  • 1 J. Cao, P. Karras, C. Raissi, and K.-L. Tan.
    ?uncertainty inference proof transaction
    anonymization. In VLDB, pp. 10331044, 2010.
  • 2 G. Ghinita, Y. Tao, and P. Kalnis. On the
    anonymization of sparse high-dimensional data. In
    ICDE, pp. 715724, 2008.
  • 3 Y. He and J. F. Naughton. Anonymization of
    set-valued data via top-down, local
    generalization. In VLDB, pp. 934945, 2009.
  • 4 M. Terrovitis, N. Mamoulis, and P. Kalnis.
    Privacy-preserving anonymization of set-valued
    data. In VLDB, pp.115125, 2008.
  • 5 M. Terrovitis, N. Mamoulis, and P. Kalnis.
    Local and global recoding methods for anonymizing
    set-valued data.VLDBJ, 20(1)83106, 2011.
  • 6 Y. Xu, B. C. M. Fung, K. Wang, A. W. C. Fu,
    and J. Pei. Publishing sensitive transactions for
    itemset utility. In ICDM, pp. 11091114, 2008.
  • 7 Y. Xu, K. Wang, A. W. C. Fu, and P. S. Yu.
    Anonymizing transaction databases for
    publication. In SIGKDD, pp. 767775, 2008.

29
References
  • 8 S. R. Ganta, S. P. Kasiviswanathan, and A.
    Smith. Composition attacks and auxiliary
    information in data privacy. In SIGKDD, pp.
    265-273, 2008.
  • 9 D. Kifer. Attacks on privacy and deFinettis
    theorem. In SIGMOD, pp. 127138, 2009.
  • 10 R. C. W. Wong, A. Fu, K. Wang, P. S. Yu, and
    J. Pei. Can the utility of anonymized data be
    used for privacy breaches, ACM Transactions on
    Knowledge Discovery from Data, to appear.
  • 11 D. Kifer and A. Machanavajjhala. No free
    lunch in data privacy. In SIGMOD, 2011.
  • 12 C. Dwork, F. McSherry, K. Nissim, and A.
    Smith. Calibrating noise to sensitivity in
    private data analysis. In Theory of Cryptography
    Conference, pp. 265284, 2006.
  • 13 F. McSherry and K. Talwar. Mechanism design
    via differential privacy. In FOCS, pp. 94103,
    2007.
  • 14 F. McSherry. Privacy integrated queries An
    extensible platform for privacy-preserving data
    analysis. In SIGMOD, pp. 1930, 2009.
  • 15 A. Blum, K. Ligett, and A. Roth. A learning
    theory approach to non-interactive database
    privacy. In STOC, pp.609618, 2008.

30
References
  • 16 G. Cormode, M. Procopiuc, D. Srivastava, and
    T. T. L. Tran. Differentially Private Publication
    of Sparse Data. In CoRR, 2011.
  • 17 N. Mohammed, R. Chen, B. C. M. Fung, and P.
    S. Yu. Differentially private data release for
    data mining. In SIGKDD, 2011.
  • 18 R. Chen, B. C. M. Fung, and B. C. Desai.
    Differentially private trajectory data
    publication. ICDE, under review, 2012.

31
  • Thank you!
  • Q A

32
  • Backup Slides

33
Lower Bound Results
  • In the interactive setting, only a limited number
    of queries could be answered otherwise, an
    adversary would be able to precisely reconstruct
    almost the entire original database.
  • In the non-interactive setting, one can only
    guarantee the utility of restricted classes of
    queries.

34
(No Transcript)
35
(No Transcript)
36
Threshold Selection
  • We design the threshold as a function of the
    standard deviation of the noise and the height of
    a partitions hierarchy cut

37
Relative error
  • (a, d)-usefulness is effective to give an overall
    estimation of utility, but fails to produce
    intuitive experimental results.
  • We experimentally measure the utility of
    sanitized data for counting queries by relative
    error

Sanity bound
38
Experiments
  • Average relative error vs. taxonomy tree fan-out

B0.75
B0.5
B1.0
Write a Comment
User Comments (0)
About PowerShow.com