Title: On the Anonymization of Sparse High-Dimensional Data
1On the Anonymization of Sparse High-Dimensional
Data
Gabriel Ghinita1 Yufei Tao2 Panos Kalnis1 Gabriel Ghinita1 Yufei Tao2 Panos Kalnis1
- 1 National University of Singapore
- ghinitag,kalnis_at_comp.nus.edu.sg
- 2 Chinese University of Hong Kong
- taoyf_at_cse.cuhk.edu.hk
2Publishing Transaction Data
- Publishing transaction data
- Retail chain-owned shopping cart data
- Infer consumer spending patterns
- Correlations among purchased items
- e.g., 90 of cereals buyers also buy milk
- What about privacy?
3Privacy Threat
Quasi-identifying Items
Sensitive Items
4Privacy Paradigm
- l-diversity
- prevent association between quasi-identifier and
sensitive attributes - Create groups of transactions
- freq. of an SA value in a group lt 1/p
- Objective
- Enforce privacy
- Preserve correlations among items
- Challenge high data dimensionality
5Data Re-organization
PRESERVES CORELATIONS!
Band Matrix Organization
6Published Data
Summary of Sensitive Items
7Contributions
- Novel data representation
- Preserves correlation among items
- Efficient heuristic for group formation
- Linear time to data size
- Supports multiple sensitive items
8State-of-the-art MondrianFWR06
- Generalization-based
- data-space partitioning
- similar to k-d-trees
- split recursively until privacy condition does
not hold - constrained global recoding
k 2
Age
20
40
60
40
GENERALIZATION HIGH DIMENSIONALITY UNACCEPTBLE
INFORMATION LOSS
60
Weight
80
100
FWR06 K. LeFevre et al. Mondrian
Multidimensional k-anonymity, Proceedings of the
22nd International Conference on Data Engineering
(ICDE), 2006
9State-of-the-art AnatomyXT06
- Permutation-based method
- discloses exact QID values
Anatomized table
RANDOM GROUP FORMATION DOES NOT PRESERVE
CORRELATIONS
G! permutations
Disease
Ulcer(1) Pneumonia(1)
Flu(1) Dyspepsia(1)
Gastritis(1) Dyspepsia(1)
Age ZipCode
42 52000
47 43000
51 32000
62 41000
55 27000
67 55000
Age ZipCode Disease
42 52000 Ulcer
47 43000 Pneumonia
51 32000 Flu
55 27000 Gastritis
62 41000 Dyspepsia
67 55000 Dyspepsia
XT06 X. Xiao and Y. Tao. Anatomy simple and
effective privacy preservation, Proceedings of
the 32nd international conference on Very Large
Data Bases (VLDB), 2006
10Band Matrix Representation
- Bandwidth UL1
- Minimizing bandwidth is NP-hard
11Reverse Cuthil-McKee (RCM)
- Heuristic Bandwidth Minimization
- Solves corresponding graph labeling problem
- Permutes rows and columns
- Complexity N D log D
- N matrix rows ( transactions)
- D maximum degree of any vertex
12Group Formation
- Correlation-aware Anonymization of
High-Dimensional Data (CAHD) - Use the order given by RCM
- Consecutive transactions highly correlated
- O(pN) complexity
13Group Formation
14Experimental Evaluation
15RCM Visualization
16Experimental Setting
- BMS dataset
- Compare with hybrid PermMondrian(PM)
- Combines Mondrian with Anatomy
- Query Workload
- Reconstruction Error
17Recostruction Error vs p
18Execution Time
19Conclusions
- Anonymizing transaction data
- High-dimensionality
- Preserving correlation
- Future work
- Different encodings for data representation
- Enhance correlation among consecutive rows