Title: Privacy Preserving Database Application Testing
1Privacy Preserving Database Application Testing
- Xintao Wu, Yongge Wang, Yuliang Zheng,
-
-
- UNC Charlotte
-
2Overview
- Milestone
- Initial investigation from May 2002 to Dec 2002
- Official starting from Sept 2003 and being
supported by NSF CCR-0310974 ( 200k, Sept 2003
August 2005) - The prototype system was finished April 2005.
Developed using C, Oracle with 22K lines of
source code - Demo at several Banks, May 2005
-
- Personnel
- Faculty Xintao Wu, Yongge Wang, Yuliang Zheng
- Current graduate students Songtao Guo, Ying Wu,
Chintan Sanghvi, Guodong Jiao - Previous graduate students Jing Jin, Amol Kedar
- Several senior undergraduate students
- More Info
- http//www.cs.uncc.edu/xwu/privacy
- xwu_at_uncc.edu
3Motivation
- To generate synthetic data for DB application
testing, especially performance testing. - Many applications are involving large-scale
databases with sensitive information. - Complete testing is essential for database
applications to function correctly and to provide
acceptable performance.
4Our Approach
- To generate synthetic databases based on a-priori
knowledge about the current production databases - The needed a-priori knowledge is generally
available from ER, DDL, Data Dictionary with
schema, data integrity rules as well as basic
statistical information - Can extract detailed statistical information if
original data or samples from production database
are available - The data can be either realistic amounts or any
amounts - Better controllability, observability, and
privacy
5Three Characteristics of Synthetic Data
- Valid
- The synthetic data need to satisfy all the same
constraints and business rules as the live data - Necessary for functional testing
- Privacy preserving
- No disclosure of any confidential information
that need to be protected - Resembling to real data
- The synthetic data need to have the similar
statistical distributions or patterns as the live
data - Necessary for performance testing as the
statistical nature of the data determines query
performance
We will show if data distributions are not
similar, the execution time of the same workload
may be totally different.
6Architecture
ER
DDL
Data
Catalog
R
NR
S
Schema Domain Filter
Disclosure Assessment
Performance Assessment
Schema
Domain
General Location Model
Data Generator
Synthetic database
7Building a Project
8Data Dictionary Information
9Statistical Information Extraction Basic
10Statistical Information Extraction Advance
11Generating Meta Data File
12Generating Confidential File
13Disclosure Analysis - Categorical
14Numerical Disclosure Basic Batch Mode
15Numerical Disclosure Basic Single Mode
16Creating Final Categorical File
17Creating Final Rule File (GLM Format)
18Generating Data