Title: Security of Statistical Databases
1Security of Statistical Databases
- Mirka Miller
- University of Ballarat
- AUSTRALIA
2(No Transcript)
3Talk Overview
- Confidentiality and Privacy
- Statistical Databases
- Statistical Compromise
- Security Mechanisms
- Noise Addition
- Query Restriction
- Audit Expert
- Usability
- Application of Combinatorics of Finite Sets to
solve a security problem - Current Research
4Confidentiality and Privacy
- Confidential personal data is kept in
- Census databases
- Medical databases
- Employee databases
- Taxation office records
- Criminal records
- Bank balances and credit records
- Phone calls
- Shopping habits
- Driving records
- and many many others
- Confidentiality of personal data is a major issue
5Confidentiality and Privacy
- This data is used for statistical analysis and
knowledge discovery and data mining, and
facilitates research in many areas, including - Marketing
- Medicine
- Crime investigation
- Examples of databases
- Census
- Medical
6Statistical Databases - Census
- Census in Australia goes back more than 200 years
- Until mid 1880s musters for convicts, blue books
for free settlers - By 1850 it became a stigma to be the descendant
of a convict - Usually there were about 5 copies of a census
kept well locked up - Not enough protection early census books were
ordered destroyed privacy concerns - Oldest surviving census in Australia is the 1828
NSW census
7Statistical Databases - Census (cont.)
- Summary tables based on the census were published
about a year later - Census in Australia now
- Data is kept on a computer
- Release of summary tables takes several years
- No on-line access
- Much more detailed information privacy concerns
8Statistical Databases - Medical
- Another common concern regarding privacy is
medical records - Traditionally
- confidentiality was assured by the doctors
Hippocratic oath and the privacy of the household - The situation now
- patient records are usually kept in a
computerised database access issues - Statistical access de-personalising removing
patient names before performing statistics - not
enough de-identified does not always mean
unidentifiable
9Statistical Databases
- What is sensitive data?
- Hundred years ago in Australia convict ancestry
- Now more likely to be the amount of money a
person makes or having an illness such as AIDS or
mental illness - Demand for privacy of personal data is nothing
new - What is new our use of computers gives us a
great capacity to gather data, process data,
communicate data, generate statistics based on
the data
10Security Problem of SDB
- A statistical database is an ordinary database
that returns only statistical information to user
queries, based on groups of records - SDB is compromised if it is possible to derive
confidential individual information from
statistics. - The security problem of statistical database is
to control the use of statistical database so
that only statistical information about groups of
records is available and no sequence of queries
is sufficient to infer protected information
about any individual
11Statistical Databases
- Various sizes
- Few hundreds of records or less, e.g., medical
data collected in an experiment - Hundreds of thousands of records or more, e.g.,
census data, large medical databases - General purpose or for statistics only
- On-line or off-line
- Static or dynamic
12Statistical Database Example
Table 1. Employee
13Database Compromise Example
- If anybody knows that Joe is the only Male staff
in the Operations section then it is possible to
disclose his salary using a SUM query - select SUM(Salary)
- from Employee
- where Section Operations AND Gender M
- Answer 37500
14Database Compromise Example
- If a statistical query that selects only one
record is not allowed then we can use two
queries - select SUM(Salary)
- from Employee
- where SectionOperations AND RankOfficer AND
GenderF - Answer 89500 rows satisfying the query are R3
and R5 - select SUM(Salary)
- from Employee
- where Section Operations AND Rank
Officer - Answer 127000 rows satisfying the query are
R3, R5, R10 - Then it can be deduced that Joes salary is
127000 - 89500 37500
15Types of Compromise
- Positive compromise occurs whenever it is found
that an individual belongs to a particular
category or holds a particular data value - Negative compromise occurs when it is determined
that the individual does not belong to the
category or does not possess a particular data
value - If everything in a database can be deduced, the
database is completely compromised - If information concerning at least one
individual can be determined, the database is
partially compromised
16Types of Compromise
- If the restricted information about a person is
disclosed, it is considered to be an individual
compromise on the other hand, the disclosed
information about a group of individuals is
called a group compromise - A subset S (S gt1) of records in a database is
relatively compromised if the relative order of
magnitude of the individuals in S is known
17Data Inference and Compromise
- All statistics contain vestiges of information
- about individual records on which they are based
- It is often possible to compare several
statistics and be able to derive some additional
information about an individual this is called
data inference - If the data inference concerns supposedly
protected (confidential) information then the
statistical database has been compromised - We are mostly interested in databases used
- For day-to-day processing of individual records,
as well as - To obtain statistics about various subpopulations
of the database
18Security Problem of SDB
- Many information security problems can be solved
by the use of - proper access methods
- encryption
- but the problem of statistical compromise cannot
be so solved since it concerns an authorised user
performing authorised actions and the statistics
are needed in the clear
19Control Mechanisms
- Most data inference techniques rely on set theory
and/or linear algebra - Many techniques for the prevention of data
compromise have been proposed - Two basic categories
- Noise addition
- Query restriction
20Noise Addition
- Noise addition techniques include
- Data perturbation
- Response perturbation
- Data swapping
- Random queries
- All introduce some error to the answers of
statistical queries - Problem is, how much error should be used?
- If too little then can get approximate compromise
- If too much then the statistics are useless
21Query Restriction
- Query restriction techniques include
- Query set-size restriction
- Query overlap restriction
- Maximum order queries
- Audit Expert
- Query restriction techniques either answer a
query correctly (precisely) or not at all
possibly giving an error message
22Audit Expert
- The task of auditing may be delegated to the
database system so that the database system - keeps track of the history of answered queries
and changes in SDB - checks for possible compromise by every new query
-
23Audit Expert Advantages
- Absolute Security
- By checking the past history of all the answered
queries, auditing allows the SDB to answer a
query only when it is secure to do so - Maximum Information
- Given the previous query history of the SDB,
auditing can provide the maximum information to
the users. This includes accurate answers and as
many query answers to the users as the security
of the SDB permits
24Security Problem of SDB - Usability
- A query is answerable if answering it does not
lead to a compromise. - SDB usability is the ratio of the number of
answerable queries to the number of posable
queries. - Statistical query types SUM, COUNT, MAX, MIN,
MEAN, etc. - The security problem of SDB is to find a control
mechanism which would prevent compromise while
maximising the usability.
25Problem 1
- Given A collection of SUM queries
Compromise exact - Question Is the collection of queries
compromise-free?
26Audit Expert
- In a database of n1 records, a collection of k
SUM queries can be thought of as a system of k
linear equations - ?1,1x1 ?1,2x2 . . . ?1,n1xn1 q1
- ?2,1x1 ?2,2x2 . . . ?2,n1xn1 q2
- .
- .
- .
- ?k,1x1 ?k,2x2 . . . ?k,n1xn1 qk
27Audit Expert
- In matrix form, Mk X Q
- Mk is called the query matrix.
- . . .
- . . .
- .
- Mk
. - .
- . . .
- In order to avoid compromise we can have at most
n linearly independent queries.
28Audit Expert
Table 1. Employee
Salary of all Males query set (1 0 0 1
0) Salary of all Females (0
1 1 0 1) Salary of all Officers
(1 0 1 0 1) Salary of all under 25
(0 0 0 1 1)
29Audit Expert
- Salary of all Males query set (1 0 0 1
0) - Salary of all Females (0 1
1 0 1) - Salary of all Officers (1
0 1 0 1) - Salary of all under 25 (0 0
0 1 1) -
-
- M5 ?
30Audit Expert
- Using Gaussian elimination we obtain
- Mk (Ik Mk)
- where Ik is k ? k identity matrix and Mk is
k?(n-k1) matrix. - Answer to Problem 1
- Database is compromised if and only if there is a
row in the normalised query matrix Mk that
contains exactly one non-zero element. (Chin and
Oszoyoglu, 1982)
31Problem 2
- Given A database of n1 records Posable
queries all SUM queries Compromise
exact - Question What is the maximum number
of queries that can be answered
without compromise?
32Mathematical Formulation
- In a multiset A a1, a2, . . . , an, ai ? R,
ai ? 0, what is the maximum number of partial
sums that are equal to 0 or 1? - What can the numbers a1, a2, . . . , an be to
achieve the maximum?
33Mathematical Formulation (cont.)
- Where does this formulation come from?
-
-
- Mn
34Example of Low Usability
- Example n6, maximum number of answerable
queries is 6. -
-
-
- M6
35Example of High Usability
- Example n6, maximum number of answerable
queries is 20. -
-
-
- M6
36Usability of Secure SDB
- Answer to Problem 2
- Theorem (Miller, Roberts, Simpson, 1991)
- The maximum usability of a statistical database
of n records in which all queries are posable,
for exact compromise, is
37Usability of Secure SDB
- Proof makes use of a symmetric chain
decomposition of an n-set and the following
theorem. - Theorem (Anderson, 1987)
- The subsets of a n-set can be partitioned into
(n choose ?n/2? ) disjoint symmetric chains.
The number of symmetric chains containing exactly
i subsets is equal to (n choose ?(ni)/2? ) - (n
choose ?(ni1)/2? )
38Symmetric Chain Decomposition
- Symmetric chain decomposition of a 4-set
Sa,b,c,d - Ø ? a ? a,b ? a,b,c ?
a,b,c,d - b ? b,c ? b,c,d
- c ? c,d ? a,c,d
- d ? a,d ? a,b,d
- a,c
- b,d
39Symmetric Chain Decomposition
- Suppose a-2, b-1, c1, d1. Associate with each
subset the sum of the values of its elements. - Ø ? a ? a,b ? a,b,c ?
a,b,c,d - 0 -2 -3 -2
-1 - b ? b,c ? b,c,d
- -1 0 1
- c ? c,d ? a,c,d
- 1 2 0
- d ? a,d ? a,b,d
- 1 -1 -2
- a,c -1 and b,d 0
40Symmetric Chain Decomposition
- a-2, b-1, c1, d1. Rearrange the order of the
subsets so that the partial sums in an
antichain are strictly increasing - Ø ? a ? a,b ? a,b,c ?
a,b,c,d - a,b b Ø c
c,d - -3 -1 0
1 2 - b ? b,c ? b,c,d
- c ? c,d ? a,c,d
- d ? a,d ? a,b,d
- a,c
- b,d
41Symmetric Chain Decomposition
- The rest of the proof is a routine application
of Andersons Theorem, adding once the number of
singleton antichains plus twice the number of
non-singleton antichains. - a,b b Ø
c c,d - a a,c
a,c,d - a,b,c a,b,c,d b,c,d
- a,b,d b,d d
- b,c
- a,d
42Maximum Usability of Secure SDB
- The maximum usability of a statistical database
of n records in which all queries are posable,
for exact compromise, is - It is achieved by having roughly half 1s and
half -1s.
43Our Research
- Identifying a new type of compromise called a
relative compromise. We gave techniques both for
achieving it as well as preventing against it
(with Jennie Seberry) - Finding general upper bounds for the number of
statistical queries that can be safely answered
(with Jamie Simpson, Ian Roberts, Ljiljana
Brankovic) - For practical reasons we allow to include a given
set of queries before maximising (with Ljiljana
Brankovic, Jozef Siran)
44Our Research
- Supplementary knowledge is always needed for the
identification of a particular record. We have
given a general framework of data inference and
compromise incorporating supplementary knowledge - Lately we became interested in range queries
(with Ljiljana Brankovic, Peter Horak) - Implementing expert system including legal
requirements (with Vivek Mishra, Andrew
Stranieri, Joe Ryan) - Theoretical research continues (with Yuliya
Lenard)
45Our Research
- The security problem of statistical databases is
- An important practical problem
- Should be of concern to any owner of a database
that releases statistics based on confidential
personal data - Is becoming more and more important as new
privacy acts and legislations are introduced - As insurance companies get involved it will be
necessary to demonstrate that security measures
are used
46Our Research
- The security problem of statistical databases is
- An ideal topic for a PhD thesis
- An excellent vehicle for theoretical research,
practical implementations, comparative study of
security mechanisms - Links with other researchers in Australia and
overseas - Information security courses are being introduced
in many universities so chances of employment may
be higher than in other areas
47