Title: Privacy-oriented Data Mining by Proof Checking
1Privacy-oriented Data Mining by Proof Checking
- Stan Matwin
- (joint work with Amy Felty )
- SITE
- University of Ottawa, Canada
- stan_at_site.uottawa.ca
2The TAMALE Group
- 4 profs
- Some 30 graduate students
- Areas machine learning, data mining, text
mining, NLP, data warehousing - Research in
- Inductive Logic Programming
- Text mining
- Learning in the presence of knowledge
- Applications of ML/DM (e.g. in SE tools for
maintenance personnel)
3- Why did I get into this research?
- what is already being done and why it s not
enough - the main idea
- its operation
- discussion correctness
- prototype - Coq and CIC
- example
- some technical challenges
- acceptance?
4Some useful concepts...
- opting out vs opting in
- Use Limitation Principle data should be used
only for the explicit purpose for which it has
been collected
5and existing technical proposals
- On the web P3P Platform for Privacy Preferences
- W3C standard
- XML specifications - on websites and in browsers
- of what can be collected and for what purpose
- ULP? - Handles cookies
- Data exchange protocol more than privacy
protocol no provisions for opting out after an
initial opt-in - the ULP part is in NLnot verifiable
6Agrawal s data perturbation transformations
- data is perturbed by random distortion xi
? xi r - r uniform or gaussian
- a procedure to reconstruct a PAC-esitimation of
the original distribution (but not the values) - a procedure to build an accurate decision tree on
the perturbed distribution
7Agrawal s transformations contd
- proposes a measure to quantify privacy estimate
intervals and their size - lately extended to non-numerical attributes, and
to association rules - does not address the ULP
- how do we know it is applied?
8the main idea towards a verifiable ULP
- User sets permissions what can and cannot be
done with her data - Any claim that a software respects these
permissions is a proof of a theorem about the
software - Verifying the claim is then checking that proof
against the software
9Who are the players?
- User C
- Data miner Org
- Data mining software developer Dev
- Independent verifier Veri
- BUT no one owns the data D
10D database scheme A given set of
database and data mining operations S
source code for A
PC(D,A) Cs permissions T(PC,S) theorem
that S respects PC R(PC,S) proof of T(PC,S) B
binary code of S
11Discussion - properties
- It can be proven that C s permissions are
respected (or not) PC is in fact a verifiable
ULP - PC can be negative (out) or positive (in)
- proof construction needs to be done only once for
a given PC ,D and A - Scheme is robust against cheating by Dev or Org
12Acceptance issues
- No Org will give Veri access to S
- Too much overhead to check R(PC,S) for each
task, and each user - Too cumbersome for C
- Based on all Orgs buying in
13Acceptance1Veris operation- access
- Veri needs
- PC from C
- R(S, PC) from Dev
- S from Dev
- B from Org
- Veri could check R(S, PC) at Devs
- Veri needs to verify that S (belonging normally
to Dev) corresponds to B that Org runs.
14Acceptance2 overhead
- Veri runs proof checking on a control basis
- Orgs execution ovhd ?
15Issues
- Naming the fields XML or disclosure
- restricted class of theorems for a given
P-automating proof techniques for this class
16Acceptance3 Cs perspective
- Building PCs must be easy for C, based on D and
processing schema initially a closed set? - permissions could be encoded on a credit card,
smart card, in the electronic wallet - or in the CA they can then be dynamically
modified and revoked
17 Political aspects who is Veri?
- generally trusted
- consumer association ?
- Ralph Nader ?
- transparency international ?
- IT expert at the level of instrumenting and
running the proof checker connection to Open
Software Foundation? - theorem proving can be cast as better testing
18how to make Orgs buy in?
- The first Org is needed to volunteer
- a Green Data Mining logo will be granted and
administered (verified) by Veri - other Orgs will have an incentive to join
19Future work
- Build the tools
- expand the prototype
- extend from Weka to commercial data mining
packages - Integrate with P3P?
- find a willing Org
20(No Transcript)
21Link between S and B
- compilation not an option
- watermarking solution B is watermaked by a
slightly modified compiler with MD5(tar(S)) 128
bytes - marks are inserted by a trusted
makefile-and-compiler in locations in B given by
Veri and unknown to Org
22Link
- Veri, given access to S, can verify that B
corresponds to S - An attack by I requires hacking the compiler
- An attack by Org requires knowing the locations
of watermarks
23Example
- C restricts her Employee data from participating
in a join with her Payroll data - Record Payroll Set
- mkPayPID nat JoinInd bool Position
string Salary nat. - Record Employee Set
- mkEmpName string EID nat .
- Record Combined Set
- mkCombCID nat CName string Csalary
nat .
24- Fixpoint Join Ps list Payroll (list Employee)
? (list Combined) - Es list Employee
- Cases Ps of
- nil ?(nil Combined)
- (cons p ps) ?(app (check_JoinInd_and_find
_employee_record p Es) - (Join ps Es))
- end.
- (check_JoinInd_and_find_employee_record p Es)
-
- if a record is found in Es whose EID matches Ps
PID and JoinInd permits Join, then a list of
length 1 with the result of Join is returned,
otherwise empty
25- Definition Pc
- S((list Payroll)?(list Employee) ?(list
Combined)) ? Prop - ? Pslist Payroll. ? Eslist Employee.
(UniqueJoinInd Ps) ? - ? P Payroll.(In P Ps) ? ((JoinInd P)false ?
- not ? CCombined ((In C (S Ps Es)) ? ((CID
C)(PID P))) - PC(S) is written as (PC Join) Coq expands the
definition of PC and provides the theorem - request to proof checking operator of Coq will
check this proof i.e it will check that the user
permissions are encoded into the Join program
given - Whole proof 300 lines of Coq code proof
checking 1 sec on a 600MHz machine