Title: Foundations of Privacy Lecture 10
1Foundations of PrivacyLecture 10
2Recap of lecture two weeks ago
- Continual changing data
- Counters
- How to combine expert advice
- Multi-counter and the list update problem
- Pan Privacy
3What if the data is dynamic?
- Want to handle situations where the data keeps
changing - Not all data is available at the time of
sanitization -
Curator/ Sanitizer
4Google Flu Trends
We've found that certain search terms are good
indicators of flu activity. Google Flu Trends
uses aggregated Google search data to estimate
current flu activity around the world in near
real-time.
5Example of Utility Google Flu Trends
6What if the data is dynamic?
- Want to handle situations where the data keeps
changing - Not all data is available at the time of
sanitization - Issues
- When does the algorithm make an output?
- What does the adversary get to examine?
- How do we define an individual which we should
protect? DMe - Efficiency measures of the sanitizer
7Data Streams
Data is a stream of items Sanitizer sees each
item and updates internal state. Produces output
either on-the-fly or at the end
output
Sanitizer
Data Stream
8Three new issues/concepts
- Continual Observation
- The adversary gets to examine the output of the
sanitizer all the time - Pan Privacy
- The adversary gets to examine the internal state
of the sanitizer. Once? Several times? All the
time? - User vs. Event Level Protection
- Are the items singletons or are they related
9Randomized Response
- Randomized Response Technique Warner 1965
- Method for polling stigmatizing questions
- Idea Lie with known probability.
- Specific answers are deniable
- Aggregate results are still valid
- The data is never stored in the plain
trust no-one
Popular in DB literature Mishra and Sandler.
1
0
1
noise
noise
noise
10The Dynamic Privacy Zoo
Petting
User-Level Continual Observation Pan Private
Differentially Private
Continual Observation
Pan Private
Randomized Response
User level Private
11Continual Output Observation
Data is a stream of items Sanitizer sees each
item, updates internal state. Produces an output
observable to the adversary
Output
Sanitizer
12Continual Observation
- Alg - algorithm working on a stream of data
- Mapping prefixes of data streams to outputs
- Step i output ?i
- Alg is e-differentially private against continual
observation if for all - adjacent data streams S and S
- for all prefixes t outputs ?1 ?2 ?t
-
Adjacent data streams can get from one to the
other by changing one element
S acgtbxcde S acgtbycde
PrAlg(S)?1 ?2 ?t
ee 1e
e-e
PrAlg(S)?1 ?2 ?t
13The Counter Problem
0/1 input stream 011001000100000011000000100101
Goal a publicly observable counter,
approximating the total number of 1s so
far Continual output each time period, output
total number of 1s Want to hide individual
increments while providing reasonable accuracy
14Counters w. Continual Output Observation
Data is a stream of 0/1 Sanitizer sees each xi,
updates internal state. Produces a value
observable to the adversary
1
1
1
2
Output
Sanitizer
1
0
0
1
0
0
1
1
0
0
0
1
15Counters w. Continual Output Observation
Continual output each time period, output total
1s Initial idea at each time period, on input
xi 2 0, 1 Update counter by input xi Add
independent Laplace noise with magnitude
1/e Privacy since each increment protected by
Laplace noise differentially private whether xi
is 0 or 1 Accuracy noise cancels out, error
Õ(vT) For sparse streams this error too high.
T total number of time periods
16Why So Inaccurate?
- Operate essentially as in randomized response
- No utilization of the state
- Problem we do the same operations when the
stream is sparse as when it is dense - Want to act differently when the stream is dense
- The times where the counter is updated are
potential leakage
17Delayed Updates
Main idea update output value only when large
gap between actual count and output Have a good
way of outputting value of counter once the
actual counter noise. Maintain Actual count
At ( noise ) Current output outt ( noise)
D update threshold
18Delayed Output Counter
- Outt - current output
- At - count since last update.
- Dt - noisy threshold
-
- If At Dt gt fresh noise then
- Outt1 ? Outt At fresh noise
- At1 ? 0
- Dt1 ? D fresh noise
- Noise independent Laplace noise with magnitude
1/e - Accuracy
- For threshold D w.h.p update about N/D times
- Total error (N/D)1/2 noise D noise noise
- Set D N1/3 ? accuracy N1/3
delay
19Privacy of Delayed Output
Outt1?Outt At fresh noise
At Dt gt fresh noise, Dt1 ? D fresh noise
- Protect update time and update value
- For any two adjacent sequences
- 101101110001
- 101101010001
- Can pair up noise vectors
- ?1?2?k-1 ?k ?k1
- ?1?2?k-1 ?k ?k1
- Identical in all locations except one
- ?k ?k 1
Where first update after difference occurred
Dt Dt
Prob ee
20Dynamic from Static
Accumulator measured when stream is in the time
frame
- Run many accumulators in parallel
- each accumulator counts number of 1's in a fixed
segment of time plus noise. - Value of the output counter at any point in time
sum of the accumulators of few segments - Accuracy depends on number of segments in
summation and the accuracy of accumulators - Privacy depends on the number of accumulators
that a point influences
Idea apply conversion of static algorithms into
dynamic ones Bentley-Saxe 1980
Only finished segments used
xt
21The Segment Construction
Based on the bit representation Each point t is
in dlog te segments ?i1t xi - Sum of at most log
t accumulators
By setting ? ¼ ? / log T can get the desired
privacy Accuracy With all but negligible in T
probability the error at every step t is at most
O((log1.5 T)/?)).
canceling
22Synthetic Counter
- Can make the counter synthetic
- Monotone
- Each round counter goes up by at most 1
- Apply to any monotone function
23Lower Bound on Accuracy
- Theorem additive inaccuracy of log T is
essential for ?-differential privacy, even for
?1 - Consider the stream 0T compared to collection of
T/b streams of the form 0jb1b0T-(j1)b - Sj 000000001111000000000000
b
Call output sequence correct if a b/3
approximation for all points in time
24Lower Bound on Accuracy
Sj000000001111000000000000
- Important properties
- For any output ratio of probabilities under
stream Sj and 0T should be at least e-?b - Hybrid argument from differential privacy
- Any output sequence correct for at most one Sj or
0T - Say probability of a good output sequence is at
least ?
b/3 approximation for all points in time
Good for Sj
Prob under 0T at least ?e-?b
b1/2log T, ? 1/2
T/b ? e-?b 1-?
contradiction
25Hybrid Proof
- Want to show that for any event B
PrA(0T)2 B
Let Sji0jb1i0T-jb-i Sj00T SjbSj
e-eb
PrA(Sj) 2 B
PrA(Sji) 2 B
e-e
PrA(Sji1)2B
PrA(Sj0)2B
PrA(Sj0)2B
PrA(Sjb-1)2B
.
.
e-eb
PrA(Sjb)2B
PrA(Sj1)2B
PrA(Sjb)2B
26What shall we do with the counter?
- Privacy-preserving counting is a basic building
block in more complex environments - General characterizations and transformationsEven
t-level pan-private continual-output algorithm
for any low sensitivity function - Following expert advice privatelyTrack experts
over time, choose who to followNeed to track how
many times each expert was correct
27Following Expert Advice
Hannan 1957Littlestone Warmuth 1989
- n experts, in every time period each gives 0/1
advice - pick which expert to follow
- then learn correct answer, say in 0/1
- Goal over time, competitive with best expert in
hindsight
1
1
1
0
1
Expert 1
0
1
1
0
0
Expert 2
0
0
1
1
1
Expert 3
0
1
1
0
0
Correct
28Following Expert Advice
n experts, in every time period each gives 0/1
advice pick which expert to follow then learn
correct answer, say in 0/1 Goal over time,
competitive with best expert in hindsight
Goalmistakes of chosen experts mistakes
made by best expert in hindsight Want 1o(1)
approximation
1
1
1
0
1
Expert 1
0
1
1
0
0
Expert 2
0
0
1
1
1
Expert 3
0
1
1
0
0
Correct
29Following Expert Advice, Privately
- n experts, in every time period each gives 0/1
advice - pick which expert to follow
- then learn correct answer, say in 0/1
- Goal over time, competitive with best expert in
hindsight - New concern
- protect privacy of experts opinions and outcomes
- User-level privacyLower bound, no non-trivial
algorithm - Event-level privacy counting gives
1o(1)-competitive
Was the expert consulted at all?
30Algorithm for Following Expert Advice
- Follow perturbed leader Kalai VempalaFor each
expert keep perturbed of mistakesfollow
expert with lowest perturbed count - Idea use counter, count privacy-preserving
mistakes - Problem not every perturbation worksneed
counter with well-behaved noise distribution - Theorem Follow the Privacy-Perturbed LeaderFor
n experts, over T time periods, mistakes is
within poly(log n,log T,1/e) of best expert
31List Update Problem
- There are n distinct elements Aa1, a2, an
- Have to maintain them in a list some
permutation - Given a request sequence r1, r2,
- Each ri 2 A
- For request ri cost is how far ri is in the
current permutation - Can rearrange list between requests
- Want to minimize total cost for request sequence
- Sequence not known in advance
for each request ri cannot tell whether ri is in
the sequence or not
Our goal do it while providing privacy for the
request sequence, assuming list order is public
32List Update Problem
- In general cost can be very high
- First problem to be analyzed in the competitive
framework by Sleator and Tarjan (1985) - Compared to the best algorithm that knows the
sequence in advance - Best algorithms
- 2- competitive deterministic
- Better randomized 1.5
- Assume free rearrangements between request
- Bad news cannot be better than (1/?)-competitive
if we want to keep privacy
Cannot act until 1/? requests to an element appear
33Lower bound for Deterministic Algorithms
- Bad schedule always ask for the last element in
the list - Cost of online nt
- Cost of best fixed list sort the list according
to popularity - Average cost 1/2n
- Total cost 1/2nt
34List Update Problem Static Optimality
- A more modest performance goal compete with the
best algorithm that fixes the permutation in
advance - Blum-Chowla-Kalai can be 1o(1) competitive wrt
best static algorithm (probabilistic) - BCK algorithm based on number of times each
element has been requested. - Algorithm
- Start with random weights ri in range 1,c
- At all times wi ri ci
- ci is of times element ai was requested.
- At any point in time arrange elements according
to weights
35Privacy with Static Optimality
- Algorithm
- Start with random weights ri in range 1,c
- At any point in time wi ri ci
- ci is of times element ai was requested.
- Arrange elements according to weights
- Privacy from privacy of counters
- list depends on counters plus randomness
- Accuracy can show that BCK proof can be modified
to handle approximate counts as well - What about efficiency?
Run with private counter
36The multi-counter problem
- How to run n counters for T time steps
- In each round few counters are incremented
- Identity of incremented counter is kept private
- Work per increment logarithmic in n and T
- Idea arrange the n counters in a binary tree
with n leaves - Output counters associated with leaves
- For each internal node maintain a counter
corresponding to sum of leaves in subtree
37The multi-counter problem
- Idea arrange the n counters in a binary tree
with n leaves - Output counters associated with leaves
- For each internal node maintain
- Counter corresponding to sum of leaves in subtree
- Register with number of increments since last
output update - When a leaf counter is updated
- All log n nodes to root are incremented
- Internal state of root updated.
- If output of parent node updated, internal state
of children updated
(internal, output)
Determines when to update subtree
38Tree of Counters
(counter, register)
Output counter
39The multi-counter problem
- Work per increment
- log n increment number of counter need to
update - Amortized complexity is O(n log n /k)
- k number of times we expect to increment a
counter until output is updated - Privacy each increment of a leaf counter effects
log n counters - Accuracy we have introduced some delay
- After t k log n increments all nodes on path
have been update
40Pan-Privacy
think of the children
- In privacy literature data curator trusted
- In reality
- even well-intentioned curator subject to mission
creep, subpoena, security breach - Pro baseball anonymous drug tests
- Facebook policies to protect users from
application developers - Google accounts hacked
- Goal curator accumulates statistical
information,but never stores sensitive data
about individuals - Pan-privacy algorithm private inside and out
- internal state is privacy-preserving.
41Randomized Response Warner 1965
- Method for polling stigmatizing questions
- Idea participants lie with known probability.
- Specific answers are deniable
- Aggregate results are still valid
- Data never stored in the clearpopular in DB
literature MiSa06
Strong guarantee no trust in curator Makes sense
when each users data appears only
once,otherwise limited utility New idea curator
aggregates statistical information,but never
stores sensitive data about individuals
User Response
noise
noise
noise
1
0
1
User Data
42Aggregation Without Storing Sensitive Data?
- Streaming algorithms small storage
- Information stored can still be sensitive
- My data many appearances, arbitrarily
interleaved with those of others - Pan-Private Algorithm
- Private inside and out
- Even internal state completely hides the
appearance pattern of any individualpresence,
absence, frequency, etc.
User level
43Pan-Privacy Model
Data is stream of items, each item belongs to a
user Data of different users interleaved
arbitrarily Curator sees items, updates internal
state, output at stream end
Can also consider multiple intrusions
Pan-Privacy For every possible behavior of user
in stream, joint distribution of the internal
state at any single point in time and the final
output is differentially private
44Adjacency User Level
- Universe U of users whose data in the stream x 2
U - Streams x-adjacent if same projections of users
onto U\x - Example axbxcxdxxxex and abcdxe are x-adjacent
- Both project to abcde
- Notion of corresponding locations in x-adjacent
streams - U -adjacent 9 x 2 U for which they are
x-adjacent - Simply adjacent, if U is understood
- Note Streams of different lengths can be adjacent
45Example Stream Density or Distinct Elements
- Universe U of users, estimate how many distinct
users in U appear in data stream - Application distinct users who searched for
flu - Ideas that dont work
- NaïveKeep list of users that appeared (bad
privacy and space) - Streaming
- Track random sub-sample of users (bad privacy)
- Hash each user, track minimal hash (bad privacy)
46Pan-Private Density Estimator
Inspired by randomized response. Store for each
user x 2 U a single bit bx Initially all bx
0 w.p. ½ 1 w.p. ½ When encountering
x redraw bx 0 w.p. ½-e 1 w.p. ½e Final
output (fraction of 1s in table - ½)/e noise
Distribution D0
Distribution D1
Pan-PrivacyIf user never appeared entry drawn
from D0If user appeared any of times entry
drawn from D1D0 and D1 are 4e-differentially
private
47Pan-Private Density Estimator
Inspired by randomized response. Store for each
user x 2 U a single bit bx Initially all bx 0
w.p. ½ 1 w.p. ½ When encountering x redraw
bx 0 w.p. ½-e 1 w.p. ½e Final output
(fraction of 1s in table - ½)/e noise
Improved accuracy and Storage Multiplicative
accuracy using hashing Small storage using
sub-sampling
48Pan-Private Density Estimator
Theorem density estimation streaming
algorithm e pan-privacy, multiplicative error
a space is poly(1/a,1/e)
49Density Estimation with Multiple Intrusions
- If intrusions are announced, can handle multiple
intrusionsaccuracy degrades exponentially in
of intrusions - Can we do better?
- Theorem multiple intrusion lower bounds
- If there are either
- Two unannounced intrusions (for finite-state
algorithms) - Non-stop intrusions (for any algorithm)
- then additive accuracy cannot be better than ?(n)
50What other statistics have pan-private algorithms?
Density of users appeared at least
once Incidence counts of users appearing k
times exactly Cropped means mean, over users,
of min(t,appearances) Heavy-hitters users
appearing at least k times
51Counters and Pan Privacy
- Is the counter algorithm pan private?
- No the internal counts accurately reflect what
happened since last update - Easy to correct store them together with noise
- Add (1/?)-Laplacian noise to all accumulators
- Both at storage and when added
- At most doubles the noise
count
accumulator
noise
52Continual Intrusion
- Consider multiple intrusions
- Most desirable resistance to continual intrusion
- Adversary can continually examine the internal
state of the algorithm - Implies also continual observation
- Something can be done randomized response
- But
- Theorem any counter that is e-pan-private under
continual observation and with m intrusions must
have additive error ?(vm) with constant
probability.
53Proof of lower bound
- Two distributions
- I0 all 0 stream
- I1 xi 0 with probability 1 - 1/kvn
- and xi 1 with probability 1/kvn.
- Let Db be the distribution on states when running
Ib - Claim statistical distance between D0 and D1 is
small - Key point can represent transition probabilities
as - Q0s (x) 1/2 C(x) 1/2 C(x)
- Q1s (x) (1/2-1/kvn)C(x)(1/21/kvn)C(x)
Randomized Response is the best we can do
54Pan Privacy under Continual Observation
Definition? U-adjacent streams S and S, joint
distribution on internal state at any single
location and sequence of all outputs is
differentially private.
Output
Internal state
55A General Transformation
- Transform any static algorithm A to continual
output, maintain - Pan-privacy
- Storage size
- Hit in accuracy low for large classes of
algorithms - Main idea delayed updatesUpdate output value
only rarely, when large gap between As current
estimate and output
56Theorem General Transformation
Max output difference on adjacent streams
Transform any algorithm A for monotone function f
with error a, sensitivity sensA, maximum value
N New algorithm has e-privacy under continual
observation, maintains As pan-privacy and
storage Error is Õ(avNsensA/e)
57General Transformation Main Idea
input a0bcbbde
A
out
- Assume A is a pan-private estimator for monotone
f N -
- If At outt-1 gt D then outt ? At
- For threshold D w.h.p update about N/D times
58General Transformation Main Idea
input a0bcbbde
A
out
- Assume A is a pan-private estimator for monotone
f N - As output may not be monotonic
- If At outt-1 gt D then outt ? At
- What about privacy? Update times, update values
- For threshold D w.h.p update about N/D times
- Quit if updates exceeds Bound N/D
59General Transformation Privacy
If At outt-1 gt D then outt ? At What about
privacy? Update times, update values Add
noise Noisy threshold test ? privacy-preserving
update times Noisy update ? privacy preserving
update values
60Error ÕD(sAN)/(De)
General Transformation Privacy
- If At outt-1 noise gt D
- then outt ? At noise
- Scale noise(s) to BoundsensA/e
- Yields (e,d)-diff. privacyPrzS
eePrzSd - Proof pairs noise vectors that are far from
causing quitting on S, with noise vectors for
which S has exact same update times - Few noise vectors bad paired vectors e-private
61Theorem General Transformation
- Transform any algorithm A for monotone function f
- with error a, sensitivity sensA, maximum value N
- New algorithm
- satisfies e-privacy with continual observation,
- maintains As pan-privacy and storage
- Error is Õ(avNsensA/e)
- Extends from monotone to stable functions
- Loose characterization of functions that can be
computed privately under continual observation
without pan-privacy
62What other statistics have pan-private algorithms?
- Pan-private streaming algorithms for
- Stream density / number of distinct elements
- t-cropped mean mean, over users, of
min(t,appearances) - Fraction of users appearing k times exactly
- Fraction of heavy-hitters, users appearing at
least k times
63Incidence Counting
- Universe X of users. Given k, estimate what
fraction of users in X appear exactly k times in
data stream - Difficulty cant track individuals of
appearances - Idea keep track of noisy of appearances
- However cant accurately track whether
individual appeared 0,k or 100k times. - Different approach follows count-min CM05
idea from streaming literature
User level privacy!
64Incidence Counting a la Count-Min
- Use pan-private algorithm that gets input
- hash function h Z?M (for small range M)
- target val
- Outputs fraction of users with h(appearances)
val - Given this, estimate k-incidence as fraction of
users with - h( appearances) h(k)
- Concern Might we over-estimate? (hash
collisions) - Accuracy If h has low collision prob, then with
some probability collisions are few and estimate
is accurate. - Repeat to amplify (output minimal estimate)
65Putting it together
- Hash by choosing small random prime ph(z) z
(mod p) - Pan-private modular incidence counterGets p and
val, estimates fraction of users with
appearances val (mod p)space is poly(p), but
small p suffices - Theorem k-incidence counting streaming
algorithm - e pan-privacy, multiplicative error a,upper
bound N on number of appearances. - Space is poly(1/a,1/e,log N)
66t -Incidence Estimator
- Let R 1, 2, , r be the smallest range of
integers containing at least 4 logN/? distinct
prime numbers. - Choose at random L distinct primes p1, p2,,pL
- Run modular incidence counter these L primes.
- When a value x 2 M appears update each of the L
modular counters - For any desired t For each i 2 L
- Let fi b the i-th modular incidence counter t
(mod pi) - Output the (noisy) minimum of these fractions
67Pan-Private Modular Incidence Counter
- For every user x, keep counter cx20,,p-1Increa
se counter (mod p) every time user appears - If initially 0 no privacy, but perfect accuracy
- If initially random perfect privacy, but no
accuracy - Initialize using a distribution slightly biased
towards 0 -
- Prcxi e-ei/(p-1)
- Privacy users appearances has only small
effecton distribution of cx
0
p-1
68Modular Incidence Counter Accuracy
- For j2 0,,p-1
- oj is users with observed noisy count j
- tj is true users that truly appear j times (mod
p) - oj ? tj-k (mod p)e-ek/(p-1)
- Using observed ojsGet p (approx.) equations in
p variables (the tks)Solve using linear
programming - Solution is close to true counts
p-1
k0
69Pan-private Algorithms
Continual Observation
Density of users appeared at least
once Incidence counts of users appearing k
times exactly Cropped means mean, over users, of
min(t,appearances) Heavy-hitters users
appearing at least k times
70The Dynamic Privacy Zoo
Petting
Continual Pan Privacy
Differentially Private Outputs
Privacy under Continual Observation
Pan Privacy
Sketch vs. Stream
User level Privacy