Title: Population Stratification with Limited Data
1Population Stratification with Limited Data
- By
- Kamalika Chaudhuri, Eran Halperin, Satish Rao and
Shuheng Zhou
2The Problem
- Given
- Samples from two hidden distributions P1 and P2
- Unknown labels
- Each sample/individual
- k features 0/1 values
- Population P1 feature f is 1 w.p. p1f
- Population P2 feature f is 1 w.p. p2f
- Unknown feature probabilities
3The Problem
- Given
- 2n samples from two hidden distributions P1 and
P2 - Unknown labels
- Goal Classify each individual correctly for most
inputs
4Applications
- Preprocessing step in statistical analysis
- Analyze the factors that cause a complex disease,
such as cancer - Cluster the samples into populations, then apply
statistical analysis - Collaborative Filtering
- Feature can be likes Star Wars or not
- Cluster users into types using the features
5The Problem
- Given
- Samples from two hidden distributions P1 and P2
- Unknown labels
Need Some Separation Between the Distributions
6Our Results
- Need some separation between the distributions!
- Measure of Separation distance between means
- ? L1 distance between means / k
- ? L22 distance between means / k
- Our Results
- Optimization function and poly-time algorithm
? k W(vk log n) - Optimization function ? k W( log n)
7Our Results
- This talk
- Optimization function and poly-time algorithm
? k ?(vk log n) - Example
- P1 For each feature f, p1f ½
- P2 For each feature f, p2f ½ vlog n/vk
- Information-theoretically optimal
- There exists two distributions with this
separation and constant overlap in probability
mass
8Optimization Function
- What measure to optimize to get the correct
clustering?
- Need a robust measure which works for small
separations
9A Robust Measure
- Find the best balanced partition (S,S) such
that - ?f Nf(S) Nf(S)
- is maximum
- Nf(S), Nf(S) of individuals with feature f
in S, S
10A Robust Measure
- Find the best balanced partition (S,S) such
that - ?f Nf(S) Nf(S)
- is maximum
- Nf(S), Nf(S) of individuals with feature f
in S, S
Theorem Optimizing this measure provides the
correct partition w.h.p. if
? k ?(vk log n)
11Proof Sketch
- How does the optimal partition behave?
E f(P) ? k n k vn Pr f(P) Ef gtnvk
2-n
E f(Any partition) k vn Pr f(P) Ef gt
nvk 2-n
The partition with the optimal value of f in (I)
dominates all the partitions in (II) w.h.p for
the separation conditions
12An Algorithm
- How can we find the partition which optimizes
this measure? -
Theorem There exists an algorithm which finds
the correct partition when ? k ?(vk
log2n) Running Time O(nk log2 n)
13An Algorithm
- Algorithm
- Divide individuals into two sets A and B
- Start with a random partition of A
- Iterate log n times
- Classify B using current partition of A and a
proximity score - And the same for A
14An Algorithm
- Iterate
- Classify B using current partition of A and a
score - And vice versa.
- Random Partition
- ( 1/2 1/vn) imbalance
- Each iteration produces a partition with more
imbalance
15Classification Score
- Our Score For each feature f,
- If Nf(S) gt Nf(S)
- add 1 to the score if f is present,
else subtract 1 - If Nf(S) lt Nf(S)
- add 1 to the score if f is absent,
else subtract 1 - Classify
- Individuals above the median score S
- Individuals below the median score S
16Classification
- Lemma If the current partition has (1/2
?)-imbalance, the next iteration produces a
partition with (1/2 2?)-imbalance for ? lt c - Lemma If the current partition has (1/2
c)-imbalance, the next iteration produces the
correct partition with our separation conditions. - ?(log n) rounds needed to get the correct
partition - Use a fresh set of features in each round to get
independence
17Proof Sketch
- Lemma If the current partition has (1/2
?)-imbalance, the next iteration produces a
partition with (1/2 2?)-imbalance for ? lt c
G ?(? ?2 kvn)
Initially G ?(log n) X, Y Bin(k, ½)
G
Population 1
Population 2
18Proof Sketch
- Lemma If the current partition has (1/2
?)-imbalance, the next iteration produces a
partition with (1/2 2?)-imbalance for ? lt c
G ?(? ?2 kvn)
Pr Correct Classification ½ Ga/vk /(½
½) gt ½ 2 ? From separation conditions
G
Population 1
Population 2
19Proof Sketch
- Lemma If the current partition has (1/2
c)-imbalance, the next iteration produces the
correct partition with our separation conditions.
G ?(? ?2 kvn)
All but a 1/poly(n) fraction is correctly
classified
Population 1
Population 2
20Related Work
- Learning Mixtures of Gaussians D99
- Best performance by Spectral Algorithms VW02,
AM05,KSV05 - Our algorithm
- Matches the bounds in VW02 for two clusters
- Not a spectral algorithm !
21Open Questions
- How to extend our algorithm to work for multiple
clusters ? - What is the relationship between our algorithm
and spectral algorithms? - Matches spectral algorithms of M01 for two-way
graph partitioning - Can our algorithm do better?
22