Population Stratification with Limited Data - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Population Stratification with Limited Data

Description:

P2 : For each feature f, p2f = log n/k. Information-theoretically optimal: ... k = (k log2n) Running Time : O(nk log2 n) An Algorithm. Algorithm: ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 22
Provided by: kama9
Category:

less

Transcript and Presenter's Notes

Title: Population Stratification with Limited Data


1
Population Stratification with Limited Data
  • By
  • Kamalika Chaudhuri, Eran Halperin, Satish Rao and
    Shuheng Zhou

2
The Problem
  • Given
  • Samples from two hidden distributions P1 and P2
  • Unknown labels
  • Each sample/individual
  • k features 0/1 values
  • Population P1 feature f is 1 w.p. p1f
  • Population P2 feature f is 1 w.p. p2f
  • Unknown feature probabilities

3
The Problem
  • Given
  • 2n samples from two hidden distributions P1 and
    P2
  • Unknown labels
  • Goal Classify each individual correctly for most
    inputs

4
Applications
  • Preprocessing step in statistical analysis
  • Analyze the factors that cause a complex disease,
    such as cancer
  • Cluster the samples into populations, then apply
    statistical analysis
  • Collaborative Filtering
  • Feature can be likes Star Wars or not
  • Cluster users into types using the features

5
The Problem
  • Given
  • Samples from two hidden distributions P1 and P2
  • Unknown labels

Need Some Separation Between the Distributions
6
Our Results
  • Need some separation between the distributions!
  • Measure of Separation distance between means
  • ? L1 distance between means / k
  • ? L22 distance between means / k
  • Our Results
  • Optimization function and poly-time algorithm
    ? k W(vk log n)
  • Optimization function ? k W( log n)

7
Our Results
  • This talk
  • Optimization function and poly-time algorithm
    ? k ?(vk log n)
  • Example
  • P1 For each feature f, p1f ½
  • P2 For each feature f, p2f ½ vlog n/vk
  • Information-theoretically optimal
  • There exists two distributions with this
    separation and constant overlap in probability
    mass

8
Optimization Function
  • What measure to optimize to get the correct
    clustering?
  • Need a robust measure which works for small
    separations

9
A Robust Measure
  • Find the best balanced partition (S,S) such
    that
  • ?f Nf(S) Nf(S)
  • is maximum
  • Nf(S), Nf(S) of individuals with feature f
    in S, S

10
A Robust Measure
  • Find the best balanced partition (S,S) such
    that
  • ?f Nf(S) Nf(S)
  • is maximum
  • Nf(S), Nf(S) of individuals with feature f
    in S, S

Theorem Optimizing this measure provides the
correct partition w.h.p. if
? k ?(vk log n)
11
Proof Sketch
  • How does the optimal partition behave?

E f(P) ? k n k vn Pr f(P) Ef gtnvk
2-n
E f(Any partition) k vn Pr f(P) Ef gt
nvk 2-n
The partition with the optimal value of f in (I)
dominates all the partitions in (II) w.h.p for
the separation conditions
12
An Algorithm
  • How can we find the partition which optimizes
    this measure?

Theorem There exists an algorithm which finds
the correct partition when ? k ?(vk
log2n) Running Time O(nk log2 n)
13
An Algorithm
  • Algorithm
  • Divide individuals into two sets A and B
  • Start with a random partition of A
  • Iterate log n times
  • Classify B using current partition of A and a
    proximity score
  • And the same for A

14
An Algorithm
  • Iterate
  • Classify B using current partition of A and a
    score
  • And vice versa.
  • Random Partition
  • ( 1/2 1/vn) imbalance
  • Each iteration produces a partition with more
    imbalance

15
Classification Score
  • Our Score For each feature f,
  • If Nf(S) gt Nf(S)
  • add 1 to the score if f is present,
    else subtract 1
  • If Nf(S) lt Nf(S)
  • add 1 to the score if f is absent,
    else subtract 1
  • Classify
  • Individuals above the median score S
  • Individuals below the median score S

16
Classification
  • Lemma If the current partition has (1/2
    ?)-imbalance, the next iteration produces a
    partition with (1/2 2?)-imbalance for ? lt c
  • Lemma If the current partition has (1/2
    c)-imbalance, the next iteration produces the
    correct partition with our separation conditions.
  • ?(log n) rounds needed to get the correct
    partition
  • Use a fresh set of features in each round to get
    independence

17
Proof Sketch
  • Lemma If the current partition has (1/2
    ?)-imbalance, the next iteration produces a
    partition with (1/2 2?)-imbalance for ? lt c

G ?(? ?2 kvn)
Initially G ?(log n) X, Y Bin(k, ½)
G
Population 1
Population 2
18
Proof Sketch
  • Lemma If the current partition has (1/2
    ?)-imbalance, the next iteration produces a
    partition with (1/2 2?)-imbalance for ? lt c

G ?(? ?2 kvn)
Pr Correct Classification ½ Ga/vk /(½
½) gt ½ 2 ? From separation conditions
G
Population 1
Population 2
19
Proof Sketch
  • Lemma If the current partition has (1/2
    c)-imbalance, the next iteration produces the
    correct partition with our separation conditions.

G ?(? ?2 kvn)
All but a 1/poly(n) fraction is correctly
classified
Population 1
Population 2
20
Related Work
  • Learning Mixtures of Gaussians D99
  • Best performance by Spectral Algorithms VW02,
    AM05,KSV05
  • Our algorithm
  • Matches the bounds in VW02 for two clusters
  • Not a spectral algorithm !

21
Open Questions
  • How to extend our algorithm to work for multiple
    clusters ?
  • What is the relationship between our algorithm
    and spectral algorithms?
  • Matches spectral algorithms of M01 for two-way
    graph partitioning
  • Can our algorithm do better?

22
  • Thank You!
Write a Comment
User Comments (0)
About PowerShow.com