Title: CSE182-L17
1CSE182-L17
- Clustering
- Population Genetics Basics
2Unsupervised Clustering
- Given a set of points (in n-dimensions), and k,
compute the k best clusters. - In k-means, clustering is done by choosing k
centers (means). - Each point is assigned to the closest center.
- The notion of best is defined by distances to
the center. - Question How can we compute the k best centers?
3Distance
- Given a data point v and a set of points X,
- define the distance from v to X
- d(v, X)
- as the (Euclidean) distance from v to
the closest point from X. - Given a set of n data points Vv1vn and a set
of k points X, - define the Squared Error Distortion
- d(V,X) ?d(vi, X)2 / n 1 lt i lt n
-
v
4K-Means Clustering Problem Formulation
- Input A set, V, consisting of n points and a
parameter k - Output A set X consisting of k points (cluster
centers) that minimizes the squared error
distortion d(V,X) over all possible choices of X -
- This problem is NP-complete in general.
-
51-Means Clustering Problem an Easy Case
- Input A set, V, consisting of n points.
- Output A single point X that minimizes d(V,X)
over all possible choices of X. -
- This problem is easy.
- However, it becomes very difficult for more
than one center. - An efficient heuristic method for k-Means
clustering is the Lloyd algorithm -
6K-means Lloyds algorithm
- Choose k centers at random
- X x1,x2,x3,xk
- Repeat
- XX
- Assign each v ? V to the closest cluster j
- d(v,xj) d(v,X) ? Cj Cj ? v
- Recompute X
- xj ? (? v ? Cj v) /Cj
- until (X X)
7(No Transcript)
8(No Transcript)
9(No Transcript)
10(No Transcript)
11Conservative K-Means Algorithm
- Lloyd algorithm is fast but in each iteration it
moves many data points, not necessarily causing
better convergence. - A more conservative method would be to move one
point at a time only if it improves the overall
clustering cost - The smaller the clustering cost of a partition of
data points is the better that clustering is - Different methods can be used to measure this
clustering cost (for example in the last
algorithm the squared error distortion was used)
12Microarray summary
- Microarrays (like MS) are a technology for
probing the dynamic state of the cell. - We answered questions like the following
- Which genes are coordinately regulated (They have
similar expression patterns in different
conditions)? - How can we reduce the dimensionality of the
system? - Using gene expression values from a sample, can
you predict if the sample is normal (state A) or
diseased (state B) - The techniques employed for classification/cluster
ing etc. are general and can be employed in a
number of contexts.
13Microarray non-summary
- We did not cover
- How are the gene expression values measured (the
technology)? (CSE183) - How do you control variability across different
experiments (normalization)? (CSE183) - What controls the expression of a gene (gene
regulation), or a set of genes? (CSE 181)
14Population Genetics
- The sequence of an individual does not say
anything about the diversity of a population. - Small individual genetic differences can have a
profound impact on phenotypes - Response to drugs
- Susceptibility to diseases
- Soon, we will have sequences of many individuals
from the same species. Studying the differences
will be a major challenge.
15Population Structure
- 377 locations (loci) were sampled in 1000 people
from 52 populations. - 6 genetic clusters were obtained, which
corresponded to 5 geographic regions (Rosenberg
et al. Science 2003)
Oceania
Eurasia
East Asia
America
Africa
16Population Genetics
- What is it about our genetic makeup that makes us
measurably different? - These genetic differences are correlated with
phenotypic differences - With cost reduction in sequencing and genotyping
technologies, we will know the sequence for
entire populations of individuals. - Here, we will study the basics of this
polymorphism data, and tools that are being
developed to analyze it.
17What causes variation in a population?
- Mutations (may lead to SNPs)
- Recombinations
- Other genetic events (Ex microsatellite
repeats) - Deletions, inversions
18Single Nucleotide Polymorphisms
Infinite Sites Assumption Each site mutates at
most once
00000101011 10001101001 01000101010 01000000011 00
011110000 00101100110
19Short Tandem Repeats
GCTAGATCATCATCATCATTGCTAG GCTAGATCATCATCATTGCTAGTT
A GCTAGATCATCATCATCATCATTGC GCTAGATCATCATCATTGCTAG
TTA GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATC
ATTGC
4 3 5 3 3 5
20STR can be used as a DNA fingerprint
- Consider a collection of regions with variable
length repeats. - Variable length repeats will lead to variable
length DNA - Vector of lengths is a finger-print
4 2 3 3 5 1 3 2 3 1 5 3
individuals
positions
21Recombination
00000000 11111111 00011111
22What if there were no recombinations?
- Life would be simpler
- Each sequence would have a single parent
- The relationship is expressed as a tree.
23The Infinite Sites Assumption
0 0 0 0 0 0 0 0
3
0 0 1 0 0 0 0 0
5
8
0 0 1 0 1 0 0 0
0 0 1 0 0 0 0 1
- The different sites are linked. A 1 in position
8 implies 0 in position 5, and vice versa. - Some phenotypes could be linked to the
polymorphisms - Some of the linkage is destroyed by
recombination
24Infinite sites assumption and Perfect Phylogeny
- Each site is mutated at most once in the history.
- All descendants must carry the mutated value, and
all others must carry the ancestral value
i
1 in position i
0 in position i
25Perfect Phylogeny
- Assume an evolutionary model in which no
recombination takes place, only mutation. - The evolutionary history is explained by a tree
in which every mutation is on an edge of the
tree. All the species in one sub-tree contain a
0, and all species in the other contain a 1. Such
a tree is called a perfect phylogeny. - How can one reconstruct such a tree?
26The 4-gamete condition
- A column i partitions the set of species into two
sets i0, and i1 - A column is homogeneous w.r.t a set of species,
if it has the same value for all species.
Otherwise, it is heterogenous. - EX i is heterogenous w.r.t A,D,E
i A 0 B 0 C 0 D 1 E 1 F 1
i0
i1
274 Gamete Condition
- 4 Gamete Condition
- There exists a perfect phylogeny if and only if
for all pair of columns (i,j), either j is not
heterogenous w.r.t i0, or i1. - Equivalent to
- There exists a perfect phylogeny if and only if
for all pairs of columns (i,j), the following 4
rows do not exist - (0,0), (0,1), (1,0), (1,1)
284-gamete condition proof
- Depending on which edge the mutation j occurs,
either i0, or i1 should be homogenous. - (only if) Every perfect phylogeny satisfies the
4-gamete condition - (if) If the 4-gamete condition is satisfied, does
a prefect phylogeny exist?
29An algorithm for constructing a perfect phylogeny
- We will consider the case where 0 is the
ancestral state, and 1 is the mutated state. This
will be fixed later. - In any tree, each node (except the root) has a
single parent. - It is sufficient to construct a parent for every
node. - In each step, we add a column and refine some of
the nodes containing multiple children. - Stop if all columns have been considered.
30Inclusion Property
- For any pair of columns i,j
- i lt j if and only if i1 ? j1
- Note that if iltj then the edge containing i is an
ancestor of the edge containing i
i
j
31Example
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D
0 0 1 0 1 E 1 0 0 0 0
Initially, there is a single clade r, and each
node has r as its parent
32Sort columns
- Sort columns according to the inclusion property
(note that the columns are already sorted here). - This can be achieved by considering the columns
as binary representations of numbers (most
significant bit in row 1) and sorting in
decreasing order
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1
0 D 0 0 1 0 1 E 1 0 0 0 0
33Add first column
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1
0 D 0 0 1 0 1 E 1 0 0 0 0
- In adding column i
- Check each edge and decide which side you belong.
- Finally add a node if you can resolve a clade
r
u
B
D
A
C
E
34Adding other columns
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D
0 0 1 0 1 E 1 0 0 0 0
- Add other columns on edges using the ordering
property
r
1
3
E
B
2
5
4
D
A
C
35Unrooted case
- Switch the values in each column, so that 0 is
the majority element. - Apply the algorithm for the rooted case
36Handling recombination
- A tree is not sufficient as a sequence may have 2
parents - Recombination leads to loss of correlation
between columns
37Linkage (Dis)-equilibrium (LD)
- Consider sites A B
- Case 1 No recombination
- PrA,B0,1 0.25
- Linkage disequilibrium
- Case 2Extensive recombination
- PrA,B(0,1)0.125
- Linkage equilibrium
A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
38(No Transcript)