Title: CSE182-L18
1CSE182-L18
2Perfect Phylogeny
- Assume an evolutionary model in which no
recombination takes place, only mutation. - The evolutionary history is explained by a tree
in which every mutation is on an edge of the
tree. All the species in one sub-tree contain a
0, and all species in the other contain a 1. Such
a tree is called a perfect phylogeny. - How can one reconstruct such a tree?
3The 4-gamete condition
- A column i partitions the set of species into two
sets i0, and i1 - A column is homogeneous w.r.t a set of species,
if it has the same value for all species.
Otherwise, it is heterogenous. - EX i is heterogenous w.r.t A,D,E
i A 0 B 0 C 0 D 1 E 1 F 1
i0
i1
44 Gamete Condition
- 4 Gamete Condition
- There exists a perfect phylogeny if and only if
for all pair of columns (i,j), either j is not
heterogenous w.r.t i0, or i1. - Equivalent to
- There exists a perfect phylogeny if and only if
for all pairs of columns (i,j), the following 4
rows do not exist - (0,0), (0,1), (1,0), (1,1)
54-gamete condition proof
- Depending on which edge the mutation j occurs,
either i0, or i1 should be homogenous. - (only if) Every perfect phylogeny satisfies the
4-gamete condition - (if) If the 4-gamete condition is satisfied, does
a prefect phylogeny exist?
6An algorithm for constructing a perfect phylogeny
- We will consider the case where 0 is the
ancestral state, and 1 is the mutated state. This
will be fixed later. - In any tree, each node (except the root) has a
single parent. - It is sufficient to construct a parent for every
node. - In each step, we add a column and refine some of
the nodes containing multiple children. - Stop if all columns have been considered.
7Inclusion Property
- For any pair of columns i,j
- i lt j if and only if i1 ? j1
- Note that if iltj then the edge containing i is an
ancestor of the edge containing i
i
j
8Example
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D
0 0 1 0 1 E 1 0 0 0 0
Initially, there is a single clade r, and each
node has r as its parent
9Sort columns
- Sort columns according to the inclusion property
(note that the columns are already sorted here). - This can be achieved by considering the columns
as binary representations of numbers (most
significant bit in row 1) and sorting in
decreasing order
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1
0 D 0 0 1 0 1 E 1 0 0 0 0
10Add first column
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1
0 D 0 0 1 0 1 E 1 0 0 0 0
- In adding column i
- Check each edge and decide which side you belong.
- Finally add a node if you can resolve a clade
r
u
B
D
A
C
E
11Adding other columns
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D
0 0 1 0 1 E 1 0 0 0 0
- Add other columns on edges using the ordering
property
r
1
3
E
B
2
5
4
D
A
C
12Unrooted case
- Switch the values in each column, so that 0 is
the majority element. - Apply the algorithm for the rooted case
13Summary No recombination leads to correlation
between sites
A
B
0 0 0 0 0 0 0 0
3
0 0 1 0 0 0 0 0
5
8
0 0 1 0 1 0 0 0
0 0 1 0 0 0 0 1
- The different sites are linked. A 1 in position
8 implies 0 in position 5, and vice versa. - The history of a population can be expressed as a
tree. - The tree can be constructed efficiently
14Recombination
- A tree is not sufficient as a sequence may have 2
parents - Recombination leads to violation of 4 gamete
property. - Recombination leads to loss of correlation
between columns
00000000 11111111 00011111
15Studying recombination
- A tree is not sufficient as a sequence may have 2
parents - Recombination leads to loss of correlation
between columns - How can we measure recombination?
16Linkage (Dis)-equilibrium (LD)
A B 0 0 0 1 0 0 0 0 1 1 1 0 1 0 1 0
A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
- Extensive Recombination
- PrA,B(0,1)0.125
- Linkage equilibrium
- No recombination
- PrA,B0,1 0.25
- Linkage disequilibrium
17Measuring LD
- Consider two bi-allelic sites A and B, with
values 0 and 1. - Let p1 probabilityindividual has allele 1 in
site A - q1 probabilityindividual has allele 1 in site
B - P11 Prob individual has allele 1 in site A,
and B - Linkage Disequilibrium, D P11-p1q1
P01-p0q1 . - If D0, sites are uncorrelated, (are in linkage
equilibrium) - If D gtgt0, sites are highly correlated (have
high LD) - Other measures exist, but they all measure
similar quantities.
18LD can be used to map disease genes
LD
0 1 1 0 0 1
D N N D D N
- LD decays with distance from the disease allele.
- By plotting LD, one can short list the region
containing the disease gene.
19Population sub-structure can cause problems in
disease gene mapping
20Population sub-structure can increase LD
- Consider two populations that were isolated and
evolving independently. - They might have different allele frequencies in
some regions. - Pick two regions that are far apart (LD is very
low, close to 0)
21Recent ad-mixing of population
- If the populations came together recently (Ex
African and European population), artificial LD
might be created. - D 0.15 (instead of 0.01), increases 10-fold
- This spurious LD might lead one false
associations - Other genetic events can cause LD to arise, and
one needs to be careful
0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1
Pop. AB
p10.5 q10.5 P110.1 D0.1-0.250.15
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0
22Determining population sub-structure
- Given a mix of people, can you sub-divide them
into ethnic populations. - Turn the problem of spurious LD into a clue.
- Find markers that are too far apart to show LD
- If they do show LD (correlation), that shows the
existence of multiple populations. - Sub-divide them into populations so that LD
disappears.
23Determining Population sub-structure
- Same example as before
- The two markers are too similar to show any LD,
yet they do show LD. - However, if you split them so that all 0..1 are
in one population and all 1..0 are in another, LD
disappears
24Iterative Algorithm for Population Substructure
- Assume that there are 2 sub-populations
- Randomly partition the individuals into two.
- Select an individual, and compute the
probabilities Pr(xA), and Pr (xB) - Assign the individual to A with probability
- Pr(xA)/ (Pr(xA)Pr(xB))
- Continue.
25Iterative algorithm for population sub-structure
- Define
- N number of individuals (each has a single
chromosome) - k number of sub-populations.
- Z ? 1..kN is a vector giving the
sub-population. - Zik gt individual i is assigned to population
k - Xi,j allelic value for individual i in position
j - Pk,j,l frequency of allele l at position j in
population k
26Example
- Ex consider the following assignment
- P1,1,0 0.9
- P2,1,0 0.1
0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
27Goal
- X is known.
- P, Z are unknown.
- The goal is to estimate Pr(P,ZX)
- Various learning techniques can be employed.
- Here a Bayesian (MCMC) scheme is employed. We
will only consider a simplified version
28AlgorithmStructure
- Iteratively estimate
- (Z(0),P(0)), (Z(1),P(1)),.., (Z(m),P(m))
- After convergence, Z(m) is the answer.
- Iteration
- Guess Z(0)
- For m 1,2,..
- Sample P(m) from Pr(P X, Z(m-1))
- Sample Z(m) from Pr(P X, P(m-1))
- How is this sampling done?
29Example
- Choose Z at random, so each individual is
assigned to be in one of 2 populations. See
example. - Now, we need to sample P(1) from Pr(P X, Z(0))
- Simply count
- Nk,j,l number of people in pouplation k which
have allele l in position j - pk,j,l Nk,j,l / N
0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 2 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 2 2 1
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
30Example
0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 2 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 2 2 1
- Nk,j,l number of people in population k which
have allele l in position j - pk,j,l Nk,j,l / N1,1,
- N1,1,0 4
- N1,1,1 6
- p1,1,0 4/10
- p1,2,0 4/10
- Thus, we can sample P(m)
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
31Sampling Z
- PrZ1 1 Pr01 belongs to population 1?
- We know that each position should be in linkage
equilibrium and independent. - Pr01 Population 1 p1,1,0 p1,2,1
(4/10)(6/10)(0.24) - Pr01 Population 2 p2,1,0 p2,2,1
(6/10)(4/10)0.24 - Pr Z1 1 0.24/(0.240.24) 0.5
32Sampling
- Suppose, during the iteration, there is a bias.
- Then, in the next step of sampling Z, we will do
the right thing - Pr01 pop. 1 p1,1,0 p1,2,1 0.70.7
0.49 - Pr01 pop. 2 p2,1,0 p2,2,1 0.30.3
0.09 - PrZ1 1 0.49/(0.490.09) 0.85
- PrZ6 1 0.49/(0.490.09) 0.85
- Eventually all 01 will become 1 population, and
all 10 will become a second population
0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 1 1 2 1 2 1 2 1 1 2 2 2 1 2 2 1 2 2 1
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
33Population Structure
- 377 locations (loci) were sampled in 1000 people
from 52 populations. - 6 genetic clusters were obtained, which
corresponded to 5 geographic regions (Rosenberg
et al. Science 2003)
Oceania
Eurasia
East Asia
America
Africa
34Other topics
Assembly
Genomic Analysis/ Pop. Genetics
Protein Sequence Analysis
Sequence Analysis
Gene Finding
ncRNA
35ncRNA gene finding
- Gene is transcribed but not translated.
- What are the clues to non-coding genes?
- Look for signals selecting start of transcription
and translation. Non coding genes are transcribed
by Pol III - Non-coding genes have structure. Look for genomic
sequences that fold into an RNA structure - Structure Given a sequence, what is the
structure into which it can fold with minimum
energy?
36tRNA structure
37RNA structure Basics
- Key RNA is single-stranded. Think of a string
over 4 letters, AC,G, and U. - The complementary bases form pairs.
- Base-pairing defines a secondary structure. The
base-pairing is usually non-crossing.
38RNA structure pseudoknots
Sometimes, unpaired bases in loops form
crossing pairs. These are pseudoknots
39A Static picture of the cell is insufficient
- Each Cell is continuously active,
- Genes are being transcribed into RNA
- RNA is translated into proteins
- Proteins are PT modified and transported
- Proteins perform various cellular functions
- Can we probe the Cell dynamically
Pathways
Gene Regulation
Proteomic profiling
Transcript profiling
40Gene expression
- The expression of transcripts and protein in the
cell is not static. It changes in response to
signals. - The expression can be measured using
micro-arrays. - What causes the change in expression?
41Transcriptional machinery
- DNA polymerase (II) scans the genome, initiating
transcription, and terminating it. - The same machinery is used for every gene, so
while Pol II is required, it is not sufficient
to confer specificity
42TF binding
Transcription factors
- Other transcription factors interact with the
core machinery and upstream DNA to provide
specificity. - TFs bind to TF binding sites which are clustered
in upstream enhancer and promoter elements. - The enhancer elements may be located many kb
upstream of the core-promoter
Upstream elements
43TF binding sites
- TF binding sites are weak signal (about 10 bp
with 5bp conserved) - If two genes are co-regulated, they are likely to
share binding sites - Discovery of binding site motifs is an important
research problem.
TCAGGAG
g1
TGAGGAG
g2
g3
TCAGGTG
g4
TGAGGTG
g5
TCAGGTG
44http//www.gene-regulation.com/pub/databases.html
transfac
45Discovering TF binding sites
- Identification of these TF binding sites/switches
is critical. - Requires identification of co-regulated genes
(genes containing the same set of switches). - How do we find co-regulated genes?
46Idea1 Use orthologous genes from different
species
- The species are too close (EX humans and
chimps). Binding non-binding sites are both
conserved. - The species are distant. Binding sites are
conserved but not other sequence. - The species are very distant. Even binding sites
are not conerved. The genes have alternative
regulators.
ACGGCAGCTCGCCGCCGCGC
ACGGC-GGGCGCCGCCCCGC
ACGGCAGCTCGCCGCCGC-C
AGTGC-GGGCGCCGCCTCAT
ACGGC-GC-TCGCCGCCGCGC
AT-ACGAAGTAGCGG-ATGGT
47Idea2 Microarray
- Expression level of all genes
48Pathways
- Proteins interact to transduce signal, catalyze
reactions, etc. - The interactions can be captured in a database.
- Queries on this database are about looking for
interesting sub-graphs in a large graph.
49Summary
- Biological databases cannot be understood without
understanding the data, and the tools for
querying and accessing these data. - While database technology (XML, Relational OO
databases, text formats) is used to store this
data, its use is (often) transparent for
Bioinformatics people. - In this course, we looked at various
data-streams, and pointed to databases that store
these data-streams - Nucleic Acids Research brings out a database
issue every January
2005 issue