CSE182-L18

About This Presentation

Title:

CSE182-L18

Description:

CSE182-L18 Population Genetics – PowerPoint PPT presentation

Number of Views:120

Avg rating:3.0/5.0

Slides: 50

Provided by: Vine61

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSE182-L18

1
CSE182-L18

Population Genetics

2
Perfect Phylogeny

Assume an evolutionary model in which no
recombination takes place, only mutation.
The evolutionary history is explained by a tree
in which every mutation is on an edge of the
tree. All the species in one sub-tree contain a
0, and all species in the other contain a 1. Such
a tree is called a perfect phylogeny.
How can one reconstruct such a tree?

3
The 4-gamete condition

A column i partitions the set of species into two
sets i0, and i1
A column is homogeneous w.r.t a set of species,
if it has the same value for all species.
Otherwise, it is heterogenous.
EX i is heterogenous w.r.t A,D,E

i A 0 B 0 C 0 D 1 E 1 F 1
i0
i1
4
4 Gamete Condition

4 Gamete Condition
There exists a perfect phylogeny if and only if
for all pair of columns (i,j), either j is not
heterogenous w.r.t i0, or i1.
Equivalent to
There exists a perfect phylogeny if and only if
for all pairs of columns (i,j), the following 4
rows do not exist
(0,0), (0,1), (1,0), (1,1)

5
4-gamete condition proof

Depending on which edge the mutation j occurs,
either i0, or i1 should be homogenous.
(only if) Every perfect phylogeny satisfies the
4-gamete condition
(if) If the 4-gamete condition is satisfied, does
a prefect phylogeny exist?

6
An algorithm for constructing a perfect phylogeny

We will consider the case where 0 is the
ancestral state, and 1 is the mutated state. This
will be fixed later.
In any tree, each node (except the root) has a
single parent.
It is sufficient to construct a parent for every
node.
In each step, we add a column and refine some of
the nodes containing multiple children.
Stop if all columns have been considered.

7
Inclusion Property

For any pair of columns i,j
i lt j if and only if i1 ? j1
Note that if iltj then the edge containing i is an
ancestor of the edge containing i

i
j
8
Example
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D
0 0 1 0 1 E 1 0 0 0 0
Initially, there is a single clade r, and each
node has r as its parent
9
Sort columns

Sort columns according to the inclusion property
(note that the columns are already sorted here).
This can be achieved by considering the columns
as binary representations of numbers (most
significant bit in row 1) and sorting in
decreasing order

1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1
0 D 0 0 1 0 1 E 1 0 0 0 0
10
Add first column
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1
0 D 0 0 1 0 1 E 1 0 0 0 0

In adding column i
Check each edge and decide which side you belong.
Finally add a node if you can resolve a clade

r
u
B
D
A
C
E
11
Adding other columns
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D
0 0 1 0 1 E 1 0 0 0 0

Add other columns on edges using the ordering
property

r
1
3
E
B
2
5
4
D
A
C
12
Unrooted case

Switch the values in each column, so that 0 is
the majority element.
Apply the algorithm for the rooted case

13
Summary No recombination leads to correlation
between sites
A
B
0 0 0 0 0 0 0 0
3
0 0 1 0 0 0 0 0
5
8
0 0 1 0 1 0 0 0
0 0 1 0 0 0 0 1

The different sites are linked. A 1 in position
8 implies 0 in position 5, and vice versa.
The history of a population can be expressed as a
tree.
The tree can be constructed efficiently

14
Recombination

A tree is not sufficient as a sequence may have 2
parents
Recombination leads to violation of 4 gamete
property.
Recombination leads to loss of correlation
between columns

00000000 11111111 00011111
15
Studying recombination

A tree is not sufficient as a sequence may have 2
parents
Recombination leads to loss of correlation
between columns
How can we measure recombination?

16
Linkage (Dis)-equilibrium (LD)
A B 0 0 0 1 0 0 0 0 1 1 1 0 1 0 1 0
A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0

Extensive Recombination
PrA,B(0,1)0.125
Linkage equilibrium

No recombination
PrA,B0,1 0.25
Linkage disequilibrium

17
Measuring LD

Consider two bi-allelic sites A and B, with
values 0 and 1.
Let p1 probabilityindividual has allele 1 in
site A
q1 probabilityindividual has allele 1 in site
B
P11 Prob individual has allele 1 in site A,
and B
Linkage Disequilibrium, D P11-p1q1
P01-p0q1 .
If D0, sites are uncorrelated, (are in linkage
equilibrium)
If D gtgt0, sites are highly correlated (have
high LD)
Other measures exist, but they all measure
similar quantities.

18
LD can be used to map disease genes
LD
0 1 1 0 0 1
D N N D D N

LD decays with distance from the disease allele.
By plotting LD, one can short list the region
containing the disease gene.

19
Population sub-structure can cause problems in
disease gene mapping
20
Population sub-structure can increase LD

Consider two populations that were isolated and
evolving independently.
They might have different allele frequencies in
some regions.
Pick two regions that are far apart (LD is very
low, close to 0)

21
Recent ad-mixing of population

If the populations came together recently (Ex
African and European population), artificial LD
might be created.
D 0.15 (instead of 0.01), increases 10-fold
This spurious LD might lead one false
associations
Other genetic events can cause LD to arise, and
one needs to be careful

0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1
Pop. AB
p10.5 q10.5 P110.1 D0.1-0.250.15
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0
22
Determining population sub-structure

Given a mix of people, can you sub-divide them
into ethnic populations.
Turn the problem of spurious LD into a clue.
Find markers that are too far apart to show LD
If they do show LD (correlation), that shows the
existence of multiple populations.
Sub-divide them into populations so that LD
disappears.

23
Determining Population sub-structure

Same example as before
The two markers are too similar to show any LD,
yet they do show LD.
However, if you split them so that all 0..1 are
in one population and all 1..0 are in another, LD
disappears

24
Iterative Algorithm for Population Substructure

Assume that there are 2 sub-populations
Randomly partition the individuals into two.
Select an individual, and compute the
probabilities Pr(xA), and Pr (xB)
Assign the individual to A with probability
Pr(xA)/ (Pr(xA)Pr(xB))
Continue.

25
Iterative algorithm for population sub-structure

Define
N number of individuals (each has a single
chromosome)
k number of sub-populations.
Z ? 1..kN is a vector giving the
sub-population.
Zik gt individual i is assigned to population
k
Xi,j allelic value for individual i in position
j
Pk,j,l frequency of allele l at position j in
population k

26
Example

Ex consider the following assignment
P1,1,0 0.9
P2,1,0 0.1

0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
27
Goal

X is known.
P, Z are unknown.
The goal is to estimate Pr(P,ZX)
Various learning techniques can be employed.
Here a Bayesian (MCMC) scheme is employed. We
will only consider a simplified version

28
AlgorithmStructure

Iteratively estimate
(Z(0),P(0)), (Z(1),P(1)),.., (Z(m),P(m))
After convergence, Z(m) is the answer.
Iteration
Guess Z(0)
For m 1,2,..
Sample P(m) from Pr(P X, Z(m-1))
Sample Z(m) from Pr(P X, P(m-1))
How is this sampling done?

29
Example

Choose Z at random, so each individual is
assigned to be in one of 2 populations. See
example.
Now, we need to sample P(1) from Pr(P X, Z(0))
Simply count
Nk,j,l number of people in pouplation k which
have allele l in position j
pk,j,l Nk,j,l / N

0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 2 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 2 2 1
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
30
Example
0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 2 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 2 2 1

Nk,j,l number of people in population k which
have allele l in position j
pk,j,l Nk,j,l / N1,1,
N1,1,0 4
N1,1,1 6
p1,1,0 4/10
p1,2,0 4/10
Thus, we can sample P(m)

1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
31
Sampling Z

PrZ1 1 Pr01 belongs to population 1?
We know that each position should be in linkage
equilibrium and independent.
Pr01 Population 1 p1,1,0 p1,2,1
(4/10)(6/10)(0.24)
Pr01 Population 2 p2,1,0 p2,2,1
(6/10)(4/10)0.24
Pr Z1 1 0.24/(0.240.24) 0.5

32
Sampling

Suppose, during the iteration, there is a bias.
Then, in the next step of sampling Z, we will do
the right thing
Pr01 pop. 1 p1,1,0 p1,2,1 0.70.7
0.49
Pr01 pop. 2 p2,1,0 p2,2,1 0.30.3
0.09
PrZ1 1 0.49/(0.490.09) 0.85
PrZ6 1 0.49/(0.490.09) 0.85
Eventually all 01 will become 1 population, and
all 10 will become a second population

0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 1 1 2 1 2 1 2 1 1 2 2 2 1 2 2 1 2 2 1
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
33
Population Structure

377 locations (loci) were sampled in 1000 people
from 52 populations.
6 genetic clusters were obtained, which
corresponded to 5 geographic regions (Rosenberg
et al. Science 2003)

Oceania
Eurasia
East Asia
America
Africa
34
Other topics
Assembly
Genomic Analysis/ Pop. Genetics
Protein Sequence Analysis
Sequence Analysis
Gene Finding
ncRNA
35
ncRNA gene finding

Gene is transcribed but not translated.
What are the clues to non-coding genes?
Look for signals selecting start of transcription
and translation. Non coding genes are transcribed
by Pol III
Non-coding genes have structure. Look for genomic
sequences that fold into an RNA structure
Structure Given a sequence, what is the
structure into which it can fold with minimum
energy?

36
tRNA structure
37
RNA structure Basics

Key RNA is single-stranded. Think of a string
over 4 letters, AC,G, and U.
The complementary bases form pairs.
Base-pairing defines a secondary structure. The
base-pairing is usually non-crossing.

38
RNA structure pseudoknots
Sometimes, unpaired bases in loops form
crossing pairs. These are pseudoknots
39
A Static picture of the cell is insufficient

Each Cell is continuously active,
Genes are being transcribed into RNA
RNA is translated into proteins
Proteins are PT modified and transported
Proteins perform various cellular functions
Can we probe the Cell dynamically

Pathways
Gene Regulation
Proteomic profiling
Transcript profiling
40
Gene expression

The expression of transcripts and protein in the
cell is not static. It changes in response to
signals.
The expression can be measured using
micro-arrays.
What causes the change in expression?

41
Transcriptional machinery

DNA polymerase (II) scans the genome, initiating
transcription, and terminating it.
The same machinery is used for every gene, so
while Pol II is required, it is not sufficient
to confer specificity

42
TF binding
Transcription factors

Other transcription factors interact with the
core machinery and upstream DNA to provide
specificity.
TFs bind to TF binding sites which are clustered
in upstream enhancer and promoter elements.
The enhancer elements may be located many kb
upstream of the core-promoter

Upstream elements
43
TF binding sites

TF binding sites are weak signal (about 10 bp
with 5bp conserved)
If two genes are co-regulated, they are likely to
share binding sites
Discovery of binding site motifs is an important
research problem.

TCAGGAG
g1
TGAGGAG
g2
g3
TCAGGTG
g4
TGAGGTG
g5
TCAGGTG
44
http//www.gene-regulation.com/pub/databases.html
transfac
45
Discovering TF binding sites

Identification of these TF binding sites/switches
is critical.
Requires identification of co-regulated genes
(genes containing the same set of switches).
How do we find co-regulated genes?

46
Idea1 Use orthologous genes from different
species

The species are too close (EX humans and
chimps). Binding non-binding sites are both
conserved.
The species are distant. Binding sites are
conserved but not other sequence.
The species are very distant. Even binding sites
are not conerved. The genes have alternative
regulators.

ACGGCAGCTCGCCGCCGCGC
ACGGC-GGGCGCCGCCCCGC
ACGGCAGCTCGCCGCCGC-C
AGTGC-GGGCGCCGCCTCAT
ACGGC-GC-TCGCCGCCGCGC
AT-ACGAAGTAGCGG-ATGGT
47
Idea2 Microarray

Expression level of all genes

48
Pathways

Proteins interact to transduce signal, catalyze
reactions, etc.
The interactions can be captured in a database.
Queries on this database are about looking for
interesting sub-graphs in a large graph.

49
Summary

Biological databases cannot be understood without
understanding the data, and the tools for
querying and accessing these data.
While database technology (XML, Relational OO
databases, text formats) is used to store this
data, its use is (often) transparent for
Bioinformatics people.
In this course, we looked at various
data-streams, and pointed to databases that store
these data-streams
Nucleic Acids Research brings out a database
issue every January

2005 issue

Write a Comment

User Comments (0)

About PowerShow.com

CSE182-L18 - PowerPoint PPT Presentation

CSE182-L18

CSE182-L18 Population Genetics – PowerPoint PPT presentation