CSE182-L18 - PowerPoint PPT Presentation

About This Presentation
Title:

CSE182-L18

Description:

CSE182-L18 Population Genetics – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 50
Provided by: Vine61
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: CSE182-L18


1
CSE182-L18
  • Population Genetics

2
Perfect Phylogeny
  • Assume an evolutionary model in which no
    recombination takes place, only mutation.
  • The evolutionary history is explained by a tree
    in which every mutation is on an edge of the
    tree. All the species in one sub-tree contain a
    0, and all species in the other contain a 1. Such
    a tree is called a perfect phylogeny.
  • How can one reconstruct such a tree?

3
The 4-gamete condition
  • A column i partitions the set of species into two
    sets i0, and i1
  • A column is homogeneous w.r.t a set of species,
    if it has the same value for all species.
    Otherwise, it is heterogenous.
  • EX i is heterogenous w.r.t A,D,E

i A 0 B 0 C 0 D 1 E 1 F 1
i0
i1
4
4 Gamete Condition
  • 4 Gamete Condition
  • There exists a perfect phylogeny if and only if
    for all pair of columns (i,j), either j is not
    heterogenous w.r.t i0, or i1.
  • Equivalent to
  • There exists a perfect phylogeny if and only if
    for all pairs of columns (i,j), the following 4
    rows do not exist
  • (0,0), (0,1), (1,0), (1,1)

5
4-gamete condition proof
  • Depending on which edge the mutation j occurs,
    either i0, or i1 should be homogenous.
  • (only if) Every perfect phylogeny satisfies the
    4-gamete condition
  • (if) If the 4-gamete condition is satisfied, does
    a prefect phylogeny exist?

6
An algorithm for constructing a perfect phylogeny
  • We will consider the case where 0 is the
    ancestral state, and 1 is the mutated state. This
    will be fixed later.
  • In any tree, each node (except the root) has a
    single parent.
  • It is sufficient to construct a parent for every
    node.
  • In each step, we add a column and refine some of
    the nodes containing multiple children.
  • Stop if all columns have been considered.

7
Inclusion Property
  • For any pair of columns i,j
  • i lt j if and only if i1 ? j1
  • Note that if iltj then the edge containing i is an
    ancestor of the edge containing i

i
j
8
Example
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D
0 0 1 0 1 E 1 0 0 0 0
Initially, there is a single clade r, and each
node has r as its parent
9
Sort columns
  • Sort columns according to the inclusion property
    (note that the columns are already sorted here).
  • This can be achieved by considering the columns
    as binary representations of numbers (most
    significant bit in row 1) and sorting in
    decreasing order

1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1
0 D 0 0 1 0 1 E 1 0 0 0 0
10
Add first column
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1
0 D 0 0 1 0 1 E 1 0 0 0 0
  • In adding column i
  • Check each edge and decide which side you belong.
  • Finally add a node if you can resolve a clade

r
u
B
D
A
C
E
11
Adding other columns
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D
0 0 1 0 1 E 1 0 0 0 0
  • Add other columns on edges using the ordering
    property

r
1
3
E
B
2
5
4
D
A
C
12
Unrooted case
  • Switch the values in each column, so that 0 is
    the majority element.
  • Apply the algorithm for the rooted case

13
Summary No recombination leads to correlation
between sites
A
B
0 0 0 0 0 0 0 0
3
0 0 1 0 0 0 0 0
5
8
0 0 1 0 1 0 0 0
0 0 1 0 0 0 0 1
  • The different sites are linked. A 1 in position
    8 implies 0 in position 5, and vice versa.
  • The history of a population can be expressed as a
    tree.
  • The tree can be constructed efficiently

14
Recombination
  • A tree is not sufficient as a sequence may have 2
    parents
  • Recombination leads to violation of 4 gamete
    property.
  • Recombination leads to loss of correlation
    between columns

00000000 11111111 00011111
15
Studying recombination
  • A tree is not sufficient as a sequence may have 2
    parents
  • Recombination leads to loss of correlation
    between columns
  • How can we measure recombination?

16
Linkage (Dis)-equilibrium (LD)
A B 0 0 0 1 0 0 0 0 1 1 1 0 1 0 1 0
A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
  • Extensive Recombination
  • PrA,B(0,1)0.125
  • Linkage equilibrium
  • No recombination
  • PrA,B0,1 0.25
  • Linkage disequilibrium

17
Measuring LD
  • Consider two bi-allelic sites A and B, with
    values 0 and 1.
  • Let p1 probabilityindividual has allele 1 in
    site A
  • q1 probabilityindividual has allele 1 in site
    B
  • P11 Prob individual has allele 1 in site A,
    and B
  • Linkage Disequilibrium, D P11-p1q1
    P01-p0q1 .
  • If D0, sites are uncorrelated, (are in linkage
    equilibrium)
  • If D gtgt0, sites are highly correlated (have
    high LD)
  • Other measures exist, but they all measure
    similar quantities.

18
LD can be used to map disease genes
LD
0 1 1 0 0 1
D N N D D N
  • LD decays with distance from the disease allele.
  • By plotting LD, one can short list the region
    containing the disease gene.

19
Population sub-structure can cause problems in
disease gene mapping
20
Population sub-structure can increase LD
  • Consider two populations that were isolated and
    evolving independently.
  • They might have different allele frequencies in
    some regions.
  • Pick two regions that are far apart (LD is very
    low, close to 0)

21
Recent ad-mixing of population
  • If the populations came together recently (Ex
    African and European population), artificial LD
    might be created.
  • D 0.15 (instead of 0.01), increases 10-fold
  • This spurious LD might lead one false
    associations
  • Other genetic events can cause LD to arise, and
    one needs to be careful

0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1
Pop. AB
p10.5 q10.5 P110.1 D0.1-0.250.15
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0
22
Determining population sub-structure
  • Given a mix of people, can you sub-divide them
    into ethnic populations.
  • Turn the problem of spurious LD into a clue.
  • Find markers that are too far apart to show LD
  • If they do show LD (correlation), that shows the
    existence of multiple populations.
  • Sub-divide them into populations so that LD
    disappears.

23
Determining Population sub-structure
  • Same example as before
  • The two markers are too similar to show any LD,
    yet they do show LD.
  • However, if you split them so that all 0..1 are
    in one population and all 1..0 are in another, LD
    disappears

24
Iterative Algorithm for Population Substructure
  • Assume that there are 2 sub-populations
  • Randomly partition the individuals into two.
  • Select an individual, and compute the
    probabilities Pr(xA), and Pr (xB)
  • Assign the individual to A with probability
  • Pr(xA)/ (Pr(xA)Pr(xB))
  • Continue.

25
Iterative algorithm for population sub-structure
  • Define
  • N number of individuals (each has a single
    chromosome)
  • k number of sub-populations.
  • Z ? 1..kN is a vector giving the
    sub-population.
  • Zik gt individual i is assigned to population
    k
  • Xi,j allelic value for individual i in position
    j
  • Pk,j,l frequency of allele l at position j in
    population k

26
Example
  • Ex consider the following assignment
  • P1,1,0 0.9
  • P2,1,0 0.1

0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
27
Goal
  • X is known.
  • P, Z are unknown.
  • The goal is to estimate Pr(P,ZX)
  • Various learning techniques can be employed.
  • Here a Bayesian (MCMC) scheme is employed. We
    will only consider a simplified version

28
AlgorithmStructure
  • Iteratively estimate
  • (Z(0),P(0)), (Z(1),P(1)),.., (Z(m),P(m))
  • After convergence, Z(m) is the answer.
  • Iteration
  • Guess Z(0)
  • For m 1,2,..
  • Sample P(m) from Pr(P X, Z(m-1))
  • Sample Z(m) from Pr(P X, P(m-1))
  • How is this sampling done?

29
Example
  • Choose Z at random, so each individual is
    assigned to be in one of 2 populations. See
    example.
  • Now, we need to sample P(1) from Pr(P X, Z(0))
  • Simply count
  • Nk,j,l number of people in pouplation k which
    have allele l in position j
  • pk,j,l Nk,j,l / N

0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 2 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 2 2 1
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
30
Example
0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 2 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 2 2 1
  • Nk,j,l number of people in population k which
    have allele l in position j
  • pk,j,l Nk,j,l / N1,1,
  • N1,1,0 4
  • N1,1,1 6
  • p1,1,0 4/10
  • p1,2,0 4/10
  • Thus, we can sample P(m)

1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
31
Sampling Z
  • PrZ1 1 Pr01 belongs to population 1?
  • We know that each position should be in linkage
    equilibrium and independent.
  • Pr01 Population 1 p1,1,0 p1,2,1
    (4/10)(6/10)(0.24)
  • Pr01 Population 2 p2,1,0 p2,2,1
    (6/10)(4/10)0.24
  • Pr Z1 1 0.24/(0.240.24) 0.5

32
Sampling
  • Suppose, during the iteration, there is a bias.
  • Then, in the next step of sampling Z, we will do
    the right thing
  • Pr01 pop. 1 p1,1,0 p1,2,1 0.70.7
    0.49
  • Pr01 pop. 2 p2,1,0 p2,2,1 0.30.3
    0.09
  • PrZ1 1 0.49/(0.490.09) 0.85
  • PrZ6 1 0.49/(0.490.09) 0.85
  • Eventually all 01 will become 1 population, and
    all 10 will become a second population

0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 1 1 2 1 2 1 2 1 1 2 2 2 1 2 2 1 2 2 1
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
33
Population Structure
  • 377 locations (loci) were sampled in 1000 people
    from 52 populations.
  • 6 genetic clusters were obtained, which
    corresponded to 5 geographic regions (Rosenberg
    et al. Science 2003)

Oceania
Eurasia
East Asia
America
Africa
34
Other topics
Assembly
Genomic Analysis/ Pop. Genetics
Protein Sequence Analysis
Sequence Analysis
Gene Finding
ncRNA
35
ncRNA gene finding
  • Gene is transcribed but not translated.
  • What are the clues to non-coding genes?
  • Look for signals selecting start of transcription
    and translation. Non coding genes are transcribed
    by Pol III
  • Non-coding genes have structure. Look for genomic
    sequences that fold into an RNA structure
  • Structure Given a sequence, what is the
    structure into which it can fold with minimum
    energy?

36
tRNA structure
37
RNA structure Basics
  • Key RNA is single-stranded. Think of a string
    over 4 letters, AC,G, and U.
  • The complementary bases form pairs.
  • Base-pairing defines a secondary structure. The
    base-pairing is usually non-crossing.

38
RNA structure pseudoknots
Sometimes, unpaired bases in loops form
crossing pairs. These are pseudoknots
39
A Static picture of the cell is insufficient
  • Each Cell is continuously active,
  • Genes are being transcribed into RNA
  • RNA is translated into proteins
  • Proteins are PT modified and transported
  • Proteins perform various cellular functions
  • Can we probe the Cell dynamically

Pathways
Gene Regulation
Proteomic profiling
Transcript profiling
40
Gene expression
  • The expression of transcripts and protein in the
    cell is not static. It changes in response to
    signals.
  • The expression can be measured using
    micro-arrays.
  • What causes the change in expression?

41
Transcriptional machinery
  • DNA polymerase (II) scans the genome, initiating
    transcription, and terminating it.
  • The same machinery is used for every gene, so
    while Pol II is required, it is not sufficient
    to confer specificity

42
TF binding
Transcription factors
  • Other transcription factors interact with the
    core machinery and upstream DNA to provide
    specificity.
  • TFs bind to TF binding sites which are clustered
    in upstream enhancer and promoter elements.
  • The enhancer elements may be located many kb
    upstream of the core-promoter

Upstream elements
43
TF binding sites
  • TF binding sites are weak signal (about 10 bp
    with 5bp conserved)
  • If two genes are co-regulated, they are likely to
    share binding sites
  • Discovery of binding site motifs is an important
    research problem.

TCAGGAG
g1
TGAGGAG
g2
g3
TCAGGTG
g4
TGAGGTG
g5
TCAGGTG
44
http//www.gene-regulation.com/pub/databases.html
transfac
45
Discovering TF binding sites
  • Identification of these TF binding sites/switches
    is critical.
  • Requires identification of co-regulated genes
    (genes containing the same set of switches).
  • How do we find co-regulated genes?

46
Idea1 Use orthologous genes from different
species
  1. The species are too close (EX humans and
    chimps). Binding non-binding sites are both
    conserved.
  2. The species are distant. Binding sites are
    conserved but not other sequence.
  3. The species are very distant. Even binding sites
    are not conerved. The genes have alternative
    regulators.

ACGGCAGCTCGCCGCCGCGC
ACGGC-GGGCGCCGCCCCGC
ACGGCAGCTCGCCGCCGC-C
AGTGC-GGGCGCCGCCTCAT
ACGGC-GC-TCGCCGCCGCGC
AT-ACGAAGTAGCGG-ATGGT
47
Idea2 Microarray
  • Expression level of all genes

48
Pathways
  • Proteins interact to transduce signal, catalyze
    reactions, etc.
  • The interactions can be captured in a database.
  • Queries on this database are about looking for
    interesting sub-graphs in a large graph.

49
Summary
  • Biological databases cannot be understood without
    understanding the data, and the tools for
    querying and accessing these data.
  • While database technology (XML, Relational OO
    databases, text formats) is used to store this
    data, its use is (often) transparent for
    Bioinformatics people.
  • In this course, we looked at various
    data-streams, and pointed to databases that store
    these data-streams
  • Nucleic Acids Research brings out a database
    issue every January

2005 issue
Write a Comment
User Comments (0)
About PowerShow.com