CSE280a: Algorithmic topics in bioinformatics - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

CSE280a: Algorithmic topics in bioinformatics

Description:

CSE280a: Algorithmic topics in bioinformatics Vineet Bafna – PowerPoint PPT presentation

Number of Views:150

Avg rating:3.0/5.0

Slides: 52

Provided by: Vine75

Category:

more less

Transcript and Presenter's Notes

Title: CSE280a: Algorithmic topics in bioinformatics

1
CSE280a Algorithmic topics in bioinformatics

Vineet Bafna

2
The scope/syllabus

We will cover topics from the following areas
Population genetics
Computational Mass Spectrometry
Biological networks (emphasis on comparative
analysis)
ncRNA
The focus will be on the use of algorithms for
analyzing data in these areas
Some background in algorithms (mathematical
maturity) is helpful.
Relevant biology will be discussed on a need to
know basis

3
Logisitics

This class has little required homework.
All reading is optional, but recommended.
1 Final Exam (20), and 1 research project (70)
Students help edit class notes in latex (10)
At most one topic per student
Groups of lt 2 per topic
Project Goal
Address research problems with minimum
preparation
Lectures will be given by instructors and by
students.
Most communication is electronic.
Check http//www.cse.ucsd.edu/classes/wi08/cse280a

4
From an individual to a population

Individual genomes vary by about 1 in 1000bp.
These small variations account for significant
phenotype differences.
Disease susceptibility.
Response to drugs
How can we understand genetic variation in a
population, and its consequences?
It took a long time (10-15 yrs) to produce the
draft sequence of the human genome.
Soon (within 10-15 years), entire populations can
have their DNA sequenced. Why do we care?

5
Population Genetics

Individuals in a species (population) are
phenotypically different.
Often these differences are inherited (genetic).
Understanding the genetic basis of these
differences is a key challenge of biology!
The analysis of these differences involves many
interesting algorithmic questions.
We will use these questions to illustrate
algorithmic principles, and use algorithms to
interpret genetic data.

6
Population genetics

We are all similar, yet we are different. How
substantial are the differences?
What are the sources of variation?
As mutations arise, they are either neutral and
subject to evolutionary drift, or they are
(dis-)advantageous and under selective pressure.
Can we tell?
If you had DNA from many sub-populations, Asian,
European, African, can you separate them?
How can we detect recombination?
Why are some people more likely to get a disease
then others? How is disease gene mapping done?
Phasing of chromosomes

7
Computational mass spectrometry

Mass Spectrometry is a key technology for
measuring active proteins, their interactions,
and their post-translational modifications
Computation plays a key role in interpreting mass
spectrometry data.

8
Back to the genome

Recall that the genome contains protein-coding
genes (1 of the genome)
Another 5 might encode RNA, and regulatory sites
How do we find regions of functional interest?

9
Population genetics basics
10
Scope of genetics lectures

Basic terminology
Key principles
Sources of variation
HW equilibrium
Linkage
Coalescent theory
Recombination/Ancestral Recombination Graph
Haplotypes/Haplotype phasing
Population sub-structure
Structural polymorphisms
Medical genetics basis Association
mapping/pedigree analysis

11
Terminology allele

Allele A specific variant at a location
The notion of alleles predates the concept of
gene, and DNA.
Initially, alleles referred to variants that
described a measurable trait (round/wrinkled
seed)
Now, an allele might be a nucleotide on a
chromosome, with no measurable phenotype.
As we discuss source of variation, we will have
different kinds of alleles.

12
Terminology

Locus The location of the allele
A nucleotide position.
A genetic marker
A gene
A chromosomal segment

13
Terminology

Genotype genetic makeup of (part of) an
individual
Phenotype A measurable trait in an organism,
often the consequence of a genetic variation
Humans are diploid, they have 2 copies of each
chromosome.
They may have heterozygosity/homozygosity at a
location
Other organisms (plants) have higher forms of
ploidy.
Additionally, some sites might have 2 allelic
forms, or even many allelic forms.
Haplotype genetic makeup of (part of) a single
chromosome

14
What causes variation in a population?

Mutations (may lead to SNPs)
Recombinations
Other crossover events (gene conversion)
Structural Polymorphisms

15
Single Nucleotide Polymorphisms

Small mutations that are sustained in a
population are called SNPs
SNPs are the most common source of variation
studied
The data is a matrix (rows are individuals,
columns are loci). Only the variant positions are
kept.

A-gtG
16
Single Nucleotide Polymorphisms
Infinite Sites Assumption Each site mutates at
most once
00000101011 10001101001 01000101010 01000000011 00
011110000 00101100110
17
Short Tandem Repeats
GCTAGATCATCATCATCATTGCTAG GCTAGATCATCATCATTGCTAGTT
A GCTAGATCATCATCATCATCATTGC GCTAGATCATCATCATTGCTAG
TTA GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATC
ATTGC
4 3 5 3 3 5
18
STR can be used as a DNA fingerprint

Consider a collection of regions with variable
length repeats.
Variable length repeats will lead to variable
length DNA
Vector of lengths is a finger-print

4 2 3 3 5 1 3 2 3 1 5 3
individuals
loci
19
Structural polymorphisms

Large scale structural changes (deletions/insertio
ns/inversions) may occur in a population.
Copy Number variation
Certain diseases (cancers) are marked by an
abundance of these events

20
Personalized genome sequencing

These variants (of which 1,288,319 were novel)
included
3,213,401 single nucleotide polymorphisms (SNPs),
53,823 block substitutions (2206 bp),
292,102 heterozygous insertion/deletion events
(indels)(1571 bp),
559,473 homozygous indels (182,711 bp),
90 inversions, as well as numerous segmental
duplications and copy number variation regions.
Non-SNP DNA variation accounts for 22 of all
events identified in the donor, however they
involve 74 of all variant bases. This suggests
an important role for non-SNP genetic alterations
in defining the diploid genome structure.
Moreover, 44 of genes were heterozygous for one
or more variants.

PLoS Biology, 2007
21
Recombination
00000000 11111111 00011111

Not all DNA recombines!

22
Human DNA

Not all DNA recombines.
mtDNA is inherited from the mother, and
y-chromosome from the father

http//upload.wikimedia.org/wikipedia/commons/b/b2
/Karyotype.png
23
Gene Conversion

Gene Conversion versus single crossover
Hard to distinguish in a population

24
Topic 1 Basic Principles

In a stable population, the distribution of
alleles obeys certain laws
Not really, and the deviations are interesting
HW Equilibrium
(due to mixing in a population)
Linkage (dis)-equilibrium
Due to recombination

25
Hardy Weinberg equilibrium

Consider a locus with 2 alleles, A, a
p (respectively, q) is the frequency of A (resp.
a) in the population
3 Genotypes AA, Aa, aa
Q What is the frequency of each genotype

If various assumptions are satisfied, (such as
random mating, no natural selection), Then
PAAp2
PAa2pq
Paaq2

26
Hardy Weinberg why?

Assumptions
Diploid
Sexual reproduction
Random mating
Bi-allelic sites
Large population size,
Why? Each individual randomly picks his two
chromosomes. Therefore, Prob. (Aa) pqqp 2pq,
and so on.

27
Hardy Weinberg Generalizations

Multiple alleles with frequencies
By HW,
Multiple loci?

28
Hardy Weinberg Implications

The allele frequency does not change from
generation to generation. Why?
It is observed that 1 in 10,000 caucasians have
the disease phenylketonuria. The disease
mutation(s) are all recessive. What fraction of
the population carries the mutation?
Males are 100 times more likely to have the red
type of color blindness than females. Why?
Conclusion While the HW assumptions are rarely
satisfied, the principle is still important as a
baseline assumption, and significant deviations
are interesting.

29
Recombination
00000000 11111111 00011111
30
What if there were no recombinations?

Life would be simpler
Each individual sequence would have a single
parent (even for higher ploidy)
The genealogical relationship is expressed as a
tree. This principle is used to track ancestry of
an individual

31
The Infinite Sites Assumption
0 0 0 0 0 0 0 0
3
0 0 1 0 0 0 0 0
5
8
0 0 1 0 1 0 0 0
0 0 1 0 0 0 0 1

The different sites are linked. A 1 in position
8 implies 0 in position 5, and vice versa.
Some phenotypes could be linked to the
polymorphisms
Some of the linkage is destroyed by
recombination

32
Infinite sites assumption and Perfect Phylogeny

Each site is mutated at most once in the history.
All descendants must carry the mutated value, and
all others must carry the ancestral value

i
1 in position i
0 in position i
33
Perfect Phylogeny

Assume an evolutionary model in which no
recombination takes place, only mutation.
The evolutionary history is explained by a tree
in which every mutation is on an edge of the
tree. All the species in one sub-tree contain a
0, and all species in the other contain a 1. Such
a tree is called a perfect phylogeny.

34
Handling recombination

A tree is not sufficient as a sequence may have 2
parents
Recombination leads to loss of correlation
between columns

35
Quiz 1

Allele, locus, genotype, haplotype
Hardy Weinberg equilibrium?
Today Linkage (dis)-equilibrium

36
Quiz 2

Recall that a SNP data-set is a binary matrix.
Rows are individual (chromosomes)
Columns are alleles at a specific locus
Suppose you have 2 SNP datasets of a contiguous
genomic region.
One from an African population, and one from a
European Population.
Can you tell which is which?
How long does the genomic region have to be?

37
Recombination, and populations

Think of a population of N individual
chromosomes.
The population remains stable from generation to
generation.
Without recombination, each individual has
exactly one parent chromosome from the previous
generation.
With recombinations, each individual is derived
from one or two parents.
We will formalize this notion later in the
context of coalescent theory.

38
Linkage (Dis)-equilibrium (LD)

Consider sites A B
Case 1 No recombination
Each new individual chromosome chooses a parent
from the existing haplotype

A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
1 0
39
Linkage (Dis)-equilibrium (LD)

Consider sites A B
Case 2 diploidy and recombination
Each new individual chooses a parent from the
existing alleles

A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
1 1
40
Linkage (Dis)-equilibrium (LD)

Consider sites A B
Case 1 No recombination
Each new individual chooses a parent from the
existing haplotype
PrA,B0,1 0.25
Linkage disequilibrium
Case 2 Extensive recombination
Each new individual simply chooses and allele
from either site
PrA,B(0,1)0.125
Linkage equilibrium

A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
41
LD

In the absence of recombination,
Correlation between columns
The joint probability PrAa,Bb is different
from P(a)P(b)
With extensive recombination
Pr(a,b)P(a)P(b)

42
Measures of LD

Consider two bi-allelic sites with alleles marked
with 0 and 1
Define
P00 PrAllele 0 in locus 1, and 0 in locus 2
P0 PrAllele 0 in locus 1
Linkage equilibrium if P00 P0 P0
D abs(P00 - P0 P0) abs(P01 - P0 P1)

43
LD over time

With random mating, and fixed recombination rate
r between the sites, Linkage Disequilibrium will
disappear
Let D(t) LD at time t
P(t)00 (1-r) P(t-1)00 r P(t-1)0 P(t-1)0
D(t) P(t)00 - P(t)0 P(t)0 P(t)00 - P(t-1)0
P(t-1)0 (Why?)
D(t) (1-r) D(t-1) (1-r)t D(0)

44
Other measures of LD

D is obtained by dividing D by the largest
possible value
Dmax max P1P1, P0P1, P1P0, P0P0
Ex D abs(P11- P1 P1)/ Dmax
? D/(P1 P0 P1 P0)1/2
Let N be the number of individuals
Show that ?2N is the ?2 statistic between the two
sites

1
0
P00N
0
P0N
Site 1
1
Site 2
45
LD over distance

Assumption
Recombination rate increases linearly with
distance
LD decays exponentially with distance.
The assumption is reasonable, but recombination
rates vary from region to region, adding to
complexity
This simple fact is the basis of disease
association mapping.

46
LD and disease mapping

Consider a mutation that is causal for a disease.
The goal of disease gene mapping is to discover
which gene (locus) carries the mutation.
Consider every polymorphism, and check
There might be too many polymorphisms
Multiple mutations (even at a single locus) that
lead to the same disease
Instead, consider a dense sample of polymorphisms
that span the genome

47
LD can be used to map disease genes
LD
0 1 1 0 0 1
D N N D D N

LD decays with distance from the disease allele.
By plotting LD, one can short list the region
containing the disease gene.

48
(No Transcript)
49

269 individuals
90 Yorubans
90 Europeans (CEPH)
44 Japanese
45 Chinese
1M SNPs

50
LD and disease gene mapping problems

Marker density?
Complex diseases
Population sub-structure

51
Topic 2 Simulating population data

We described various population genetic concepts
(HW, LD), and their applicability
The values of these parameters depend critically
upon the population assumptions.
What if we do not have infinite populations
No random mating (Ex geographic isolation)
Sudden growth
Bottlenecks
Ad-mixture
It would be nice to have a simulation of such a
population to test various ideas. How would you
do this simulation?