Title: CSE280a: Algorithmic topics in bioinformatics
1CSE280a Algorithmic topics in bioinformatics
2The scope/syllabus
- We will cover topics from the following areas
- Population genetics
- Computational Mass Spectrometry
- Biological networks (emphasis on comparative
analysis) - ncRNA
- The focus will be on the use of algorithms for
analyzing data in these areas - Some background in algorithms (mathematical
maturity) is helpful. - Relevant biology will be discussed on a need to
know basis
3Logisitics
- This class has little required homework.
- All reading is optional, but recommended.
- 1 Final Exam (20), and 1 research project (70)
- Students help edit class notes in latex (10)
- At most one topic per student
- Groups of lt 2 per topic
- Project Goal
- Address research problems with minimum
preparation - Lectures will be given by instructors and by
students. - Most communication is electronic.
- Check http//www.cse.ucsd.edu/classes/wi08/cse280a
4From an individual to a population
- Individual genomes vary by about 1 in 1000bp.
- These small variations account for significant
phenotype differences. - Disease susceptibility.
- Response to drugs
- How can we understand genetic variation in a
population, and its consequences? - It took a long time (10-15 yrs) to produce the
draft sequence of the human genome. - Soon (within 10-15 years), entire populations can
have their DNA sequenced. Why do we care?
5Population Genetics
- Individuals in a species (population) are
phenotypically different. - Often these differences are inherited (genetic).
Understanding the genetic basis of these
differences is a key challenge of biology! - The analysis of these differences involves many
interesting algorithmic questions. - We will use these questions to illustrate
algorithmic principles, and use algorithms to
interpret genetic data.
6Population genetics
- We are all similar, yet we are different. How
substantial are the differences? - What are the sources of variation?
- As mutations arise, they are either neutral and
subject to evolutionary drift, or they are
(dis-)advantageous and under selective pressure.
Can we tell? - If you had DNA from many sub-populations, Asian,
European, African, can you separate them? - How can we detect recombination?
- Why are some people more likely to get a disease
then others? How is disease gene mapping done? - Phasing of chromosomes
7Computational mass spectrometry
- Mass Spectrometry is a key technology for
measuring active proteins, their interactions,
and their post-translational modifications - Computation plays a key role in interpreting mass
spectrometry data.
8Back to the genome
- Recall that the genome contains protein-coding
genes (1 of the genome) - Another 5 might encode RNA, and regulatory sites
- How do we find regions of functional interest?
9Population genetics basics
10Scope of genetics lectures
- Basic terminology
- Key principles
- Sources of variation
- HW equilibrium
- Linkage
- Coalescent theory
- Recombination/Ancestral Recombination Graph
- Haplotypes/Haplotype phasing
- Population sub-structure
- Structural polymorphisms
- Medical genetics basis Association
mapping/pedigree analysis
11Terminology allele
- Allele A specific variant at a location
- The notion of alleles predates the concept of
gene, and DNA. - Initially, alleles referred to variants that
described a measurable trait (round/wrinkled
seed) - Now, an allele might be a nucleotide on a
chromosome, with no measurable phenotype. - As we discuss source of variation, we will have
different kinds of alleles.
12Terminology
- Locus The location of the allele
- A nucleotide position.
- A genetic marker
- A gene
- A chromosomal segment
13Terminology
- Genotype genetic makeup of (part of) an
individual - Phenotype A measurable trait in an organism,
often the consequence of a genetic variation - Humans are diploid, they have 2 copies of each
chromosome. - They may have heterozygosity/homozygosity at a
location - Other organisms (plants) have higher forms of
ploidy. - Additionally, some sites might have 2 allelic
forms, or even many allelic forms. - Haplotype genetic makeup of (part of) a single
chromosome
14What causes variation in a population?
- Mutations (may lead to SNPs)
- Recombinations
- Other crossover events (gene conversion)
- Structural Polymorphisms
15Single Nucleotide Polymorphisms
- Small mutations that are sustained in a
population are called SNPs - SNPs are the most common source of variation
studied - The data is a matrix (rows are individuals,
columns are loci). Only the variant positions are
kept.
A-gtG
16Single Nucleotide Polymorphisms
Infinite Sites Assumption Each site mutates at
most once
00000101011 10001101001 01000101010 01000000011 00
011110000 00101100110
17Short Tandem Repeats
GCTAGATCATCATCATCATTGCTAG GCTAGATCATCATCATTGCTAGTT
A GCTAGATCATCATCATCATCATTGC GCTAGATCATCATCATTGCTAG
TTA GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATC
ATTGC
4 3 5 3 3 5
18STR can be used as a DNA fingerprint
- Consider a collection of regions with variable
length repeats. - Variable length repeats will lead to variable
length DNA - Vector of lengths is a finger-print
4 2 3 3 5 1 3 2 3 1 5 3
individuals
loci
19Structural polymorphisms
- Large scale structural changes (deletions/insertio
ns/inversions) may occur in a population. - Copy Number variation
- Certain diseases (cancers) are marked by an
abundance of these events
20Personalized genome sequencing
- These variants (of which 1,288,319 were novel)
included - 3,213,401 single nucleotide polymorphisms (SNPs),
- 53,823 block substitutions (2206 bp),
- 292,102 heterozygous insertion/deletion events
(indels)(1571 bp), - 559,473 homozygous indels (182,711 bp),
- 90 inversions, as well as numerous segmental
duplications and copy number variation regions. - Non-SNP DNA variation accounts for 22 of all
events identified in the donor, however they
involve 74 of all variant bases. This suggests
an important role for non-SNP genetic alterations
in defining the diploid genome structure. - Moreover, 44 of genes were heterozygous for one
or more variants.
PLoS Biology, 2007
21Recombination
00000000 11111111 00011111
22Human DNA
- Not all DNA recombines.
- mtDNA is inherited from the mother, and
- y-chromosome from the father
http//upload.wikimedia.org/wikipedia/commons/b/b2
/Karyotype.png
23Gene Conversion
- Gene Conversion versus single crossover
- Hard to distinguish in a population
24Topic 1 Basic Principles
- In a stable population, the distribution of
alleles obeys certain laws - Not really, and the deviations are interesting
- HW Equilibrium
- (due to mixing in a population)
- Linkage (dis)-equilibrium
- Due to recombination
25Hardy Weinberg equilibrium
- Consider a locus with 2 alleles, A, a
- p (respectively, q) is the frequency of A (resp.
a) in the population - 3 Genotypes AA, Aa, aa
- Q What is the frequency of each genotype
- If various assumptions are satisfied, (such as
random mating, no natural selection), Then - PAAp2
- PAa2pq
- Paaq2
26Hardy Weinberg why?
- Assumptions
- Diploid
- Sexual reproduction
- Random mating
- Bi-allelic sites
- Large population size,
- Why? Each individual randomly picks his two
chromosomes. Therefore, Prob. (Aa) pqqp 2pq,
and so on.
27Hardy Weinberg Generalizations
- Multiple alleles with frequencies
- By HW,
- Multiple loci?
28Hardy Weinberg Implications
- The allele frequency does not change from
generation to generation. Why? - It is observed that 1 in 10,000 caucasians have
the disease phenylketonuria. The disease
mutation(s) are all recessive. What fraction of
the population carries the mutation? - Males are 100 times more likely to have the red
type of color blindness than females. Why? - Conclusion While the HW assumptions are rarely
satisfied, the principle is still important as a
baseline assumption, and significant deviations
are interesting.
29Recombination
00000000 11111111 00011111
30What if there were no recombinations?
- Life would be simpler
- Each individual sequence would have a single
parent (even for higher ploidy) - The genealogical relationship is expressed as a
tree. This principle is used to track ancestry of
an individual
31The Infinite Sites Assumption
0 0 0 0 0 0 0 0
3
0 0 1 0 0 0 0 0
5
8
0 0 1 0 1 0 0 0
0 0 1 0 0 0 0 1
- The different sites are linked. A 1 in position
8 implies 0 in position 5, and vice versa. - Some phenotypes could be linked to the
polymorphisms - Some of the linkage is destroyed by
recombination
32Infinite sites assumption and Perfect Phylogeny
- Each site is mutated at most once in the history.
- All descendants must carry the mutated value, and
all others must carry the ancestral value
i
1 in position i
0 in position i
33Perfect Phylogeny
- Assume an evolutionary model in which no
recombination takes place, only mutation. - The evolutionary history is explained by a tree
in which every mutation is on an edge of the
tree. All the species in one sub-tree contain a
0, and all species in the other contain a 1. Such
a tree is called a perfect phylogeny.
34Handling recombination
- A tree is not sufficient as a sequence may have 2
parents - Recombination leads to loss of correlation
between columns
35Quiz 1
- Allele, locus, genotype, haplotype
- Hardy Weinberg equilibrium?
- Today Linkage (dis)-equilibrium
36Quiz 2
- Recall that a SNP data-set is a binary matrix.
- Rows are individual (chromosomes)
- Columns are alleles at a specific locus
- Suppose you have 2 SNP datasets of a contiguous
genomic region. - One from an African population, and one from a
European Population. - Can you tell which is which?
- How long does the genomic region have to be?
37Recombination, and populations
- Think of a population of N individual
chromosomes. - The population remains stable from generation to
generation. - Without recombination, each individual has
exactly one parent chromosome from the previous
generation. - With recombinations, each individual is derived
from one or two parents. - We will formalize this notion later in the
context of coalescent theory.
38Linkage (Dis)-equilibrium (LD)
- Consider sites A B
- Case 1 No recombination
- Each new individual chromosome chooses a parent
from the existing haplotype
A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
1 0
39Linkage (Dis)-equilibrium (LD)
- Consider sites A B
- Case 2 diploidy and recombination
- Each new individual chooses a parent from the
existing alleles
A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
1 1
40Linkage (Dis)-equilibrium (LD)
- Consider sites A B
- Case 1 No recombination
- Each new individual chooses a parent from the
existing haplotype - PrA,B0,1 0.25
- Linkage disequilibrium
- Case 2 Extensive recombination
- Each new individual simply chooses and allele
from either site - PrA,B(0,1)0.125
- Linkage equilibrium
A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
41LD
- In the absence of recombination,
- Correlation between columns
- The joint probability PrAa,Bb is different
from P(a)P(b) - With extensive recombination
- Pr(a,b)P(a)P(b)
42Measures of LD
- Consider two bi-allelic sites with alleles marked
with 0 and 1 - Define
- P00 PrAllele 0 in locus 1, and 0 in locus 2
- P0 PrAllele 0 in locus 1
- Linkage equilibrium if P00 P0 P0
- D abs(P00 - P0 P0) abs(P01 - P0 P1)
43LD over time
- With random mating, and fixed recombination rate
r between the sites, Linkage Disequilibrium will
disappear - Let D(t) LD at time t
- P(t)00 (1-r) P(t-1)00 r P(t-1)0 P(t-1)0
- D(t) P(t)00 - P(t)0 P(t)0 P(t)00 - P(t-1)0
P(t-1)0 (Why?) - D(t) (1-r) D(t-1) (1-r)t D(0)
44Other measures of LD
- D is obtained by dividing D by the largest
possible value - Dmax max P1P1, P0P1, P1P0, P0P0
- Ex D abs(P11- P1 P1)/ Dmax
- ? D/(P1 P0 P1 P0)1/2
- Let N be the number of individuals
- Show that ?2N is the ?2 statistic between the two
sites
1
0
P00N
0
P0N
Site 1
1
Site 2
45LD over distance
- Assumption
- Recombination rate increases linearly with
distance - LD decays exponentially with distance.
- The assumption is reasonable, but recombination
rates vary from region to region, adding to
complexity - This simple fact is the basis of disease
association mapping.
46LD and disease mapping
- Consider a mutation that is causal for a disease.
- The goal of disease gene mapping is to discover
which gene (locus) carries the mutation. - Consider every polymorphism, and check
- There might be too many polymorphisms
- Multiple mutations (even at a single locus) that
lead to the same disease - Instead, consider a dense sample of polymorphisms
that span the genome
47LD can be used to map disease genes
LD
0 1 1 0 0 1
D N N D D N
- LD decays with distance from the disease allele.
- By plotting LD, one can short list the region
containing the disease gene.
48(No Transcript)
49- 269 individuals
- 90 Yorubans
- 90 Europeans (CEPH)
- 44 Japanese
- 45 Chinese
- 1M SNPs
50LD and disease gene mapping problems
- Marker density?
- Complex diseases
- Population sub-structure
51Topic 2 Simulating population data
- We described various population genetic concepts
(HW, LD), and their applicability - The values of these parameters depend critically
upon the population assumptions. - What if we do not have infinite populations
- No random mating (Ex geographic isolation)
- Sudden growth
- Bottlenecks
- Ad-mixture
- It would be nice to have a simulation of such a
population to test various ideas. How would you
do this simulation?