CSE280a: Algorithmic topics in bioinformatics - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

CSE280a: Algorithmic topics in bioinformatics

Description:

CSE280a: Algorithmic topics in bioinformatics Vineet Bafna – PowerPoint PPT presentation

Number of Views:144
Avg rating:3.0/5.0
Slides: 52
Provided by: Vine75
Category:

less

Transcript and Presenter's Notes

Title: CSE280a: Algorithmic topics in bioinformatics


1
CSE280a Algorithmic topics in bioinformatics
  • Vineet Bafna

2
The scope/syllabus
  • We will cover topics from the following areas
  • Population genetics
  • Computational Mass Spectrometry
  • Biological networks (emphasis on comparative
    analysis)
  • ncRNA
  • The focus will be on the use of algorithms for
    analyzing data in these areas
  • Some background in algorithms (mathematical
    maturity) is helpful.
  • Relevant biology will be discussed on a need to
    know basis

3
Logisitics
  • This class has little required homework.
  • All reading is optional, but recommended.
  • 1 Final Exam (20), and 1 research project (70)
  • Students help edit class notes in latex (10)
  • At most one topic per student
  • Groups of lt 2 per topic
  • Project Goal
  • Address research problems with minimum
    preparation
  • Lectures will be given by instructors and by
    students.
  • Most communication is electronic.
  • Check http//www.cse.ucsd.edu/classes/wi08/cse280a

4
From an individual to a population
  • Individual genomes vary by about 1 in 1000bp.
  • These small variations account for significant
    phenotype differences.
  • Disease susceptibility.
  • Response to drugs
  • How can we understand genetic variation in a
    population, and its consequences?
  • It took a long time (10-15 yrs) to produce the
    draft sequence of the human genome.
  • Soon (within 10-15 years), entire populations can
    have their DNA sequenced. Why do we care?

5
Population Genetics
  • Individuals in a species (population) are
    phenotypically different.
  • Often these differences are inherited (genetic).
    Understanding the genetic basis of these
    differences is a key challenge of biology!
  • The analysis of these differences involves many
    interesting algorithmic questions.
  • We will use these questions to illustrate
    algorithmic principles, and use algorithms to
    interpret genetic data.

6
Population genetics
  • We are all similar, yet we are different. How
    substantial are the differences?
  • What are the sources of variation?
  • As mutations arise, they are either neutral and
    subject to evolutionary drift, or they are
    (dis-)advantageous and under selective pressure.
    Can we tell?
  • If you had DNA from many sub-populations, Asian,
    European, African, can you separate them?
  • How can we detect recombination?
  • Why are some people more likely to get a disease
    then others? How is disease gene mapping done?
  • Phasing of chromosomes

7
Computational mass spectrometry
  • Mass Spectrometry is a key technology for
    measuring active proteins, their interactions,
    and their post-translational modifications
  • Computation plays a key role in interpreting mass
    spectrometry data.

8
Back to the genome
  • Recall that the genome contains protein-coding
    genes (1 of the genome)
  • Another 5 might encode RNA, and regulatory sites
  • How do we find regions of functional interest?

9
Population genetics basics
10
Scope of genetics lectures
  • Basic terminology
  • Key principles
  • Sources of variation
  • HW equilibrium
  • Linkage
  • Coalescent theory
  • Recombination/Ancestral Recombination Graph
  • Haplotypes/Haplotype phasing
  • Population sub-structure
  • Structural polymorphisms
  • Medical genetics basis Association
    mapping/pedigree analysis

11
Terminology allele
  • Allele A specific variant at a location
  • The notion of alleles predates the concept of
    gene, and DNA.
  • Initially, alleles referred to variants that
    described a measurable trait (round/wrinkled
    seed)
  • Now, an allele might be a nucleotide on a
    chromosome, with no measurable phenotype.
  • As we discuss source of variation, we will have
    different kinds of alleles.

12
Terminology
  • Locus The location of the allele
  • A nucleotide position.
  • A genetic marker
  • A gene
  • A chromosomal segment

13
Terminology
  • Genotype genetic makeup of (part of) an
    individual
  • Phenotype A measurable trait in an organism,
    often the consequence of a genetic variation
  • Humans are diploid, they have 2 copies of each
    chromosome.
  • They may have heterozygosity/homozygosity at a
    location
  • Other organisms (plants) have higher forms of
    ploidy.
  • Additionally, some sites might have 2 allelic
    forms, or even many allelic forms.
  • Haplotype genetic makeup of (part of) a single
    chromosome

14
What causes variation in a population?
  • Mutations (may lead to SNPs)
  • Recombinations
  • Other crossover events (gene conversion)
  • Structural Polymorphisms

15
Single Nucleotide Polymorphisms
  • Small mutations that are sustained in a
    population are called SNPs
  • SNPs are the most common source of variation
    studied
  • The data is a matrix (rows are individuals,
    columns are loci). Only the variant positions are
    kept.

A-gtG
16
Single Nucleotide Polymorphisms
Infinite Sites Assumption Each site mutates at
most once
00000101011 10001101001 01000101010 01000000011 00
011110000 00101100110
17
Short Tandem Repeats
GCTAGATCATCATCATCATTGCTAG GCTAGATCATCATCATTGCTAGTT
A GCTAGATCATCATCATCATCATTGC GCTAGATCATCATCATTGCTAG
TTA GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATC
ATTGC
4 3 5 3 3 5
18
STR can be used as a DNA fingerprint
  • Consider a collection of regions with variable
    length repeats.
  • Variable length repeats will lead to variable
    length DNA
  • Vector of lengths is a finger-print

4 2 3 3 5 1 3 2 3 1 5 3
individuals
loci
19
Structural polymorphisms
  • Large scale structural changes (deletions/insertio
    ns/inversions) may occur in a population.
  • Copy Number variation
  • Certain diseases (cancers) are marked by an
    abundance of these events

20
Personalized genome sequencing
  • These variants (of which 1,288,319 were novel)
    included
  • 3,213,401 single nucleotide polymorphisms (SNPs),
  • 53,823 block substitutions (2206 bp),
  • 292,102 heterozygous insertion/deletion events
    (indels)(1571 bp),
  • 559,473 homozygous indels (182,711 bp),
  • 90 inversions, as well as numerous segmental
    duplications and copy number variation regions.
  • Non-SNP DNA variation accounts for 22 of all
    events identified in the donor, however they
    involve 74 of all variant bases. This suggests
    an important role for non-SNP genetic alterations
    in defining the diploid genome structure.
  • Moreover, 44 of genes were heterozygous for one
    or more variants.

PLoS Biology, 2007
21
Recombination
00000000 11111111 00011111
  • Not all DNA recombines!

22
Human DNA
  • Not all DNA recombines.
  • mtDNA is inherited from the mother, and
  • y-chromosome from the father

http//upload.wikimedia.org/wikipedia/commons/b/b2
/Karyotype.png
23
Gene Conversion
  • Gene Conversion versus single crossover
  • Hard to distinguish in a population

24
Topic 1 Basic Principles
  • In a stable population, the distribution of
    alleles obeys certain laws
  • Not really, and the deviations are interesting
  • HW Equilibrium
  • (due to mixing in a population)
  • Linkage (dis)-equilibrium
  • Due to recombination

25
Hardy Weinberg equilibrium
  • Consider a locus with 2 alleles, A, a
  • p (respectively, q) is the frequency of A (resp.
    a) in the population
  • 3 Genotypes AA, Aa, aa
  • Q What is the frequency of each genotype
  • If various assumptions are satisfied, (such as
    random mating, no natural selection), Then
  • PAAp2
  • PAa2pq
  • Paaq2

26
Hardy Weinberg why?
  • Assumptions
  • Diploid
  • Sexual reproduction
  • Random mating
  • Bi-allelic sites
  • Large population size,
  • Why? Each individual randomly picks his two
    chromosomes. Therefore, Prob. (Aa) pqqp 2pq,
    and so on.

27
Hardy Weinberg Generalizations
  • Multiple alleles with frequencies
  • By HW,
  • Multiple loci?

28
Hardy Weinberg Implications
  • The allele frequency does not change from
    generation to generation. Why?
  • It is observed that 1 in 10,000 caucasians have
    the disease phenylketonuria. The disease
    mutation(s) are all recessive. What fraction of
    the population carries the mutation?
  • Males are 100 times more likely to have the red
    type of color blindness than females. Why?
  • Conclusion While the HW assumptions are rarely
    satisfied, the principle is still important as a
    baseline assumption, and significant deviations
    are interesting.

29
Recombination
00000000 11111111 00011111
30
What if there were no recombinations?
  • Life would be simpler
  • Each individual sequence would have a single
    parent (even for higher ploidy)
  • The genealogical relationship is expressed as a
    tree. This principle is used to track ancestry of
    an individual

31
The Infinite Sites Assumption
0 0 0 0 0 0 0 0
3
0 0 1 0 0 0 0 0
5
8
0 0 1 0 1 0 0 0
0 0 1 0 0 0 0 1
  • The different sites are linked. A 1 in position
    8 implies 0 in position 5, and vice versa.
  • Some phenotypes could be linked to the
    polymorphisms
  • Some of the linkage is destroyed by
    recombination

32
Infinite sites assumption and Perfect Phylogeny
  • Each site is mutated at most once in the history.
  • All descendants must carry the mutated value, and
    all others must carry the ancestral value

i
1 in position i
0 in position i
33
Perfect Phylogeny
  • Assume an evolutionary model in which no
    recombination takes place, only mutation.
  • The evolutionary history is explained by a tree
    in which every mutation is on an edge of the
    tree. All the species in one sub-tree contain a
    0, and all species in the other contain a 1. Such
    a tree is called a perfect phylogeny.

34
Handling recombination
  • A tree is not sufficient as a sequence may have 2
    parents
  • Recombination leads to loss of correlation
    between columns

35
Quiz 1
  • Allele, locus, genotype, haplotype
  • Hardy Weinberg equilibrium?
  • Today Linkage (dis)-equilibrium

36
Quiz 2
  • Recall that a SNP data-set is a binary matrix.
  • Rows are individual (chromosomes)
  • Columns are alleles at a specific locus
  • Suppose you have 2 SNP datasets of a contiguous
    genomic region.
  • One from an African population, and one from a
    European Population.
  • Can you tell which is which?
  • How long does the genomic region have to be?

37
Recombination, and populations
  • Think of a population of N individual
    chromosomes.
  • The population remains stable from generation to
    generation.
  • Without recombination, each individual has
    exactly one parent chromosome from the previous
    generation.
  • With recombinations, each individual is derived
    from one or two parents.
  • We will formalize this notion later in the
    context of coalescent theory.

38
Linkage (Dis)-equilibrium (LD)
  • Consider sites A B
  • Case 1 No recombination
  • Each new individual chromosome chooses a parent
    from the existing haplotype

A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
1 0
39
Linkage (Dis)-equilibrium (LD)
  • Consider sites A B
  • Case 2 diploidy and recombination
  • Each new individual chooses a parent from the
    existing alleles

A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
1 1
40
Linkage (Dis)-equilibrium (LD)
  • Consider sites A B
  • Case 1 No recombination
  • Each new individual chooses a parent from the
    existing haplotype
  • PrA,B0,1 0.25
  • Linkage disequilibrium
  • Case 2 Extensive recombination
  • Each new individual simply chooses and allele
    from either site
  • PrA,B(0,1)0.125
  • Linkage equilibrium

A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
41
LD
  • In the absence of recombination,
  • Correlation between columns
  • The joint probability PrAa,Bb is different
    from P(a)P(b)
  • With extensive recombination
  • Pr(a,b)P(a)P(b)

42
Measures of LD
  • Consider two bi-allelic sites with alleles marked
    with 0 and 1
  • Define
  • P00 PrAllele 0 in locus 1, and 0 in locus 2
  • P0 PrAllele 0 in locus 1
  • Linkage equilibrium if P00 P0 P0
  • D abs(P00 - P0 P0) abs(P01 - P0 P1)

43
LD over time
  • With random mating, and fixed recombination rate
    r between the sites, Linkage Disequilibrium will
    disappear
  • Let D(t) LD at time t
  • P(t)00 (1-r) P(t-1)00 r P(t-1)0 P(t-1)0
  • D(t) P(t)00 - P(t)0 P(t)0 P(t)00 - P(t-1)0
    P(t-1)0 (Why?)
  • D(t) (1-r) D(t-1) (1-r)t D(0)

44
Other measures of LD
  • D is obtained by dividing D by the largest
    possible value
  • Dmax max P1P1, P0P1, P1P0, P0P0
  • Ex D abs(P11- P1 P1)/ Dmax
  • ? D/(P1 P0 P1 P0)1/2
  • Let N be the number of individuals
  • Show that ?2N is the ?2 statistic between the two
    sites

1
0
P00N
0
P0N
Site 1
1
Site 2
45
LD over distance
  • Assumption
  • Recombination rate increases linearly with
    distance
  • LD decays exponentially with distance.
  • The assumption is reasonable, but recombination
    rates vary from region to region, adding to
    complexity
  • This simple fact is the basis of disease
    association mapping.

46
LD and disease mapping
  • Consider a mutation that is causal for a disease.
  • The goal of disease gene mapping is to discover
    which gene (locus) carries the mutation.
  • Consider every polymorphism, and check
  • There might be too many polymorphisms
  • Multiple mutations (even at a single locus) that
    lead to the same disease
  • Instead, consider a dense sample of polymorphisms
    that span the genome

47
LD can be used to map disease genes
LD
0 1 1 0 0 1
D N N D D N
  • LD decays with distance from the disease allele.
  • By plotting LD, one can short list the region
    containing the disease gene.

48
(No Transcript)
49
  • 269 individuals
  • 90 Yorubans
  • 90 Europeans (CEPH)
  • 44 Japanese
  • 45 Chinese
  • 1M SNPs

50
LD and disease gene mapping problems
  • Marker density?
  • Complex diseases
  • Population sub-structure

51
Topic 2 Simulating population data
  • We described various population genetic concepts
    (HW, LD), and their applicability
  • The values of these parameters depend critically
    upon the population assumptions.
  • What if we do not have infinite populations
  • No random mating (Ex geographic isolation)
  • Sudden growth
  • Bottlenecks
  • Ad-mixture
  • It would be nice to have a simulation of such a
    population to test various ideas. How would you
    do this simulation?
Write a Comment
User Comments (0)
About PowerShow.com