Title: Sequence Variation Informatics
1Sequence Variation Informatics
BI420 Introduction to Bioinformatics
Gabor T. Marth
Department of Biology, Boston College marth_at_bc.edu
2Sequence variations
- Human Genome Project produced a reference genome
sequence that is 99.9 common to each human being
3Why do we care about variations?
phenotypic differences
4Where do variations come from?
- sequence variations are the result of mutation
events
TAAAAAT
5SNP discovery
- comparative analysis of multiple sequences from
the same region of the genome (redundant sequence
coverage)
6Steps of SNP discovery
7Computational SNP mining PolyBayes
8Computational SNP mining PolyBayes
sequence clustering simplifies to database search
with genome reference
multiple alignment by anchoring fragments to
genome reference
paralog filtering by counting mismatches weighed
by quality values
SNP detection by differentiating true
polymorphism from sequencing error using quality
values
9SNP discovery with PolyBayes
genome reference sequence
10Sequence clustering
- Clustering simplifies to search against sequence
database to recruit relevant sequences
- Clusters groups of overlapping sequence
fragments matching the genome reference
genome reference
fragments
cluster 1
cluster 2
cluster 3
11(Anchored) multiple alignment
- The genomic reference sequence serves as an
anchor - fragments pair-wise aligned to genomic sequence
- insertions are propagated sequence padding
- Advantages
- efficient -- only involves pair-wise comparisons
- accurate -- correctly aligns alternatively
spliced ESTs
12Paralog filtering -- idea
- The paralog problem
- unrecognized paralogs give rise to spurious SNP
predictions - SNPs in duplicated regions may be useless for
genotyping
13Paralog filtering -- probabilities
- Pair-wise comparison between EST and genomic
sequence
- Model of expected discrepancies
- Native sequencing error polymorphisms
- Paralog sequencing error paralogous sequence
difference
14Paralog filtering -- paralogs
15Paralog filtering -- selectivity
375 paralogous ESTs
1,579 native ESTs
16SNP detection
- Goal to discern true variation from sequencing
error
17Bayesian-statistical SNP detection
18The SNP score
polymorphism
specific variation
19SNP priors
- Polymorphism rate in population -- e.g. 1 / 300
bp
20Selectivity of detection
21Validation by pooled sequencing
22Validation by re-sequencing
23Rare alleles are hard to detect
- frequent alleles are easier to detect
- high-quality alleles are easier to detect
24The PolyBayes software
http//genome.wustl.edu/gsc/polybayes
- First statistically rigorous SNP discovery tool
- Correctly analyzes alternative cDNA splice forms
- Available for use (70 licenses)
Marth et al., Nature Genetics, 1999
25INDEL discovery
Sequencing chemistry context-dependent
There is no base quality value for deleted
nucleotide(s)
No reliable prior expectation for INDEL rates of
various classes
26INDEL discovery
Deletion Flank
Deletion Flank
Deletion
Insertion
Insertion Flank
Insertion Flank
Q(deletion) average of Q(deletion flank)
Q(insertion flank) gt 35
Q(deletion flank) gt 35
27INDEL discovery
- 123,035 candidate INDELs ( 25 of substitutions)
- Majority 1-4 bp insertion length (1 bp 68 ,
2bp 13)
- Validation rate steeply increases with insertion
length
lt
lt
61.7
60.8
14.3
28SNP discovery in diploid traces
usually, PCR products are sequenced from multiple
individuals
29SNP discovery in diploid traces
30SNP mining genome BAC overlaps
inter- intra-chromosomal duplications known
human repeats fragmentary nature of draft data
31BAC overlap mining results
30,000 clones
gtCloneX ACGTTGCAACGT GTCAATGCTGCA
gtCloneY ACGTTGCAACGT GTCAATGCTGCA
25,901 clones (7,122 finished, 18,779 draft with
basequality values)
21,020 clone overlaps (124,356 fragment overlaps)
ACCTAGGAGACTGAACTTACTG
ACCTAGGAGACCGAACTTACTG
32SNP mining projects
1. Short deletions/insertions (DIPs) in the BAC
overlaps
Weber et al., AJHG 2002
33The current variation resource
- The current public resource (dbSNP) contains
over 2 million SNPs as a dense genome map of
polymorphic markers
1. How are these SNPs structured within the
genome? 2. What can we learn about the processes
that shape human variability?
34New sequencers for SNP discovery