Sequence Variation Informatics - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Sequence Variation Informatics

Description:

Seminar talk given at the Childrens Hospital Informatics Program, 10-02-2003. – PowerPoint PPT presentation

Number of Views:118

Avg rating:3.0/5.0

Slides: 35

Provided by: Gabo90

Learn more at: http://clavius.bc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Sequence Variation Informatics

1
Sequence Variation Informatics
BI420 Introduction to Bioinformatics
Gabor T. Marth
Department of Biology, Boston College marth_at_bc.edu
2
Sequence variations

Human Genome Project produced a reference genome
sequence that is 99.9 common to each human being

3
Why do we care about variations?
phenotypic differences
4
Where do variations come from?

sequence variations are the result of mutation
events

TAAAAAT
5
SNP discovery

comparative analysis of multiple sequences from
the same region of the genome (redundant sequence
coverage)

6
Steps of SNP discovery
7
Computational SNP mining PolyBayes
8
Computational SNP mining PolyBayes
sequence clustering simplifies to database search
with genome reference
multiple alignment by anchoring fragments to
genome reference
paralog filtering by counting mismatches weighed
by quality values
SNP detection by differentiating true
polymorphism from sequencing error using quality
values
9
SNP discovery with PolyBayes
genome reference sequence
10
Sequence clustering

Clustering simplifies to search against sequence
database to recruit relevant sequences

Clusters groups of overlapping sequence
fragments matching the genome reference

genome reference
fragments
cluster 1
cluster 2
cluster 3
11
(Anchored) multiple alignment

The genomic reference sequence serves as an
anchor
fragments pair-wise aligned to genomic sequence
insertions are propagated sequence padding

Advantages
efficient -- only involves pair-wise comparisons
accurate -- correctly aligns alternatively
spliced ESTs

12
Paralog filtering -- idea

The paralog problem
unrecognized paralogs give rise to spurious SNP
predictions
SNPs in duplicated regions may be useless for
genotyping

13
Paralog filtering -- probabilities

Pair-wise comparison between EST and genomic
sequence

Model of expected discrepancies
Native sequencing error polymorphisms
Paralog sequencing error paralogous sequence
difference

14
Paralog filtering -- paralogs
15
Paralog filtering -- selectivity
375 paralogous ESTs
1,579 native ESTs
16
SNP detection

Goal to discern true variation from sequencing
error

17
Bayesian-statistical SNP detection
18
The SNP score
polymorphism
specific variation
19
SNP priors

Polymorphism rate in population -- e.g. 1 / 300
bp

20
Selectivity of detection
21
Validation by pooled sequencing
22
Validation by re-sequencing
23
Rare alleles are hard to detect

frequent alleles are easier to detect
high-quality alleles are easier to detect

24
The PolyBayes software
http//genome.wustl.edu/gsc/polybayes

First statistically rigorous SNP discovery tool

Correctly analyzes alternative cDNA splice forms

Available for use (70 licenses)

Marth et al., Nature Genetics, 1999
25
INDEL discovery
Sequencing chemistry context-dependent
There is no base quality value for deleted
nucleotide(s)
No reliable prior expectation for INDEL rates of
various classes
26
INDEL discovery
Deletion Flank
Deletion Flank
Deletion
Insertion
Insertion Flank
Insertion Flank
Q(deletion) average of Q(deletion flank)
Q(insertion flank) gt 35
Q(deletion flank) gt 35
27
INDEL discovery

123,035 candidate INDELs ( 25 of substitutions)

Majority 1-4 bp insertion length (1 bp 68 ,
2bp 13)

Validation rate steeply increases with insertion
length

lt
lt
61.7
60.8
14.3
28
SNP discovery in diploid traces
usually, PCR products are sequenced from multiple
individuals
29
SNP discovery in diploid traces
30
SNP mining genome BAC overlaps
inter- intra-chromosomal duplications known
human repeats fragmentary nature of draft data
31
BAC overlap mining results
30,000 clones
gtCloneX ACGTTGCAACGT GTCAATGCTGCA
gtCloneY ACGTTGCAACGT GTCAATGCTGCA
25,901 clones (7,122 finished, 18,779 draft with
basequality values)
21,020 clone overlaps (124,356 fragment overlaps)
ACCTAGGAGACTGAACTTACTG
ACCTAGGAGACCGAACTTACTG
32
SNP mining projects
1. Short deletions/insertions (DIPs) in the BAC
overlaps
Weber et al., AJHG 2002
33
The current variation resource

The current public resource (dbSNP) contains
over 2 million SNPs as a dense genome map of
polymorphic markers

1. How are these SNPs structured within the
genome? 2. What can we learn about the processes
that shape human variability?
34
New sequencers for SNP discovery

Write a Comment

User Comments (0)