Sequence Variation Informatics - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Sequence Variation Informatics

Description:

Seminar talk given at the Childrens Hospital Informatics Program, 10-02-2003. – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 35
Provided by: Gabo90
Learn more at: http://clavius.bc.edu
Category:

less

Transcript and Presenter's Notes

Title: Sequence Variation Informatics


1
Sequence Variation Informatics
BI420 Introduction to Bioinformatics
Gabor T. Marth
Department of Biology, Boston College marth_at_bc.edu
2
Sequence variations
  • Human Genome Project produced a reference genome
    sequence that is 99.9 common to each human being

3
Why do we care about variations?
phenotypic differences
4
Where do variations come from?
  • sequence variations are the result of mutation
    events

TAAAAAT
5
SNP discovery
  • comparative analysis of multiple sequences from
    the same region of the genome (redundant sequence
    coverage)

6
Steps of SNP discovery
7
Computational SNP mining PolyBayes
8
Computational SNP mining PolyBayes
sequence clustering simplifies to database search
with genome reference
multiple alignment by anchoring fragments to
genome reference
paralog filtering by counting mismatches weighed
by quality values
SNP detection by differentiating true
polymorphism from sequencing error using quality
values
9
SNP discovery with PolyBayes
genome reference sequence
10
Sequence clustering
  • Clustering simplifies to search against sequence
    database to recruit relevant sequences
  • Clusters groups of overlapping sequence
    fragments matching the genome reference

genome reference
fragments
cluster 1
cluster 2
cluster 3
11
(Anchored) multiple alignment
  • The genomic reference sequence serves as an
    anchor
  • fragments pair-wise aligned to genomic sequence
  • insertions are propagated sequence padding
  • Advantages
  • efficient -- only involves pair-wise comparisons
  • accurate -- correctly aligns alternatively
    spliced ESTs

12
Paralog filtering -- idea
  • The paralog problem
  • unrecognized paralogs give rise to spurious SNP
    predictions
  • SNPs in duplicated regions may be useless for
    genotyping

13
Paralog filtering -- probabilities
  • Pair-wise comparison between EST and genomic
    sequence
  • Model of expected discrepancies
  • Native sequencing error polymorphisms
  • Paralog sequencing error paralogous sequence
    difference

14
Paralog filtering -- paralogs
15
Paralog filtering -- selectivity
375 paralogous ESTs
1,579 native ESTs
16
SNP detection
  • Goal to discern true variation from sequencing
    error

17
Bayesian-statistical SNP detection
18
The SNP score
polymorphism
specific variation
19
SNP priors
  • Polymorphism rate in population -- e.g. 1 / 300
    bp

20
Selectivity of detection
21
Validation by pooled sequencing
22
Validation by re-sequencing
23
Rare alleles are hard to detect
  • frequent alleles are easier to detect
  • high-quality alleles are easier to detect

24
The PolyBayes software
http//genome.wustl.edu/gsc/polybayes
  • First statistically rigorous SNP discovery tool
  • Correctly analyzes alternative cDNA splice forms
  • Available for use (70 licenses)

Marth et al., Nature Genetics, 1999
25
INDEL discovery
Sequencing chemistry context-dependent
There is no base quality value for deleted
nucleotide(s)
No reliable prior expectation for INDEL rates of
various classes
26
INDEL discovery
Deletion Flank
Deletion Flank
Deletion
Insertion
Insertion Flank
Insertion Flank
Q(deletion) average of Q(deletion flank)
Q(insertion flank) gt 35
Q(deletion flank) gt 35
27
INDEL discovery
  • 123,035 candidate INDELs ( 25 of substitutions)
  • Majority 1-4 bp insertion length (1 bp 68 ,
    2bp 13)
  • Validation rate steeply increases with insertion
    length

lt
lt
61.7
60.8
14.3
28
SNP discovery in diploid traces
usually, PCR products are sequenced from multiple
individuals
29
SNP discovery in diploid traces
30
SNP mining genome BAC overlaps
inter- intra-chromosomal duplications known
human repeats fragmentary nature of draft data
31
BAC overlap mining results
30,000 clones
gtCloneX ACGTTGCAACGT GTCAATGCTGCA
gtCloneY ACGTTGCAACGT GTCAATGCTGCA
25,901 clones (7,122 finished, 18,779 draft with
basequality values)
21,020 clone overlaps (124,356 fragment overlaps)
ACCTAGGAGACTGAACTTACTG
ACCTAGGAGACCGAACTTACTG
32
SNP mining projects
1. Short deletions/insertions (DIPs) in the BAC
overlaps
Weber et al., AJHG 2002
33
The current variation resource
  • The current public resource (dbSNP) contains
    over 2 million SNPs as a dense genome map of
    polymorphic markers

1. How are these SNPs structured within the
genome? 2. What can we learn about the processes
that shape human variability?
34
New sequencers for SNP discovery
Write a Comment
User Comments (0)
About PowerShow.com