Title: http:creativecommons'orglicensesbysa2'0
1http//creativecommons.org/licenses/by-sa/2.0/
2Sequencing Sequence Alignment
David Wishart, University of Alberta
3Objectives
- Understand how DNA sequence data is collected and
prepared - Be aware of the importance of sequence searching
and sequence alignment in biology and medicine - Be familiar with the different algorithms and
scoring schemes used in sequence searching and
sequence alignment
4High Throughput DNA Sequencing
530,000
6Shotgun Sequencing
Isolate Chromosome
ShearDNA into Fragments
Clone into Seq. Vectors
Sequence
7Principles of DNA Sequencing
Primer
DNA fragment
Amp
PBR322
Tet
Ori
Denature with heat to produce ssDNA
Klenow ddNTP dNTP primers
8The Secret to Sanger Sequencing
9Principles of DNA Sequencing
3 Template
G C A T G C
5
5 Primer
dATP dCTP dGTP dTTP
ddCTP
GddC
GCddA
GCAddT
ddG
GCATGddC
GCATddG
10Principles of DNA Sequencing
G
T
short
_
_
C
A
G C A T G C
long
11Capillary Electrophoresis
Separation by Electro-osmotic Flow
12Multiplexed CE with Fluorescent detection
ABI 3700
96x700 bases
13Shotgun Sequencing
Assembled Sequence
Sequence Chromatogram
Send to Computer
14Shotgun Sequencing
- Very efficient process for small-scale (10 kb)
sequencing (preferred method) - First applied to whole genome sequencing in 1995
(H. influenzae) - Now standard for all prokaryotic genome
sequencing projects - Successfully applied to D. melanogaster
- Moderately successful for H. sapiens
15The Finished Product
GATTACAGATTACAGATTACAGATTACAGATTACAG ATTACAGATTACA
GATTACAGATTACAGATTACAGA TTACAGATTACAGATTACAGATTACA
GATTACAGAT TACAGATTAGAGATTACAGATTACAGATTACAGATT AC
AGATTACAGATTACAGATTACAGATTACAGATTA CAGATTACAGATTAC
AGATTACAGATTACAGATTAC AGATTACAGATTACAGATTACAGATTAC
AGATTACA GATTACAGATTACAGATTACAGATTACAGATTACAG ATTA
CAGATTACAGATTACAGATTACAGATTACAGA TTACAGATTACAGATTA
CAGATTACAGATTACAGAT
16Sequencing Successes
T7 bacteriophage completed in 1983 39,937 bp, 59
coded proteins Escherichia coli completed in
1998 4,639,221 bp, 4293 ORFs Sacchoromyces
cerevisae completed in 1996 12,069,252 bp, 5800
genes
17Sequencing Successes
Caenorhabditis elegans completed in
1998 95,078,296 bp, 19,099 genes Drosophila
melanogaster completed in 2000 116,117,226 bp,
13,601 genes Homo sapiens completed in
2003 3,201,762,515 bp, 31,780 genes
18Genomes to Date
- 8 vertebrates (human, mouse, rat, fugu,
zebrafish) - 3 plants (arabadopsis, rice, poplar)
- 2 insects (fruit fly, mosquito)
- 2 nematodes (C. elegans, C. briggsae)
- 1 sea squirt
- 4 parasites (plasmodium, guillardia)
- 4 fungi (S. cerevisae, S. pombe)
- 200 bacteria and archebacteria
- 2000 viruses
19So what do we do with all this sequence data?
20Sequence Alignment
21Alignments tell us about...
- Function or activity of a new gene/protein
- Structure or shape of a new protein
- Location or preferred location of a protein
- Stability of a gene or protein
- Origin of a gene or protein
- Origin or phylogeny of an organelle
- Origin or phylogeny of an organism
22Factoid
Sequence comparisons lie at the heart of
all bioinformatics
23Similarity versus Homology
- Similarity refers to the likeness or identity
between 2 sequences - Similarity means sharing a statistically
significant number of bases or amino acids - Similarity does not imply homology
- Homology refers to shared ancestry
- Two sequences are homologous is they are derived
from a common ancestral sequence - Homology usually implies similarity
24Similarity versus Homology
- Similarity can be quantified
- It is correct to say that two sequences are X
identical - It is correct to say that two sequences have a
similarity score of Z - It is generally incorrect to say that two
sequences are X similar
25Similarity versus Homology
- Homology cannot be quantified
- If two sequences have a high identity it is OK
to say they are homologous - It is incorrect to say two sequences have a
homology score of Z - It is incorrect to say two sequences are X
homologous
26Homologues All That
- Homologue (or Homolog)
- Protein/gene that shares a common ancestor and
which has good sequence and/or structure
similarity to another (general term) - Paralogue (or Paralog)
- A homologue which arose through gene duplication
in the same species/chromosome - Orthologue (or Ortholog)
- A homologue which arose through speciation (found
in different species)
27Sequence Complexity
MCDEFGHIKLAN. High Complexity
ACTGTCACTGAT. Mid Complexity
NNNNTTTTTNNN. Low Complexity
Translate those DNA sequences!!!
28Assessing Sequence Similarity
THESTORYOFGENESIS THISBOOKONGENETICS THESTORYOFGE
NESI-S THISBOOKONGENETICS THE STORY OF
GENESIS THIS BOOK ON GENETICS
Two Character Strings
Character Comparison
Context Comparison
29Assessing Sequence Similarity
is this alignment significant?
30Is This Alignment Significant?
31Some Simple Rules
- If two sequence are gt 100 residues and gt
25 identical, they are likely related - If two sequences are 15-25 identical they may be
related, but more tests are needed - If two sequences are lt 15 identical they are
probably not related - If you need more than 1 gap for every 20 residues
the alignment is suspicious
32Doolittles Rules of Thumb
33Sequence Alignment - Methods
- Dot Plots
- Dynamic Programming
- Heuristic (Fast) Local Alignment
- Multiple Sequence Alignment
- Contig Assembly
34Dot Plots
35Dot Plots
- Invented in 1970 by Gibbs McIntyre
- Good for quick graphical overview
- Simplest method for sequence comparison
- Inter-sequence comparison
- Intra-sequence comparison
- Identifies internal repeats
- Identifies domains or modules
36Dot Plots Internal Repeats
37Dot Plot Algorithm
- Take two sequences (A B), write sequence A out
as a row (lengthm) and sequence B as a column
(length n) - Create a table or matrix of m columns and n
rows - Compare each letter of sequence A with every
letter in sequence B. If theres a match mark it
with a dot, if not, leave blank
38Dot Plot Algorithm
A C D E F G H G
A C D E F G H G
39Dot Plots
- Most commercial programs offer pretty good dot
plot programs including - GCG/Omiga (Pharmacopeia)
- PepTool (BioTools Inc.)
- LaserGene (DNAStar)
- Popular freeware package is Dotter
www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html - Dotlet http//www.isrec.isb-sib.ch/java/dotlet/Dot
let.html - JDotter http//athena.bioc.uvic.ca/sars/jdotter/ma
in.php
40Dynamic Programming
41Dynamic Programming
- Developed by Needleman Wunsch (1970)
- Refined by Smith Waterman (1981)
- Ideal for quantitative assessment
- Guaranteed to be mathematically optimal
- Slow N2 algorithm
- Performed in 2 stages
- Prepare a scoring matrix using recursive function
- Scan matrix diagonally using traceback protocol
42The Recursive Function
Si-1,j-1 or max Si-x,j-1 wx-1
or max Si-1,j-y wy-1
Sij sij max
2ltxlti
2ltyltj
W gap penalty S alignment score
43Identity Scoring Matrix (Sij)
44A Simple Example...
A A T V D A 1 V V D
A A T V D A 1 1 V V D
A A T V D A 1 1 0 0 0 V V D
A A T V D A 1 1 0 0 0 V 0 V D
A A T V D A 1 1 0 0 0 V 0 1 1 V D
A A T V D A 1 1 0 0 0 V 0 1 1 2 V D
45A Simple Example...
A A T V D A 1 1 0 0 0 V 0 1 1 2 1 V D
A A T V D A 1 1 0 0 0 V 0 1 1 2 1 V 0 1
1 2 2 D 0 1 1 1 3
A A T V D A 1 1 0 0 0 V 0 1 1 2 1 V 0 1
1 2 2 D 0 1 1 1 3
A A T V D A - V V D
A A T V D A V V D
A A T V D A V - V D
46Could We Do Better?
- Key to the performance of Dynamic Programming is
the scoring function - Dynamic Programming always gives the
mathematically correct answer - Dynamic Programming does not always give the
biologically correct answer - The weakest link -- The Scoring Matrix
47Scoring Matrices
- An empirical model of evolution, biology and
chemistry all wrapped up in a 20 X 20 table of
integers - Structurally or chemically similar residues
should ideally have high diagonal or off-diagonal
numbers - Structurally or chemically dissimilar residues
should ideally have low diagonal or off-diagonal
numbers
48A Better Matrix - PAM250
49Using PAM250...
A T V D A 2 T 1 3 V 0 0 4 D 0 0-2 4
Gap Penalty -1
A A T V D A 2 V V D
A A T V D A 2 1 V V D
A A T V D A 2 1 0 -1 -1 V V D
A A T V D A 2 1 0 -1 -1 V -1 2 V D
A A T V D A 2 1 0 -1 -1 V -1 2 1
V D
A A T V D A 2 1 0 -1 -1 V -1 2 1 5 V D
50Using PAM250...
A T V D A 2 T 1 3 V 0 0 4 D 0 0-2 4
Gap Penalty -1
A A T V D A 2 1 0 -1 -1 V -1 2 1 5
-1 V D
A A T V D A 2 1 0 -1 -1 V -1 2 1 5 -1 V
-1 1 2 5 3 D -1 1 1 0 9
A A T V D A 2 1 0 -1 -1 V -1 2 1 5 -1 V
-1 1 2 5 3 D -1 1 1 0 9
A A T V D A V - V D
51PAM Matrices
- Developed by M.O. Dayhoff (1978)
- PAM Point Accepted Mutation
- Matrix assembled by looking at patterns of
substitutions in closely related proteins - 1 PAM corresponds to 1 amino acid change per 100
residues - 1 PAM 1 divergence or 1 million years in
evolutionary history
52Dynamic Programming
- Great for doing pairwise global alignments
- Produces a quantitative alignment score
- Problems if one tries to do alignments with very
large sequences (memory requirement grows as N2
or as N x M) - Serious problems if one tries to align one
sequence against a database (10s of hours) - Need an alternative..
53Fast Local Alignment Methods
ACDEAGHNKLM...
KKDEFGHPKLM...
SCDEFCHLKLM...
MCDEFGHNKLV...
ACDEFGHIKLM...
QCDEFGHAKLM...
AQQQFGHIKLPI...
WCDEFGHLKLM...
SMDEFAHVKLM...
ACDEFGFKKLM...
54Fast Local Alignment Methods
- Developed by Lipman Pearson (1985/88)
- Refined by Altschul et al. (1990/97)
- Ideal for large database comparisons
- Uses heuristics statistical simplification
- Fast N-type algorithm (similar to Dot Plot)
- Cuts sequences into short words (k-tuples)
- Uses Hash Tables to speed comparison
55Fast Alignment Algorithm
56Fast Alignment Algorithm
57Fast Alignment Algorithm
A C D E F G D E F...
L M R G CD D Y G
58Fast Alignment Algorithm
59FASTA
- Developed in 1985 and 1988 (W. Pearson)
- Looks for clusters of nearby or locally dense
identical k-tuples - init1 score score for first set of k-tuples
- initn score score for gapped k-tuples
- opt score optimized alignment score
- Z-score number of S.D. above random
- expect expected of random matches
60FASTA
61Multiple Sequence Alignment
Multiple alignment of Calcitonins
62Multiple Alignment Algorithm
- Take all n sequences and perform all possible
pairwise (n/2(n-1)) alignments - Identify highest scoring pair, perform an
alignment create a consensus sequence - Select next most similar sequence and align it to
the initial consensus, regenerate a second
consensus - Repeat step 3 until finished
63Multiple Sequence Alignment
- Developed and refined by many (Doolittle, Barton,
Corpet) through the 1980s - Used extensively for extracting hidden
phylogenetic relationships and identifying
sequence families - Powerful tool for extracting new sequence motifs
and signature sequences
64Multiple Alignment
- Most commercial vendors offer good multiple
alignment programs including - GCG (Accelerys)
- PepTool/GeneTool (BioTools Inc.)
- LaserGene (DNAStar)
- Popular web servers include T-COFFEE, MULTALIN
and CLUSTALW - Popular freeware includes PHYLIP PAUP
65Mutli-Align Websites
- Match-Box http//www.fundp.ac.be/sciences/biologie
/bms/matchbox_submit.shtml - MUSCA http//cbcsrv.watson.ibm.com/Tmsa.html
- T-Coffee http//www.ch.embnet.org/software/TCoffee
.html - MULTALIN http//www.toulouse.inra.fr/multalin.html
- CLUSTALW http//www.ebi.ac.uk/clustalw/
66(No Transcript)
67T-Coffee
- Uses standard progressive alignment but with a
twist to avoid local minima - Allows the combination of a collection of
multiple/pairwise, global or local alignments
into a single model - It also allows to estimate the level of
consistency of each position within the new
alignment with the rest of the alignments
68Multi-alignment Contig Assembly
ATCGATGCGTAGCAGACTACCGTTACGATGCCTT TAGCTACGCATCGT
CTGATGGCAATGCTACGGAA..
TAGCTACGCATCGT
TAGCAGACTACCGTT
ATCGATGCGTAGC
GTTACGATGCCTT
69Contig Assembly
- Read, edit trim DNA chromatograms
- Remove overlaps ambiguous calls
- Read in all sequence files (10-10,000)
- Reverse complement all sequences (doubles of
sequences to align) - Remove vector sequences (vector trim)
- Remove regions of low complexity
- Perform multiple sequence alignment
70Contig Assembly Multiple Alignment
- Only accept a very high sequence identity
- Accept unlimited number of end gaps
- Very high cost for opening internal gaps
- A short match with high score/residue is
preferred over a long match with low score/residue
71Assembly Parameters
- User-selected parameters
- minimum length of overlap
- percent identity within overlap
- Non-adjustable parameters
- sequence quality factors
72Chromatogram Editing
73Sequence Loading
74Sequence Alignment
75Contig Alignment - Process
ATCGATGCGTAGC
TAGCAGACTACCGTT
GTTACGATGCCTT
TGCTACGCATCG
CGATGCGTAGCA
CGATGCGTAGCA
ATCGATGCGTAGC
TAGCAGACTACCGTT
GTTACGATGCCTT
ATCGATGCGTAGCAGACTACCGTTACGATGCCTT
76Problems for Assembly
- Repeat regions
- Capture sequences from non-contiguous regions
- Polymorphisms
- Cause failure to join correct regions
- Large data volume
- Requires large numbers of pair-wise comparisons
77Sequence Assembly Programs
- Phred - base calling program that does detailed
statistical analysis (UNIX)
http//www.phrap.org/ - Phrap - sequence assembly program (UNIX)
http//www.phrap.org/ - TIGR Assembler - microbial genomes (UNIX)
http//www.tigr.org/softlab/assembler/ - The Staden Package (UNIX)
- http//www.mrc-lmb.cam.ac.uk/pubseq/
- GeneTool/ChromaTool/Sequencher (PC/Mac)
78Phrap
- Phrap is a program for assembling shotgun DNA
sequence data - Uses a combination of user-supplied and
internally computed data quality information to
improve assembly accuracy in the presence of
repeats - Constructs the contig sequence as a mosaic of the
highest quality read segments rather than a
consensus - Handles large datasets
79http//bio.ifom-firc.it/ASSEMBLY/assemble.html
80Conclusions
- Sequence alignments and database searching are
key to all of bioinformatics - There are four different methods for doing
sequence comparisons 1) Dot Plots 2) Dynamic
Programming 3) Fast Alignment and 4) Multiple
Alignment - Understanding the significance of alignments
requires an understanding of statistics and
distributions