Title: Alignment of Pairs of Sequence
1Alignment of Pairs of Sequence
Chapter-3
Luonan Chen
2Synthesizing DNA Fragments by PCR (polymerase
chain reaction)
Primer
DNA
Heat Anneal
ddNTPA
ddNTPs
3Scanned Data from electrophoresed fragments
4Shotgun Sequencing for DNA
Repetitive sequences ?
A large DNA molecule
5Sequence Analysis
- Homology search similarity search ?
combinatorial optimization (sequence alignment) - Motif search machine learning from motif library
collecting common properties and structures - (Motif ? domain ? finger print)
Sequence Alignment Methods
Multiple sequence alignment
Alignment of pairs of sequence
6Definition of Sequence Alignment
- Sequence alignment is the procedure of comparing
two or more sequences by searching for a series
of individual characters or character patterns
that are in the same order in the sequences.
(bases or amino acids) - LGPSSKQTGKGS - SRIWDN
-
- LN ITKSAGKGAIMRLFDA
- - - - - - - - - TGKG - - - - - - - - -
- - - - - - - - - AGKG - - - - - - - - -
Global Alignment
Local Alignment
7Methods for Searching Similarity
- Dot matrix analysis (intuitive)
- DP algorithm (exact)
- Word or k-tuple (FASTA, BLAST)
- (heuristic)
- Motivation Homology, motif, domain,
classification, structure/function prediction - phylogenetic tree,
interaction - Similarity is a measure of the matching
characters in an alignment - Homology is a statement of common evolutionary
origin. (genes are descended from a common
ancestor)
8Global vs. Local Alignments
- Global alignment algorithms start at the
beginning of two sequences and add gaps to each
until the end of one is reached. - Local alignment algorithms finds the region (or
regions) of highest similarity between two
sequences and build the alignment outward from
there.
9 Dot Matrix Analysis
- 1) two sequences on vertical and horizontal axes
of graph - 2) put dots wherever there is a match
- 3) diagonal line is region of identity (local
alignment) - 4) apply a window filter put dots when n among
m match - (window size m and stringency n m15 n10
for DNA, mn1 for protein ) - --- applications similarity for different two
sequences, - direct and inverted repeats for a sequence
with itself.
10Simple Dot Matrix Analysis
11Dot matrix filtered with 4 base window and 3
stringency
12Dot matrix analysis for similarity
The amino acid sequences of the phage ?cI
(horizontal sequence) and phage P22 c2 (vertical
sequence) repressors. The window size and
stringency are both 1.
13Sequence Repeats by Dot Matrix
Polymorphic, SNP
diathesis
14 Scoring Similarity
Actually, mutation for A??G,T??C are more likely
than A??T, G??C
- 1) Can only score aligned sequences
- 2) DNA is usually scored as identical or not
- 3) modified scoring for gaps - single vs.
multiple base gaps (gap extension affine
penalty) - 4) AAs have varying degrees of similarity
- a. of mutations to convert one to another
- b. chemical similarity
- c. observed mutation frequencies
- 5) Score systems PAM matrices based on
evolutionary model of protein (or DNA) change
(mutations), from a small data set - BLOSUM matrices designed to identify members
of the same family, from a large data set. - --log odds score 2Snm/2 fold more likely than
expected by chance
15The PAM 250 scoring matrix
PAM percent accepted mutation 250 2.5
position changes (2.5107 years evolutionary
distance) M transition matrix pij
of PAM1 logMij/probjlogfij
mi/(fiprobj)logfij/(100fprobiprobj)
PAMn Pn f no. of mutations the
shorter and nearer the sequences, the smaller n.
16Example of Scoring a Sequence Alignment
- DNA ATGG T A (gap
penalty-2) - AACG T T A
- score 2 1 1 2 -2 2 2 Score2421-28
- (scores are set artificially. Transition between
A and G or C and T are more probable !) - Protein V D S - - C Y (gap
opening penalty-10) - V E S L D C Y
(gap extension penalty-8) - score 4 2 4 -10 -8 9 7 Score
26-188 - (scores are based on PAM250
matrix) - --- Results depend on the choice of a scoring
system. -
17DP Algorithm for Global Alignment(exact,
handling gap)
- Sequences aa1a2am, bb1b2bn
- Score Si,jS(a1a2ai, b1b2bj), s(aibj) from PAM
- wx,wy the penalties for a gap of length x and y
in a and b - Sijmax Si-1,j-1 s(aibj),
- max(Si-x,j-wx) for x1,
- max(Si,j-y-wy) for y1
- -- The alignment from the position (m,n), trace
back to (1,1) - -- Computation complexity O(nm) O(nm2 n2m)
for nltm
Computation complexity can be reduced to O(nm) ?
Yes
18Dynamic Programming
- Dynamic Programming is a very general programming
technique. - It is applicable when a large search space can be
structured into a succession of stages, such
that - the initial stage contains trivial solutions to
sub-problems - each partial solution in a later stage can be
calculated by recurring a fixed number of partial
solutions in an earlier stage - the final stage contains the overall solution
19Global Alignment by Needleman-Wunsch Algorithm
20(No Transcript)
21(No Transcript)
22(No Transcript)
23DP Algorithm for Local Alignment(exact, handling
gap)
- Sequences aa1a2an, bb1b2bn
- Score Hi,jH(a1a2ai, b1b2bj), s(aibj) from PAM
- wx,wy the penalties for a gap of length x and y
in a and b - Hijmax Hi-1,j-1 s(aibj),
- max(Hi-x,j-wx) for x1,
- max(Hi,j-y-wy) for y1,
- 0
- --the alignment from highest score position,
trace back to a zero - --negative scores for mismatches, Hij gt 0,
initial end gap penalty 0
24Local Alignment by the Smith-Waterman Algorithm
25(No Transcript)
26Improvement of Algorithm
- Computation complexity and storage O(mn)
- Approximate algorithm, parallel computation
- Substitution matrix (PAM, BLOSUM)
- (PAM mutation matrix ? substitution matrix)
- Gap penalties
- Bayes Alignment
- Assessing significance of sequence alignment
(S comparing with scores R of random
sequences) - P(SgtR) 1-e-Kmne-?R The Gumblel extreme value
distribution, not normal dist.
27What program to use for searching?
- 1) BLAST is fastest and easily accessed on the
Web - limited sets of databases
- nice translation tools (BLASTX, TBLASTN)
- 2) FASTA works best in GCG
- integrated with GCG
- precise choice of databases
- more sensitive for DNA-DNA comparisons
- FASTX and TFASTX can find similarities in
sequences with frameshifts - 3) Smith-Waterman is slower, but more sensitive
- known as a rigorous or exhaustive search
- SSEARCH in GCG and standalone FASTA
28 FASTA
- 1) Derived from logic of the dot plot
- compute best diagonals from all frames of
alignment - 2) Word method looks for exact matches between
words in query and test sequence - hash tables (fast computer technique)
- DNA words are usually 6 bases
- protein words are 1 or 2 amino acids
- only searches for diagonals in region of word
matches faster searching
29Query and Hash Table
Query A T G G G T C Test
sequence T G G A T C G A
2-Tuple
---
30FASTA Algorithm
31Makes Longest Diagonal
- 3) after all diagonals found, tries to join
diagonals by adding gaps (Connect the sequences
with close offset value by the restricted DP with
gap. ) - 4) computes alignments in regions of best
diagonals
32FASTA Alignments
33FASTA on the Web
- Many websites offer FASTA searches
- Various databases and various other services
- Be sure to use FASTA 3
- Each server has its limits
- Be aware that you are depending on the kindness
of strangers.
34BLAST
- Uses word matching like FASTA
- Similarity matching of words (3 aas, 11 bases)
- does not require identical words.
- If no words are similar, then no alignment
- wont find matches for very short sequences
- Does not handle gaps.
- Lower sensitivity but faster than FASTA (10
times) - (good for motif, et al. due to high
consensus without gap) - Use finite automaton for pattern recognition
- New gapped BLAST (PSI-BLAST) is better
35BLAST Algorithm
Add similar words besides those in the query.
36Extend hits one base at a time
which are called HSP
37HSPs are Aligned Regions
- The results of the word matching and attempts to
extend the alignment are segments - - called HSPs (High scoring Segment Pairs)
- BLAST often produces several short HSPs rather
than a single aligned region
38Gapped Blast and PSI-Blast
- Ungapped extension for finding HSP
- Using window (e.g. 11), let HSP with highest
scores be a seed - Gapped extension for the seed by DP.
- PSI-Blast can be used for multiple sequence
alignment.
39Genome Alignment
- How to match a protein or mRNA to genomic
sequence? - There is a Genome BLAST server at NCBI
- Each of the Genome websites has a similar search
function - What about introns?
- An intron is penalized as a gap, or each exon is
treated as a separate alignment with its own
e-score - Need a search algorithm that looks for consensus
intron splice sites and points in the alignment
where similarity drops off.