Alignment of Pairs of Sequence - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Alignment of Pairs of Sequence

Description:

Synthesizing DNA Fragments by PCR (polymerase chain reaction) DNA. Heat. Anneal. Primer. ddNTP A. ddNTPs. Scanned Data from electrophoresed fragments ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 40

Provided by: luona

Category:

more less

Transcript and Presenter's Notes

Title: Alignment of Pairs of Sequence

1
Alignment of Pairs of Sequence
Chapter-3
Luonan Chen
2
Synthesizing DNA Fragments by PCR (polymerase
chain reaction)
Primer
DNA
Heat Anneal
ddNTPA
ddNTPs
3
Scanned Data from electrophoresed fragments
4
Shotgun Sequencing for DNA
Repetitive sequences ?
A large DNA molecule
5
Sequence Analysis

Homology search similarity search ?
combinatorial optimization (sequence alignment)
Motif search machine learning from motif library
collecting common properties and structures
(Motif ? domain ? finger print)

Sequence Alignment Methods
Multiple sequence alignment
Alignment of pairs of sequence
6
Definition of Sequence Alignment

Sequence alignment is the procedure of comparing
two or more sequences by searching for a series
of individual characters or character patterns
that are in the same order in the sequences.
(bases or amino acids)
LGPSSKQTGKGS - SRIWDN
LN ITKSAGKGAIMRLFDA
- - - - - - - - TGKG - - - - - - - - -
- - - - - - - - AGKG - - - - - - - - -

Global Alignment
Local Alignment
7
Methods for Searching Similarity

Dot matrix analysis (intuitive)
DP algorithm (exact)
Word or k-tuple (FASTA, BLAST)
(heuristic)

Motivation Homology, motif, domain,
classification, structure/function prediction
phylogenetic tree,
interaction
Similarity is a measure of the matching
characters in an alignment
Homology is a statement of common evolutionary
origin. (genes are descended from a common
ancestor)

8
Global vs. Local Alignments

Global alignment algorithms start at the
beginning of two sequences and add gaps to each
until the end of one is reached.
Local alignment algorithms finds the region (or
regions) of highest similarity between two
sequences and build the alignment outward from
there.

9
Dot Matrix Analysis

1) two sequences on vertical and horizontal axes
of graph
2) put dots wherever there is a match
3) diagonal line is region of identity (local
alignment)
4) apply a window filter put dots when n among
m match
(window size m and stringency n m15 n10
for DNA, mn1 for protein )
--- applications similarity for different two
sequences,
direct and inverted repeats for a sequence
with itself.

10
Simple Dot Matrix Analysis
11
Dot matrix filtered with 4 base window and 3
stringency
12
Dot matrix analysis for similarity
The amino acid sequences of the phage ?cI
(horizontal sequence) and phage P22 c2 (vertical
sequence) repressors. The window size and
stringency are both 1.
13
Sequence Repeats by Dot Matrix
Polymorphic, SNP
diathesis
14
Scoring Similarity
Actually, mutation for A??G,T??C are more likely
than A??T, G??C

1) Can only score aligned sequences
2) DNA is usually scored as identical or not
3) modified scoring for gaps - single vs.
multiple base gaps (gap extension affine
penalty)
4) AAs have varying degrees of similarity
a. of mutations to convert one to another
b. chemical similarity
c. observed mutation frequencies
5) Score systems PAM matrices based on
evolutionary model of protein (or DNA) change
(mutations), from a small data set
BLOSUM matrices designed to identify members
of the same family, from a large data set.
--log odds score 2Snm/2 fold more likely than
expected by chance

15
The PAM 250 scoring matrix
PAM percent accepted mutation 250 2.5
position changes (2.5107 years evolutionary
distance) M transition matrix pij
of PAM1 logMij/probjlogfij
mi/(fiprobj)logfij/(100fprobiprobj)
PAMn Pn f no. of mutations the
shorter and nearer the sequences, the smaller n.
16
Example of Scoring a Sequence Alignment

DNA ATGG T A (gap
penalty-2)
AACG T T A
score 2 1 1 2 -2 2 2 Score2421-28
(scores are set artificially. Transition between
A and G or C and T are more probable !)
Protein V D S - - C Y (gap
opening penalty-10)
V E S L D C Y
(gap extension penalty-8)
score 4 2 4 -10 -8 9 7 Score
26-188
(scores are based on PAM250
matrix)
--- Results depend on the choice of a scoring
system.

17
DP Algorithm for Global Alignment(exact,
handling gap)

Sequences aa1a2am, bb1b2bn
Score Si,jS(a1a2ai, b1b2bj), s(aibj) from PAM
wx,wy the penalties for a gap of length x and y
in a and b
Sijmax Si-1,j-1 s(aibj),
max(Si-x,j-wx) for x1,
max(Si,j-y-wy) for y1
-- The alignment from the position (m,n), trace
back to (1,1)
-- Computation complexity O(nm) O(nm2 n2m)
for nltm

Computation complexity can be reduced to O(nm) ?
Yes
18
Dynamic Programming

Dynamic Programming is a very general programming
technique.
It is applicable when a large search space can be
structured into a succession of stages, such
that
the initial stage contains trivial solutions to
sub-problems
each partial solution in a later stage can be
calculated by recurring a fixed number of partial
solutions in an earlier stage
the final stage contains the overall solution

19
Global Alignment by Needleman-Wunsch Algorithm
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
DP Algorithm for Local Alignment(exact, handling
gap)

Sequences aa1a2an, bb1b2bn
Score Hi,jH(a1a2ai, b1b2bj), s(aibj) from PAM
wx,wy the penalties for a gap of length x and y
in a and b
Hijmax Hi-1,j-1 s(aibj),
max(Hi-x,j-wx) for x1,
max(Hi,j-y-wy) for y1,
0
--the alignment from highest score position,
trace back to a zero
--negative scores for mismatches, Hij gt 0,
initial end gap penalty 0

24
Local Alignment by the Smith-Waterman Algorithm
25
(No Transcript)
26
Improvement of Algorithm

Computation complexity and storage O(mn)
Approximate algorithm, parallel computation
Substitution matrix (PAM, BLOSUM)
(PAM mutation matrix ? substitution matrix)
Gap penalties
Bayes Alignment
Assessing significance of sequence alignment
(S comparing with scores R of random
sequences)
P(SgtR) 1-e-Kmne-?R The Gumblel extreme value
distribution, not normal dist.

27
What program to use for searching?

1) BLAST is fastest and easily accessed on the
Web
limited sets of databases
nice translation tools (BLASTX, TBLASTN)
2) FASTA works best in GCG
integrated with GCG
precise choice of databases
more sensitive for DNA-DNA comparisons
FASTX and TFASTX can find similarities in
sequences with frameshifts
3) Smith-Waterman is slower, but more sensitive
known as a rigorous or exhaustive search
SSEARCH in GCG and standalone FASTA

28
FASTA

1) Derived from logic of the dot plot
compute best diagonals from all frames of
alignment
2) Word method looks for exact matches between
words in query and test sequence
hash tables (fast computer technique)
DNA words are usually 6 bases
protein words are 1 or 2 amino acids
only searches for diagonals in region of word
matches faster searching

29
Query and Hash Table
Query A T G G G T C Test
sequence T G G A T C G A
2-Tuple
---
30
FASTA Algorithm
31
Makes Longest Diagonal

3) after all diagonals found, tries to join
diagonals by adding gaps (Connect the sequences
with close offset value by the restricted DP with
gap. )
4) computes alignments in regions of best
diagonals

32
FASTA Alignments
33
FASTA on the Web

Many websites offer FASTA searches
Various databases and various other services
Be sure to use FASTA 3
Each server has its limits
Be aware that you are depending on the kindness
of strangers.

34
BLAST

Uses word matching like FASTA
Similarity matching of words (3 aas, 11 bases)
does not require identical words.
If no words are similar, then no alignment
wont find matches for very short sequences
Does not handle gaps.
Lower sensitivity but faster than FASTA (10
times)
(good for motif, et al. due to high
consensus without gap)
Use finite automaton for pattern recognition
New gapped BLAST (PSI-BLAST) is better

35
BLAST Algorithm
Add similar words besides those in the query.
36
Extend hits one base at a time
which are called HSP
37
HSPs are Aligned Regions

The results of the word matching and attempts to
extend the alignment are segments
- called HSPs (High scoring Segment Pairs)
BLAST often produces several short HSPs rather
than a single aligned region

38
Gapped Blast and PSI-Blast

Ungapped extension for finding HSP
Using window (e.g. 11), let HSP with highest
scores be a seed
Gapped extension for the seed by DP.
PSI-Blast can be used for multiple sequence
alignment.

39
Genome Alignment

How to match a protein or mRNA to genomic
sequence?
There is a Genome BLAST server at NCBI
Each of the Genome websites has a similar search
function
What about introns?
An intron is penalized as a gap, or each exon is
treated as a separate alignment with its own
e-score
Need a search algorithm that looks for consensus
intron splice sites and points in the alignment
where similarity drops off.