Title: Heuristic approaches
1Heuristic approaches scoring matrices
- M.Prasad Naidu
- MSc Medical Biochemistry, Ph.D,.
2Introduction
- Two algorithms are there in these methods
- BLAST
- FASTA
- FastA is an algorithm developed by Pearson and
Lipman. Its more sensitive than Blast. - Blast is an algorithm developed by Altschul et
al., in 1990. It provides tools for high scoring
local alignment between two sequences. Now a
days, a gapped versions are available.
3BLASTP algorithm
- Blast Algorithm involves the following steps.
- Breaking of the sequence into defined word size.
- Finding a match or HSP (High Scoring Pair).
- Alignment of the word and extending the alignment.
4Breaking of the sequence into defined word size
- Query AILDTGATGDA
- Word size 4
- AILDTGATGDA
AILD ILDT LDTG DTGA TGAT GATG
ATGD TGDA
5Finding a High scoring Pair
- MQVWGWAILDTVATDAAMLL
- AILD
6Extending the alignment
- MQVWGWAILDTVATDAAMLL
-
- ..AILDTGATGDA
Parameters in BLAST result Percentage of
Homology Scoring of the alignment No of residues
aligned E-value
7FastA algorithm
- The word size in FastA algorithm is defined as
K-tuple. - Generally the K-tuple for the algorithm is either
3 or 4 for nucleotide sequences and 1 or 2 for
protein sequences. - FastA algorithm also involves the steps similar
to that of the BLAST tool. But the alignment
generation procedure is different.
8Breaking of the sequence into defined k-tuple
- F A M L G F I K Y L P G C M
- 1 2 3 4 5 6 7 8 9 10 11 12 13 14
A B C D E F G H I K L M
2 13 1 5 7 8 4 3
6 12 10 14
N P Q R S T V W Y Z
11 9
9A B C D E F G H I K L M
2 13 1 5 7 8 4 3
6 12 10 14
N P Q R S T V W Y Z
11 9
T 1 G 2 F 3 I 4 K 5 Y 6 L 7 P 8 G 9 A 10 C 11 T 12
3 -2 3 3 3 -3 3 -4 -8 2
10 3 3 3
The most occuring number in the algorithm is 3,
so the alignment starts after leaving three
characters or residues
10Alignment of the sequences
- F A M L G F I K Y L P G C M
- T G F I K Y L P G A C T
Parameters in FASTA result Percentage of
Homology Scoring of the alignment No of residues
aligned P-Score
11Scoring schemes
- Identity scoring matrix
- Residue to residue scores are represented here in
the form of similarity. - A 4 X 4 matrix is built for the nucleotides and
20 X 20 matrix for the amino acids. - For match score is 1 and mismatch is -1
A T G C
A 1 0 0 0
T 0 1 0 0
G 0 0 1 0
C 0 0 0 1
12PAM Matrices
- These were first developed by Margaret Dayhoff
and co-workers in 1978. - This model assumes that evolutionary changes
follow the markov model i.e. residual changes
occur independent on the previous mutation. One
PAM is a unit of evolutionary divergence in which
there is 1 amino acid change but it doesnt
imply that 100 PAM results in different
aminoacids. - Dayhoff and coworkers have calculated the
frequencies of accepted mutations for 1PAM by
analyzing closely related families of sequences. - The scores are represented as log odd ratios.
- The 1PAM can be extended to any no of PAMS. For
example, 1PAM table is extended to N X 1PAM. - For closely related protein sequences, lower
distance PAM is used and higher PAM is used for
variying proteins. - PAM 30 is used for closer proteins and PAM 250
for divergent ones.
13PAM 250 scoring matrix
14BLOSUM Matrices
- These matrices are developed by Heinkoff and
Heinkoff in 1991. - The matrices have been constructed in a similar
fashion as PAM matrices. - The data was derived for local alignment of
distantly related proteins deposited in the
BLOCKS database. - BLOSUM 30 is used for comparing highly divergent
sequences and BLOSUM 90 is used for closely
related proteins. - Commonly used BLOSUM matrix is BLOSUM 62 that is
used for proteins with 62 identities.
15BLOSUM 62 Matrix
16THANK YOU