CAP5510

About This Presentation

Title:

CAP5510

Description:

BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ, MegaBLAST, PsiBLAST, PhiBLAST. Others ... BLAT, BLASTZ, MegaBLAST. FLASH, PatternHunter, SSAHA, SENSEI, WABA, ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 41

Provided by: tamerk

Learn more at: http://www.cise.ufl.edu

Category:

more less

Transcript and Presenter's Notes

Title: CAP5510

1
CAP5510 BioinformaticsDatabase Searches for
Biological Sequences or Imperfect Alignments

Tamer Kahveci
CISE Department
University of Florida

2
Goals

Understand how major heuristic methods for
sequence comparison work
FASTA
BLAST
Understand how search results are evaluated

3
What is Database Search ?

Find a particular (usually) short sequence in a
database of sequences (or one huge sequence).
Problem is identical to local sequence alignment,
but on a much larger scale.
We must also have some idea of the significance
of a database hit.
Databases always return some kind of hit, how
much attention should be paid to the result?
A similar problem is the global alignment of two
large sequences
General idea good alignments contain high
scoring regions.

4
Imperfect Alignment

What is an imperfect alignment?
Why imperfect alignment?
The result may not be optimal.
Finding optimal alignment is usually to costly in
terms of time and memory.

5
Database Search Methods

Hash table based methods
FASTA family
FASTP, FASTA, TFASTA, FASTAX, FASTAY
BLAST family
BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ,
MegaBLAST, PsiBLAST, PhiBLAST
Others
FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS
Suffix tree based methods
Mummer, AVID, Reputer, MGA, QUASAR

6
History of sequence searching

1970 NW
1980 SW
1985 FASTA
1990 BLAST

7
Hash Table
8
Hash Table

K-gram subsequence of length K
Ak entries
A is alphabet size
Linear time construction
Constant lookup time

9
FASTP

Lipman Pearson, 1985

10
FASTP

Three phase algorithm
Find short good matches using k-grams
K 1 or 2
Find start and end positions for good matches
Use DP to align good matches

11
FASTP Phase 1 (1)
position 1 2 3 4 5 6 7 8 9 10 11 protein 1 n c s
p t a . . . . . protein 2 . . . . . a c s p r k
position in
offset amino acid protein A protein B pos
A - posB -----------------------------------------
------------ a 6 6
0 c 2 7
-5 k - 11 n
1 - p 4
9 -5 r -
10 s 3 8
-5 t 5
- ------------------------------------------------
----- Note the common offset for the 3 amino
acids c,s and p A possible alignment can be
quickly found protein 1 n c s p t a
protein 2 a c s p r k
12
FASTP Phase 1 (2)

Similar to dot plot
Offsets range from 1-m to n-1
Each offset is scored as
matches - mismatches
Diagonals (offsets) with large score show local
similarities
How does it depend on k?

13
FASTP Phase 2

5 best diagonal runs are found
Rescore these 5 regions using PAM250.
Initial score
Indels are not considered yet

14
FASTP Phase 3

Sort the aligned regions in descending score
Optimize these alignments using Needleman-Wunsch
Report the results

15
FASTP - Discussion

Results are not optimal. Why ?
How does performance compare to Smith-Waterman?
What is the impact of k?
How does this idea work for DNAs ?
K 4 or 6 for DNA

16
FASTA Improvement Over FASTP

Pearson 1995

17
FASTA (1)

Phase 2 Choose 10 best diagonal runs instead of 5

18
FASTA (2)

Phase 2.5
Eliminate diagonals that score less than some
given threshold.
Combine matches to find longer matches. It incurs
join penalty similar to gap penalty

19
FASTA Variations

TFASTAX and TFASTAY query protein against a DNA
library in all reading frames
FASTAX, FASTAY DNA query in all reading frames
against protein database

20
BLAST

Altschul, Gish, Miller, Myers, Lipman, 1990

21
BLAST (or BLASTP)

BLAST Basic Local Alignment Search Tool
An approximation of Smith-Waterman
Designed for database searches
Short query sequence against long database
sequence or a database of many sequences
Sacrifices search sensitivity for speed

22
BLAST Algorithm (1)

Eliminate low complexity regions from the query
sequence.
Replace them with X (protein) or N (DNA)
Hash table on query sequence.
K 3 for proteins

23
BLAST Algorithm (2)

For each k-gram find all k-grams that align with
score at least cutoff T using BLOSUM62
20k candidates
50 on the average per k-gram
50n for the entire query
Build hash table

PQGMCGPFILGTYC
QGM
PQG
PQG PQG 18 PEG 15 PRG 14 PSG 13 PQA 12
T 13
24
BLAST Algorithm (3)

Sequentially scan the database and locate each
k-gram in the hash table
Each match is a seed for an ungapped alignment.

25
BLAST Algorithm (4)

HSP (High Scoring Pair) A match between a query
word and the database
Find a hit Two non-overlapping HSPs on a
diagonal within distance A
Extend the hit until the score falls below a
threshold value, X

26
BLAST Algorithm (5)

Keep only the extended matches that have a score
at least S.
Determine the statistical significance of the
result

27
What is Statistical Significance?

Two one-on-one games, two scores.
Which result is more significant?
Expected maybe a random result.
Unexpected significant, may have significant
meanings.

13 15
13 15
28
Statistical Significance

E-value The expected number of matches with
score at least S
E Kmne-lambda.S
m, n sequence lengths
S alignment score
K, lambda normalization parameters
P-value The probability of having at least one
match with score at least S
1 e-E
The smaller these values are, the more
significant the result
http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/gl
ossary2.html

29
BLAST - Analysis

K (k-gram)
Lower more sensitive. Slower.
T (neighbor cutoff)
Lower Find distant neighbors. Introduces noise
X (extension cutoff)
Higher lower chances of getting into a local
minima. Slower.

30
Sample Query

http//www.ncbi.nlm.nih.gov/BLAST/

Dhal_ecoli
I D R A M S A A R G V F E R G D W S L S S P A K
R K A V L N K L A D L M E A H A E E L A L L E T L
D T G K P I R H S L R D D I P G A A R A I R W Y A
E A I D K V Y G E V A T T S S H E L A M I V R E P
V G V I A A I V P W N F P L L L T C W K L G P A L
A A G N S V I L K P S E K S P L S A I R L A G L A
K E A G L P D G V L N V V T G F G H E A G Q A L S
R H N D I D A I A F T G S T R T G K Q L L K D A G
D S N M K R V W L E A G G K S A N I V F A D C P D
L Q Q A A S A T A A G I F Y N Q G Q V C I A G T R
L L L E E S I A D E F L A L L K Q Q A Q N W Q P G
H P L D P A T T M G T L I D C A H A D S V H S F I
R E G E S K G Q L L L D G R N A G L A A A I G P T
I F V D V D P N A S L S R E E I F G P V L V V T R
F T S E E Q A L Q L A N D S Q Y G L G A A V W T R
D L S R A H R M S R R L K A G S V F V N N Y N D G
D M T V P F G G Y K Q S G N G R D K S L H A L E K
F T E L K T I W I
31
BLASTN

BLAST for nucleic acids
K 11
Exact match instead of neighborhood search.

32
BLAST Variations
Program Query Target Type
BLASTP Protein Protein Gapped
BLASTN Nucleic acid Nucleic acid Gapped
BLASTX Nucleic acid Protein Gapped
TBLASTN Protein Nucleic acid Gapped
TBLASTX Protein Nucleic acid Gapped
33
Even More Variations