Title: Database Searches
1Database Searches
2Database searches Why?
- To discover or verify identity of a newly
sequenced gene - To find other members of a multigene family
- To classify groups of genes
3Database searching
- In practice, we cannot use Smith-Waterman to
search for sequences in a database - Databases are huge (GenBank 30 million
sequences, Swiss-Prot gtgt 100,000 sequences) - S-W is slow Time is proportional to N n2 where
n sequence length and N number of sequences
in the database - Instead, use faster heuristic approaches
- FASTA
- BLAST
- Tradeoff Sensitivity vs. false positives
- Smith-Waterman is slower, but more sensitive
4Dot Plots
5Dot Plots
4-base window and 75 identity
6FASTA
- Originally developed 1985 by Lipman and Pearson
- Goal Perform fast, approximate local alignments
to find sequences in the database that are
related to the query sequence - Based on dot plot idea
7FASTA Step 1
- Look for exact matches between words in query and
test sequence - Words are short
- DNA words are usually 6 bases
- Protein words are 1 or 2 amino acids
- Ktup denotes word length
- Use hash tables to locate words quickly
8FASTA Details
- Hashing Map a strings of characters to integers.
e.g., - AAA ? 0
- AAC ? 1
- ...
- TTT ? 63 (oversimplified)
- Preprocess the database and create a table that
stores locations of each possible k-tuple - 20k for amino acids (400 if k 2),
- 4k for DNA (4096 if k 6),
- Use hash code computed from query sequence
k-tuples for quick look up
9FASTA
10FASTA Step 2
- Find 10 best diagonal runs (sequence of nearby
hot spots on same diagonal) - Give each hot spot a positive score, and each
space between consecutive hot spots a negative
score that decreases with distance - similar to affine gap costs in S-W
- Each diagonal run is composed of matches (hot
spots themselves) and mismatches (interspot
regions) but no indels
11FASTA Step 3
- Evaluate each diagonal run using an appropriate
scoring matrix and find best scoring run - Discard runs with low scores (filtration)
- The highest-scoring diagonal is reported as init1
12FASTA Step 4
- After all diagonals found, try to join diagonals
by adding gaps - Use weighted directed acyclic graph between
segments representing those which could be
combined using indel - Find a maximum weight path in this graph
corresponds to a local alignment, reported as
initn
13Adding gaps
14FASTA Step 5
- If score reaches a threshold value, compute an
alternative local alignment - Form a band around init1 in dynamic programming
table - Width depends on ktup
- Use Smith-Waterman to find best alignment
restricted to that band. - Result is called opt
15FASTA Final Steps
- Rank database sequences according to opt scores
- use full Smith-Waterman method to align query
sequence against each of the highest ranking
sequences from the database - Perform statistical analysis
16!!SEQUENCE_LIST 1.0 (Nucleotide) FASTA of b2.seq
from 1 to 693 December 9, 2002 1402 TO
/u/browns02/Victor/Search-set/.seq Sequences
2,050 Symbols 913,285 Word Size 6
Searching with both strands of the query.
Scoring matrix GenRunDatafastadna.cmp Constant
pamfactor used Gap creation penalty 16 Gap
extension penalty 4 Histogram Key Each
histogram symbol represents 4 search set
sequences Each inset symbol represents 1 search
set sequences z-scores computed from opt
scores z-score obs exp () () lt 20
0 0 22 0 0 24 3
0 26 2 0 28 5 0
30 11 3 32 19 11
34 38 30 36 58
61 38 79
100 40 134
140 42
167 171
44 205 189
46 209
192
48 177 184
17List
The best scores are init1
initn opt z-sc E(1018780).. SWPPI1_HUMAN
Begin 1 End 269 ! Q00169 homo sapiens
(human). phosph... 1854 1854 1854 2249.3
1.8e-117 SWPPI1_RABIT Begin 1 End 269 !
P48738 oryctolagus cuniculus (rabbi... 1840 1840
1840 2232.4 1.6e-116 SWPPI1_RAT Begin 1
End 270 ! P16446 rattus norvegicus (rat). pho...
1543 1543 1837 2228.7 2.5e-116 SWPPI1_MOUSE
Begin 1 End 270 ! P53810 mus musculus
(mouse). phosph... 1542 1542 1836 2227.5
2.9e-116 SWPPI2_HUMAN Begin 1 End 270 !
P48739 homo sapiens (human). phosph... 1533 1533
1533 1861.0 7.7e-96 SPTREMBL_NEWBAC25830
Begin 1 End 270 ! Bac25830 mus musculus
(mouse). 10, ... 1488 1488 1522 1847.6
4.2e-95 SP_TREMBLQ8N5W1 Begin 1 End 268 !
Q8n5w1 homo sapiens (human). simila... 1477 1477
1522 1847.6 4.3e-95 SWPPI2_RAT Begin 1
End 269 ! P53812 rattus norvegicus (rat). pho...
1482 1482 1516 1840.4 1.1e-94
18Alignments
SCORES Init1 1515 Initn 1565 Opt 1687
z-score 1158.1 E() 2.3e-58 gtgtGB_IN3DMU09374
(2038 nt)
initn 1565 init1 1515 opt 1687 Z-score 1158.1
expect() 2.3e-58 66.2 identity in 875 nt
overlap (83-957151-1022) 60
70 80 90 100 110
u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGA
AGCGGAGGCGATGGCGCTGTTGGCC
DMU09374 AGGCGGACATAAATCCTCGACATGGGTGACA
ACGAACAGAAGGCGCTCCAACTGATGGCC
130 140 150 160 170
180 120 130 140
150 160 170 u39412.gb_pr
GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTT
TGGAGGCTCA
DMU09374
GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTT
CGGAGGGTCC 190 200
210 220 230 240
180 190 200 210 220
230 u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATC
TACGCCAGAGCAGCAAACATGTTCAAAATGGCC
DMU09374 AACAAGGTGGAGGACGCCATCGAGTGC
TACCAGCGGGCGGGCAACATGTTTAAGATGTCC
250 260 270 280 290
300 240 250
260 270 280 290 u39412.gb_pr
AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCA
CCTGCAGCTC
DMU09374
AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACA
CGCGCGGGCT 310 320
330 340 350 360