Database Searches - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Database Searches

Description:

In practice, we cannot use Smith-Waterman to search for sequences in ... Each inset symbol represents 1 search set sequences. z-scores computed from opt scores ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 19

Provided by: davidfern2

Category:

more less

Transcript and Presenter's Notes

Title: Database Searches

1
Database Searches

FASTA

2
Database searches Why?

To discover or verify identity of a newly
sequenced gene
To find other members of a multigene family
To classify groups of genes

3
Database searching

In practice, we cannot use Smith-Waterman to
search for sequences in a database
Databases are huge (GenBank 30 million
sequences, Swiss-Prot gtgt 100,000 sequences)
S-W is slow Time is proportional to N n2 where
n sequence length and N number of sequences
in the database
Instead, use faster heuristic approaches
FASTA
BLAST
Tradeoff Sensitivity vs. false positives
Smith-Waterman is slower, but more sensitive

4
Dot Plots
5
Dot Plots
4-base window and 75 identity
6
FASTA

Originally developed 1985 by Lipman and Pearson
Goal Perform fast, approximate local alignments
to find sequences in the database that are
related to the query sequence
Based on dot plot idea

7
FASTA Step 1

Look for exact matches between words in query and
test sequence
Words are short
DNA words are usually 6 bases
Protein words are 1 or 2 amino acids
Ktup denotes word length
Use hash tables to locate words quickly

8
FASTA Details

Hashing Map a strings of characters to integers.
e.g.,
AAA ? 0
AAC ? 1
...
TTT ? 63 (oversimplified)
Preprocess the database and create a table that
stores locations of each possible k-tuple
20k for amino acids (400 if k 2),
4k for DNA (4096 if k 6),
Use hash code computed from query sequence
k-tuples for quick look up

9
FASTA
10
FASTA Step 2

Find 10 best diagonal runs (sequence of nearby
hot spots on same diagonal)
Give each hot spot a positive score, and each
space between consecutive hot spots a negative
score that decreases with distance
similar to affine gap costs in S-W
Each diagonal run is composed of matches (hot
spots themselves) and mismatches (interspot
regions) but no indels

11
FASTA Step 3

Evaluate each diagonal run using an appropriate
scoring matrix and find best scoring run
Discard runs with low scores (filtration)
The highest-scoring diagonal is reported as init1

12
FASTA Step 4

After all diagonals found, try to join diagonals
by adding gaps
Use weighted directed acyclic graph between
segments representing those which could be
combined using indel
Find a maximum weight path in this graph
corresponds to a local alignment, reported as
initn

13
Adding gaps
14
FASTA Step 5

If score reaches a threshold value, compute an
alternative local alignment
Form a band around init1 in dynamic programming
table
Width depends on ktup
Use Smith-Waterman to find best alignment
restricted to that band.
Result is called opt

15
FASTA Final Steps

Rank database sequences according to opt scores
use full Smith-Waterman method to align query
sequence against each of the highest ranking
sequences from the database
Perform statistical analysis

16
!!SEQUENCE_LIST 1.0 (Nucleotide) FASTA of b2.seq
from 1 to 693 December 9, 2002 1402 TO
/u/browns02/Victor/Search-set/.seq Sequences
2,050 Symbols 913,285 Word Size 6
Searching with both strands of the query.
Scoring matrix GenRunDatafastadna.cmp Constant
pamfactor used Gap creation penalty 16 Gap
extension penalty 4 Histogram Key Each
histogram symbol represents 4 search set
sequences Each inset symbol represents 1 search
set sequences z-scores computed from opt
scores z-score obs exp () () lt 20
0 0 22 0 0 24 3
0 26 2 0 28 5 0
30 11 3 32 19 11
34 38 30 36 58
61 38 79
100 40 134
140 42
167 171
44 205 189
46 209
192
48 177 184

17
List
The best scores are init1
initn opt z-sc E(1018780).. SWPPI1_HUMAN
Begin 1 End 269 ! Q00169 homo sapiens
(human). phosph... 1854 1854 1854 2249.3
1.8e-117 SWPPI1_RABIT Begin 1 End 269 !
P48738 oryctolagus cuniculus (rabbi... 1840 1840
1840 2232.4 1.6e-116 SWPPI1_RAT Begin 1
End 270 ! P16446 rattus norvegicus (rat). pho...
1543 1543 1837 2228.7 2.5e-116 SWPPI1_MOUSE
Begin 1 End 270 ! P53810 mus musculus
(mouse). phosph... 1542 1542 1836 2227.5
2.9e-116 SWPPI2_HUMAN Begin 1 End 270 !
P48739 homo sapiens (human). phosph... 1533 1533
1533 1861.0 7.7e-96 SPTREMBL_NEWBAC25830
Begin 1 End 270 ! Bac25830 mus musculus
(mouse). 10, ... 1488 1488 1522 1847.6
4.2e-95 SP_TREMBLQ8N5W1 Begin 1 End 268 !
Q8n5w1 homo sapiens (human). simila... 1477 1477
1522 1847.6 4.3e-95 SWPPI2_RAT Begin 1
End 269 ! P53812 rattus norvegicus (rat). pho...
1482 1482 1516 1840.4 1.1e-94
18
Alignments
SCORES Init1 1515 Initn 1565 Opt 1687
z-score 1158.1 E() 2.3e-58 gtgtGB_IN3DMU09374
(2038 nt)
initn 1565 init1 1515 opt 1687 Z-score 1158.1
expect() 2.3e-58 66.2 identity in 875 nt
overlap (83-957151-1022) 60
70 80 90 100 110
u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGA
AGCGGAGGCGATGGCGCTGTTGGCC

DMU09374 AGGCGGACATAAATCCTCGACATGGGTGACA
ACGAACAGAAGGCGCTCCAACTGATGGCC
130 140 150 160 170
180 120 130 140
150 160 170 u39412.gb_pr
GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTT
TGGAGGCTCA
DMU09374
GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTT
CGGAGGGTCC 190 200
210 220 230 240
180 190 200 210 220
230 u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATC
TACGCCAGAGCAGCAAACATGTTCAAAATGGCC

DMU09374 AACAAGGTGGAGGACGCCATCGAGTGC
TACCAGCGGGCGGGCAACATGTTTAAGATGTCC
250 260 270 280 290
300 240 250
260 270 280 290 u39412.gb_pr
AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCA
CCTGCAGCTC
DMU09374
AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACA
CGCGCGGGCT 310 320
330 340 350 360

Write a Comment

User Comments (0)