Database Searches - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Database Searches

Description:

In practice, we cannot use Smith-Waterman to search for sequences in ... Each inset symbol represents 1 search set sequences. z-scores computed from opt scores ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 19
Provided by: davidfern2
Category:

less

Transcript and Presenter's Notes

Title: Database Searches


1
Database Searches
  • FASTA

2
Database searches Why?
  • To discover or verify identity of a newly
    sequenced gene
  • To find other members of a multigene family
  • To classify groups of genes

3
Database searching
  • In practice, we cannot use Smith-Waterman to
    search for sequences in a database
  • Databases are huge (GenBank 30 million
    sequences, Swiss-Prot gtgt 100,000 sequences)
  • S-W is slow Time is proportional to N n2 where
    n sequence length and N number of sequences
    in the database
  • Instead, use faster heuristic approaches
  • FASTA
  • BLAST
  • Tradeoff Sensitivity vs. false positives
  • Smith-Waterman is slower, but more sensitive

4
Dot Plots
5
Dot Plots
4-base window and 75 identity
6
FASTA
  • Originally developed 1985 by Lipman and Pearson
  • Goal Perform fast, approximate local alignments
    to find sequences in the database that are
    related to the query sequence
  • Based on dot plot idea

7
FASTA Step 1
  • Look for exact matches between words in query and
    test sequence
  • Words are short
  • DNA words are usually 6 bases
  • Protein words are 1 or 2 amino acids
  • Ktup denotes word length
  • Use hash tables to locate words quickly

8
FASTA Details
  • Hashing Map a strings of characters to integers.
    e.g.,
  • AAA ? 0
  • AAC ? 1
  • ...
  • TTT ? 63 (oversimplified)
  • Preprocess the database and create a table that
    stores locations of each possible k-tuple
  • 20k for amino acids (400 if k 2),
  • 4k for DNA (4096 if k 6),
  • Use hash code computed from query sequence
    k-tuples for quick look up

9
FASTA
10
FASTA Step 2
  • Find 10 best diagonal runs (sequence of nearby
    hot spots on same diagonal)
  • Give each hot spot a positive score, and each
    space between consecutive hot spots a negative
    score that decreases with distance
  • similar to affine gap costs in S-W
  • Each diagonal run is composed of matches (hot
    spots themselves) and mismatches (interspot
    regions) but no indels

11
FASTA Step 3
  • Evaluate each diagonal run using an appropriate
    scoring matrix and find best scoring run
  • Discard runs with low scores (filtration)
  • The highest-scoring diagonal is reported as init1

12
FASTA Step 4
  • After all diagonals found, try to join diagonals
    by adding gaps
  • Use weighted directed acyclic graph between
    segments representing those which could be
    combined using indel
  • Find a maximum weight path in this graph
    corresponds to a local alignment, reported as
    initn

13
Adding gaps
14
FASTA Step 5
  • If score reaches a threshold value, compute an
    alternative local alignment
  • Form a band around init1 in dynamic programming
    table
  • Width depends on ktup
  • Use Smith-Waterman to find best alignment
    restricted to that band.
  • Result is called opt

15
FASTA Final Steps
  • Rank database sequences according to opt scores
  • use full Smith-Waterman method to align query
    sequence against each of the highest ranking
    sequences from the database
  • Perform statistical analysis

16
!!SEQUENCE_LIST 1.0 (Nucleotide) FASTA of b2.seq
from 1 to 693 December 9, 2002 1402 TO
/u/browns02/Victor/Search-set/.seq Sequences
2,050 Symbols 913,285 Word Size 6
Searching with both strands of the query.
Scoring matrix GenRunDatafastadna.cmp Constant
pamfactor used Gap creation penalty 16 Gap
extension penalty 4 Histogram Key Each
histogram symbol represents 4 search set
sequences Each inset symbol represents 1 search
set sequences z-scores computed from opt
scores z-score obs exp () () lt 20
0 0 22 0 0 24 3
0 26 2 0 28 5 0
30 11 3 32 19 11
34 38 30 36 58
61 38 79
100 40 134
140 42
167 171
44 205 189
46 209
192
48 177 184

17
List
The best scores are init1
initn opt z-sc E(1018780).. SWPPI1_HUMAN
Begin 1 End 269 ! Q00169 homo sapiens
(human). phosph... 1854 1854 1854 2249.3
1.8e-117 SWPPI1_RABIT Begin 1 End 269 !
P48738 oryctolagus cuniculus (rabbi... 1840 1840
1840 2232.4 1.6e-116 SWPPI1_RAT Begin 1
End 270 ! P16446 rattus norvegicus (rat). pho...
1543 1543 1837 2228.7 2.5e-116 SWPPI1_MOUSE
Begin 1 End 270 ! P53810 mus musculus
(mouse). phosph... 1542 1542 1836 2227.5
2.9e-116 SWPPI2_HUMAN Begin 1 End 270 !
P48739 homo sapiens (human). phosph... 1533 1533
1533 1861.0 7.7e-96 SPTREMBL_NEWBAC25830
Begin 1 End 270 ! Bac25830 mus musculus
(mouse). 10, ... 1488 1488 1522 1847.6
4.2e-95 SP_TREMBLQ8N5W1 Begin 1 End 268 !
Q8n5w1 homo sapiens (human). simila... 1477 1477
1522 1847.6 4.3e-95 SWPPI2_RAT Begin 1
End 269 ! P53812 rattus norvegicus (rat). pho...
1482 1482 1516 1840.4 1.1e-94
18
Alignments
SCORES Init1 1515 Initn 1565 Opt 1687
z-score 1158.1 E() 2.3e-58 gtgtGB_IN3DMU09374
(2038 nt)
initn 1565 init1 1515 opt 1687 Z-score 1158.1
expect() 2.3e-58 66.2 identity in 875 nt
overlap (83-957151-1022) 60
70 80 90 100 110
u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGA
AGCGGAGGCGATGGCGCTGTTGGCC

DMU09374 AGGCGGACATAAATCCTCGACATGGGTGACA
ACGAACAGAAGGCGCTCCAACTGATGGCC
130 140 150 160 170
180 120 130 140
150 160 170 u39412.gb_pr
GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTT
TGGAGGCTCA
DMU09374
GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTT
CGGAGGGTCC 190 200
210 220 230 240
180 190 200 210 220
230 u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATC
TACGCCAGAGCAGCAAACATGTTCAAAATGGCC

DMU09374 AACAAGGTGGAGGACGCCATCGAGTGC
TACCAGCGGGCGGGCAACATGTTTAAGATGTCC
250 260 270 280 290
300 240 250
260 270 280 290 u39412.gb_pr
AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCA
CCTGCAGCTC
DMU09374
AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACA
CGCGCGGGCT 310 320
330 340 350 360
Write a Comment
User Comments (0)
About PowerShow.com