Sequence Alignment and Approaches to Database Searching - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Sequence Alignment and Approaches to Database Searching

Description:

They are guaranteed to find the best alignment for a given scoring ... The choice of search algorithm influences the sensitivity and selectivity of the search ... – PowerPoint PPT presentation

Number of Views:155
Avg rating:3.0/5.0
Slides: 37
Provided by: jessicack
Category:

less

Transcript and Presenter's Notes

Title: Sequence Alignment and Approaches to Database Searching


1
Sequence Alignment and Approaches to Database
Searching
  • Jessica Kissinger 2001

2
Why do we align sequences?
  • To discover functional, structural and
    evolutionary similarities
  • Because similarity may be an indicator of
    homology and thus provide some insight into
    function or gene identification.

3
Origins of similar sequences
A
Gene Duplication
A1
A2
A1
A2
Gene Duplication
Speciation
A1
A2
A1
A2
Species A Species B
Gene Conversion
Horizontal Gene Transfer
4
The various algorithms
  • Dynamic programming algorithms provide a rigorous
    mathematical approach to sequence alignment.
    They are guaranteed to find the best alignment
    for a given scoring matrix and gap penalty.
  • Local alignments, as opposed to global alignments
    are better for DB searching and for finding
    similar domains

5
Scoring Matrices are designed to detect signal
above background, to detect similarities beyond
what would be observed by chance alone
6
Why do we need these matrices?
  • Database searching
  • Need different levels of sensitivity
  • Close relationships (Low PAM, high Blosum)
  • Distant relationships (High PAM, low Blosum)

7
Dot Plot Nuts Bolts
Dot Plot Word Size 1 g c t g g a a
g g c a t g c
a
g a
g
c a
c
t
8
Dot Plots Nuts Bolts
Dot Plot Word Size 2 g c t g g a a
g g c a t g c
a
g a g
c
a c t
9
Dot Plot Nuts Bolts
Dot Plot Word Size 3 g c t g g a a
g g c a t g c a
g a
g c
a c
t
10
Plasmodium falciparum circumsporozoite protein
MMRKLAILSVSSFLFVEALFQEYQCYGSSSNTRVLNELNYDNAGTNLYNE
LEMNYYGKQENWYSLKKNSRSLGENDDGNN NNGDNGREGKDEDKRDGNN
EDNEKLRKPKHKKLKQPGDGNPDPNANPNVDPNANPNVDPNANPNVDPNA
NPNANPNANPN ANPNANPNANPNANPNANPNANPNANPNANPNANPNAN
PNANPNANPNVDPNANPNANPNANPNANPNANPNANPNANPN ANPNANP
NANPNANPNANPNANPNANPNANPNANPNANPNANPNANPNKNNQGNGQG
HNMPNDPNRNVDENANANNAVKN NNNEEPSDKHIEQYLKKIKNSISTEW
SPCSVTCGNGIQVRIKPGSANKPKDELDYENDIEKKICKMEKCSSVFNVV
NSSI GLIMVLSFLFLN
Plasmodium vivax circumsporozoite protein
MKNFILLAVSSILLVDLFPTHCGHNVDLSKAINLNGVNFNNVDASSLGAA
HVGQSASRGRGLGENPDDEEGDAKKKKDGK KAEPKNPRENKLKQPGDRA
DGQPAGDRADGQPAGDRADGQPAGDRAAGQPAGDRADGQPAGDRADGQPA
GDRADGQPAGD RADGQPAGDRAAGQPAGDRAAGQPAGDRADGQPAGDRA
AGQPAGDRADGQPAGDRAAGQPAGDRADGQPAGDRAAGQPAG DRAAGQP
AGDRAAGQPAGDRAAGQPAGNGAGGQAAGGNAGGGQGQNNEGANAPNEKS
VKEYLDKVRATVGTEWTPCSVTC GVGVRVRRRVNAANKKPEDLTLNDLE
TDVCTMDKCAGIFNVVSNSLGLVILLVLALFN
11
Plasmodium falciparum CS protein
Plasmodium vivax CS protein
Window2
12
Plasmodium falciparum CS protein
Plasmodium vivax CS protein
window 7
13
Database Searching
  • Database Searching ? Sequence alignment
  • Database searching is the application of
    knowledge gained from previous experiments to the
    problem of gene discovery
  • Similarity ? Homology

14
Database Searching
  • The Assumptions
  • The sequences being sought have an evolutionary
    ancestral sequence in common with the query
    sequence
  • The best guess at the actual path of evolution is
    the path that requires the fewest evolutionary
    events (most parsimonious)
  • All substitutions are not equally likely and
    should be weighted accordingly
  • Insertions and deletions are less likely than
    substitutions and should be weighted accordingly

15
Database Searching
  • Applied Considerations
  • The choice of search algorithm influences the
    sensitivity and selectivity of the search
  • The choice of matrix determines both the pattern
    and the extent of substitution in the sequences
    the database search is most likely to discover

16
Protein vs Nucleotide
  • Which molecules should you search with?
  • Which databases should you search, nucleotide or
    protein?

17
Why cant we just look at the DNA sequence for
the protein?
  • It was one thought that we might be able to
    calculate a minimum mutation matrix, i.e. one in
    which the minimum number of steps needed to
    change from one aa to another we counted. The
    problem is, because of the degeneracy of the
    genetic code, often likely and unlikely mutations
    would receive the same score

18
BLAST
  • BLAST is less sensitive than SW
  • Basic BLAST uses a word size of 3 for proteins
    and is more sensitive than FASTA (even though
    FASTA uses a word of size 2)
  • Basic BLAST uses a word size of 11 or 12 for
    nucleic acid sequences
  • The Heuristic is applied to the words in BLAST
    via a threshold value, T for alignments of
    words.

19
Basic BLAST Algorithms
  • BLASTN - compares a nucleotide query to a
    nucleotide database
  • BLASTP - compares a protein query to a protein
    database
  • BLASTX - compares a nucleotide query sequence
    translated in all reading frames against a
    protein sequence database
  • TBLASTN - compares a protein query sequence
    against a nucleotide sequence database
    dynamically translated in all reading frames.
  • TBLASTX - compares the six-frame translations of
    a nucleotide query sequence against the six-frame
    translations of a nucleotide sequence database.
    Please note that tblastx program cannot be used
    with the nr database on the BLAST Web page.

20
BLAST Nuts Bolts
  • Search a database with initial words and the
    expanded word set(neighborhood words) with scores
    above some threshold, T.
  • If a match is found between a word and a DB
    entry, attempt to extend the alignment until the
    score falls off by some value, i.e. the score is
    no longer maximal

21
Blast in a Nutshell
22
PAM 120
A 3 R -3 6 N -1 -1 4 D 0 -3 2 5 C -3 -4
-5 -7 9 Q -1 1 0 1 -7 6 E 0 -3 1 3 -7
2 5 G 1 -4 0 0 -4 -3 -1 5 H -3 1 2 0 -4
3 -1 -4 7 I -1 -2 -2 -3 -3 -3 -3 -4 -4 6 L
-3 -4 -4 -5 -7 -2 -4 -5 -3 1 5 K -2 2 1 -1 -7
0 -1 -3 -2 -3 -4 5 M -2 -1 -3 -4 -6 -1 -3 -4
-4 1 3 0 8 F -4 -5 -4 -7 -6 -6 -7 -5 -3 0 0
-7 -1 8 P 1 -1 -2 -3 -4 0 -2 -2 -1 -3 -3 -2 -3
-5 6 S 1 -1 1 0 0 -2 -1 1 -2 -2 -4 -1 -2 -3
1 3 T 1 -2 0 -1 -3 -2 -2 -1 -3 0 -3 -1 -1 -4
-1 2 4 W -7 1 -4 -8 -8 -6 -8 -8 -3 -6 -3 -5 -6
-1 -7 -2 -6 12 Y -4 -5 -2 -5 -1 -5 -5 -6 -1 -2 -2
-5 -4 4 -6 -3 -3 -2 8 V 0 -3 -3 -3 -3 -3 -3 -2
-3 3 1 -4 1 -3 -2 -2 0 -8 -3 5 B 0 -2 3 4
-6 0 3 0 1 -3 -4 0 -4 -5 -2 0 0 -6 -3 -3
4 Z -1 -1 0 3 -7 4 4 -2 1 -3 -3 -1 -2 -6 -1
-1 -2 -7 -5 -3 2 4 X -1 -2 -1 -2 -4 -1 -1 -2 -2
-1 -2 -2 -2 -3 -2 -1 -1 -5 -3 -1 -1 -1 -2 -8 -8
-8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
-8 -8 -8 -8 -8 1 A R N D C Q E G H I
L K M F P S T W Y V B Z X
23
Blast Extension
Database Sequence T G Y A A S S S T Y M Q V G P
R E G V L K
P R E G A I
Word has a hit
Extend the word
During extension, the matrix is used to calculate
the score. Extension continues until the score
is reached or the score deteriorates by the
specified fall or cutoff.
24
BLAST Nuts Bolts
  • Normal word sizes for proteins are W3 with T
    14 or W4 with T16.
  • Normal word sizes for nucleic acids are W11 or
    W12
  • The default scoring matrix for nucleic acid
    sequences is (1, -3) for NCBI BLAST and (5, -4)
    for WUBLAST

25
Gapped BLAST - 3 Changes to the Algorithm
  • Criterion for extending word pairs modified, this
    gives an increase in speed
  • Ability to create gapped alignments added
  • Smith-Waterman calculations are used to produce
    the final alignment

26
Word Extension
  • In the older versions of BLAST, if a word pair
    with a score above T was encountered when
    screening the DB, it was extended.
  • In the newer version, two non-overlapping words
    located at some distance X (the hitdist)from
    each other must hit the same sequence in the DB
    before an extension is performed.
  • To maintain sensitivity, must lower the value of
    T. This yields more hits, but few are extended.

27
Gapped Alignment
  • Original BLAST found many HSP and used all to
    generate a SUM statistic
  • If you gap then you only need to find only one
    rather than all ungapped alignments.
  • This allows T to be raised and increases the
    initial scan
  • Gapped alignments are achieved via dynamic
    programming to extend a central pair of aligned
    residues in both directions.

28
PSI-BLAST
  • Distant relationships are often best detected by
    motif or profile searches rather than pairwise
    comparisons
  • PSI-BLAST searches are iterated, with a
    position-specific matrix generated from
    significant alignments found in round i used in
    round i 1.
  • BLAST uses a generalized matrix
  • May not be as sensitive as motif search but is
    very general and easy to use.

29
A PSSM (position specific scoring matrix) for
PSI-BLAST
A R N D C Q E G H I L K M F
P S T W Y V 20 N 0 0 3 -2 -4 2 0 0 -2
0 0 2 -2 -4 -3 2 0 -5 -3 -3 21 S -2 0 3 0
-4 0 0 0 -2 -4 -4 1 -3 -4 -3 2 2 4 -3 -3
22 G 1 0 2 -2 -3 0 -2 1 2 -2 0 1 -2 -3
-3 1 -2 -4 -3 0 23 W -2 2 1 1 -4 0 1 0
2 -1 -3 0 -3 2 -3 1 -2 3 -2 -3 24 D -3 0 0
4 -4 -1 3 -3 1 -2 0 0 -2 -4 0 -2 0 -5 -3
-1 25 Q -2 0 1 0 -4 2 3 0 -2 -1 -4 -1 -3
-3 -3 1 2 -4 0 -3
30
There are 2 Blast Variants
  • NCBI BLAST (http//ncbi.nlm.nih.gov/BLAST/) or
    via local install
  • WUBLAST (http//blast.wustl.edu/) for
    information. This program is most often used at
    database web sites and for local installs.

31
Available GenBank Peptide Sequence Databases
nr All non-redundant GenBank
CDS translationsPDBSwissProtPIRPRF month
All new or revised GenBank CDS
translationPDBSwissProtPIRPRF released in
the last 30 days. swissprot
Last major release of the SWISS-PROT protein
sequence database (no updates) Drosophila
genome Drosophila genome
proteins provided by Celera and Berkeley
Drosophila Genome Project (BDGP). yeast
Yeast (Saccharomyces cerevisiae)
genomic CDS translations ecoli
Escherichia coli genomic CDS translations
pdb Sequences derived from the
3-dimensional structure from Brookhaven Protein
Data Bank Patent Protein
sequences derived from the Patent division of
GenBank
32
Available Genbank Nucleotide Sequence Databases
nr All GenBankEMBLDDBJPDB
sequences (but no EST, STS, GSS, or phase 0, 1
or 2 HTGS sequences). No longer "non-redundant".
month All new or revised
GenBankEMBLDDBJPDB sequences released in the
last 30 days. Drosophila genome
Drosophila genome provided by Celera and
Berkeley Drosophila Genome Project (BDGP).
dbest Database of
GenBankEMBLDDBJ sequences from EST Divisions
dbsts Database of
GenBankEMBLDDBJ sequences from STS Divisions
htgs Unfinished High
Throughput Genomic Sequences phases 0, 1 and 2
(finished, phase 3 HTG sequences are in nr)
gss Genome Survey Sequence,
includes single-pass genomic data, exon-trapped
sequences, and Alu PCR sequences.
33
Available GenBank Nucleotide Databases continued
yeast Yeast (Saccharomyces
cerevisiae) genomic nucleotide sequences E. coli
Escherichia coli genomic
nucleotide sequences pdb
Sequences derived from the 3-dimensional
structure from Brookhaven Protein Data Bank
Patent Nucleotide sequences
derived from the Patent division of GenBank
vector Vector subset of
GenBank(R), NCBI, in ftp//ncbi.nlm.nih.gov/blast/
db/ mito Database of
mitochondrial sequences alu
Select Alu repeats from REPBASE, suitable for
masking Alu repeats from query sequences. It is
available by anonymous FTP from ncbi.nlm.nih.gov
(under the /pub/jmc/alu directory). See "Alu
alert" by Claverie and Makalowski, Nature vol.
371, page 752 (1994).
34
Essential BLAST Parameters
  • W word size
  • T neighborhood word score threshold
    (varies by word size and matrix used)
  • V number of descriptions to report
  • B number of alignments to report
  • M value of a nucleotide match
  • N value of a nucleotide mismatch
  • X word hit extension drop off score
  • E Expected frequency of chance occurances
  • S Score at which a single HSP would satisfy E
  • -matrix defines a matrix to use
  • -filter defines a specific filter program

35
Command line BLAST
  • Format algorithm db query options
  • Example blastp nr myprot.txt -matrixpam70 V10
    B10
  • Example blastn nt mynuc.txt M5 N-4 E1.0e-5
  • Example blastn nt mynuc.txt M5 N-4 E1.0e-5 gt
    blast.out

36
Making your own BLAST DB
  • Any sequence file of fasta formatted sequences
    can be turned into a BLAST DB.
  • How you do this depends on which BLAST variant
    you are using.
  • NCBI BLAST-protein DB setdb myseqfile
  • NCBI BLAST-nucleotide DB pressdb myseqfile
  • WUBLAST - protienDB formatdb -p myseqfile
  • WUBLAST-nucleotideDB
Write a Comment
User Comments (0)
About PowerShow.com