http:creativecommons'orglicensesbysa2'0 - PowerPoint PPT Presentation

1 / 59

About This Presentation

Title:

http:creativecommons'orglicensesbysa2'0

Description:

Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ ... should look for hits with E-values of 10-6 or less and sequence identity of 70% or more ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 60

Provided by: stephe78

Category:

more less

Transcript and Presenter's Notes

Title: http:creativecommons'orglicensesbysa2'0

1
http//creativecommons.org/licenses/by-sa/2.0/
2
Sequence Similarity Searching Understanding
and UsingWeb Based BLAST
Dr. Joanne Fox joanne_at_bioinformatics.ubc.ca
3
Concepts of Sequence Similarity Searching

The premise
The sequence itself is not informative it must
be analyzed by comparative methods against
existing databases to develop hypothesis
concerning relatives and function.

4
Important Terms for Sequence Similarity Searching
with very different meanings

Similarity
The extent to which nucleotide or protein
sequences are related. The extent of similarity
between two sequences can be based on percent
sequence identity and/or conservation. In BLAST
similarity refers to a positive matrix score.
Identity
The extent to which two (nucleotide or amino
acid) sequences are invariant.
Homology
Similarity attributed to descent from a common
ancestor.
It is your responsibility as an informed
bioinformatician to use these terms correctly A
sequence is either homologous or not. Dont use
with this term!

5
Sequence Similarity Searching The Approach

Sequence similarity searching involves the use of
a set of algorithms (such as the BLAST programs)
to compare a query sequence to all the sequences
in a specified database.
Comparisons are made in a pairwise fashion. Each
comparison is given a score reflecting the degree
of similarity between the query and the sequence
being compared.
The higher the score, the greater the degree of
similarity. The similarity is measured and shown
by aligning two sequences.

6
Sequence Similarity Searching The Alignment

Alignments can be global or local (this is
algorithm specific)
A global alignment is an optimal alignment that
includes all characters from each sequence
(clustal generates global alignments)
A local alignment is an optimal alignment that
includes only the most similar local region or
regions (BLAST generates local alignments).

7
QUERY sequence(s)
BLAST results
BLAST program
BLAST database
8
Topics
BLAST program

The different blast programs
Understanding the BLAST algorithm
Word size
HSPs
Understanding BLAST statistics
The alignment score (S)
Scoring Matrices
Dealing with gaps in an alignment
The expectation value (E)

9
The BLAST algorithm

The BLAST programs (Basic Local Alignment Search
Tools) are a set of sequence comparison
algorithms introduced in 1990 that are used to
search sequence databases for optimal local
alignments to a query.
Altschul SF, Gish W, Miller W, Myers EW, Lipman
DJ (1990) Basic local alignment search tool. J.
Mol. Biol. 215403-410.
Altschul SF, Madden TL, Schaeffer AA, Zhang J,
Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST
and PSI-BLAST a new generation of protein
database search programs. NAR 253389-3402.

10
http//www.ncbi.nlm.nih.gov/BLAST/
blastn
11
Several different BLAST programs

12
Other BLAST programs

BLAST 2 Sequences (bl2seq)
Aligns two sequences of your choice
Can do different types of comparison ex. Blastx
Gives dot-plot like output
VecScreen
Compares query with sequences of known cloning
vectors
Both very handy for sequencing!

13
More BLAST programs

BLAST against genomes
Many available
BLAST parameters pre-optimized
Handy for mapping query to genome
Search for short exact matches
BLAST parameters pre-optimized
Great for checking probes and primers

14
MegaBLAST

megaBLAST
For aligning sequences which differ slightly due
to sequencing errors etc.
Very efficient for long query sequences
Uses big word (k-tuple) sizes to start search
Very fast
Accepts batch submissions of ESTs
Can upload files of sequences as queries
More detailed info see megaBLAST pages

15
How Does BLAST Really Work?

The BLAST programs improved the overall speed of
searches while retaining good sensitivity
(important as databases continue to grow) by
breaking the query and database sequences into
fragments ("words"), and initially seeking
matches between fragments.
Word hits are then extended in either direction
in an attempt to generate an alignment with a
score exceeding the threshold of "S".

16
Picture used with permission from Chapter 11 of
Bioinformatics A Practical Guide to the
Analysis of Genes and Proteins
17
How Does BLAST Really Work?

The BLAST programs improved the overall speed of
searches while retaining good sensitivity
(important as databases continue to grow) by
breaking the query and database sequences into
fragments ("words"), and initially seeking
matches between fragments.
Word hits are then extended in either direction
in an attempt to generate an alignment with a
score exceeding the threshold of "S".

18
Picture used with permission from Chapter 11 of
Bioinformatics A Practical Guide to the
Analysis of Genes and Proteins
19
Each BLAST hit generates an alignment that can
contain one or more these high scoring pairs
(HSPs)
20
Where does the score (S) come from?

The quality of each pair-wise alignment is
represented as a score and the scores are ranked.
Scoring matrices are used to calculate the score
of the alignment base by base (DNA) or amino acid
by amino acid (protein).
The alignment score will be the sum of the scores
for each position.

21
Whats a scoring matrix?

Substitution matrices are used for amino acid
alignments. These are matrices in which each
possible residue substitution is given a score
reflecting the probability that it is related to
the corresponding residue in the query.
A unitary matrix is used for DNA pairs because
each position can be given a score of 1 if it
matches and a score of zero if it does not.

22
PAM vs. BLOSUM scoring matrices

BLOSUM 62 is the default matrix in BLAST 2.0.
Though it is tailored for comparisons of
moderately distant proteins, it performs well in
detecting closer relationships. A search for
distant relatives may be more sensitive with a
different matrix.

23
PAM vs BLOSUM scoring matrices

The PAM Family
PAM matrices are based on global alignments of
closely related proteins.
The PAM1 is the matrix calculated from
comparisons of sequences with no more than 1
divergence.
Other PAM matrices are extrapolated from PAM1.

The BLOSUM family
BLOSUM matrices are based on local alignments.
BLOSUM 62 is a matrix calculated from comparisons
of sequences with no less than 62 divergence.
All BLOSUM matrices are based on observed
alignments they are not extrapolated from
comparisons of closely related proteins.

24
What happens if you have a gap in the alignment?

A gap is a position in the alignment at which a
letter is paired with a null
Gap scores are negative. Since a single
mutational event may cause the insertion or
deletion of more than one residue, the presence
of a gap is frequently ascribed more significance
than the length of the gap.
Hence the gap is penalized heavily, whereas a
lesser penalty is assigned to each subsequent
residue in the gap.

25
What do the Score and the e-value really mean?

The quality of the alignment is represented by
the Score.
Score (S)
The score of an alignment is calculated as the
sum of substitution and gap scores. Substitution
scores are given by a look-up table (PAM, BLOSUM)
whereas gap scores are assigned empirically .
The significance of each alignment is computed as
an E value.
E value (E)
Expectation value. The number of different
alignments with scores equivalent to or better
than S that are expected to occur in a database
search by chance. The lower the E value, the more
significant the score.

26
Is the E-value the same P-value?

E value (E)
Expectation value. The number of different
alignments with scores equivalent to or better
than S that are expected to occur in a database
search by chance. The lower the E value, the more
significant the score.
When E lt 0.01, P-values and E-value are nearly
identical.
So, the E-value is the number of times you expect
to see your hit occur in the database (with as
good as or better score) due to randomn chance
alone.

27
QUERY sequence(s)
BLAST results
BLAST program
BLAST database
28
Topics
BLAST databases

The different blast databases provided by the
NCBI
Protein databases
Nucleotide databases
Genomic databases
Considerations for choosing a BLAST database
Custom databases for BLAST

29
BLAST protein databases available at through
blastp web interface _at_ NCBI
30
BLAST nucleotide databases available at through
blastn web interface _at_ NCBI
31
Considerations for choosing a BLAST database

First consider your research question
Are you looking for an ortholog in a particular
species?
BLAST against the genome of that species.
Are you looking for additional members of a
protein family across all species?
BLAST against nr, if you cant find hits check
wgs, htgs, and the trace archives.
Are you looking to annotate genes in your species
of interest?
BLAST against known genes (RefSeq) and/or ESTs
from a closely related species.

32
When choosing a database for BLAST

It is important to know your reagents.
Changing your choice of database is changing your
search space completely
Database size affects the BLAST statistics
record BLAST parameters, database choice,
database size in your bioinformatics lab book,
just as you would for your wet-bench experiments.
Databases change rapidly and are updated
frequently
It may be necessary to repeat your analyses

33
Creating Custom Databases for BLAST
UBiC FAQ
34
QUERY sequence(s)
BLAST results
BLAST program
BLAST database
35
Topics
BLAST results

Choosing the right BLAST program
Running a blastp search
BLAST parameters and options to consider
Viewing BLAST results
Look at your alignments
Using the BLAST taxonomy report

36
http//www.ncbi.nlm.nih.gov/BLAST/
blastn
37
http//www.ncbi.nlm.nih.gov/BLAST/
Program selection guide
38
What BLAST program should I use? check the
NCBIs BLAST Program selection guide
39
http//www.ncbi.nlm.nih.gov/BLAST/
40
Input your query (gi231571) as FASTA, raw
sequence, or Accession/ID and choose your
database
database
41
Links to more information can be found on the
BLAST page
links
links
links
links
42
BLAST parameters and options to consider
conserved domains
Entrez query
E-value cutoff
Word size
43
More BLAST parameters and options to consider
filtering
gap penalities
matrix
44
Run your BLAST search
BLAST
45
The BLAST Queue
click for more info
Note your RID
46
Formatting and Retrieving your BLAST results
Results
options
47
A graphical view of your BLAST results
48
The BLAST hit list
Score
E-Value
GenBank
alignment
EntrezGene
49
The BLAST pairwise alignments
Identity
Similarity
50
Sorting BLAST results by Taxonomy
Taxonomy Report
51
Tax BLAST Report
Summary hits by lineage
BLAST hits by organism
52
BLAST statistics to record in your bioinformatics
labbook
Record the statistics that are found at bottom of
your BLAST results page
53
Homology Some Guidelines

Similarity can be indicative of homology
Generally, if two sequences are significantly
similar over entire length they are likely
homologous
Low complexity regions can be highly similar
without being homologous
Homologous sequences not always highly similar
Suggested BLAST Cutoffs
(source Chapter 11 Bioinformatics A Practical
Guide to the Analysis of Genes and Proteins)
For nucleotide based searches, one should look
for hits with E-values of 10-6 or less and
sequence identity of 70 or more
For protein based searches, one should look for
hits with E-values of 10-3 or less and sequence
identity of 25 or more

54
Advanced BLAST programs

The NCBI BLAST pages have several advanced BLAST
methods available
PSI-BLAST
PHI-BLAST
RPS-BLAST
All are powerful methods based on protein
similarities

55
PSI-BLAST

Position Specific Iterated BLAST
A cycling/iterative method
Gives increased sensitivity for detecting
distantly related proteins
Can give insight into functional relationships
Very refined statistical methods
Fast still based on BLAST methods
Simple to use

56
How does PSI-BLAST work?

First, a standard blastp is performed
The highest scoring hits are used to generate a
multiple alignment
A Position Specific Scoring Matrix (PSSM) is
generated from the multiple alignment.
Highly conserved residues get high scores
Less conserved residues get lower scores
The PSSM describes the sequence similarity
between your query and all significant blastp
hits
Another similarity search is performed, this time
using the new PSSM instead of the standard BLOSUM
or PAM matrices
- This PSSM (scoring matrix) is now customized to
find sequences that are related to your original
query
Steps 2-4 can be repeated until convergence
Convergence occurs when no new sequences appear
after iteration

57
http//www.ncbi.nlm.nih.gov/BLAST/
PSI-BLAST
58
Format results for PSI-BLAST with inclusion
E-value set at 0.005
PSI-BLAST
BLAST
59
Contributors

Special thanks to David Wishart, Andy Baxevanis,
Stephanie Minnema, Sohrab Shah, and Francis
Ouellette for their contributions to these
materials
You are now ready to complete the BLAST
assignment

Write a Comment

User Comments (0)