Title: Intro to Bioinformatics
1Tutorial 3 - BLAST
2BLAST
- What is BLAST?
- Basic Local Alignment Search Tool
- Set of similarity search programs for exploring
sequence databases.
Database
Query
BLAST program
3Why perform similarity search?
- One sequence by itself is not informative it
must be analyzed by comparative methods against
existing sequence databases to develop hypothesis
concerning relatives and function - There are 3 possibilities
- A prefect Match.
- A similar sequence.
- Not even one similar sequence.
4BLAST Databases
Automatically searches opposite strand
The query is genomic, translated to protein using
6 possible reading frames
Name Query type Database
blastn Genomic Genomic
blastp Protein Protein
blastx Translated genomic Protein
tblastn Protein Translated genomic
tblastx Translated genomic Translated genomic
One search in tblastx is like ___ searches of
blastp
5http//www.ncbi.nlm.nih.gov/BLAST/
6Place Query
Choose Database
?
7BLASTN Databases
Gene collection GenBank, EMBL, DDBJ, PDB and NCBI reference sequences (RefSeq)
Genomic Transcript Complete human and mouse genome transcriptome
EST Expressed sequence tags
mito Mitochondrial sequences
vector Vector subset of GenBank
month GenBank, EMBL, DDBJ, PDB from 30 days
Envi Environmental samples
http//www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.sht
mlnucleotide_databases
8Place Query
Choose Database
Optimize similarity level of the search
Limit output size
?
Threshold for results significance
Primary word match (16-64 nt)
Reward and penalty for matching and mismatching
bases
Cost to create and extend a gap
Remove low information content
Limit search to specific organism
9Search for homologous to chick olfactory
receptor 6 gene
10Global Alignments
Local Alignments
Query sequence
Matched Areas of database sequences
11Sequence description
E value
Score(bits)
Sequence Identifier
Identity
Coverage
12Score andE value
Identities and gaps
Strand
13Multiple hits on a same subject
14Design of the BLAST survey
- Consider your research question
- Are you looking for a particular gene in a
particular species? BLAST against the genome of
that species. - Are you looking for additional members of a gene
family across all species? BLAST against the
gene collection database. - Are you looking for exact motif matches?
increase gap penalty or use megablast.
15Score and E-value
Score (S) ?(identities mismatches)-?gaps
Bit Score (S)
Score
Depends on search space
Depends on scoring system
Query length(bp)
Effective length (total number of bases) of the
database(bp)
16Score and E-value
- The score is a measure of the similarity of the
query to the a sequence from the database. - The E-value is a measure of the reliability of
the score. - The definition of the E-value is The probability
due to chance, that there is another alignment
with a similarity greater than the given S score.
17Score and E-value
- The Size of the E-value
- The typical threshold for a good E-value from a
BLAST search is E10-6e-6 or lower. - The reason for such low values is that an E0.001
in a million entry database would still leave
1000 entries due to chance. An Ee-6 would only
leave one entry due to chance.
18Exercise
Calculate the S, S and E for the following BLAST
hit
ACGTCGATCGAGCT AGGTCGTC-GAGGT
- Given the following parameters
- Query length 150
- 1.37
- K0.711
- Average Sequence length in database 270
- Number of sequences in database 4,554,026
-
S ?(IdMM)-?GP
S 13-1 12 S (1.3712 ln(0.711))/ln(2) S
16.44 0.341 /0.693 S 24.2
19Exercise
Calculate the S, S and E for the following BLAST
hit
ACGTCGATCGAGCT AGGTCGTC-GAGGT
- Given the following parameters
- Query length 150
- 1.37
- K0.711
- Average Sequence length in database 270
- Number of sequences in database 4,554,026
-
E 0.711x150x270x4,554,026xe-1.3712 E
131135455683x7.24e-8 E 9504.27
20Exercise
What will be the minimal score in order to
achieve a significant E value (e-610-6)?
131135455683e-1.37S10-6 ln (131135455683e-1.37S)
ln(10-6) ln (131135455683)ln(e-1.37S)-13.81 25.6
-1.37S-13.81 S -13.81-25.6/-1.37 S 28.76
211. ????? ????? ?????????? ??? CFTR ????
222. ???? ????? ?????? ??? CFTR ??????? ???????
?????
233. ??? CFTR ???? ?????? ,ABC transporters????
???? ?????? ???? ?????? ??ABC transporters
244. ?????? ??? ?? ?????, ???? ??????? ????? ??
?-BLAST . ?????? ??????? ?? ????? ?????? ????
??????? ????? (???? ?? ?-Algorithm parameters,
????????? ?? Filters and Masking ???? ?? ????? ??
""Low Complexity regions) ????
gtmy protein MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGAR
PCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIA
QYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLRNLASEKQQMG
ADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTN
SISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEALEKEFERTHYPDVF
ARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNTPSHIPIS
SSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMA
NNLPMQPPVPSQTSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGT
SGTTSTGLISPGVSVPVQVPGSEPDMSQYWPRLQ
a. ????? ????? ?? BLAST ???????? BLAST
PROTEIN ????? ???? ??????? Swissprot ??? ??????
?????? ???????? ????? ?????? ??? - Paired box
protein Pax-6 . ?????????? ?????? ????
?????????? ?? ???? ????? ?? 731 (Rattus
norvegicus, Human, Bovine). b. ???? ?? ?-BLAST
?-alignments ????? ??? ????? ?? ?????? ???????
??????? ????? ??????. ????? ?? ???????? ????
???????? ?????? ???? ??????? ?????, ??? ???????
???? ???????? ?????? ?? ??????? ??????? ?????,
??? ???? ?????????? ????? ???.
255. ????? ?? ?????? RecA ?? E. coli (???? ????
P0A7G6. ???? ????? ?? ???? ???????? ?????, ??
???? ????? ?? ?? ???? ?????). ??? ????? ?? ???
(Saccharomyces cerevisiae . ???? ???????? ??
??????? ?? organism) ?- BLAST ??? ?? ?????? .a
????? ????? ?? BLAST ???????? TBLASTN (nr
Database) .b ?? ?? ?????? ??? ???? ?? ????? ????
?????? RAD57 . c ????? ?????? ???? ?????
BLOSUM62 ???? ?- gap penalty 11,1(????? ?"? ?????
?? ?????? Search Summary ?????? ???? ????? ????
?? ?????? )? d. ???? ????? ?? ????? ???? ???????
14,042,622