Title: Stand alone BLAST on Linux
1Stand alone BLAST on Linux
CBW Bioinformatics Vancouver 2004 Lab 4.1
Sohrab Shah bioinformatics.ubc.ca sohrab_at_bioinfo
rmatics.ubc.ca
Stephanie Minnema University of Calgary Will
Hsiao Simon Fraser University
2Outline
- What is stand alone BLAST?
- Why stand alone BLAST?
- Installing BLAST
- Formatting databases for BLAST
- Running stand alone BLAST searches
- Changing parameters
- Formatting BLAST output
- Assignment
3What is stand alone BLAST?
- A local installation of the NCBI BLAST suite of
programs - Requires CPU, disk and RAM
- The same application that drives the NCBI WWW
BLAST server - Software distribution and documentation available
from - ftp//ftp.ncbi.nih.gov/blast/executables/release
4Why stand alone BLAST?
- Allows creation of custom databases
- Specific data sets for specific tasks
- Increase computational efficiency
- Increase specificity of results
- Secure querying
- Important for IP protection no internet traffic
- Facilitates high-throughput analyses
- No queues only competing with internal users
- Can automate searches
5Some drawbacks
- Often need significant hardware resources
- Need to maintain the databases
6Installing BLAST
- The BLAST distribution
- Point your browser to
- ftp//ftp.ncbi.nih.gov/blast/executables/release
- Mailing list
- http//www.ncbi.nlm.nih.gov/mailman/listinfo/blast
-announce - Distribution announcements
- Bug reports/fixes
7ftp//ftp.ncbi.nih.gov/blast/executables/release/2
.2.6
We have already downloaded the distribution, but
this is the ftp directory
8Installing BLAST
9Unpack the distribution
Unzip the distribution
Helpful info Standalone BLAST is distributed as
a gziped tar archive The .gz file extension
indicates that the file has been compressed with
gzip a standard Unix compression utility The
gunzip utility uncompresses the file See gt man
gunzip for more info
10Unpack the distribution
Untar the distribution
- Helpful info
- The .tar extension indicates that the file is a
tape archive created with tar a standard Unix
archiving tool - The tar command above extracts the archive into
the current working directory - See gt man tar for more info
- x extract
- p preserve permissions
- f file
11List the contents of the distribution
- A suite of tools for
- running various blast searches
- formatting and extracting sequences
- Documentation
- README. files read em!
- Data files with scoring matrices
- data
12Configuring BLAST
- We need to configure the system so the BLAST
programs can function correctly - Set the PATH environment variable by editing
/.bashrc
Save the file
13Configuring BLAST
- We need to set up a configuration file /.ncbirc
to point to the data directory in the
distribution - Open a file
- emacs /.ncbirc
- Save the file
14Exit the shell
- Exit the shell
- Start a new shell
- When you start a new shell, your environment will
be set up to run BLAST
15Formatting the swissprot database for BLAST
- Change directory to /home/guest/blast/db
- View the contents of the directory
- Unzip the swissprot database
16View the contents of the swissprot database
17FASTA format
gtSOME DEFINITION OF THE SEQUENCE
\n ACGATCGACTACGATCAGCAGCATAGCTACAGATAG
18FASTA -gt BLASTable
- FASTA formatted files are not compatible for the
BLAST programs - You need to prepare the FASTA files for BLAST
with formatdb - This indexes the entries in the FASTA file and
enables BLAST to run much faster
19formatdb
- Formats FASTA formatted databases for BLAST
20Formatting swissprot
- Format the swissprot database using formatdb
- List the contents of the directory
- The formatdb command will take a few minutes
- Useful info
- there should be seven files that are a
combination of indexes and data - note the formatdb.log file
- View its contents with more formatdb.log
- Ignore WARNING errors potential bug in new
release - You should see Formatted 143046 sequences in
volume 0 as the last line in the file
21formatdb documentation
22Running BLAST - parameters
23Running BLAST - parameters
24Running BLAST - parameters
25Running BLAST - parameters
26Running BLAST try it
- Change directory to /home/guest/Lab4.1
- List the contents
- Useful info
- bact_genome.fna 12Kb of genomic sequence of
Pseudomonas aeruginosa for the assignment - hs_tryp_trna_synth.aa Human tryptophanyl tRNA
synthetase to try command psi-blast - test_blast.aa test protein to try blastp and
rpsblast - unknown1.aa mystery protein for assignment
- unknown2.aa mystery protein for assignment
27Running BLAST try it
Run the blastall command below What will this
command do?
What is the protein in test_blast.aa? Repeat the
search with a higher e-value cut-off (10) . How
does the output change?
28BLAST output
NEW
29BLAST output
NEW
30rpsblast
- Reverse Position Specific BLAST
- Query protein sequence
- Database domains
- We have installed Pfam on your laptop
- http//pfam.wustl.edu/
- Other domain databases Smart
- http//smart.embl-heidelberg.de/
- CDD
- http//www.ncbi.nih.gov/Structure/cdd/cdd.shtml
- For creating local blastable domain databases,
consult - ftp//ftp.ncbi.nih.gov/pub/mmdb/cdd/README
31Running rpsblast
32Run rpsblast
- Search test_blast.aa against Pfam
- Produce HTML output with T
- Open the results in your browser
What domains are present?
33rpsblast output
NEW
34Running psiblast
- Preferred option when dealing with an unknown
protein - or trying to find distant homologues
- Much more sensitive than blastall
- Less specific with each iteration
- Use blastpgp to run psiblast on the command line
35blastpgp parameters
36blastpgp parameters
37blastpgp parameters
38blastpgp parameters
39Running psiblast (blastpgp)
- Search swissprot with human tryp tRNA synthetase
using psiblast with 4 iterations. Generate HTML
output
How does the hit list change with each
iteration? How can the matrix.ctx file be used in
downstream analysis?
40psiblast results
41Further information
- Consult README files in BLAST distribution
42Summary
- A standalone BLAST server enables custom, secure,
high throughput searches - BLAST distribution available from
- ftp//ftp.ncbi.nih.gov/blast/executables/release
- Use command line parameters to tune your
searches and format your results - Use different BLAST tools for different purposes
- Regular (blastall blastp, blastn, blastx,
tblastn, tblastx) - Searching for domains (rpsblast cdd search)
- distant homologues (blastpgp psi/phi blast)
43Assignment
- Four questions
- Running
- blastp identify a protein
- rpsblast search for domains in a protein
- blastx annotate a genomic sequence
- psiblast find a function for an unknown protein
- Some searches may take a few minutes
- Where applicable report the e-value of hits and
their locations on the query sequence and the
command you used to run the search - No longer than 2 printed pages
- Submit to Saara by Fri 9am