Title: Please use linux today if possible
1Please use linux today if possible!
2Introduction to Molecular Biology Databases
Alinda Nagy Hedi Hegyi, PhD _at_ Institute of
Enzymology, Budapest The BioSapiens Permanent
School of Bioinformatics
Budapest, Sept 4-8, 2006
3Databases
4What is a database?
- A database is a structured collection of
information. (An organized array of information.)
- A database consists of basic objects called
records or entries. - Each record consists of fields, which hold
defined data that is related to that record. - For example, a protein database would typically
have proteins as records and protein properties
as fields (i.e. name, length, sequence,
taxonomical origin, etc.)
Noam Kaplan
5What is a database?
- A database is searchable (index)
- -gt table of contents
- A database is updated periodically (release)
- -gt new edition
- A database is cross-referenced (hyperlinks)
- gt links with other db
-
6Why Databases?
- The purpose of databases is not merely to collect
and organize data, but mainly to allow advanced
data retrieval. - A query is a method to retrieve information from
the database. - The organization of each record into
predetermined fields allows us to use queries on
fields. - Example Find all human proteins that are enzymes
and have a length of 1000-1200 aa.
Noam Kaplan
7Databases on the Internet
- Biological databases often have a web interface,
which allows the user to send queries to the
database. - Some databases can be accessed by different web
servers, each offering a different interface.
query
request
result
web page
User
Web server
Database server
Noam Kaplan
8Databases on the Internet
Information system
Query system
Storage System
Data
Francis Ouellette
9Databases on the Internet
Information system
Query system
Storage System
Data
Francis Ouellette
10Databases on the Internet
Information system
Query system
Storage System
Data
Francis Ouellette
11Databases on the Internet
- A List you look at- A catalogue- indexed
files- SQL- grep
Information system
Query system
Storage System
Data
Francis Ouellette
12Databases on the Internet
Information system
Query system
Storage System
Data
Francis Ouellette
13Database download
- Nearly all biological databases are available for
download as simple text files. - A local version of the database removes
limitations on how you process the data. - Processing data in files requires some minimal
computer-programming skills. - PERL is an easy programming language that can be
used for extraction and analysis of data from
files.
Noam Kaplan
14Tour of the major molecular biology databases
- There is a tremendous amount of information about
biomolecules in publicly available databases. - Today, we will just look at some of the main
databases and what kind of information they
contain. - Exercises will give you a little practice at
browsing databases.
15List of molecular biology databases
16List of molecular biology databases
- Nucleic Acids Research publishes an annual
database issue. The 2006 update of the online
Molecular Biology Database Collection includes
858 databases - http//www3.oup.co.uk/nar/database/c/
17Large Growth in the Number of Biological Databases
NAR Database Issue
18Molecular biology data types
Organisms
Genome maps
Lei Liu
19Molecular biology data types
Organisms
Genome maps
DNA sequences
RNA sequences
...AATGGTACCGATGACCTGGAGCTTGGTTCGA...
Lei Liu
20Molecular biology data types
Organisms
Genome maps
DNA sequences
RNA sequences
Protein sequences
...TRLRPLLALLALWPPPPARAFVNQHLCGSHLVEA...
Lei Liu
21Molecular biology data types
Organisms
Genome maps
DNA sequences
RNA structures
RNA sequences
Protein sequences
Protein structures
PDB entry 1CIS P.Osmark, P.Sorensen, F.M.Poulsen
Lei Liu
22Molecular biology data types
Organisms
Genome maps
DNA motifs
DNA sequences
RNA expression
RNA structures
RNA sequences
Protein sequences
Protein structures
Protein motifs
Lei Liu
23Types of molecular biology databases
- 14 main NAR categories
- Nucleotide Sequence
- RNA sequence
- Protein sequence
- Structure
- Genomics (non-vertebrate)
- Metabolic and Signaling Pathways
- Human and other Vertebrate Genomes
- Human Genes and Diseases
- Microarray Data and other Gene Expression
- Proteomics Resources
- Other
- Organelle
- Plant
- Immunological
24Resources are Becoming More Diverse
NAR Database Categories
2004
2006
25NAR 2006 A Closer Look
- Genome scale databases have proliferated
- Traditional sequence databases are now a small
part - Databases around new specific data types are
emerging - Pathway and disease orientated databases are
emerging
26Database searches
27Using a database
- How to get information out of a database
- Summaries how many entries, average or extreme
values - Browsing no targeted information to retrieve
- Search looking for particular information
- Searching a database
- Must have a key that identifies the element(s) of
the database that are of interest. - Name of gene
- Sequence of gene
- Other information
Larry Hunter
28Searching sequence databases
- Start from sequence, find information about it
- Many kinds of input sequences
- Could be amino acid or nucleotide sequence
- Genomic or mRNA/cDNA or protein sequence
- Complete or fragmentary sequences
- Exact matches are rare (even uninteresting in
many cases), so often goal is to retrieve a set
of similar sequences. - Both small (mutations) and large (required for
function) differences within similar can be
interesting.
Larry Hunter
29What might we want to know about a sequence?
- Is this sequence similar to any known genes? How
close is the best match? Significance? - What do we know about that gene?
- Genomic (chromosomal location, allelic
information, regulatory regions, etc.) - Structural (known structure? structural domains?
etc.) - Functional (molecular, cellular disease)
- Evolutionary information
- Is this gene found in other organisms?
- What is its taxonomic tree?
Larry Hunter
30What can be discovered about a gene by a database
search?
- A little or a lot, depending on the gene
- Evolutionary information homologous genes,
taxonomic distributions, allele frequencies,
synteny, etc. - Genomic information chromosomal location,
introns, UTRs, regulatory regions, shared
domains, etc. - Structural information associated protein
structures, fold types, structural domains - Expression information expression specific to
particular tissues, developmental stages,
phenotypes, diseases, etc. - Functional information enzymatic/molecular
function, pathway/cellular role, localization,
role in diseases
Larry Hunter
31NCBI and Entrez
32NCBI and Entrez
- One of the most useful and comprehensive sources
of databases is the NCBI (National Center for
Biotechnology Information), part of the NIH
(National Institute of Health). - NCBI provides interesting summaries, browsers for
genome data, and search tools - Entrez is their database search
interfacehttp//www.ncbi.nlm.nih.gov/Entrez - Can search on gene names, sequences, chromosomal
location, diseases, keywords, ...
Larry Hunter
33(No Transcript)
34BLAST Searching with a sequence
- Goals is to find other sequences that are more
similar to the query than would be expected by
chance (and therefore are homologous). - Can start with nucleotide or amino acid sequence,
and search for either (or both) - Many options
- E.g. ignore low information (repetitive)
sequence, set significance critical value - Defaults are not always appropriate READ THE
NCBI EDUCATION PAGES!
Larry Hunter
35(No Transcript)
36- Major choices
- Translation
- Database
- Filters
- Restrictions
- Matrix
Larry Hunter
37Larry Hunter
38Larry Hunter
39Close hit Rat ADH alpha
Larry Hunter
40Distant hitHuman sorbitol dehydrogenase
Larry Hunter
41Parameters (at bottom!)
Larry Hunter
42Click on
Larry Hunter
43Larry Hunter
44BLAST searches online
- http//www.ncbi.nlm.nih.gov/BLAST/
- Sequences
- gtENSP00000002501 pepknown chrNCBI361688598804
88613382 - MEPPEGAGTGEIVKEAEVPQAALGVPAQGTGDNGHTPVEEEVGGIPVPAP
GLLQVTERRQ - PLSSVSSLEVHFDLLDLTELTDMSDQELAEVFADSDDENLNTESPAGLHP
LPRAGYLRSP - SWTRTRAEQSHEKQPLGDPERQATVLDTFLTVERPQED
- gtENSP00000314902 chr18 geneENSG00000176890
trENST00000323250 - MPVAGSELPRRPLPPAAQERDAEPRPPHGELQYLGQIQHILRCGVRKDDR
TGTGTLSVFG - MQARYSLRDYSGQGVDQLQRVIDTIKTNPDDRRIIMCAWNPRDLPLMALP
PCHALCQFYV - VNSELSCQLYQRSGDMGLGVPFNIASYALLTYMIAHITGLKPGDFIHTLG
DAHIYLNHIE - PLKIQLQREPRPFPKLRILRKVEKIDDFKAEDFQIEGYNPHPTIKMEMAV
45BLAST output for ENSP00000002501
46BLAST output for ENSP00000002501
47BLAST output for ENSP00000314902
48BLAST output for ENSP00000314902
49Take home messages
- There are a lot of molecular biology databases,
containing a lot of valuable information - Not even the best databases have everything (or
the best of everything) - These databases are moderately well cross-linked,
and there are linker databases - Sequence is a good identifier, maybe even better
than gene name!
Larry Hunter
50Protein sequence databases
- General sequence databases (e.g. UniProt)
- Protein properties (e.g. PFD Protein Folding
Database) - Protein localization and targeting
- (e.g. NPD - Nuclear Protein Database)
- Protein sequence motifs and active sites
- (e.g. BLOCKS, InterPro, PROSITE, PRINTS)
- Protein domain databases protein classification
- (e.g. InterPro, ProDom, SMART, Pfam)
- Databases of individual protein families
- (e.g. Histone Database)
- http//www3.oup.co.uk/nar/database/cat/1
51UniProt ( The Universal Protein Resource)
http//www.uniprot.org/ ftp//ftp.uniprot.org/pub
/databases/
Wu CH et al. The Universal Protein Resource
(UniProt) an expanding universe of protein
information.Nucleic Acids Res. 2006 Jan
134(Database issue) D187-91.
52Margaret Dayhoff
- The first protein database was created by
Margaret Dayhoff, calledThe Atlas of Protein
Sequences. - It was a book.
53The Atlas of Protein Sequences
- Dayhoff had the idea that a compilation of all
protein sequences in the literature into one
resource would be a useful research tool. - She and her co-workers collected all known
sequences and published them together. - Then, when a new sequence was obtained, there
was a single resource available for determining
its relationship to other known sequences.
54What is UniProt
55What is UniProt
- The world's most comprehensive catalog of
information on proteins. - Central repository of protein sequence and
function. - Created by joining the information contained in
Swiss-Prot, TrEMBL, and PIR. - Collaboration between EBI (European
Bioinformatics Institiute), SIB (Swiss Institute
of Bioinformatics) and PIR (DDBJ to join). - Funded mainly by NIH.
- Three database components
- UniProt Knowledgebase (UniProtKB)
- UniProt Reference Clusters (UniRef)
- UniProt Archive (UniParc)
56What is UniProt
1. UniProt Knowledgebase (UniProtKB) central
access point for extensive curated protein
information, including function, classification,
and cross-reference comprising the manually
annotated UniProtKB/Swiss-Prot section and the
automatically annotated UniProtKB/TrEMBL
section 2. UniProt Reference Clusters
(UniRef) combines closely related sequences into
a single record to speed searches speed
similarity searches via sequence space
compression by merging sequences that are 100
(UniRef100), 90 (UniRef90) or 50 (UniRef50)
identical 3. UniProt Archive (UniParc) comprehens
ive repository, reflecting the history of all
protein sequences stores all publicly available
protein sequences, containing the history of
sequence data with links to the source databases
57What is UniProt
The UniProt databases collect both protein
sequences obtained through experimental
determination and protein sequences derived from
the translation of nucleotide sequences (which
were predicted or determined to codify for a
protein).
Amino acid sequence determined through
experimental analysis
Protein sequence databases
PIR
SWISSPROT
TrEMBL
GeneBank
Nucleotide sequence databases
DDBJ
EMBL
Validated Enriched of specific information
58UniProt Goals
- High level of annotation
- Minimal redundancy
- High level of integration with other databases
- Complete and up-to-date
59Annotation concepts
- UniParc
- No annotation
- UniProtKB
- Annotated
- UniRef
- No annotation, just description line of
UniProtKB or UniParc master entry in the cluster
for use in FASTA files
60Minimal redundancy
- UniParc
- All sequences that are 100 identical over their
entire length are merged into a single entry,
regardless of species. UniParc represents each
protein sequence once and only once, assigning it
a unique identifier. UniParc cross-references the
accession numbers of the source databases. - UniProtKB
- Aims to describe in a single record all protein
products derived from a certain gene (or genes if
the translation from different genes in a genome
leads to indistinguishable proteins) from a
certain species. - UniRef
- Merges sequences automatically.
61Integration with other databases
- UniParc
- Linked back to source records
- UniProtKB
- Linked to gt60 other databases
- UniRef
- UniRef clusters link back to UniProtKB and
UniParc records in the cluster
62Complete and up-to-date
- UniParc
- All publically available protein sequences,
updated every 2 weeks (05/06, Rel 8.0 7.116.519
entries) - UniProtKB
- All suitable stable protein sequences, updated
every 2 weeks (05/06, Rel 8.0 3.170.612 entries) - UniRef
- All protein sequences in the UniProtKB and in
UniParc useful for sequence similarity searches,
updated every 2 weeks (05/06, Rel 8.0 3.511.676
UniRef100, 2.254.474 UniRef90, 1.148.123 UniRef50
entries)
63An example
64An example
65An example
66An example
67An example
68Exercise 1 Text search
1. Go to EXPASY. Click "UniProt Knowledgebase
(Swiss-Prot and TrEMBL) and then search for
human cochlin. Notice that there is a wealth of
information about this protein. Furthermore,
there are many links to sequence analysis tools
(some of which you will learn later) and some
other nice features. Note that this is merely a
graphical display of the original
UniProtKB/SwissProt database entry (which is in
text). 2. Try to answer all of the questions
below. 1. Which year was the NMR structure of the
LCCL domain determined? 2. Where is the protein
expressed?3. Which diseases are associated with
the protein?
69Exercise 2 BLAST search
- 1. Go to EXPASY. Click "UniProt Knowledgebase
(Swiss-Prot and TrEMBL) and then BLAST. - 2. Copy the following human amino acid sequence.
- MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAK
VLRLFEENDVNLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRH
DIGATVHELSRDKKKDTVPWFPRTIQELDRFANQILSYGAELDADHPGFK
DPVYRARRKQFADIAYNYRHGQPIPRVEYMEEEKKTWGTVFKTLKSLYKT
HACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGFRLRPVAGLLS
SRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA
QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSS
FGELQYCLSEKPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVR
NFAATIPRPFSVRYDPYTQRIEVLDNTQQLKILADSINSEIGILCSALQK
IK - 3. Paste the sequence into the query sequence
window and adjust the options as necessary. You
won't need to specify advanced options, but you
should choose a program and database. For
simplicity, use e.g. the UniProtKB database. - 4. Run the search and identify the protein. Use
the link provided to see the UniProtKB/SWISS-PROT
report.
70Exercise 2 BLAST search
- 5. Now, try to answer all of the questions below.
- 1. What is the SWISS-PROT primary accession
number? 2. What is the common name of the
protein? 3. What is the gene called? 4. Which
year was the crystal structure of the catalytic
domain determined? Name the first author. 5.
Does the enzyme require a co-factor to function?
If so, what? 6. Name the most common disease
that arises as a result of deficiency of this
enzyme. - 7. How many amino acid residues are there in the
protein? 8. What is the molecular weight of the
protein?
71Patterns and Profiles, Protein Motifs and Domains
- InterPro - an integrated database of protein
families, domains, motifs and functional sites. - Blocks - multiply aligned ungapped segments for
the most highly conserved regions of proteins. - Motif - a server that scans databases to find
motifs or patterns and that can generate sequence
profiles. - Pfam - multiple sequence alignments and HMMs of
protein domains and families. - PRINTS - database of groups of conserved motifs,
or protein fingerprints. - ProDom - protein domain families automatically
generated from SWISS-PROT and TrEMBL. - PROSITE - database of protein families and
domains defined by functional sites, patterns and
profiles. - SMART - Simple Modular Architecture Research Tool
for the identification of domains. - COGS database - clusters of sequences determined
by comparing sequences from whole genomes.
72InterPro(Integrated resource of Protein
Families, Domains and Sites)
- http//www.ebi.ac.uk/interpro/
- ftp//ftp.ebi.ac.uk/pub/databases/interpro
- Mulder NJ et al. (2005) InterPro, progress and
status in 2005. Nucleic Acids Res. 33 (Database
Issue) D201-5.
73What is InterPro
- Secondary protein databases on functional sites
and domains are vital resources for identifying
distant relationships in novel sequences, and
hence for predicting protein function and
structure. - Unfortunately, these signature databases do not
share the same formats and nomenclature, and each
database has its own strengths and weaknesses. - Thus, for best results, search strategies should
ideally combine all of them.
74What is InterPro
- InterPro is a collaborative project aimed at
providing an integrated layer on top of the most
commonly used signature databases by creating a
unique, non- redundant characterization of a
given protein family, domain or functional site. - Integrates PROSITE, PRINTS, Pfam, ProDom, SMART,
TIGRFAMs, PIR superfamily, SUPERFAMILY, Gene3D
and PANTHER databases and the addition of others
is scheduled. - Has cross-references to the BLOCKS database as
well as many specialized protein family and
protein structure databases.
75InterPro
- The latest release of InterPro (12.1) contains
12,953 entries, with 78 coverage of all proteins
in UniProtKB. - Each entry has annotation provided in the name,
GO mapping and abstract fields, and all matches
against the Swiss-Prot and TrEMBL components of
UniProt are precomputed and available for viewing
in different formats. - Protein 3D structural information is integrated
from MSD, CATH and SCOP, and this data is
available in the match views to provide an at a
glance comparison of sequence and structural
domains.
76InterPro
77InterProScan result
78PROSITEhttp//www.expasy.org/prosite/
- Database of protein families and domains
79PROSITE
- consists of a large collection of biologically
meaningful signatures that are described as
patterns or profiles that help to reliably
identify to which known protein family (if any) a
new sequence belongs - the latest version (release 19.11) contains 1329
patterns and 552 profile entries - each signature is linked to a documentation
providing information on the protein family or
domain detected by the signature origin of its
name, taxonomic occurrence, domain architecture,
function, 3D structure, main characteristics of
the sequence, domain size and some references
80PRINTShttp//www.bioinf.man.ac.uk/dbbrowser/PRINT
S/
81PRINTS
- The PRINT database is a compendium of protein
fingerprints. - A fingerprint is a group of conserved sequence
motifs that together provide diagnostic
signatures for protein families. - Fingerprints are diagnostically more powerful
than single motifs by making use of the
biological context inherent in a multiple-motif
method. - The fingerprinting method is a reliable technique
for detecting members of large, highly divergent
protein super-families.
82PFAMhttp//www.sanger.ac.uk/Software/Pfam/
83PFAM
- Database of multiple sequence alignments and HMMs
of protein domains and families. - Profile hidden Markov models are statistical
models of the primary structure consensus of a
sequence family. - The construction and use of Pfam is tightly tied
to the HMMER software package. -
84PFAM
- Composed of two sets of families
- Pfam-A
- curated part containing over 8296 protein
families - Pfam-B
- automatically generated supplement containing a
large number of small families taken from the
PRODOM database that do not overlap with Pfam-A
(lower quality)
85PFAM
- Each family has the following data
- A seed alignment which is a hand edited multiple
alignment representing the family. - Hidden Markov Models (HMM) derived from the seed
alignment which can be used to find new members
of the domain and also take a set of sequences to
realign them to the model. One HMM is in ls mode
(global) the other is an fs mode (local) model. - A full alignment which is an automatic alignment
of all the examples of the domain using the two
HMMs to find and then align the sequences - Annotation which contains a brief description of
the domain, links to other databases and some
Pfam specific data. To record how the family was
constructed.
86A PFAM entry
87A PFAM entry, contd
88PFAM searches
89PFAM results
90PRODOM http//www.toulouse.inra.fr/prodom.html
91PRODOM
- Database of protein domain families automatically
generated from SWISS-PROT and TrEMBL databases by
sequence comparison. - Useful for analysing the domain arrangements of
complex protein families and the homology
relationships in modular proteins. - Contains (release 2003.1) 144,444 domain families
containing two or more individual domains.
92SMARThttp//smart.embl-heidelberg.de/
- Simple Modular Architecture Research Tool
93SMART
- Allows the identification and annotation of
protein domains and the analysis of domain
architectures. - The current release has more than 600 domain
families represented among nuclear, signalling
and extracellular proteins. - Extensive annotation for each domain family is
available, providing information on function,
subcellular localization, phyletic distribution
and tertiary structure, links to OMIM in cases
where a human disease is associated with one or
more mutations in a particular domain.
94Exercise 3 Domain search
- 1. Go to the PROSITE site.
- 2. Under "Tools for PROSITE" choose ScanProsite.
- 3. Paste the sequence below into the box and tick
the Option "Exclude patterns with a high
probability of occurrence" (to find very common
patterns will not tell you much about your
protein). - MWAPRCRRFWSRWEQVAALLLLLLLLGVPPRSLALPPIRYSHAGICPNDM
NPNLWVDAQSTCRRECETDQECETYEKCCPNVCGTKSCVAARYMDVKGKK
GPVGMPKEATCDHFMCLQQGSECDIWDGQPVCKCKDRCEKEPSFTCASDG
LTYYNRCYMDAEACSKGITLAVVTCRYHFTWPNTSPPPPETTMHPTTASP
ETPELDMAAPALLNNPVHQSVTMGETVSFCDVVGRPRPEITWEKQLEDRE
NVVMRPNHVRGNVVVTNIAQLVIYNAQLQDAGIYTCTARNVAGVLRADFP
LSVVRGHQAAATSESSPNGTAFPAAELKPPDSEDCGEEQTRWHFDAQANN
CLTFTFGHCHRNLNHFETYEACMLACMSGPLAACSLPALQGPCKAYAPRW
AYNSQTGQCQSFVYGGCEGNGNNFESREACEESPFPRGNQRCRACKPRQK
LVTSFCRSDFVILGRVSELTEEPDSGRALVTVDEVLKDEKMGLKFLGQEP
LEVTLLHVDWACPCPNVTVSEMPLIIMGEVDGGMAMLRPDSFVGASSARR
VRKLREVMHKKTCDVLKEFLGLH - 4. Start the scan.
- Which are the motifs that are found?
95Exercise 4 Domain search
- 1. Go to the Pfam site.
- 2. Click Search by protein name or sequence.
- 3. Paste the sequence below into the box and
choose Both Global and Fragment Pfam search. - MWAPRCRRFWSRWEQVAALLLLLLLLGVPPRSLALPPIRYSHAGICPNDM
NPNLWVDAQSTCRRECETDQECETYEKCCPNVCGTKSCVAARYMDVKGKK
GPVGMPKEATCDHFMCLQQGSECDIWDGQPVCKCKDRCEKEPSFTCASDG
LTYYNRCYMDAEACSKGITLAVVTCRYHFTWPNTSPPPPETTMHPTTASP
ETPELDMAAPALLNNPVHQSVTMGETVSFCDVVGRPRPEITWEKQLEDRE
NVVMRPNHVRGNVVVTNIAQLVIYNAQLQDAGIYTCTARNVAGVLRADFP
LSVVRGHQAAATSESSPNGTAFPAAELKPPDSEDCGEEQTRWHFDAQANN
CLTFTFGHCHRNLNHFETYEACMLACMSGPLAACSLPALQGPCKAYAPRW
AYNSQTGQCQSFVYGGCEGNGNNFESREACEESPFPRGNQRCRACKPRQK
LVTSFCRSDFVILGRVSELTEEPDSGRALVTVDEVLKDEKMGLKFLGQEP
LEVTLLHVDWACPCPNVTVSEMPLIIMGEVDGGMAMLRPDSFVGASSARR
VRKLREVMHKKTCDVLKEFLGLH - 4. Search Pfam.
- 1. Which domains are found?
- 2, What may be the function of this protein?
96Exercise 5 Blast searches on your computer
- download blast-2.2.14-ia32-linux.tar.gz
- file from ftp//ftp.ncbi.nih.gov/blast/executables
/LATEST - Make a subdirectory in your home directory
- mkdir /blast
- 3. Move the blast file there
- mv blast-2.2.14-ia32-linux.tar.gz /blast/
- 4. Go to the blast directory
- cd /blast/
- 4. unzip the file
- gunzip blast-2.2.14-ia32-linux.tar.gz
- 5. unpack it
- tar xvf blast-2.2.14-ia32-linux.tar
97Exercise 5 Blast searches, contd
- 6. Get the first 100 human proteins in Swissprot
- - go to http//www.expasy.org/srs5/
- - click on Start
- - unmark TREMBL, to search only in Swissprot
-press Continue
98Exercise 5 Blast searches, contd
- Select in the first Info line Organism and type
in human
Press Do Query, this will retrieve all human
proteins in Swissprot in batches of 100
99Exercise 5 Blast searches, contd
Press save
100Exercise 5 Blast searches, contd
1. Change view to FastaSeqs
2. Change Sequence Format to fasta
3. Press SAVE
101Exercise 5 Blast searches, contd
- 6. Save file e.g. as 100seq.fa
- 7. Format your database of 100 sequences to make
it searchable by blast - /blast/blast-2.2.14/bin/formatdb i 100seq.fa
- 8. Now you have a searchable database, you can
search with an input sequence of your choice.
E.g. make a file from the first sequence in
100seq.fa, grab the first sequence with the mouse
and type - cat gt seq1.fa
- and paste it into the file, then press ltCtrl-dgt
- 9. Now you have an input sequence and a database,
type - /blast/blast-2.2.14/bin/blastall p blastp i
seq1.fa d 100seq.fa o seq1-vs-100seq.blastp - 10. After it finished running (it will be ready
immediately) you will get your output in
seq1-vs-100seq.blastp file. If you invoke the
blastall program without the switches it will
list all the options you can use.