Please use linux today if possible - PowerPoint PPT Presentation

1 / 101
About This Presentation
Title:

Please use linux today if possible

Description:

Bookshelves. Francis Ouellette. Databases on the Internet. Information system. Query system ... It was a book. The Atlas of Protein Sequences ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 102
Provided by: agyu4
Category:
Tags: linux | please | possible | today | use

less

Transcript and Presenter's Notes

Title: Please use linux today if possible


1
Please use linux today if possible!
2
Introduction to Molecular Biology Databases
Alinda Nagy Hedi Hegyi, PhD _at_ Institute of
Enzymology, Budapest The BioSapiens Permanent
School of Bioinformatics
Budapest, Sept 4-8, 2006
3
Databases
4
What is a database?
  • A database is a structured collection of
    information. (An organized array of information.)
  • A database consists of basic objects called
    records or entries.
  • Each record consists of fields, which hold
    defined data that is related to that record.
  • For example, a protein database would typically
    have proteins as records and protein properties
    as fields (i.e. name, length, sequence,
    taxonomical origin, etc.)

Noam Kaplan
5
What is a database?
  • A database is searchable (index)
  • -gt table of contents
  • A database is updated periodically (release)
  • -gt new edition
  • A database is cross-referenced (hyperlinks)
  • gt links with other db

6
Why Databases?
  • The purpose of databases is not merely to collect
    and organize data, but mainly to allow advanced
    data retrieval.
  • A query is a method to retrieve information from
    the database.
  • The organization of each record into
    predetermined fields allows us to use queries on
    fields.
  • Example Find all human proteins that are enzymes
    and have a length of 1000-1200 aa.

Noam Kaplan
7
Databases on the Internet
  • Biological databases often have a web interface,
    which allows the user to send queries to the
    database.
  • Some databases can be accessed by different web
    servers, each offering a different interface.

query
request
result
web page
User
Web server
Database server
Noam Kaplan
8
Databases on the Internet
Information system
Query system
Storage System
Data
Francis Ouellette
9
Databases on the Internet
Information system
Query system
Storage System
Data
Francis Ouellette
10
Databases on the Internet
Information system
Query system
Storage System
Data
Francis Ouellette
11
Databases on the Internet
- A List you look at- A catalogue- indexed
files- SQL- grep
Information system
Query system
Storage System
Data
Francis Ouellette
12
Databases on the Internet
Information system
Query system
Storage System
Data
Francis Ouellette
13
Database download
  • Nearly all biological databases are available for
    download as simple text files.
  • A local version of the database removes
    limitations on how you process the data.
  • Processing data in files requires some minimal
    computer-programming skills.
  • PERL is an easy programming language that can be
    used for extraction and analysis of data from
    files.

Noam Kaplan
14
Tour of the major molecular biology databases
  • There is a tremendous amount of information about
    biomolecules in publicly available databases.
  • Today, we will just look at some of the main
    databases and what kind of information they
    contain.
  • Exercises will give you a little practice at
    browsing databases.

15
List of molecular biology databases
16
List of molecular biology databases
  • Nucleic Acids Research publishes an annual
    database issue. The 2006 update of the online
    Molecular Biology Database Collection includes
    858 databases
  • http//www3.oup.co.uk/nar/database/c/

17
Large Growth in the Number of Biological Databases
NAR Database Issue
18
Molecular biology data types
Organisms
Genome maps
Lei Liu
19
Molecular biology data types
Organisms
Genome maps
DNA sequences
RNA sequences
...AATGGTACCGATGACCTGGAGCTTGGTTCGA...
Lei Liu
20
Molecular biology data types
Organisms
Genome maps
DNA sequences
RNA sequences
Protein sequences
...TRLRPLLALLALWPPPPARAFVNQHLCGSHLVEA...
Lei Liu
21
Molecular biology data types
Organisms
Genome maps
DNA sequences
RNA structures
RNA sequences
Protein sequences
Protein structures
PDB entry 1CIS P.Osmark, P.Sorensen, F.M.Poulsen
Lei Liu
22
Molecular biology data types
Organisms
Genome maps
DNA motifs
DNA sequences
RNA expression
RNA structures
RNA sequences
Protein sequences
Protein structures
Protein motifs
Lei Liu
23
Types of molecular biology databases
  • 14 main NAR categories
  • Nucleotide Sequence
  • RNA sequence
  • Protein sequence
  • Structure
  • Genomics (non-vertebrate)
  • Metabolic and Signaling Pathways
  • Human and other Vertebrate Genomes
  • Human Genes and Diseases
  • Microarray Data and other Gene Expression
  • Proteomics Resources
  • Other
  • Organelle
  • Plant
  • Immunological

24
Resources are Becoming More Diverse
NAR Database Categories
2004
2006
25
NAR 2006 A Closer Look
  • Genome scale databases have proliferated
  • Traditional sequence databases are now a small
    part
  • Databases around new specific data types are
    emerging
  • Pathway and disease orientated databases are
    emerging

26
Database searches
27
Using a database
  • How to get information out of a database
  • Summaries how many entries, average or extreme
    values
  • Browsing no targeted information to retrieve
  • Search looking for particular information
  • Searching a database
  • Must have a key that identifies the element(s) of
    the database that are of interest.
  • Name of gene
  • Sequence of gene
  • Other information

Larry Hunter
28
Searching sequence databases
  • Start from sequence, find information about it
  • Many kinds of input sequences
  • Could be amino acid or nucleotide sequence
  • Genomic or mRNA/cDNA or protein sequence
  • Complete or fragmentary sequences
  • Exact matches are rare (even uninteresting in
    many cases), so often goal is to retrieve a set
    of similar sequences.
  • Both small (mutations) and large (required for
    function) differences within similar can be
    interesting.

Larry Hunter
29
What might we want to know about a sequence?
  • Is this sequence similar to any known genes? How
    close is the best match? Significance?
  • What do we know about that gene?
  • Genomic (chromosomal location, allelic
    information, regulatory regions, etc.)
  • Structural (known structure? structural domains?
    etc.)
  • Functional (molecular, cellular disease)
  • Evolutionary information
  • Is this gene found in other organisms?
  • What is its taxonomic tree?

Larry Hunter
30
What can be discovered about a gene by a database
search?
  • A little or a lot, depending on the gene
  • Evolutionary information homologous genes,
    taxonomic distributions, allele frequencies,
    synteny, etc.
  • Genomic information chromosomal location,
    introns, UTRs, regulatory regions, shared
    domains, etc.
  • Structural information associated protein
    structures, fold types, structural domains
  • Expression information expression specific to
    particular tissues, developmental stages,
    phenotypes, diseases, etc.
  • Functional information enzymatic/molecular
    function, pathway/cellular role, localization,
    role in diseases

Larry Hunter
31
NCBI and Entrez
32
NCBI and Entrez
  • One of the most useful and comprehensive sources
    of databases is the NCBI (National Center for
    Biotechnology Information), part of the NIH
    (National Institute of Health).
  • NCBI provides interesting summaries, browsers for
    genome data, and search tools
  • Entrez is their database search
    interfacehttp//www.ncbi.nlm.nih.gov/Entrez
  • Can search on gene names, sequences, chromosomal
    location, diseases, keywords, ...

Larry Hunter
33
(No Transcript)
34
BLAST Searching with a sequence
  • Goals is to find other sequences that are more
    similar to the query than would be expected by
    chance (and therefore are homologous).
  • Can start with nucleotide or amino acid sequence,
    and search for either (or both)
  • Many options
  • E.g. ignore low information (repetitive)
    sequence, set significance critical value
  • Defaults are not always appropriate READ THE
    NCBI EDUCATION PAGES!

Larry Hunter
35
(No Transcript)
36
  • Major choices
  • Translation
  • Database
  • Filters
  • Restrictions
  • Matrix

Larry Hunter
37
Larry Hunter
38
Larry Hunter
39
Close hit Rat ADH alpha
Larry Hunter
40
Distant hitHuman sorbitol dehydrogenase
Larry Hunter
41
Parameters (at bottom!)
Larry Hunter
42
Click on
Larry Hunter
43
Larry Hunter
44
BLAST searches online
  • http//www.ncbi.nlm.nih.gov/BLAST/
  • Sequences
  • gtENSP00000002501 pepknown chrNCBI361688598804
    88613382
  • MEPPEGAGTGEIVKEAEVPQAALGVPAQGTGDNGHTPVEEEVGGIPVPAP
    GLLQVTERRQ
  • PLSSVSSLEVHFDLLDLTELTDMSDQELAEVFADSDDENLNTESPAGLHP
    LPRAGYLRSP
  • SWTRTRAEQSHEKQPLGDPERQATVLDTFLTVERPQED
  • gtENSP00000314902 chr18 geneENSG00000176890
    trENST00000323250
  • MPVAGSELPRRPLPPAAQERDAEPRPPHGELQYLGQIQHILRCGVRKDDR
    TGTGTLSVFG
  • MQARYSLRDYSGQGVDQLQRVIDTIKTNPDDRRIIMCAWNPRDLPLMALP
    PCHALCQFYV
  • VNSELSCQLYQRSGDMGLGVPFNIASYALLTYMIAHITGLKPGDFIHTLG
    DAHIYLNHIE
  • PLKIQLQREPRPFPKLRILRKVEKIDDFKAEDFQIEGYNPHPTIKMEMAV

45
BLAST output for ENSP00000002501
46
BLAST output for ENSP00000002501
47
BLAST output for ENSP00000314902
48
BLAST output for ENSP00000314902
49
Take home messages
  • There are a lot of molecular biology databases,
    containing a lot of valuable information
  • Not even the best databases have everything (or
    the best of everything)
  • These databases are moderately well cross-linked,
    and there are linker databases
  • Sequence is a good identifier, maybe even better
    than gene name!

Larry Hunter
50
Protein sequence databases
  • General sequence databases (e.g. UniProt)
  • Protein properties (e.g. PFD Protein Folding
    Database)
  • Protein localization and targeting
  • (e.g. NPD - Nuclear Protein Database)
  • Protein sequence motifs and active sites
  • (e.g. BLOCKS, InterPro, PROSITE, PRINTS)
  • Protein domain databases protein classification
  • (e.g. InterPro, ProDom, SMART, Pfam)
  • Databases of individual protein families
  • (e.g. Histone Database)
  • http//www3.oup.co.uk/nar/database/cat/1

51
UniProt ( The Universal Protein Resource)
http//www.uniprot.org/ ftp//ftp.uniprot.org/pub
/databases/
Wu CH et al. The Universal Protein Resource
(UniProt) an expanding universe of protein
information.Nucleic Acids Res. 2006 Jan
134(Database issue) D187-91.
52
Margaret Dayhoff
  • The first protein database was created by
    Margaret Dayhoff, calledThe Atlas of Protein
    Sequences.
  • It was a book.

53
The Atlas of Protein Sequences
  • Dayhoff had the idea that a compilation of all
    protein sequences in the literature into one
    resource would be a useful research tool.
  • She and her co-workers collected all known
    sequences and published them together.
  • Then, when a new sequence was obtained, there
    was a single resource available for determining
    its relationship to other known sequences.

54
What is UniProt
55
What is UniProt
  • The world's most comprehensive catalog of
    information on proteins.
  • Central repository of protein sequence and
    function.
  • Created by joining the information contained in
    Swiss-Prot, TrEMBL, and PIR.
  • Collaboration between EBI (European
    Bioinformatics Institiute), SIB (Swiss Institute
    of Bioinformatics) and PIR (DDBJ to join).
  • Funded mainly by NIH.
  • Three database components
  • UniProt Knowledgebase (UniProtKB)
  • UniProt Reference Clusters (UniRef)
  • UniProt Archive (UniParc)

56
What is UniProt
1. UniProt Knowledgebase (UniProtKB) central
access point for extensive curated protein
information, including function, classification,
and cross-reference comprising the manually
annotated UniProtKB/Swiss-Prot section and the
automatically annotated UniProtKB/TrEMBL
section 2. UniProt Reference Clusters
(UniRef) combines closely related sequences into
a single record to speed searches speed
similarity searches via sequence space
compression by merging sequences that are 100
(UniRef100), 90 (UniRef90) or 50 (UniRef50)
identical 3. UniProt Archive (UniParc) comprehens
ive repository, reflecting the history of all
protein sequences stores all publicly available
protein sequences, containing the history of
sequence data with links to the source databases
57
What is UniProt
The UniProt databases collect both protein
sequences obtained through experimental
determination and protein sequences derived from
the translation of nucleotide sequences (which
were predicted or determined to codify for a
protein).
Amino acid sequence determined through
experimental analysis
Protein sequence databases
PIR
SWISSPROT
TrEMBL
GeneBank
Nucleotide sequence databases
DDBJ
EMBL
Validated Enriched of specific information
58
UniProt Goals
  • High level of annotation
  • Minimal redundancy
  • High level of integration with other databases
  • Complete and up-to-date

59
Annotation concepts
  • UniParc
  • No annotation
  • UniProtKB
  • Annotated
  • UniRef
  • No annotation, just description line of
    UniProtKB or UniParc master entry in the cluster
    for use in FASTA files

60
Minimal redundancy
  • UniParc
  • All sequences that are 100 identical over their
    entire length are merged into a single entry,
    regardless of species. UniParc represents each
    protein sequence once and only once, assigning it
    a unique identifier. UniParc cross-references the
    accession numbers of the source databases.
  • UniProtKB
  • Aims to describe in a single record all protein
    products derived from a certain gene (or genes if
    the translation from different genes in a genome
    leads to indistinguishable proteins) from a
    certain species.
  • UniRef
  • Merges sequences automatically.

61
Integration with other databases
  • UniParc
  • Linked back to source records
  • UniProtKB
  • Linked to gt60 other databases
  • UniRef
  • UniRef clusters link back to UniProtKB and
    UniParc records in the cluster

62
Complete and up-to-date
  • UniParc
  • All publically available protein sequences,
    updated every 2 weeks (05/06, Rel 8.0 7.116.519
    entries)
  • UniProtKB
  • All suitable stable protein sequences, updated
    every 2 weeks (05/06, Rel 8.0 3.170.612 entries)
  • UniRef
  • All protein sequences in the UniProtKB and in
    UniParc useful for sequence similarity searches,
    updated every 2 weeks (05/06, Rel 8.0 3.511.676
    UniRef100, 2.254.474 UniRef90, 1.148.123 UniRef50
    entries)

63
An example
64
An example
65
An example
66
An example
67
An example
68
Exercise 1 Text search
1. Go to EXPASY. Click "UniProt Knowledgebase
(Swiss-Prot and TrEMBL) and then search for
human cochlin. Notice that there is a wealth of
information about this protein. Furthermore,
there are many links to sequence analysis tools
(some of which you will learn later) and some
other nice features. Note that this is merely a
graphical display of the original
UniProtKB/SwissProt database entry (which is in
text). 2. Try to answer all of the questions
below. 1. Which year was the NMR structure of the
LCCL domain determined? 2. Where is the protein
expressed?3. Which diseases are associated with
the protein?
69
Exercise 2 BLAST search
  • 1. Go to EXPASY. Click "UniProt Knowledgebase
    (Swiss-Prot and TrEMBL) and then BLAST.
  • 2. Copy the following human amino acid sequence.
  • MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAK
    VLRLFEENDVNLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRH
    DIGATVHELSRDKKKDTVPWFPRTIQELDRFANQILSYGAELDADHPGFK
    DPVYRARRKQFADIAYNYRHGQPIPRVEYMEEEKKTWGTVFKTLKSLYKT
    HACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGFRLRPVAGLLS
    SRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA
    QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSS
    FGELQYCLSEKPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVR
    NFAATIPRPFSVRYDPYTQRIEVLDNTQQLKILADSINSEIGILCSALQK
    IK
  • 3. Paste the sequence into the query sequence
    window and adjust the options as necessary. You
    won't need to specify advanced options, but you
    should choose a program and database. For
    simplicity, use e.g. the UniProtKB database.
  • 4. Run the search and identify the protein. Use
    the link provided to see the UniProtKB/SWISS-PROT
    report.

70
Exercise 2 BLAST search
  • 5. Now, try to answer all of the questions below.
  • 1. What is the SWISS-PROT primary accession
    number? 2. What is the common name of the
    protein? 3. What is the gene called? 4. Which
    year was the crystal structure of the catalytic
    domain determined? Name the first author. 5.
    Does the enzyme require a co-factor to function?
    If so, what? 6. Name the most common disease
    that arises as a result of deficiency of this
    enzyme.
  • 7. How many amino acid residues are there in the
    protein? 8. What is the molecular weight of the
    protein?

71
Patterns and Profiles, Protein Motifs and Domains
  • InterPro - an integrated database of protein
    families, domains, motifs and functional sites.
  • Blocks - multiply aligned ungapped segments for
    the most highly conserved regions of proteins.
  • Motif - a server that scans databases to find
    motifs or patterns and that can generate sequence
    profiles.
  • Pfam - multiple sequence alignments and HMMs of
    protein domains and families.
  • PRINTS - database of groups of conserved motifs,
    or protein fingerprints.
  • ProDom - protein domain families automatically
    generated from SWISS-PROT and TrEMBL.
  • PROSITE - database of protein families and
    domains defined by functional sites, patterns and
    profiles.
  • SMART - Simple Modular Architecture Research Tool
    for the identification of domains.
  • COGS database - clusters of sequences determined
    by comparing sequences from whole genomes.

72
InterPro(Integrated resource of Protein
Families, Domains and Sites)
  • http//www.ebi.ac.uk/interpro/
  • ftp//ftp.ebi.ac.uk/pub/databases/interpro
  • Mulder NJ et al. (2005) InterPro, progress and
    status in 2005. Nucleic Acids Res. 33 (Database
    Issue) D201-5.

73
What is InterPro
  • Secondary protein databases on functional sites
    and domains are vital resources for identifying
    distant relationships in novel sequences, and
    hence for predicting protein function and
    structure.
  • Unfortunately, these signature databases do not
    share the same formats and nomenclature, and each
    database has its own strengths and weaknesses.
  • Thus, for best results, search strategies should
    ideally combine all of them.

74
What is InterPro
  • InterPro is a collaborative project aimed at
    providing an integrated layer on top of the most
    commonly used signature databases by creating a
    unique, non- redundant characterization of a
    given protein family, domain or functional site.
  • Integrates PROSITE, PRINTS, Pfam, ProDom, SMART,
    TIGRFAMs, PIR superfamily, SUPERFAMILY, Gene3D
    and PANTHER databases and the addition of others
    is scheduled.
  • Has cross-references to the BLOCKS database as
    well as many specialized protein family and
    protein structure databases.

75
InterPro
  • The latest release of InterPro (12.1) contains
    12,953 entries, with 78 coverage of all proteins
    in UniProtKB.
  • Each entry has annotation provided in the name,
    GO mapping and abstract fields, and all matches
    against the Swiss-Prot and TrEMBL components of
    UniProt are precomputed and available for viewing
    in different formats.
  • Protein 3D structural information is integrated
    from MSD, CATH and SCOP, and this data is
    available in the match views to provide an at a
    glance comparison of sequence and structural
    domains.

76
InterPro
  • Dataflow scheme

77
InterProScan result
78
PROSITEhttp//www.expasy.org/prosite/
  • Database of protein families and domains

79
PROSITE
  • consists of a large collection of biologically
    meaningful signatures that are described as
    patterns or profiles that help to reliably
    identify to which known protein family (if any) a
    new sequence belongs
  • the latest version (release 19.11) contains 1329
    patterns and 552 profile entries
  • each signature is linked to a documentation
    providing information on the protein family or
    domain detected by the signature origin of its
    name, taxonomic occurrence, domain architecture,
    function, 3D structure, main characteristics of
    the sequence, domain size and some references

80
PRINTShttp//www.bioinf.man.ac.uk/dbbrowser/PRINT
S/
81
PRINTS
  • The PRINT database is a compendium of protein
    fingerprints.
  • A fingerprint is a group of conserved sequence
    motifs that together provide diagnostic
    signatures for protein families.
  • Fingerprints are diagnostically more powerful
    than single motifs by making use of the
    biological context inherent in a multiple-motif
    method.
  • The fingerprinting method is a reliable technique
    for detecting members of large, highly divergent
    protein super-families.

82
PFAMhttp//www.sanger.ac.uk/Software/Pfam/
83
PFAM
  • Database of multiple sequence alignments and HMMs
    of protein domains and families.
  • Profile hidden Markov models are statistical
    models of the primary structure consensus of a
    sequence family.
  • The construction and use of Pfam is tightly tied
    to the HMMER software package.

84
PFAM
  • Composed of two sets of families
  • Pfam-A
  • curated part containing over 8296 protein
    families
  • Pfam-B
  • automatically generated supplement containing a
    large number of small families taken from the
    PRODOM database that do not overlap with Pfam-A
    (lower quality)

85
PFAM
  • Each family has the following data
  • A seed alignment which is a hand edited multiple
    alignment representing the family.
  • Hidden Markov Models (HMM) derived from the seed
    alignment which can be used to find new members
    of the domain and also take a set of sequences to
    realign them to the model. One HMM is in ls mode
    (global) the other is an fs mode (local) model.
  • A full alignment which is an automatic alignment
    of all the examples of the domain using the two
    HMMs to find and then align the sequences
  • Annotation which contains a brief description of
    the domain, links to other databases and some
    Pfam specific data. To record how the family was
    constructed.

86
A PFAM entry
87
A PFAM entry, contd
88
PFAM searches
89
PFAM results
90
PRODOM http//www.toulouse.inra.fr/prodom.html
91
PRODOM
  • Database of protein domain families automatically
    generated from SWISS-PROT and TrEMBL databases by
    sequence comparison.
  • Useful for analysing the domain arrangements of
    complex protein families and the homology
    relationships in modular proteins.
  • Contains (release 2003.1) 144,444 domain families
    containing two or more individual domains.

92
SMARThttp//smart.embl-heidelberg.de/
  • Simple Modular Architecture Research Tool

93
SMART
  • Allows the identification and annotation of
    protein domains and the analysis of domain
    architectures.
  • The current release has more than 600 domain
    families represented among nuclear, signalling
    and extracellular proteins.
  • Extensive annotation for each domain family is
    available, providing information on function,
    subcellular localization, phyletic distribution
    and tertiary structure, links to OMIM in cases
    where a human disease is associated with one or
    more mutations in a particular domain.

94
Exercise 3 Domain search
  • 1. Go to the PROSITE site.
  • 2. Under "Tools for PROSITE" choose ScanProsite.
  • 3. Paste the sequence below into the box and tick
    the Option "Exclude patterns with a high
    probability of occurrence" (to find very common
    patterns will not tell you much about your
    protein).
  • MWAPRCRRFWSRWEQVAALLLLLLLLGVPPRSLALPPIRYSHAGICPNDM
    NPNLWVDAQSTCRRECETDQECETYEKCCPNVCGTKSCVAARYMDVKGKK
    GPVGMPKEATCDHFMCLQQGSECDIWDGQPVCKCKDRCEKEPSFTCASDG
    LTYYNRCYMDAEACSKGITLAVVTCRYHFTWPNTSPPPPETTMHPTTASP
    ETPELDMAAPALLNNPVHQSVTMGETVSFCDVVGRPRPEITWEKQLEDRE
    NVVMRPNHVRGNVVVTNIAQLVIYNAQLQDAGIYTCTARNVAGVLRADFP
    LSVVRGHQAAATSESSPNGTAFPAAELKPPDSEDCGEEQTRWHFDAQANN
    CLTFTFGHCHRNLNHFETYEACMLACMSGPLAACSLPALQGPCKAYAPRW
    AYNSQTGQCQSFVYGGCEGNGNNFESREACEESPFPRGNQRCRACKPRQK
    LVTSFCRSDFVILGRVSELTEEPDSGRALVTVDEVLKDEKMGLKFLGQEP
    LEVTLLHVDWACPCPNVTVSEMPLIIMGEVDGGMAMLRPDSFVGASSARR
    VRKLREVMHKKTCDVLKEFLGLH
  • 4. Start the scan.
  • Which are the motifs that are found?

95
Exercise 4 Domain search
  • 1. Go to the Pfam site.
  • 2. Click Search by protein name or sequence.
  • 3. Paste the sequence below into the box and
    choose Both Global and Fragment Pfam search.
  • MWAPRCRRFWSRWEQVAALLLLLLLLGVPPRSLALPPIRYSHAGICPNDM
    NPNLWVDAQSTCRRECETDQECETYEKCCPNVCGTKSCVAARYMDVKGKK
    GPVGMPKEATCDHFMCLQQGSECDIWDGQPVCKCKDRCEKEPSFTCASDG
    LTYYNRCYMDAEACSKGITLAVVTCRYHFTWPNTSPPPPETTMHPTTASP
    ETPELDMAAPALLNNPVHQSVTMGETVSFCDVVGRPRPEITWEKQLEDRE
    NVVMRPNHVRGNVVVTNIAQLVIYNAQLQDAGIYTCTARNVAGVLRADFP
    LSVVRGHQAAATSESSPNGTAFPAAELKPPDSEDCGEEQTRWHFDAQANN
    CLTFTFGHCHRNLNHFETYEACMLACMSGPLAACSLPALQGPCKAYAPRW
    AYNSQTGQCQSFVYGGCEGNGNNFESREACEESPFPRGNQRCRACKPRQK
    LVTSFCRSDFVILGRVSELTEEPDSGRALVTVDEVLKDEKMGLKFLGQEP
    LEVTLLHVDWACPCPNVTVSEMPLIIMGEVDGGMAMLRPDSFVGASSARR
    VRKLREVMHKKTCDVLKEFLGLH
  • 4. Search Pfam.
  • 1. Which domains are found?
  • 2, What may be the function of this protein?

96
Exercise 5 Blast searches on your computer
  • download blast-2.2.14-ia32-linux.tar.gz
  • file from ftp//ftp.ncbi.nih.gov/blast/executables
    /LATEST
  • Make a subdirectory in your home directory
  • mkdir /blast
  • 3. Move the blast file there
  • mv blast-2.2.14-ia32-linux.tar.gz /blast/
  • 4. Go to the blast directory
  • cd /blast/
  • 4. unzip the file
  • gunzip blast-2.2.14-ia32-linux.tar.gz
  • 5. unpack it
  • tar xvf blast-2.2.14-ia32-linux.tar

97
Exercise 5 Blast searches, contd
  • 6. Get the first 100 human proteins in Swissprot
  • - go to http//www.expasy.org/srs5/
  • - click on Start
  • - unmark TREMBL, to search only in Swissprot

-press Continue
98
Exercise 5 Blast searches, contd
  • Select in the first Info line Organism and type
    in human

Press Do Query, this will retrieve all human
proteins in Swissprot in batches of 100
99
Exercise 5 Blast searches, contd
Press save
100
Exercise 5 Blast searches, contd
1. Change view to FastaSeqs
2. Change Sequence Format to fasta
3. Press SAVE
101
Exercise 5 Blast searches, contd
  • 6. Save file e.g. as 100seq.fa
  • 7. Format your database of 100 sequences to make
    it searchable by blast
  • /blast/blast-2.2.14/bin/formatdb i 100seq.fa
  • 8. Now you have a searchable database, you can
    search with an input sequence of your choice.
    E.g. make a file from the first sequence in
    100seq.fa, grab the first sequence with the mouse
    and type
  • cat gt seq1.fa
  • and paste it into the file, then press ltCtrl-dgt
  • 9. Now you have an input sequence and a database,
    type
  • /blast/blast-2.2.14/bin/blastall p blastp i
    seq1.fa d 100seq.fa o seq1-vs-100seq.blastp
  • 10. After it finished running (it will be ready
    immediately) you will get your output in
    seq1-vs-100seq.blastp file. If you invoke the
    blastall program without the switches it will
    list all the options you can use.
Write a Comment
User Comments (0)
About PowerShow.com