Title: Computer Storage of Sequences
1Computer Storage of Sequences
(Chapter 2 of Bioinformatics Sequence and
Genome Analysis By David W. Mount)
- CSE730 Seminar on
- Information Retrieval of Biomedical Text and
Data
2Outline
- Storing DNA/Protein sequences into computer files
or databases. - Related information placed in the database along
with the sequence in a number of sequence data
formats. - Online public access Databases for sequence
retrieval.
3Nucleotide Sequence
Nomenclature Committee of the International Union of Biochemistry Nomenclature Committee of the International Union of Biochemistry Nomenclature Committee of the International Union of Biochemistry Nomenclature Committee of the International Union of Biochemistry
Code Nucleic Acid(s) Code Nucleic Acid(s)
A Adenine M A or C (amino)
C Cytosine R A or G (purine)
G Guanine W A or T (weak)
T Thymine S C or G (strong)
U Uracil Y C or T (pyrimidine)
K G or T (keto)
V A or C or G
H A or C or T
D A or G or T
B C or G or T
N A or G or C or T (any)
4Protein Sequence
Code Amino acid Code Amino acid
A Alanine N Asparagine
B Asparagine P Proline
C Cysteine Q Glutamine
D Aspartic acid R Arginine
E Glutamic acid S Serine
F Phenylalanine T Threonine
G Glycine V Valine
H Histidine W Tryptophan
I Isoleucine X Unknown
K Lysine Y Tyrosine
L Leucine Z Glutamine
M Methionine
Adapted from IUPAC-IUB (1969,1972, 1983)
5Sequence Formats
- Sequence is stored as ASCII text (i.e. sequence
of A,G,C,T) along with annotations. - Different sequence formats recognized by
different sequence analyzer programs. - Sequence Format includes accessory information,
gene names, source organism, investigator name,
references, and the actual sequence.
6Sequence Formats (continued)
- FASTA
- GenBank Flat File format
- PIR/CODATA format
- EMBL sequence entry format
- Intelligenetics sequence entry format
- GCG (Genetics Computer Group) sequence entry
format. - ASN.1
- XML
7Databases
- NCBI
- GenBank at the National Center of Biotechnology
Information (NCBI), National Library of Medicine,
Washington, DC - NBRF
- Protein Information Resource (PIR) database at
the National Biomedical Research Foundation in
Washington, DC
8Databases (continued)
- SwissProt
- The SwissProt protein sequence database at
ISREC, Swiss Institute for Experimental Cancer
Research. - EMBL
- European Molecular Biology Laboratory (EMBL)
Outstation at Hixton, England - DDBJ
- DNA DataBank of Japan (DDBJ) at Mishima, Japan
9Databases on Internet
- NCBI http//www.ncbi.nlm.nih.gov
- PIR
- http//www-nbrf.georgetown.edu/pirwww
- SwissProt
- http//www.expasy.ch/cgi-bin/sprot-search-de
- EMBL http//www.ebi.ac.uk/embl/index.html
- DDBJ http//www.ddbj.nig.ac.jp/
10NCBI
- National resource for molecular biology
information. - Maintains comprehensive databases for variety of
Biotech related information. - Develops and manages access to a range of
databases and softwares for scientific and
medical communities.
11NCBI Integrated Databases
- Literature Databases
- Pubmed
- PubMed Central
- OMIM
- PROW
- BookShelf
12NCBI Integrated Databases (continued)
- Nucleotide Databases
- GenBank
- EST Database
- GSS Database
- SNPs Database
- RefSeq
- STS Database
13NCBI Integrated Databases (continued)
- Entrez Databases
- Pubmed
- Protein Sequence Database
- Nucleotide Sequence Database
- Taxonomy
- OMIM
14GenBank
- GenBank is the NIH genetic sequence database.
- Annotated collection of all publicly available
DNA sequences. - GenBank is a part of an international
collaboration of sequence databases along with
EMBL and DDBJ.
15GenBank DNA Sequence Format
- DNA sequence in GenBank is formatted into
distinct attributes as following - Locus locus name, sequence length, division,
date - Definition description of entry
- Accession unique accession number
- Version version of sequence
- Keywords keywords for cross referencing
16GenBank DNA Sequence Format(continued)
- Source source organism of DNA
- Organism description of organism
- References authors, title, journal, Medline, etc
- Features information about sequence
- Base count number of bases in sequence
- Origin sequence data begin following origin.
- Genebank sample
17NCBI Tools
- Tools for Data Retrieval and submission
- Text Term Searching
- Sequence Similarity Searching
- Taxonomy Searching
- Sequence Submission
18NCBI ENTREZ
- Entrez is a search and retrieval system that
integrates information from databases at NCBI. - These databases include nucleotide sequences,
protein sequences, macromolecular structures,
whole genomes, and MEDLINE, PubMed. Etc. - Entrez
19NCBI BLAST
- BLAST Basic Local Alignment Search Tool
- It is a set of similarity search programs
designed to explore available sequence databases. - It uses a heuristic algorithm which is able to
detect relationships among sequences which share
only isolated regions of similarity - Q-BLAST It is a queuing system to BLAST that
allows users to retrieve results at their
convenience and format their results.
20NCBI BLAST (continued)
- Access to BLAST service
- Web-BLAST
- Standalone BLAST
- Network BLAST
- BLAST URL API
21NCBI BLAST (continued)
- BLAST Programs
- Blastp Compares amino acid sequence against
protein sequence Database - Blastn Compares nucleotide sequence against
nucleotide sequence Database - Blastx Compares nucleotide query sequence
against protein sequence Database - Tblastn Compares protein query sequence against
nucleotide sequence Database - BLAST
22NBRF PIR
- Protein Information Resource
- 3 Major Databases
- PSD (Protein Sequence Database)
- iProClass
- PIR-NREF
- (Nonredundant REFerence protein database)
23PIR PSD
- The PIR, in collaboration with MIPS and JIPID,
produces and distributes the PIR-International
Protein Sequence Database (PSD) . - Comprehensive and expertly annotated protein
sequence database. - The primary sources of PSD data are sequences
from GenBank/EMBL/DDBJ translations, published
literature, and direct submission to
PIR-International.
24PIR PSD (continued)
- The PIR-PSD data is available in XML format and
NBRF, PIR/CODATA formats. The sequence file is
available in FASTA format. - Also available at PIR UNIX FTP server. Address
- ftp//ftp.pir.georgetown.edu/pir_databases/psd/
25CODATA format
- CODATA format has approximately the same
information as a GenBank or EMBL sequence file,
but is slightly differently formatted and has
different field names. - Also called PIR format, used by NBRF.
- CODATA Sample
26PIR iProClass
- The iProClass database provides comprehensive
descriptions of all proteins and serves as a
framework for data integration in a distributed
networking environment. - Very user-friendly description.
27PIR NREF (Non-redundant REFerence protein
database)
- Comprehensive Containing all sequences from
PIR-PSD, Swiss-Prot, TrEMBL, RefSeq, GenPept, and
updated bi-weekly. - Non-Redundant Clustered by sequence identity and
taxonomy at the species level. - Source Attribution Containing protein IDs and
names from associated databases (with hypertext
links), in addition to protein sequence,
taxonomy, and bibliography. - The current version (July 2002) consists of more
than 809,000 non-redundant PIR-PSD, SwissProt and
TrEMBL proteins organized with more than 36,200
PIR superfamilies, 145,340 families, and links to
over 50 molecular biology databases.
28Swiss-Prot
- Swiss-Prot is a protein knowledgebase established
in 1986. - Maintained collaboratively, by the Department of
Medical Biochemistry of the University of Geneva
(now the Swiss Institute of Bioinformatics) and
the EMBL Data Library. - Swiss-Prot Sequence Entry Example
29Sequence Format Conversion
- READSEQ
- Sequence Format Conversion program.
- http//bimas.dcrt.nih.gov/molbio/readseq/
- Can convert to/from
- ASN.1
- FASTA
- CODATA
- GCG
- EMBL format
- GenBank format and many other formats
30References
- http//www.ncbi.nlm.nih.gov
- http//www-nbrf.georgetown.edu/pirwww
- http//www.expasy.ch/cgi-bin/sprot-search-de
- http//www.ebi.ac.uk/embl/index.html
- http//www.ddbj.nig.ac.jp/
31Thank You ?
Presented byHemal Patel Jeetal Shah