Computer Storage of Sequences - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Computer Storage of Sequences

Description:

Nomenclature Committee of the International Union of Biochemistry. Protein Sequence ... by the Department of Medical Biochemistry of the University of Geneva (now ... – PowerPoint PPT presentation

Number of Views:120

Avg rating:3.0/5.0

Slides: 32

Provided by: publicco

Learn more at: http://www.cedar.buffalo.edu

Category:

more less

Transcript and Presenter's Notes

Title: Computer Storage of Sequences

1
Computer Storage of Sequences
(Chapter 2 of Bioinformatics Sequence and
Genome Analysis By David W. Mount)

CSE730 Seminar on
Information Retrieval of Biomedical Text and
Data

2
Outline

Storing DNA/Protein sequences into computer files
or databases.
Related information placed in the database along
with the sequence in a number of sequence data
formats.
Online public access Databases for sequence
retrieval.

3
Nucleotide Sequence
Nomenclature Committee of the International Union of Biochemistry Nomenclature Committee of the International Union of Biochemistry Nomenclature Committee of the International Union of Biochemistry Nomenclature Committee of the International Union of Biochemistry
Code Nucleic Acid(s) Code Nucleic Acid(s)
A Adenine M A or C (amino)
C Cytosine R A or G (purine)
G Guanine W A or T (weak)
T Thymine S C or G (strong)
U Uracil Y C or T (pyrimidine)
K G or T (keto)
V A or C or G
H A or C or T
D A or G or T
B C or G or T
N A or G or C or T (any)
4
Protein Sequence
Code Amino acid Code Amino acid
A Alanine N Asparagine
B Asparagine P Proline
C Cysteine Q Glutamine
D Aspartic acid R Arginine
E Glutamic acid S Serine
F Phenylalanine T Threonine
G Glycine V Valine
H Histidine W Tryptophan
I Isoleucine X Unknown
K Lysine Y Tyrosine
L Leucine Z Glutamine
M Methionine
Adapted from IUPAC-IUB (1969,1972, 1983)
5
Sequence Formats

Sequence is stored as ASCII text (i.e. sequence
of A,G,C,T) along with annotations.
Different sequence formats recognized by
different sequence analyzer programs.
Sequence Format includes accessory information,
gene names, source organism, investigator name,
references, and the actual sequence.

6
Sequence Formats (continued)

FASTA
GenBank Flat File format
PIR/CODATA format
EMBL sequence entry format
Intelligenetics sequence entry format
GCG (Genetics Computer Group) sequence entry
format.
ASN.1
XML

7
Databases

NCBI
GenBank at the National Center of Biotechnology
Information (NCBI), National Library of Medicine,
Washington, DC
NBRF
Protein Information Resource (PIR) database at
the National Biomedical Research Foundation in
Washington, DC

8
Databases (continued)

SwissProt
The SwissProt protein sequence database at
ISREC, Swiss Institute for Experimental Cancer
Research.
EMBL
European Molecular Biology Laboratory (EMBL)
Outstation at Hixton, England
DDBJ
DNA DataBank of Japan (DDBJ) at Mishima, Japan

9
Databases on Internet

NCBI http//www.ncbi.nlm.nih.gov
PIR
http//www-nbrf.georgetown.edu/pirwww
SwissProt
http//www.expasy.ch/cgi-bin/sprot-search-de
EMBL http//www.ebi.ac.uk/embl/index.html
DDBJ http//www.ddbj.nig.ac.jp/

10
NCBI

National resource for molecular biology
information.
Maintains comprehensive databases for variety of
Biotech related information.
Develops and manages access to a range of
databases and softwares for scientific and
medical communities.

11
NCBI Integrated Databases

Literature Databases
Pubmed
PubMed Central
OMIM
PROW
BookShelf

12
NCBI Integrated Databases (continued)

Nucleotide Databases
GenBank
EST Database
GSS Database
SNPs Database
RefSeq
STS Database

13
NCBI Integrated Databases (continued)

Entrez Databases
Pubmed
Protein Sequence Database
Nucleotide Sequence Database
Taxonomy
OMIM

14
GenBank

GenBank is the NIH genetic sequence database.
Annotated collection of all publicly available
DNA sequences.
GenBank is a part of an international
collaboration of sequence databases along with
EMBL and DDBJ.

15
GenBank DNA Sequence Format

DNA sequence in GenBank is formatted into
distinct attributes as following
Locus locus name, sequence length, division,
date
Definition description of entry
Accession unique accession number
Version version of sequence
Keywords keywords for cross referencing

16
GenBank DNA Sequence Format(continued)

Source source organism of DNA
Organism description of organism
References authors, title, journal, Medline, etc
Features information about sequence
Base count number of bases in sequence
Origin sequence data begin following origin.
Genebank sample

17
NCBI Tools

Tools for Data Retrieval and submission
Text Term Searching
Sequence Similarity Searching
Taxonomy Searching
Sequence Submission

18
NCBI ENTREZ

Entrez is a search and retrieval system that
integrates information from databases at NCBI.
These databases include nucleotide sequences,
protein sequences, macromolecular structures,
whole genomes, and MEDLINE, PubMed. Etc.
Entrez

19
NCBI BLAST

BLAST Basic Local Alignment Search Tool
It is a set of similarity search programs
designed to explore available sequence databases.
It uses a heuristic algorithm which is able to
detect relationships among sequences which share
only isolated regions of similarity
Q-BLAST It is a queuing system to BLAST that
allows users to retrieve results at their
convenience and format their results.

20
NCBI BLAST (continued)

Access to BLAST service
Web-BLAST
Standalone BLAST
Network BLAST
BLAST URL API

21
NCBI BLAST (continued)

BLAST Programs
Blastp Compares amino acid sequence against
protein sequence Database
Blastn Compares nucleotide sequence against
nucleotide sequence Database
Blastx Compares nucleotide query sequence
against protein sequence Database
Tblastn Compares protein query sequence against
nucleotide sequence Database
BLAST

22
NBRF PIR

Protein Information Resource
3 Major Databases
PSD (Protein Sequence Database)
iProClass
PIR-NREF
(Nonredundant REFerence protein database)

23
PIR PSD

The PIR, in collaboration with MIPS and JIPID,
produces and distributes the PIR-International
Protein Sequence Database (PSD) .
Comprehensive and expertly annotated protein
sequence database.
The primary sources of PSD data are sequences
from GenBank/EMBL/DDBJ translations, published
literature, and direct submission to
PIR-International.

24
PIR PSD (continued)

The PIR-PSD data is available in XML format and
NBRF, PIR/CODATA formats. The sequence file is
available in FASTA format.
Also available at PIR UNIX FTP server. Address
ftp//ftp.pir.georgetown.edu/pir_databases/psd/

25
CODATA format

CODATA format has approximately the same
information as a GenBank or EMBL sequence file,
but is slightly differently formatted and has
different field names.
Also called PIR format, used by NBRF.
CODATA Sample

26
PIR iProClass

The iProClass database provides comprehensive
descriptions of all proteins and serves as a
framework for data integration in a distributed
networking environment.
Very user-friendly description.

27
PIR NREF (Non-redundant REFerence protein
database)

Comprehensive Containing all sequences from
PIR-PSD, Swiss-Prot, TrEMBL, RefSeq, GenPept, and
updated bi-weekly.
Non-Redundant Clustered by sequence identity and
taxonomy at the species level.
Source Attribution Containing protein IDs and
names from associated databases (with hypertext
links), in addition to protein sequence,
taxonomy, and bibliography.
The current version (July 2002) consists of more
than 809,000 non-redundant PIR-PSD, SwissProt and
TrEMBL proteins organized with more than 36,200
PIR superfamilies, 145,340 families, and links to
over 50 molecular biology databases.

28
Swiss-Prot

Swiss-Prot is a protein knowledgebase established
in 1986.
Maintained collaboratively, by the Department of
Medical Biochemistry of the University of Geneva
(now the Swiss Institute of Bioinformatics) and
the EMBL Data Library.
Swiss-Prot Sequence Entry Example

29
Sequence Format Conversion