Computer Storage of Sequences - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Computer Storage of Sequences

Description:

Nomenclature Committee of the International Union of Biochemistry. Protein Sequence ... by the Department of Medical Biochemistry of the University of Geneva (now ... – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 32
Provided by: publicco
Category:

less

Transcript and Presenter's Notes

Title: Computer Storage of Sequences


1
Computer Storage of Sequences
(Chapter 2 of Bioinformatics Sequence and
Genome Analysis By David W. Mount)
  • CSE730 Seminar on
  • Information Retrieval of Biomedical Text and
    Data

2
Outline
  • Storing DNA/Protein sequences into computer files
    or databases.
  • Related information placed in the database along
    with the sequence in a number of sequence data
    formats.
  • Online public access Databases for sequence
    retrieval.

3
Nucleotide Sequence
Nomenclature Committee of the International Union of Biochemistry Nomenclature Committee of the International Union of Biochemistry Nomenclature Committee of the International Union of Biochemistry Nomenclature Committee of the International Union of Biochemistry
Code Nucleic Acid(s) Code Nucleic Acid(s)
A Adenine M A or C (amino)
C Cytosine R A or G (purine)
G Guanine W A or T (weak)
T Thymine S C or G (strong)
U Uracil Y C or T (pyrimidine)
K G or T (keto)
V A or C or G
H A or C or T
D A or G or T
B C or G or T
N A or G or C or T (any)
4
Protein Sequence
Code Amino acid Code Amino acid
A Alanine N Asparagine
B Asparagine P Proline
C Cysteine Q Glutamine
D Aspartic acid R Arginine
E Glutamic acid S Serine
F Phenylalanine T Threonine
G Glycine V Valine
H Histidine W Tryptophan
I Isoleucine X Unknown
K Lysine Y Tyrosine
L Leucine Z Glutamine
M Methionine
Adapted from IUPAC-IUB (1969,1972, 1983)
5
Sequence Formats
  • Sequence is stored as ASCII text (i.e. sequence
    of A,G,C,T) along with annotations.
  • Different sequence formats recognized by
    different sequence analyzer programs.
  • Sequence Format includes accessory information,
    gene names, source organism, investigator name,
    references, and the actual sequence.

6
Sequence Formats (continued)
  • FASTA
  • GenBank Flat File format
  • PIR/CODATA format
  • EMBL sequence entry format
  • Intelligenetics sequence entry format
  • GCG (Genetics Computer Group) sequence entry
    format.
  • ASN.1
  • XML

7
Databases
  • NCBI
  • GenBank at the National Center of Biotechnology
    Information (NCBI), National Library of Medicine,
    Washington, DC
  • NBRF
  • Protein Information Resource (PIR) database at
    the National Biomedical Research Foundation in
    Washington, DC

8
Databases (continued)
  • SwissProt
  • The SwissProt protein sequence database at
    ISREC, Swiss Institute for Experimental Cancer
    Research.
  • EMBL
  • European Molecular Biology Laboratory (EMBL)
    Outstation at Hixton, England
  • DDBJ
  • DNA DataBank of Japan (DDBJ) at Mishima, Japan

9
Databases on Internet
  • NCBI http//www.ncbi.nlm.nih.gov
  • PIR
  • http//www-nbrf.georgetown.edu/pirwww
  • SwissProt
  • http//www.expasy.ch/cgi-bin/sprot-search-de
  • EMBL http//www.ebi.ac.uk/embl/index.html
  • DDBJ http//www.ddbj.nig.ac.jp/

10
NCBI
  • National resource for molecular biology
    information.
  • Maintains comprehensive databases for variety of
    Biotech related information.
  • Develops and manages access to a range of
    databases and softwares for scientific and
    medical communities.

11
NCBI Integrated Databases
  • Literature Databases
  • Pubmed
  • PubMed Central
  • OMIM
  • PROW
  • BookShelf

12
NCBI Integrated Databases (continued)
  • Nucleotide Databases
  • GenBank
  • EST Database
  • GSS Database
  • SNPs Database
  • RefSeq
  • STS Database

13
NCBI Integrated Databases (continued)
  • Entrez Databases
  • Pubmed
  • Protein Sequence Database
  • Nucleotide Sequence Database
  • Taxonomy
  • OMIM

14
GenBank
  • GenBank is the NIH genetic sequence database.
  • Annotated collection of all publicly available
    DNA sequences.
  • GenBank is a part of an international
    collaboration of sequence databases along with
    EMBL and DDBJ.

15
GenBank DNA Sequence Format
  • DNA sequence in GenBank is formatted into
    distinct attributes as following
  • Locus locus name, sequence length, division,
    date
  • Definition description of entry
  • Accession unique accession number
  • Version version of sequence
  • Keywords keywords for cross referencing

16
GenBank DNA Sequence Format(continued)
  • Source source organism of DNA
  • Organism description of organism
  • References authors, title, journal, Medline, etc
  • Features information about sequence
  • Base count number of bases in sequence
  • Origin sequence data begin following origin.
  • Genebank sample

17
NCBI Tools
  • Tools for Data Retrieval and submission
  • Text Term Searching
  • Sequence Similarity Searching
  • Taxonomy Searching
  • Sequence Submission

18
NCBI ENTREZ
  • Entrez is a search and retrieval system that
    integrates information from databases at NCBI.
  • These databases include nucleotide sequences,
    protein sequences, macromolecular structures,
    whole genomes, and MEDLINE, PubMed. Etc.
  • Entrez

19
NCBI BLAST
  • BLAST Basic Local Alignment Search Tool
  • It is a set of similarity search programs
    designed to explore available sequence databases.
  • It uses a heuristic algorithm which is able to
    detect relationships among sequences which share
    only isolated regions of similarity
  • Q-BLAST It is a queuing system to BLAST that
    allows users to retrieve results at their
    convenience and format their results.

20
NCBI BLAST (continued)
  • Access to BLAST service
  • Web-BLAST
  • Standalone BLAST
  • Network BLAST
  • BLAST URL API

21
NCBI BLAST (continued)
  • BLAST Programs
  • Blastp Compares amino acid sequence against
    protein sequence Database
  • Blastn Compares nucleotide sequence against
    nucleotide sequence Database
  • Blastx Compares nucleotide query sequence
    against protein sequence Database
  • Tblastn Compares protein query sequence against
    nucleotide sequence Database
  • BLAST

22
NBRF PIR
  • Protein Information Resource
  • 3 Major Databases
  • PSD (Protein Sequence Database)
  • iProClass
  • PIR-NREF
  • (Nonredundant REFerence protein database)

23
PIR PSD
  • The PIR, in collaboration with MIPS and JIPID,
    produces and distributes the PIR-International
    Protein Sequence Database (PSD) .
  • Comprehensive and expertly annotated protein
    sequence database.
  • The primary sources of PSD data are sequences
    from GenBank/EMBL/DDBJ translations, published
    literature, and direct submission to
    PIR-International.

24
PIR PSD (continued)
  • The PIR-PSD data is available in XML format and
    NBRF, PIR/CODATA formats. The sequence file is
    available in FASTA format.
  • Also available at PIR UNIX FTP server. Address
  • ftp//ftp.pir.georgetown.edu/pir_databases/psd/

25
CODATA format
  • CODATA format has approximately the same
    information as a GenBank or EMBL sequence file,
    but is slightly differently formatted and has
    different field names.
  • Also called PIR format, used by NBRF.
  • CODATA Sample

26
PIR iProClass
  • The iProClass database provides comprehensive
    descriptions of all proteins and serves as a
    framework for data integration in a distributed
    networking environment.
  • Very user-friendly description.

27
PIR NREF (Non-redundant REFerence protein
database)
  • Comprehensive Containing all sequences from
    PIR-PSD, Swiss-Prot, TrEMBL, RefSeq, GenPept, and
    updated bi-weekly.
  • Non-Redundant Clustered by sequence identity and
    taxonomy at the species level.
  • Source Attribution Containing protein IDs and
    names from associated databases (with hypertext
    links), in addition to protein sequence,
    taxonomy, and bibliography.
  • The current version (July 2002) consists of more
    than 809,000 non-redundant PIR-PSD, SwissProt and
    TrEMBL proteins organized with more than 36,200
    PIR superfamilies, 145,340 families, and links to
    over 50 molecular biology databases.

28
Swiss-Prot
  • Swiss-Prot is a protein knowledgebase established
    in 1986.
  • Maintained collaboratively, by the Department of
    Medical Biochemistry of the University of Geneva
    (now the Swiss Institute of Bioinformatics) and
    the EMBL Data Library.
  • Swiss-Prot Sequence Entry Example

29
Sequence Format Conversion
  • READSEQ
  • Sequence Format Conversion program.
  • http//bimas.dcrt.nih.gov/molbio/readseq/
  • Can convert to/from
  • ASN.1
  • FASTA
  • CODATA
  • GCG
  • EMBL format
  • GenBank format and many other formats

30
References
  • http//www.ncbi.nlm.nih.gov
  • http//www-nbrf.georgetown.edu/pirwww
  • http//www.expasy.ch/cgi-bin/sprot-search-de
  • http//www.ebi.ac.uk/embl/index.html
  • http//www.ddbj.nig.ac.jp/

31
Thank You ?
Presented byHemal Patel Jeetal Shah
Write a Comment
User Comments (0)
About PowerShow.com