Sequence Databases - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Sequence Databases

Description:

The DNA Databank of Japan (DDBJ) joined the data-collecting collaboration a few years later. ... Header Apply the information (Descriptor) of entire record. ... – PowerPoint PPT presentation

Number of Views:258
Avg rating:3.0/5.0
Slides: 37
Provided by: PCC53
Category:

less

Transcript and Presenter's Notes

Title: Sequence Databases


1
Sequence Databases
2
  • 1.1 Introduction
  • 1.2 Primary and Secondary Databases
  • 1.3 Nucleotide Sequence Databases
  • 1.4 Nucleotide Sequence Flatfiles
  • A Dissection
  • 1.5 Protein Sequence Databases
  • 1.6 Summary

3
1.1 Introduction
  • The history of sequence databases began in the
    early 1960s, when Margaret Dayhoff and colleagues
    at the Protein Information Resource (PIR)
    collected all of the protein sequences known at
    that time.
  • The advance of DNA sequence databases in 1982,
    initiated by the European Molecular biology
    Laboratory (EMBL), and joined shortly thereafter
    by GenBank.

4
  • The DNA Databank of Japan (DDBJ) joined the
    data-collecting collaboration a few years later.
  • DDBJ/EMBL/GenBank records are updated
    automatically every 24 hours at all three sites
    now.

5
1.2 Primary and Secondary Databases
  • The primary databases contain, for the most part,
    experimental results, but are not a curated
    review.
  • Curated reviews are found in what are called
    secondary databases.
  • The DNA, RNA, or protein sequences are the most
    valuable component of primary databases.

6
1.3 Nucleotide Sequence Databases
  • The major sources of nucleotide sequence data are
    the databases involved in DDBJ, EMBL, and
    GenBank.
  • DDBJ/EMBL/GenBank nucleotide records often are
    the primary source of sequence and biological
    information from which records in other databases
    are derived.

7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
Database Formats - FASTA
Beginning Symbol
Definition
Accession Number
60 characters per line
Accession Number accession number.version
number Ex. U54469.1
12
1.4 Nucleotide Sequence FlatfilesA Dissection
  • Flatfiles can be separated into three major
    parts
  • Header Apply the information
  • (Descriptor) of entire record.
  • Feature Table Annotation on the
  • record. (Biological information)
  • End // on the last line.

13
DDBJ/GenBank
14
EMBL
15
Third Party Annotation (TPA)
  • Primary database entries are owned by the
    original submitter and the co-authors of the
    submission publication(s), and owners have
    privileges to update the data contained in the
    record.
  • The entries of TPA dataset include reannotations
    of existing entries, combinations of novel
    sequence, and existing primary entries.

16
(No Transcript)
17
RefSeq (NCBI)
  • RefSeq is a curated secondary database that aims
    to provide a comprehensive, integrated,
    nonredundant sets of sequences, including genomic
    DNA, transcripts (RNA), and protein products, for
    selected organisms.

18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
  • NT_123456 Genomic contigs (DNA)
  • NM_123456 mRNAs
  • NP_123456 Proteins
  • XM_123456 Model mRNAs
  • XP_123456 Model proteins
  • N actual and experimental determined
  • X computational prediction

24
EMBL Genome Reviews
  • The Genome Reviews is a secondary database that
    provides an up-to-date, standardized and
    comprehensively annotated view of the genomic
    sequence of organisms with completely deciphered
    genomes.

25
1.5 Protein Sequences Databases
  • With the increasing of number of complete genome,
    efforts are now focused on the identification and
    functional analysis of the proteins encoded by
    these genomes.
  • Most sequence data in protein databases are
    derived from the translation of nucleotide
    sequences, by and large, they are secondary
    databases.

26
  • Protein databases can be divided into two broad
    categories
  • 1. Sequence repository
  • GenPept, UniParc
  • 2. Curated database
  • RefSeq, UniProt, UniProt knowledgebase

27
GenPept - Sequence repository
  • Protein sequences in GenPept are derived from
    translations of the sequences in
    DDBJ/EMBL/GenBank and contain the annotation
    present in the nucleotide record.
  • GenPept does not contain proteins derived through
    direct amino acid sequencing.
  • Redundant database.

28
UniParc - Sequence repository
  • UniParc contain protein sequences from
    translations of DDBJ/EMBL/GenBank sequences and
    primary protein sequences from direct sequencing
    of proteins.
  • Including protein data from Swiss-Prot, TrEMBL,
    PIR-PSD, IPI, RefSeq, FlyBase, and WormBase etc.
  • Non-redundant database but No annotation.

29
(No Transcript)
30
  • Non-redundant level
  • similar sequences are merged.
  • One gene one entry.

Extending annotation and computational analysis
New and update sequences
31
UniProt Curated Database
32
FTP Sites
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com