Protein Database - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Protein Database

Description:

Protein Database Bioinformatics Lab Sequence Databases GenBank--DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein sequences DDBJ ... – PowerPoint PPT presentation

Number of Views:511
Avg rating:3.0/5.0
Slides: 34
Provided by: hsun2
Category:

less

Transcript and Presenter's Notes

Title: Protein Database


1
Protein Database
  • Bioinformatics Lab

2
Sequence Databases
  • GenBank--DNA sequences and derived protein
    sequences
  • EMBL --DNA sequences and derived protein
    sequences
  • DDBJ --DNA sequences and derived protein
    sequences
  • SWISS-PROT--Protein sequences
  • PDB--three-dimensional structures of protein

3
GenBank,EMBL DDBJ
  • GenBank is the NIH genetic sequence database, an
    annotated collection of all publicly available
    DNA sequences .
  • A new release is made every two months. GenBank
    is part of the International Nucleotide Sequence
    Database Collaboration, which is comprised of the
    DNA DataBank of Japan (DDBJ), the European
    Molecular Biology Laboratory (EMBL), and GenBank
    at NCBI.
  • These three organizations exchange data on a
    daily basis.

4
GenBank,EMBL DDBJ
  • GenBank Release 122.0,Feb.15,2001.
  • 10,897,000 sequence records
  • 11,720,000,000 bases
  • EMBL Release 66,Mar.2,2000
  • 11,169,673
  • 11,916,112,872
  • DDBJ,the Center for operating DDBJ, National
    Institute of Genetics (NIG),Japan,established in
    April 1995.

5
Protein Databases
  • There are many styles in protein databases,such
    as protein sequences,motif,classification,structur
    e, structure alignment, curation
  • GenBANK,EMBL and DDBJ(derived sequences,
    http//www.ncbi.nlm.nih.gov/gorf/gorf.html)
  • SWISS-PROT,PIR (sequences)
  • PROSITE,PRINTS(sequence motifs)
  • HSSP,FSSP(classification,alignment)
  • PDB(3-D structure)

6
SWISS-PROT/TrEMBL
  • Annotated protein sequences,
  • Established in 1986
  • Developed by the SWISS-PROT groups at SIB and at
    EBI.
  • Maintained collaboratively, since 1987, by the
    Department of Medical Biochemistry of the
    University of Geneva(???) and the EMBL Data
    Library (now the EMBL Outstation - The European
    Bioinformatics Institute (EBI)).
  • Website http//www.expasy.ch/

7
Different Features of SWISS-PROT
  • Format follows as closely as possible that of
    EMBLs
  • Curated protein sequence database
  • Three differences
  • Strives to provide a high level of
    annotations(??)
  • Minimal level of redundancy(????)
  • High level of integration with other databases
    (????)

8
Three Distinct Criteria
  • 1. Annotation
  • The sequence data the citation information
    (bibliographical references) and the taxonomic
    data (description of the biological source of the
    protein) such as protein functions,post-translatio
    nal modifications ,domains and sites,secondary
    structure,quaternary structure,similarities to
    other proteins,diseases associated with
    deficiencies in the protein,sequence conflicts,
    variants, etc.

9
  • 2. Minimal Redundancy
  • Many sequence databases contain, for a given
    protein sequence, separate entries which
    correspond to different literature reports.
    SWISS-PROT is as much as possible to merge all
    these data so as to minimize the redundancy. If
    conflicts exist between various sequencing
    reports, they are indicated in the feature table
    of the corresponding entry.

10
  • 3. Integration With Other Databases
  • SWISS-PROT and TrEMBL - Protein sequences
  • PROSITE - Protein families and domains
  • SWISS-2DPAGE - Two-dimensional polyacrylamide gel
    electrophoresis???????
  • SWISS-3DIMAGE - 3D images of proteins and other
    biological macromolecules
  • SWISS-MODEL Repository - Automatically generated
    protein models
  • CD40Lbase - CD40 ligand defects (?????)
  • ENZYME - Enzyme nomenclature (???)
  • SeqAnalRef - Sequence analysis bibliographic
    references (????????)

11
SWISS-PROT/TrEMBL
  • TrEMBL is a computer-annotated supplement of
    SWISS-PROT that contains all the translations of
    EMBL nucleotide sequence entries not yet
    integrated in SWISS-PROT
  • SWISS-PROT Release 39.15 of 19-Mar-2001 94,152
    entriesTrEMBL Release 16.2 of 23-Mar-2001
    436,924 entries

12
SWISS-PROT FORMAT
 
 
13
Access to SWISS-PROT and TrEMBL
  • SRS - Access to SWISS-PROT, TrEMBL and other
    databases using the Sequence Retrieval System
  • Full text search in SWISS-PROT and TrEMBL
  • by accession number or ID (AC or ID line
    SWISS-PROT and TrEMBL)
  • by description or identification (any word in the
    DE, OS, OG, GN and ID lines SWISS-PROT and
    TrEMBL)
  • by author (RA line SWISS-PROT and TrEMBL)
  • by citation (RL line SWISS-PROT only)
  • Retrieve a list of SWISS-PROT/TrEMBL entries
  • Randomly retrieve a SWISS-PROT/TrEMBL entry

14
Protein Data Bank
  • PDB is three-dimensional structure of
    proteins,some nuclei acids involved
  • PDB is operated by RCSB(Research Collaboratory
    for Structural Bioinformatics),funded by NSF,
    DOE, and two units of NIHNIGMS National
    Institute Of General Medical Sciences and NLM
    National Library Of Medicine.
  • Established at BNL Brookhaven National
    Laboratories in 1971,as an archive for biological
    macromolecular crystal structures
  • In 1980s, the number of deposited structures
    began to increase dramatically.
  • October 1998, the management of the PDB became
    the responsibility of RCSB.
  • Website http//www.rcsb.org

15
PDB Holdings List 27-Mar-2001
Molecule Type Molecule Type Molecule Type Molecule Type Molecule Type
Proteins, Peptides, and Viruses Protein/ Nucleic Acid Complexes Nucleic Acids Carbohydrates total
Exp. Tech. X-ray Diffraction and other 11045 526 552 14 12137
Exp. Tech. NMR 1832 71 366 4 2273
Exp. Tech. Theoretical Modeling 281 19 21 0 321
total total 13158 616 939 18 14731
  • Structure Factor Files
  • 968 NMR Restraint Files

16
PDB Content Growth
17
PDB Growth in New Folds
18
PDB Data File Format
  • There are mainly two formatsPDB and CIF
  • PDB is fixed format in its columns
  • CIF is free format

19
PDB Format
  • HEADER First line of the entry, contains PDB ID
    code, classification, and date of deposition.
  • OBSLTE Statement that the entry has been
    removed from distribution and list of the ID
    code(s) which replaced it.
  • TITLE Description of the experiment represented
    in the entry.
  • CAVEAT Severe error indicator. Entries with
    this record must be used with care.
  • COMPND Description of macromolecular contents
    of the entry.
  • SOURCE Biological source of macromolecules in
    the entry.
  • KEYWDS List of keywords describing the
    macromolecule.
  • EXPDTA Experimental technique used for the
    structure determination.
  • AUTHOR List of contributors.
  • REVDAT Revision date and related information.
  • SPRSDE List of entries withdrawn from release
    and replaced by current entry.
  • JRNL Literature citation that defines the
    coordinate set.
  • REMARK General remarks, some are structured and
    some are free form.
  • DBREF Reference to the entry in the sequence
    database(s).
  • SEQADV Identification of conflicts between PDB
    and the named sequence database.
  • SEQRES Primary sequence of backbone residues.
  • MODRES Identification of modifications to
    standard residues.
  • HET Identification of non-standard groups or
    residues (heterogens)
  • HETNAM Compound name of the heterogens.

20
  • SHEET Identification of sheet substructures.
  • TURN Identification of turns.
  • SSBOND Identification of disulfide bonds.
  • LINK Identification of inter-residue bonds.
  • HYDBND Identification of hydrogen bonds.
  • SLTBRG Identification of salt bridges
  • CISPEP Identification of peptide residues in
    cis conformation.
  • SITE Identification of groups comprising
    important sites.
  • CRYST1 Unit cell parameters, space group, and
    Z.
  • ORIGXn Transformation from orthogonal
    coordinates to the submitted coordinates (n 1,
    2, or 3).
  • SCALEn Transformation from orthogonal
    coordinates to fractional crystallographic
    coordinates (n 1, 2, or 3).
  • MTRIXn Transformations expressing
    non-crystallographic symmetry (n 1, 2, or 3).
    There may be multiple sets of these records.
  • TVECT Translation vector for infinite
    covalently connected structures.
  • MODEL Specification of model number for
    multiple structures in a single coordinate entry.
  • ATOM Atomic coordinate records for standard
    groups.
  • SIGATM Standard deviations of atomic
    parameters.
  • ANISOU Anisotropic temperature factors.
  • SIGUIJ Standard deviations of anisotropic
    temperature factors.
  • TER Chain terminator.

21
An Example of PDB
  • HEADER IMMUNOGLOBULIN
    09-MAY-89 2MCG 2MCG 2
  • COMPND IMMUNOGLOBULIN LAMBDA LIGHT CHAIN DIMER
    (/MCG) 2MCG 3
  • COMPND 2 (TRIGONAL FORM)
    2MCG 4
  • SOURCE HUMAN (HOMO SAPIENS)
    2MCG 5
  • AUTHOR K.R.ELY,J.N.HERRON,A.B.EDMUNDSON
    2MCG 6
  • REVDAT 2 15-JUL-92 2MCGA 1 SPRSDE
    2MCGA 1
  • SPRSDE 15-OCT-90 2MCG 1MCG
    2MCGA 2
  • JRNL AUTH K.R.ELY,J.N.HERRON,M.HARKER,A.B
    .EDMUNDSON 2MCG 9
  • JRNL TITL THREE-DIMENSIONAL STRUCTURE OF
    A LIGHT CHAIN 2MCG 10
  • REMARK 1 REFERENCE 1
    2MCG 16
  • REMARK 1 AUTH A.B.EDMUNDSON,K.R.ELY,J.N.HERRO
    N,B.D.CHESON 2MCG 17
  • SEQRES 1 1 216 PCA SER ALA LEU THR GLN PRO
    PRO SER ALA SER GLY SER 2MCG 183
  • FORMUL 3 HOH 318(H2 O1)
    2MCG 217
  • SSBOND 1 CYS 1 22 CYS 1 90
    2MCG 218
  • CRYST1 72.300 72.300 185.900 90.00 90.00
    120.00 P 31 2 1 6 2MCG 223
  • ORIGX1 0.013831 0.007985 0.000000
    0.00000 2MCG 224
  • ORIGX2 0.000000 0.015971 0.000000
    0.00000 2MCG 225
  • ORIGX3 0.000000 0.000000 0.005379
    0.00000 2MCG 226

22
Fragment of CIF example
  • ATOM_SITE
  • loop_
  • _atom_site.label_seq_id
  • _atom_site.group_PDB
  • _atom_site.type_symbol
  • _atom_site.label_atom_id
  • _atom_site.label_comp_id
  • _atom_site.label_asym_id
  • _atom_site.auth_seq_id
  • _atom_site.label_alt_id
  • _atom_site.cartn_x
  • _atom_site.cartn_y
  • _atom_site.cartn_z
  • _atom_site.occupancy
  • _atom_site.B_iso_or_equiv
  • _atom_site.footnote_id
  • _atom_site.label_entity_id

23
3-D Structure from PDB
  • 20 Amino acids
  • http//www.clunet.edu/BioDev/omm/aa/aa.htm
  • http//www.nyu.edu/pages/mathmol/library/life/
  • http//inquiry.uiuc.edu/bioweb/tutorial/amino_acid
    s.htm

24
(No Transcript)
25
How to Construct 3-D Molecule
  • Read coordinates from PDB(?????)
  • Set up data structure of molecules
  • Form bonds among atoms and groups
  • Calculate secondary structure
  • Implement 3-D graphical algorithms
  • Render 3-D graph in various style, wires, sticks,
    balls, ribbons, and the like.

26
Bonds among atoms
  • ATOM 20 N LEU 1 4 30.279 -25.716
    105.041 1.00 10.60 2MCG 249
  • ATOM 21 CA LEU 1 4 31.406 -26.518
    104.496 1.00 9.39 2MCG 250
  • ATOM 22 C LEU 1 4 32.658 -25.786
    105.165 1.00 8.90 2MCG 251
  • ATOM 23 O LEU 1 4 32.890 -24.586
    104.967 1.00 8.74 2MCG 252
  • ATOM 24 CB LEU 1 4 31.615 -26.794
    103.141 1.00 8.79 2MCG 253
  • ATOM 25 CG LEU 1 4 31.552 -27.440
    101.860 1.00 8.37 2MCG 254
  • ATOM 26 CD1 LEU 1 4 32.732 -26.945
    100.970 1.00 7.99 2MCG 255
  • ATOM 27 CD2 LEU 1 4 31.706 -28.963
    102.016 1.00 8.09 2MCG 256

Leucine LEU L(???)
27
Bonds between groups
  • ATOM 9 N SER 1 2 25.548 -22.930
    103.333 1.00 16.05 2MCG 238
  • ATOM 10 CA SER 1 2 26.608 -22.758
    104.327 1.00 15.38 2MCG 239
  • ATOM 11 C SER 1 2 27.351 -24.076
    104.604 1.00 14.81 2MCG 240
  • ATOM 12 O SER 1 2 27.530 -24.949
    103.740 1.00 15.00 2MCG 241
  • ATOM 13 CB SER 1 2 25.887 -22.406
    105.682 1.00 15.73 2MCG 242
  • ATOM 14 OG SER 1 2 25.193 -23.586
    106.117 1.00 15.14 2MCG 243
  • ATOM 15 N ALA 1 3 27.758 -24.228
    105.876 1.00 13.72 2MCG 244
  • ATOM 16 CA ALA 1 3 28.328 -25.397
    106.456 1.00 12.33 2MCG 245
  • ATOM 17 C ALA 1 3 29.255 -26.303
    105.686 1.00 11.58 2MCG 246
  • ATOM 18 O ALA 1 3 29.033 -27.552
    105.641 1.00 11.28 2MCG 247
  • ATOM 19 CB ALA 1 3 27.101 -26.228
    106.998 1.00 12.39 2MCG 248
  • ATOM 20 N LEU 1 4 30.279 -25.716
    105.041 1.00 10.60 2MCG 249
  • ATOM 21 CA LEU 1 4 31.406 -26.518
    104.496 1.00 9.39 2MCG 250
  • ATOM 22 C LEU 1 4 32.658 -25.786
    105.165 1.00 8.90 2MCG 251
  • ATOM 23 O LEU 1 4 32.890 -24.586
    104.967 1.00 8.74 2MCG 252
  • ATOM 24 CB LEU 1 4 31.615 -26.794
    103.141 1.00 8.79 2MCG 253
  • ATOM 25 CG LEU 1 4 31.552 -27.440
    101.860 1.00 8.37 2MCG 254
  • ATOM 26 CD1 LEU 1 4 32.732 -26.945
    100.970 1.00 7.99 2MCG 255
  • ATOM 27 CD2 LEU 1 4 31.706 -28.963
    102.016 1.00 8.09 2MCG 256

28
Nucleic Acid Database(NDB)
  • The NDB Project is funded by the National Science
    Foundation and the Department of Energy
  • The goal of NDBP is to assemble and distribute
    structural information about nucleic acids
  • The format of NDB is the same as PDB.

29
Molvie1.0
  • A visual and interactive environment to
    display,analyze,fold and compare molecular
    structure.
  • Developed in Java AWT by us.
  • Java application/applet,really embedded in
    webpage.(http//www.cs.ucsb.edu/mli/Bioinf/softwa
    re/index.html)

30
Some features
  • Molvie 1.0 is programmed in Java, hence it is
    platform-independent.
  • There is no limit on the number of molecules,
    atoms, residues or the number of animation frames
    displayed, as long as there is enough in computer
    memory.
  • Molvie has many rendering(??) styles.
  • Molvie can display two molecules simultaneously
    and allows the user to align secondary structure
    by dragging the mouse.
  • Molvie also allows the users to click at some
    part of the 3-D structure of a protein and
    displays the corresponding primary amino acid
    sequences.

31
Molvie Application Screen
32
Molvie Applet Screen
33
  • Show Molvie
Write a Comment
User Comments (0)
About PowerShow.com