Structure Databases - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Structure Databases

Description:

a usually large collection of data organized especially for rapid search and ... IPI, GOA, GO, Proteome Analysis, HPI, IntEnz, TrEMBLnew, SP_ML, NEWT, PANDIT ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 44
Provided by: VictorAS
Category:

less

Transcript and Presenter's Notes

Title: Structure Databases


1
Structure Databases
  • DNA/Protein structure-function analysis and
    prediction
  • Lecture 6
  • Bioinformatics Section, Vrije Universiteit,
    Amsterdam

2
The dictionary definition
  • Main Entry database Pronunciation
    'dA-t-"bAs, 'da- also 'dä-Function nounDate
    circa 1962
  • a usually large collection of data organized
    especially for rapid search and retrieval (as by
    a computer)
  • - Webster dictionary

3
WHAT is a database?
  • A collection of data that needs to be
  • Structured
  • Searchable
  • Updated (periodically)
  • Cross referenced
  • Challenge
  • To change meaningless data into useful
    information that can be accessed and analysed the
    best way possible.
  • For example
  • HOW would YOU organise all biological sequences
    so that the biological information is optimally
    accessible?
  • You need an appropriate database management
    system (DBMS)

4
DBMS
Database
  • Internal organization
  • Controls speed and flexibility
  • A unity of programs that
  • Store
  • Extract
  • Modify

Store
Extract
Modify
USER(S)
5
DBMS organisation types
  • Flat file databases (flat DBMS)
  • Simple, restrictive, table
  • Hierarchical databases (hierarchical DBMS)
  • Simple, restrictive, tables
  • Relational databases (RDBMS)
  • Complex,versatile, tables
  • Object-oriented databases (ODBMS)
  • Complex, versatile, objects

6
Relational databases
  • Data is stored in multiple related tables
  • Data relationships across tables can be either
    many-to-one or many-to-many
  • A few rules allow the database to be viewed in
    many ways
  • Lets convert the course details to a relational
    database

7
Our flat file database
FLAT DATABASE 2
Course details
Name Depart. Course E1 E2
E3 P1 P2
Student 1 Chemistry Biology A B B
A C ..
Student 1 Chemistry Maths C C B
A A ..
Student 1 Chemistry English A A A
A A ..
. . . .
Student 2 Ecology Biology A B A
A A ..
Student 2 Ecology Maths A D A
A A ..
. . . .
8
Normalize (1NF)
  • We remove repeating records (rows)

9
Normalize (2NF)
  • We remove redundant fields (columns)

wID Project
10
Relational Databases
  • What have we achieved?
  • No repeating information
  • Less storage space
  • Better reality representation
  • Easy modification/management
  • Easy usage of any combination of records
  • Remember
  • the DBMS has programs to access and edit this
    information so ignore the human reading
    limitation of the primary keys

11
Accessing database information
  • A request for data from a database is called a
    query
  • Queries can be of three forms
  • Choose from a list of parameters
  • Query by example (QBE)
  • Query language

12
Query Languages
  • The standard
  • SQL (Structured Query Language) originally
    called SEQUEL (Structured English QUEry Language)
  • Developed by IBM in 1974 introduced commercially
    in 1979 by Oracle Corp.
  • Standard interactive and programming language for
    getting information from and updating a database.
  • RDMS (SQL), ODBMS (Java, C, OQL etc)

13
Distributed databases
  • From local to global attitude
  • Data appears to be in one location but is most
    definitely not
  • A definition Two or more data files in different
    locations, periodically synchronized by the DBMS
    to keep data in all locations consistent (A,B,C)
  • An intricate network for combining and sharing
    information
  • Administrators praise fast network
    technologies!!!
  • Users praise the internet!!!

14
Data warehouse
  • Periodically, one imports data from databases and
    store it (locally) in the data warehouse.
  • Now a local database can be created, containing
    for instance protein family data (sequence,
    structure, function and pathway/process data
    integrated with the gene expression and other
    experimental data).
  • Disadvantage expensive, intensive, needs to be
    updated.
  • Advantage easy control of integrated data-mining
    pipeline.

15
So why do biologists care?
16
Three main reasons
  • Database proliferation
  • Dozens to hundreds at the moment
  • More and more scientific discoveries result from
    inter-database analysis and mining
  • Rising complexity of required data-combinations
  • E.g. translational medicine from bench to
    bedside (genomic data vs. clinical data)

17
Biological databases
  • Like any other database
  • Data organization for optimal analysis
  • Data is of different types
  • Raw data (DNA, RNA, protein sequences)
  • Curated data (DNA, RNA and protein annotated
    sequences and structures, expression data)

18
Raw Biological dataNucleic Acids (DNA)
19
Raw Biological dataAmino acid residues (proteins)
20
Curated Biological Data
DNA, nucleotide sequences
Gene boundaries, topology
Gene structure
Introns, exons, ORFs, splicing
Mass spectometry
Expression data
21
Curated Biological Data
Proteins, residue sequences
Mass spectometry (metabolomics, proteomics)
Extended sequence information
MCTUYTCUYFSTYRCCTYFSCD
Secondary structure
Post-Translational protein Modification (PTM)
Hydrophobicity, motif data
Protein-protein interaction
22
Curated Biological data3D Structures, folds
23
Biological Databases
The 2003 NAR Database Issue http//nar.oupjournal
s.org/content/vol31/issue1/
24
Distributed information
  • Pearsons Law The usefulness of a column of data
    varies as the square of the number of columns it
    is compared to.

25
A few biological databases
  • Nucleotide DatabasesAlternative Splicing,
    EMBL-Bank, Ensembl, Genomes Server, Genome, MOT,
    EMBL-Align, Simple Queries, dbSTS Queries,
    Parasites, Mutations, IMGT
  • Genome Databases
  • Human, Mouse, Yeast, C.elegans, FLYBASE,
    Parasites
  • Protein Databases Swiss-Prot, TrEMBL,
    InterPro, CluSTr, IPI, GOA, GO, Proteome
    Analysis, HPI, IntEnz, TrEMBLnew, SP_ML, NEWT,
    PANDIT
  • Structure Databases PDB, MSD, FSSP, DALI
  • Microarray Database ArrayExpress
  • Literature Databases MEDLINE, Software
    Biocatalog, Flybase Archives
  • Alignment Databases
  • BAliBASE, Homstrad, FSSP

26
Structural Databases
  • Protein Data Bank (PDB) http//www.rcsb.org/pdb/
  • Structural Classification of Proteins (SCOP)
  • http//scop.berkeley.edu
  • http//scop.mrc-lmb.cam.ac.uk/scop/

27
PDB
  • 3D Macromolecular structural data
  • Data originates from NMR or X-ray crystallography
    techniques
  • Total no of structures 34.626 (17/01/2006)
  • If the 3D structure of a protein is solved ...
    they have it

28
PDB content
29
PDB information
  • The PDB files have a standard format
  • Key features
  • Informative descriptors

30
PDB-mirror on the WWW
e.g.1AE5
31
Example output 1AE5
32
SCOP
  • Structural Classification Of Proteins
  • 3D Macromolecular structural data grouped based
    on structural classification
  • Data originates from the PDB
  • Current version (v1.69)
  • 25973 PDB Entries (July 2005).
  • 70859 Domains

33
SCOP levels bottom-up
  • Family Clear evolutionarily relationshipProteins
    clustered together into families are clearly
    evolutionarily related. Generally, this means
    that pairwise residue identities between the
    proteins are 30 and greater. However, in some
    cases similar functions and structures provide
    definitive evidence of common descent in the
    absence of high sequence identity for example,
    many globins form a family though some members
    have sequence identities of only 15.
  • Superfamily Probable common evolutionary
    originProteins that have low sequence
    identities, but whose structural and functional
    features suggest that a common evolutionary
    origin is probable are placed together in
    superfamilies. For example, actin, the ATPase
    domain of the heat shock protein, and hexakinase
    together form a superfamily.
  • Fold Major structural similarityProteins are
    defined as having a common fold if they have the
    same major secondary structures in the same
    arrangement and with the same topological
    connections. Different proteins with the same
    fold often have peripheral elements of secondary
    structure and turn regions that differ in size
    and conformation. In some cases, these differing
    peripheral regions may comprise half the
    structure. Proteins placed together in the same
    fold category may not have a common evolutionary
    origin the structural similarities could arise
    just from the physics and chemistry of proteins
    favouring certain packing arrangements and chain
    topologies.

34
SCOP-mirror on the WWW
35
Enter SCOP at the top of the hierarchy
36
Keyword search of SCOP entries
37
CATH
  • Class, derived from secondary structure content,
    is assigned for more than 90 of protein
    structures automatically.
  • Architecture, which describes the gross
    orientation of secondary structures, independent
    of connectivities, is currently assigned
    manually.
  • Topology level clusters structures according to
    their toplogical connections and numbers of
    secondary structures.
  • The Homologous superfamilies cluster proteins
    with highly similar structures and functions. The
    assignments of structures to topology families
    and homologous superfamilies are made by sequence
    and structure comparisons.

38
CATH-mirror on the WWW
39
DSSP
  • Dictionary of secondary structure of proteins
  • The DSSP database comprises the secondary
    structures of all PDB entries
  • DSSP is actually software that translates the PDB
    structural co-ordinates into secondary
    (standardized) structure elements
  • A similar example is STRIDE

40
WHY bother???
  • Researchers create and use the data
  • Use of known information for analyzing new data
  • New data needs to be screened
  • Structural/Functional information
  • Extends the knowledge and information on a higher
    level than DNA or protein sequences

41
In the end .
Computers can figure out all kinds of problems,
except the things in the world that just don't
add up. James Magary We should add For that we
employ the human brain, experts and experience.
42
Bio-databases A short word on problems
  • Even today we face some key limitations
  • There is no standard format
  • Every database or program has its own format
  • There is no standard nomenclature
  • Every database has its own names
  • Data is not fully optimized
  • Some datasets have missing information without
    indications of it
  • Data errors
  • Data is sometimes of poor quality, erroneous,
    misspelled
  • Error propagation resulting from computer
    annotation

43
What to take home
  • Databases are a collection of data
  • Need to access and maintain easily and flexibly
  • Biological information is vast and sometimes very
    redundant
  • Distributed databases bring it all together with
    quality controls, cross-referencing and
    standardization
  • Computers can only create data, they do not give
    answers
  • Review-suggestion Integrating biological
    databases, Stein, Nature 2003
Write a Comment
User Comments (0)
About PowerShow.com