Structure Databases - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Structure Databases

Description:

a usually large collection of data organized especially for rapid search and ... IPI, GOA, GO, Proteome Analysis, HPI, IntEnz, TrEMBLnew, SP_ML, NEWT, PANDIT ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 44

Provided by: VictorAS

Category:

more less

Transcript and Presenter's Notes

Title: Structure Databases

1
Structure Databases

DNA/Protein structure-function analysis and
prediction
Lecture 6
Bioinformatics Section, Vrije Universiteit,
Amsterdam

2
The dictionary definition

Main Entry database Pronunciation
'dA-t-"bAs, 'da- also 'dä-Function nounDate
circa 1962
a usually large collection of data organized
especially for rapid search and retrieval (as by
a computer)
- Webster dictionary

3
WHAT is a database?

A collection of data that needs to be
Structured
Searchable
Updated (periodically)
Cross referenced
Challenge
To change meaningless data into useful
information that can be accessed and analysed the
best way possible.
For example
HOW would YOU organise all biological sequences
so that the biological information is optimally
accessible?
You need an appropriate database management
system (DBMS)

4
DBMS
Database

Internal organization
Controls speed and flexibility
A unity of programs that
Store
Extract
Modify

Store
Extract
Modify
USER(S)
5
DBMS organisation types

Flat file databases (flat DBMS)
Simple, restrictive, table
Hierarchical databases (hierarchical DBMS)
Simple, restrictive, tables
Relational databases (RDBMS)
Complex,versatile, tables
Object-oriented databases (ODBMS)
Complex, versatile, objects

6
Relational databases

Data is stored in multiple related tables
Data relationships across tables can be either
many-to-one or many-to-many
A few rules allow the database to be viewed in
many ways
Lets convert the course details to a relational
database

7
Our flat file database
FLAT DATABASE 2
Course details
Name Depart. Course E1 E2
E3 P1 P2
Student 1 Chemistry Biology A B B
A C ..
Student 1 Chemistry Maths C C B
A A ..
Student 1 Chemistry English A A A
A A ..
. . . .
Student 2 Ecology Biology A B A
A A ..
Student 2 Ecology Maths A D A
A A ..
. . . .
8
Normalize (1NF)

We remove repeating records (rows)

9
Normalize (2NF)

We remove redundant fields (columns)

wID Project
10
Relational Databases

What have we achieved?
No repeating information
Less storage space
Better reality representation
Easy modification/management
Easy usage of any combination of records
Remember
the DBMS has programs to access and edit this
information so ignore the human reading
limitation of the primary keys

11
Accessing database information

A request for data from a database is called a
query
Queries can be of three forms
Choose from a list of parameters
Query by example (QBE)
Query language

12
Query Languages

The standard
SQL (Structured Query Language) originally
called SEQUEL (Structured English QUEry Language)
Developed by IBM in 1974 introduced commercially
in 1979 by Oracle Corp.
Standard interactive and programming language for
getting information from and updating a database.
RDMS (SQL), ODBMS (Java, C, OQL etc)

13
Distributed databases

From local to global attitude
Data appears to be in one location but is most
definitely not
A definition Two or more data files in different
locations, periodically synchronized by the DBMS
to keep data in all locations consistent (A,B,C)
An intricate network for combining and sharing
information
Administrators praise fast network
technologies!!!
Users praise the internet!!!

14
Data warehouse

Periodically, one imports data from databases and
store it (locally) in the data warehouse.
Now a local database can be created, containing
for instance protein family data (sequence,
structure, function and pathway/process data
integrated with the gene expression and other
experimental data).
Disadvantage expensive, intensive, needs to be
updated.
Advantage easy control of integrated data-mining
pipeline.

15
So why do biologists care?
16
Three main reasons

Database proliferation
Dozens to hundreds at the moment
More and more scientific discoveries result from
inter-database analysis and mining
Rising complexity of required data-combinations
E.g. translational medicine from bench to
bedside (genomic data vs. clinical data)

17
Biological databases

Like any other database
Data organization for optimal analysis
Data is of different types
Raw data (DNA, RNA, protein sequences)
Curated data (DNA, RNA and protein annotated
sequences and structures, expression data)

18
Raw Biological dataNucleic Acids (DNA)
19
Raw Biological dataAmino acid residues (proteins)
20
Curated Biological Data
DNA, nucleotide sequences
Gene boundaries, topology
Gene structure
Introns, exons, ORFs, splicing
Mass spectometry
Expression data
21
Curated Biological Data
Proteins, residue sequences
Mass spectometry (metabolomics, proteomics)
Extended sequence information
MCTUYTCUYFSTYRCCTYFSCD
Secondary structure
Post-Translational protein Modification (PTM)
Hydrophobicity, motif data
Protein-protein interaction
22
Curated Biological data3D Structures, folds
23
Biological Databases
The 2003 NAR Database Issue http//nar.oupjournal
s.org/content/vol31/issue1/
24
Distributed information

Pearsons Law The usefulness of a column of data
varies as the square of the number of columns it
is compared to.

25
A few biological databases

Nucleotide DatabasesAlternative Splicing,
EMBL-Bank, Ensembl, Genomes Server, Genome, MOT,
EMBL-Align, Simple Queries, dbSTS Queries,
Parasites, Mutations, IMGT
Genome Databases
Human, Mouse, Yeast, C.elegans, FLYBASE,
Parasites
Protein Databases Swiss-Prot, TrEMBL,
InterPro, CluSTr, IPI, GOA, GO, Proteome
Analysis, HPI, IntEnz, TrEMBLnew, SP_ML, NEWT,
PANDIT
Structure Databases PDB, MSD, FSSP, DALI
Microarray Database ArrayExpress
Literature Databases MEDLINE, Software
Biocatalog, Flybase Archives
Alignment Databases
BAliBASE, Homstrad, FSSP

26
Structural Databases

Protein Data Bank (PDB) http//www.rcsb.org/pdb/
Structural Classification of Proteins (SCOP)
http//scop.berkeley.edu
http//scop.mrc-lmb.cam.ac.uk/scop/

27
PDB

3D Macromolecular structural data
Data originates from NMR or X-ray crystallography
techniques
Total no of structures 34.626 (17/01/2006)
If the 3D structure of a protein is solved ...
they have it

28
PDB content
29
PDB information

The PDB files have a standard format
Key features
Informative descriptors

30
PDB-mirror on the WWW
e.g.1AE5
31
Example output 1AE5
32
SCOP

Structural Classification Of Proteins
3D Macromolecular structural data grouped based
on structural classification
Data originates from the PDB
Current version (v1.69)
25973 PDB Entries (July 2005).
70859 Domains

33
SCOP levels bottom-up

Family Clear evolutionarily relationshipProteins
clustered together into families are clearly
evolutionarily related. Generally, this means
that pairwise residue identities between the
proteins are 30 and greater. However, in some
cases similar functions and structures provide
definitive evidence of common descent in the
absence of high sequence identity for example,
many globins form a family though some members
have sequence identities of only 15.
Superfamily Probable common evolutionary
originProteins that have low sequence
identities, but whose structural and functional
features suggest that a common evolutionary
origin is probable are placed together in
superfamilies. For example, actin, the ATPase
domain of the heat shock protein, and hexakinase
together form a superfamily.
Fold Major structural similarityProteins are
defined as having a common fold if they have the
same major secondary structures in the same
arrangement and with the same topological
connections. Different proteins with the same
fold often have peripheral elements of secondary
structure and turn regions that differ in size
and conformation. In some cases, these differing
peripheral regions may comprise half the
structure. Proteins placed together in the same
fold category may not have a common evolutionary
origin the structural similarities could arise
just from the physics and chemistry of proteins
favouring certain packing arrangements and chain
topologies.

34
SCOP-mirror on the WWW
35
Enter SCOP at the top of the hierarchy
36
Keyword search of SCOP entries
37
CATH

Class, derived from secondary structure content,
is assigned for more than 90 of protein
structures automatically.
Architecture, which describes the gross
orientation of secondary structures, independent
of connectivities, is currently assigned
manually.
Topology level clusters structures according to
their toplogical connections and numbers of
secondary structures.
The Homologous superfamilies cluster proteins
with highly similar structures and functions. The
assignments of structures to topology families
and homologous superfamilies are made by sequence
and structure comparisons.

38
CATH-mirror on the WWW
39
DSSP

Dictionary of secondary structure of proteins
The DSSP database comprises the secondary
structures of all PDB entries
DSSP is actually software that translates the PDB
structural co-ordinates into secondary
(standardized) structure elements
A similar example is STRIDE

40
WHY bother???

Researchers create and use the data
Use of known information for analyzing new data
New data needs to be screened
Structural/Functional information
Extends the knowledge and information on a higher
level than DNA or protein sequences

41
In the end .
Computers can figure out all kinds of problems,
except the things in the world that just don't
add up. James Magary We should add For that we
employ the human brain, experts and experience.
42
Bio-databases A short word on problems

Even today we face some key limitations
There is no standard format
Every database or program has its own format
There is no standard nomenclature
Every database has its own names
Data is not fully optimized
Some datasets have missing information without
indications of it
Data errors
Data is sometimes of poor quality, erroneous,
misspelled
Error propagation resulting from computer
annotation

43
What to take home

Databases are a collection of data
Need to access and maintain easily and flexibly
Biological information is vast and sometimes very
redundant
Distributed databases bring it all together with
quality controls, cross-referencing and
standardization
Computers can only create data, they do not give
answers
Review-suggestion Integrating biological
databases, Stein, Nature 2003