Title: Structure Databases
1Structure Databases
- DNA/Protein structure-function analysis and
prediction - Lecture 6
- Bioinformatics Section, Vrije Universiteit,
Amsterdam
2The dictionary definition
- Main Entry database Pronunciation
'dA-t-"bAs, 'da- also 'dä-Function nounDate
circa 1962 - a usually large collection of data organized
especially for rapid search and retrieval (as by
a computer) - - Webster dictionary
3WHAT is a database?
- A collection of data that needs to be
- Structured
- Searchable
- Updated (periodically)
- Cross referenced
- Challenge
- To change meaningless data into useful
information that can be accessed and analysed the
best way possible. - For example
- HOW would YOU organise all biological sequences
so that the biological information is optimally
accessible? - You need an appropriate database management
system (DBMS)
4DBMS
Database
- Internal organization
- Controls speed and flexibility
- A unity of programs that
- Store
- Extract
- Modify
Store
Extract
Modify
USER(S)
5DBMS organisation types
- Flat file databases (flat DBMS)
- Simple, restrictive, table
- Hierarchical databases (hierarchical DBMS)
- Simple, restrictive, tables
- Relational databases (RDBMS)
- Complex,versatile, tables
- Object-oriented databases (ODBMS)
- Complex, versatile, objects
6Relational databases
- Data is stored in multiple related tables
- Data relationships across tables can be either
many-to-one or many-to-many - A few rules allow the database to be viewed in
many ways - Lets convert the course details to a relational
database
7Our flat file database
FLAT DATABASE 2
Course details
Name Depart. Course E1 E2
E3 P1 P2
Student 1 Chemistry Biology A B B
A C ..
Student 1 Chemistry Maths C C B
A A ..
Student 1 Chemistry English A A A
A A ..
. . . .
Student 2 Ecology Biology A B A
A A ..
Student 2 Ecology Maths A D A
A A ..
. . . .
8Normalize (1NF)
- We remove repeating records (rows)
9Normalize (2NF)
- We remove redundant fields (columns)
wID Project
10Relational Databases
- What have we achieved?
- No repeating information
- Less storage space
- Better reality representation
- Easy modification/management
- Easy usage of any combination of records
- Remember
- the DBMS has programs to access and edit this
information so ignore the human reading
limitation of the primary keys
11Accessing database information
- A request for data from a database is called a
query - Queries can be of three forms
- Choose from a list of parameters
- Query by example (QBE)
- Query language
12Query Languages
- The standard
- SQL (Structured Query Language) originally
called SEQUEL (Structured English QUEry Language) - Developed by IBM in 1974 introduced commercially
in 1979 by Oracle Corp. - Standard interactive and programming language for
getting information from and updating a database. - RDMS (SQL), ODBMS (Java, C, OQL etc)
13Distributed databases
- From local to global attitude
- Data appears to be in one location but is most
definitely not - A definition Two or more data files in different
locations, periodically synchronized by the DBMS
to keep data in all locations consistent (A,B,C) - An intricate network for combining and sharing
information - Administrators praise fast network
technologies!!! - Users praise the internet!!!
14Data warehouse
- Periodically, one imports data from databases and
store it (locally) in the data warehouse. - Now a local database can be created, containing
for instance protein family data (sequence,
structure, function and pathway/process data
integrated with the gene expression and other
experimental data). - Disadvantage expensive, intensive, needs to be
updated. - Advantage easy control of integrated data-mining
pipeline.
15So why do biologists care?
16Three main reasons
- Database proliferation
- Dozens to hundreds at the moment
- More and more scientific discoveries result from
inter-database analysis and mining - Rising complexity of required data-combinations
- E.g. translational medicine from bench to
bedside (genomic data vs. clinical data)
17Biological databases
- Like any other database
- Data organization for optimal analysis
- Data is of different types
- Raw data (DNA, RNA, protein sequences)
- Curated data (DNA, RNA and protein annotated
sequences and structures, expression data)
18Raw Biological dataNucleic Acids (DNA)
19Raw Biological dataAmino acid residues (proteins)
20Curated Biological Data
DNA, nucleotide sequences
Gene boundaries, topology
Gene structure
Introns, exons, ORFs, splicing
Mass spectometry
Expression data
21Curated Biological Data
Proteins, residue sequences
Mass spectometry (metabolomics, proteomics)
Extended sequence information
MCTUYTCUYFSTYRCCTYFSCD
Secondary structure
Post-Translational protein Modification (PTM)
Hydrophobicity, motif data
Protein-protein interaction
22Curated Biological data3D Structures, folds
23Biological Databases
The 2003 NAR Database Issue http//nar.oupjournal
s.org/content/vol31/issue1/
24Distributed information
- Pearsons Law The usefulness of a column of data
varies as the square of the number of columns it
is compared to.
25A few biological databases
- Nucleotide DatabasesAlternative Splicing,
EMBL-Bank, Ensembl, Genomes Server, Genome, MOT,
EMBL-Align, Simple Queries, dbSTS Queries,
Parasites, Mutations, IMGT - Genome Databases
- Human, Mouse, Yeast, C.elegans, FLYBASE,
Parasites - Protein Databases Swiss-Prot, TrEMBL,
InterPro, CluSTr, IPI, GOA, GO, Proteome
Analysis, HPI, IntEnz, TrEMBLnew, SP_ML, NEWT,
PANDIT - Structure Databases PDB, MSD, FSSP, DALI
- Microarray Database ArrayExpress
- Literature Databases MEDLINE, Software
Biocatalog, Flybase Archives - Alignment Databases
- BAliBASE, Homstrad, FSSP
26Structural Databases
- Protein Data Bank (PDB) http//www.rcsb.org/pdb/
- Structural Classification of Proteins (SCOP)
- http//scop.berkeley.edu
- http//scop.mrc-lmb.cam.ac.uk/scop/
27PDB
- 3D Macromolecular structural data
- Data originates from NMR or X-ray crystallography
techniques - Total no of structures 34.626 (17/01/2006)
- If the 3D structure of a protein is solved ...
they have it
28PDB content
29PDB information
- The PDB files have a standard format
- Key features
- Informative descriptors
30PDB-mirror on the WWW
e.g.1AE5
31Example output 1AE5
32SCOP
- Structural Classification Of Proteins
- 3D Macromolecular structural data grouped based
on structural classification - Data originates from the PDB
- Current version (v1.69)
- 25973 PDB Entries (July 2005).
- 70859 Domains
33SCOP levels bottom-up
- Family Clear evolutionarily relationshipProteins
clustered together into families are clearly
evolutionarily related. Generally, this means
that pairwise residue identities between the
proteins are 30 and greater. However, in some
cases similar functions and structures provide
definitive evidence of common descent in the
absence of high sequence identity for example,
many globins form a family though some members
have sequence identities of only 15. - Superfamily Probable common evolutionary
originProteins that have low sequence
identities, but whose structural and functional
features suggest that a common evolutionary
origin is probable are placed together in
superfamilies. For example, actin, the ATPase
domain of the heat shock protein, and hexakinase
together form a superfamily. - Fold Major structural similarityProteins are
defined as having a common fold if they have the
same major secondary structures in the same
arrangement and with the same topological
connections. Different proteins with the same
fold often have peripheral elements of secondary
structure and turn regions that differ in size
and conformation. In some cases, these differing
peripheral regions may comprise half the
structure. Proteins placed together in the same
fold category may not have a common evolutionary
origin the structural similarities could arise
just from the physics and chemistry of proteins
favouring certain packing arrangements and chain
topologies.
34SCOP-mirror on the WWW
35Enter SCOP at the top of the hierarchy
36Keyword search of SCOP entries
37CATH
- Class, derived from secondary structure content,
is assigned for more than 90 of protein
structures automatically. - Architecture, which describes the gross
orientation of secondary structures, independent
of connectivities, is currently assigned
manually. - Topology level clusters structures according to
their toplogical connections and numbers of
secondary structures. - The Homologous superfamilies cluster proteins
with highly similar structures and functions. The
assignments of structures to topology families
and homologous superfamilies are made by sequence
and structure comparisons.
38CATH-mirror on the WWW
39DSSP
- Dictionary of secondary structure of proteins
- The DSSP database comprises the secondary
structures of all PDB entries - DSSP is actually software that translates the PDB
structural co-ordinates into secondary
(standardized) structure elements - A similar example is STRIDE
40WHY bother???
- Researchers create and use the data
- Use of known information for analyzing new data
- New data needs to be screened
- Structural/Functional information
- Extends the knowledge and information on a higher
level than DNA or protein sequences
41In the end .
Computers can figure out all kinds of problems,
except the things in the world that just don't
add up. James Magary We should add For that we
employ the human brain, experts and experience.
42Bio-databases A short word on problems
- Even today we face some key limitations
- There is no standard format
- Every database or program has its own format
- There is no standard nomenclature
- Every database has its own names
- Data is not fully optimized
- Some datasets have missing information without
indications of it - Data errors
- Data is sometimes of poor quality, erroneous,
misspelled - Error propagation resulting from computer
annotation
43What to take home
- Databases are a collection of data
- Need to access and maintain easily and flexibly
- Biological information is vast and sometimes very
redundant - Distributed databases bring it all together with
quality controls, cross-referencing and
standardization - Computers can only create data, they do not give
answers - Review-suggestion Integrating biological
databases, Stein, Nature 2003