Title: Structuring molecular biology databanks into a databank network
1Structuring molecular biology databanks into a
databank network
Cambridge, 14. October 2001
- Thure Etzold, MD of Lion Bioscience Ltd, Cambridge
2Our situation
- Lots of databanks
- Databanks appear (many) and disappear (much less)
- Some databanks are very different (heterogeneity)
- Some databanks are very similar, but not quite
the same - Many databanks have links to other databanks
- What about analysis tools?
3Different Degrees of Databank Interoperation
X03456
4The Data Jungle
High throughputscreening
Pharm. / Tox.
Physiology
Molecular Biology
Genetics
Too much data - isolated research - suboptimal
communication
5Our mantra the library network
6Querying several databanks simultaneously
- Only fields that databanks have in common are
shown in query form. - Only views that databanks have in common are
displayed
7Our mantra the library network
8Linking Databases with SRS
- Explicit cross-references unique ID
- Implicit links by
- organism name (Taxonomy)
- gene name (?),
- small compound names (structure similarity
search) - gene function (GO)
- Literature citation (?)
- Sequence similarity (protein family databanks,
sequence similarity search)
9The Problem with Nomenclature
is referred in BRENDA as the following
- Pyruvic acid common or trivial name
- Pyruvate common name for the anionic species
- 2-Oxopropanoic acid IUPAC name
- 2-Oxopropionic acid systematic name
- 2-Oxopropionate systematic name for the anionic
species - alpha-Keto-propanoic acid systematic name
- CH3COCOOH line diagram notation
- CH3COCOO- line diagram notation for anionic
species
Other names (or descriptors) for this parent
structure include Acetylformic acid, BTS,
Pyroracemic acid, alpha-Oxo-propanoic acid and
2-Keto-propionate there are plenty more
possibilities
10Introducing Links
EMBL
SWISSPROT
PDB
ID ABC
ID ABC
ID XYZ
DR EMABC
DR SW123
DR PDBXYZ
- Entries may contain references to other databases
e.g. an EMBL entry may contain a
cross-reference to the SWISSPROT entry for the
protein it encodes - Cross-references may be bidirectional or
unidirectional - Cross-references can be indirect
11Different Degrees of Databank Interoperation
X03456
- hypertext links
- indexed links
12The Link Operators
ID A1 DR B3
Indexing
Queries
A gt B B lt A
B2
A
B
B3
B4
A1
B1
A2
B2
A3
B3
A4
B4
A lt B B gt A
A5
A1
A2
A3
A4
13Queries Using Links
- EMBL gt Swissprotproteins encoded by genes
- EMBL lt Swissprotgenes coding for proteins
- Swissprot lt EPDall eukaryotic proteins for which
the promoter is further characterised - Swissprot gt Prosite gt Swissprota single protein
is expanded by all members of its family (find
all similar sequences)
14Integration Supports Enquiry
TIGR
HSSP
PDB
SwissProt
PATHWAY
ENZYME
All H. pylori genes, encoding membrane bound
proteins, involved in glucose metabolism, and
with a homologue of known 3D structure
withresolution better 2Ã…
15Solving a murder case
- Ideal We find the knife with blood stains of the
victim and the finger prints of the murderer - More probable circumstantial evidence
- Murder weapon, but no finger prints.
- A cracked watch belonging to the victim, probably
with the precise time of the murder. - A suspect denies knowing the victim, but has
address of victim in his address book. - Alibi of suspect not watertight (no witnesses).
- Witness saw person of same height as suspect
disappear from victims house shortly after
estimated time of murder, but it was too dark to
see more.
16Scientific discovery
- Ideal information for a new drug target
contained in a single SWISS-PROT entry. - If it is that easy, someone else has found it
already. - Real world concepts are often distributed over
many databanks, e.g., transcription regulation,
gene, protein family.
17Different Degrees of Databank Interoperation
X03456
- hypertext links
- indexed links
- composite structures
new datastructure
18The Object Loader
19Adding a genome to genomeSCOUT
- Input is the genome sequence and a list of
putative genes - Computation of protein function prediction
(bioSCOUT) - Computation of orthologs for each genome pairing
20bioSCOUT automating the pipeline
21Searching genomeSCOUT databanks
22Identification of Orthologs
23Comparative Pathway View
Straightforward analysis of pathways and enzymes
in any selection of organisms
24Applications in SRS
How many members of the TM4 family did I
find? Did I find any enzymes in the
phenylanaline pathway? Remove all viral
sequences from my hit list
25Comparing Results of Different Search Methods
- Run a BLAST search
- Run a FASTA search with the same sequence
- Use links to
- Create a view thatshows both result types
- Ask questions likeGive me all hits found by
BLAST and FASTA
BLAST
FASTA
SWISS-PROT
Hit
FASTA
BLAST
1
2
3
26The Blast results viewer
27The protein story
28Selective Display of Protein/Protein Interactions
29Inspecting the evidence
30Scaleability is important
mloft
Frustration level
SystemB
SystemA
SRS
1 10 15 50
100 200
No. of Databanks
mloft Maximum Level Of Frustration Tolerance
31The system must be kept up to date
- Obtain and install the latest versions of all
involved databanks - Regenerate all derived databanks (eg, FASTA
files, nonredundant sequence databanks - Update links between databanks
32 Complete Automation with SRS Prisma
33LIONs Company Profile
- Establishment March 1997 - Heidelberg, Germany
- IPO August 2000
-
- Employees gt 420
- Locations Heidelberg, Germany Cambridge,
UK Cambridge MA, USA - San Diego (formerly Trega)
- Represented by CTC in Japan
- Revenues 1. Software product licenses 2.
Comprehensive Bio-IT solutions 3. Drug
discovery-partnerships