Title: From Biological Data to Biological Knowledge
1From Biological Datato Biological Knowledge
- Volker Stümpflen
- Group for Biological Information Systems
- MIPS / Institute for Bioinformatics
- GSF National Research Center for Environment
and Health
2Something About Our Problem
- For a long time we focused on individual genes /
proteins - but e.g. humans dont have much more genes than
simple organisms - because complexity occurs at the level of
biological networks
- We cant understand anything without
understanding the context
3Small Scale Knowledge Generation
- Accessing some of the several hundred (web)
resources (public available data gt 2 Petabyte)
gt Compilation of required knowledge by hand
4Large Scale Assessment of Information and
Knowledge
- R. Shamir et. al., Revealing modularity and
organization in the yeast molecular network by
integrated analysis of highly heterogeneous
genomewide data, PNAS, Vol. 101, No. 9, 2004, p.
2981-2986 - To gain deeper understanding of the biological
systems, it is pertinent to analyze heterogeneous
data sources in a truly integrated fashion and
shape the analysis results into one body of
knowledge. - By integrating experimental data of
heterogeneous sources and types, we are able to
perform analysis on a much broader scope than
previous studies.
5Technical Problems
- Information integration from heterogeneous and
distributed data sources (databases AND
applications) - Solvable with n-Tier architectures
- E.g. GenRE at MIPS
- J2EE based middleware
- Enterprise Java Beans (EJBs) and Web Services (WS)
6Semantic Problems
- Sloppy Definitionse.g. Gene has Function
- Homonym / Synonym problemse.g. gene identifiers
- Ambiguity of terms
- Differences in meaning of terms between different
biological communities - Results of in-vitro often differ within the
experimental scope (e.g. Protein Interactions)
7Strategies
- Complete semantic annotation of (all) resources
- Funding ?
- Data models ?
- Modeling of individual domains
- Suited for biologists (Topic Maps)
- Access of relevant data sources
- Merging of individual domains to obtain the
complete picture
8Static Generation of Topic Maps
- Highly flexible data model
- Straightforward process
- Intuitive user interface
- Finding the right information easy
- Topic maps tend to be very large
- Redundant information in DBs and Topic Map files
- Update problems
- Dynamic generation of topic maps
9Dynamic Topic Map Generation
- Dynamical information retrieval via EJBs / Web
Services - Each topic type is mapped to a EJBs / Web Service
- Each association is also represented by a EJBs /
Web Service - Straightforward extension of the data model
- Afterwards user's adjustments are possible
- Intuitive navigation of related information
Protein ECNum Association
has
Protein
EC Number
is associated to
Protein Web Services
EC Number Web Services
Protein ECNum Association Web Services
10Interface Definition
- Information retrieval via EJBs (Web Service)
- Each topic type is mapped to a EJB / WS
- Each association type is also represented by a
EJB / WS - Straightforward extension of the data model
- Afterwards user's adjustments are possible
11DTMG Architecture(Extension of GenRE)
K. Nenova
12Worst Case Example
- Combination of two large resources at MIPS
- Annotated ProteinsCalculated properties of
genes / proteins from various organisms - OrthologsCalculated similarities of
proteins(all against all)
K. Nenova / R. Gregory
13Large Scale Annotation with PEDANT(Protein
Extraction, Description and Analysis Tool)
- Covers currently gt 400 genomes
- 1000 end of this year
14SIMAP Precalculated Sequence Homologies
SIMAP database NFS-Server Grid Master
- 450 proteoms
- 4 sequence collections
- 7.5 million protein entries
- 3.5 million sequences
LAN
Grid execution hosts
8 billion FASTA hits
External users MIPS WWW users
Web- server
SIMAP client
Internet
Linux
BOINC core
- BOINC
- 12600 hosts
- 2.3 TeraFLOPS
Mac
BOINC daemons
Windows
SIMAP database
Database-, Fileserver
R. Arnold, T. Rattei, P. Tischler, V. Stümpflen,
M-D. Truong and HW. Mewes Bioinformatics in press
15Topic Map Schema
is represented by
Pedant URL
Classification
Description
Length
Molecular Weight
Contig Name
Sequence
Description
Description
has is associated to
Genome
contains belongs
EC Number
Protein
PFAM Domain
Pfam URL
URL
KEGG URL
belongs
is represented by
has orthologs
Domain
Genome
Fun Cat
Taxonomy Id
Description
Strain
Status
FunCat URL
Description
16Some Screenshots
17Improvements
- Parallel searches based on Message Driven Beans
R. Gregory
18Further Improvements
- More Maps
- Deseases, Metabolisms
- Combination with Text Mining
- Inference Engines, Reasoners
-
Computer Show me all proteins in mus
musculus involved in transmembrane signal
transduction and show me the orthologs in rattus
norvegicus
19Conclusion
- Topic Maps suitable for semantic information
integration - Development of a Dynamic Topic Map Generation
(DTMG) Framework - Generation of fragments based on component and
service oriented architectures - Capable to gain deeper understanding of
biological entities and systems in a truly
integrated fashion
20Acknowledgements
- Filka NenovaRichard GregoryMatthias Oesterheld
Roland ArnoldOctave NoubibouMarisa
ThomaKonrad Schreiber - Thomas Rattei
- Ulrich GüldenerMartin Münsterkötter
- FundingImpuls- und Vernetzungsfonds
derHelmholtz-Gemeinschaft Deutscher
Forschungszentren e.V.