Title: Biological Database Systems
1Biological Database Systems
- 12.1. Integration of biological data
2Schema heterogeneity
- Fundamental reason
- Data schemas are developed independently
- Varying structures represent the same or
overlapping concepts - Even same or overlapping domains are modeled in
different ways - These differences (in schemas describing the same
domain) are referred to as semantic heterogeneity - Increasingly important challenge in the broad
field of data management - Including specifically scientific (biological)
data management of course
3Schema heterogeneity
- Database schemas for the same domain developed by
independent parties are usually quite different - Why differing structures?
- People think differently even when faced with the
same modeling goal - Example having the same database requirements
students of database course invariably create
different database designs (reference here) - Another example web search interfaces in one
domain (e.g., car search, hotel search, etc.) are
very heterogeneous
4Schema heterogeneity
- Bio-example
- Consider largest nucleotide sequence databases -
DDBJ (1984), EMBL (1982), GenBank (1983) - In February, 1986, GenBank and EMBL began a
collaborative effort (joined by DDBJ in 1987) to
devise a common feature table format and common
standards for annotation practice. - DDBJ/EMBL/GenBank adhere to documented
guidelines - The DDBJ/EMBL/GenBank Feature Table Definition
regulating the content and syntax of the database
entries - A set of database policies issued and published
by the Int. Advisors to DDBJ/EMBL/GenBank - The International Nucleotide Sequence Database
Collaboration (INSDC)
5Schema heterogeneity
- From the practical point of view
- Schema heterogeneity is a problem since it
requires people who understand meaning of
combined schemas and people skilled in writing
transformations (e.g., SQL, XQuery, experts) - Even more challenging for programs a schema is
just some text for them they cannot capture
meaning and intent of the schemas - Kinds of semantic discrepancies between schemas
- Same schema element in two schemas is given
different names (e.g., InsertDate and
SeqInsertDate) - Attributes in schemas are grouped into table
structures in different ways - One schema may cover aspects of the domain not
covered by the other schema
6Schema heterogeneity example
7Schema heterogeneity
- Using standard schemas to resolve semantic
heterogeneity? - Create a standard schema describing a particular
domain and use it - According to experience standards have limited
success and only in domains where the incentives
to agree on standards are quite strong - But even with standards data providers may share
their data using a standard and still use their
original data schemas
8Data heterogeneity
- Heterogeneity occurs not only in schema
(semantic-level) but also in data values
themselves (data-level) - Multiple ways of referring to the same product,
gene, protein, etc. - Multiple ways of describing the same things
(order and number of address elements) - Multiple formats (e.g., A,C,T,G,- and IUPAC
nucleic acid codes alphabets for nucleotide
sequence data) - Data incompleteness (e.g., peoples names) and
uncertainty
9Integration of bio-databases
- Data integration - providing a unified interface
to access multiple databases - At present, more than 1000 bioinformatics
databases exist - All types of biological data (e.g., metabolic
pathways, protein structures, nucleotide
sequences, diseases, etc.) - Many biological questions require to
query/search/integrate data from several sources - To integrate databases, heterogeneities at the
various levels have to be overcome - Technical
- Data-level and semantic-level
- Administrative
10Integration of bio-databases
From Integration of life science databases by
Kohler, Drug Discovery Today BIOSILICO, 2(2),
2004
11Integration of bio-databases
- Technical heterogeneity
- Storage
- Flat-file storage
- Relational DBMSs
- Object-oriented DBMSs
- XML data (text2XML convertation)
- Access methods and interfaces
- HTTP/FTP
- Web interfaces
- Web services
- Query languages
- SQL
- OQL (O - object)
- XQuery/XPath
12Database integration architectures
- Link navigation
- Hyperlink to entries in other databases
- Provide links to the most relevant (known)
databases
13Database integration architectures
- Data warehousing
- Integrates and aggregates data of several
databases into one database - Best suited for the close integration of a
limited number of data sources with stable
database schemas - XML can solve some problematic issues
- Simplifies schema integration (i.e., each
database can be have its XML Schema, which
becomes a part of the large integrated schema) - Data exchanged in XML-format
- Queries using XQuery/XPath
- BUT, performance/reliability of current XML
storage systems is not adequate
14Database integration architectures
15Database integration architectures
Metadata
16Database integration architectures
- Database mediation and federation
- No data is converted to some unique format
- Wrappers (uniform access to heterogeneous data
sources) - Integration layer (decomposes user queries, sends
them to corresponding wrappers, and integrates
query results from all involved sources) - Query interface
- Database federations
- (Autonomous) data sources provide their own query
functionality - Wrappers mainly translate between different
interfaces - Mediated databases
- Wrappers have more active role and implement
query methods to data sources if necessary
17Database integration architectures
- Database mediation and federation
- Examples DiscoveryLink from IBM Chap 11 of
Bioinformatics Managing Scientific Data book,
K2/BioKleisli Chap 6,8 of Bioinformatics
Managing Scientific Data book
18Database integration architectures
- Database mediation and federation
19Database integration architectures
- Indexing flat-files
- Allows to integrate high number of heterogeneous
databases - Principle databases to be integrated are
provided as flat-files
20Database integration architectures
- Indexing flat-files
- Integration system uses a script (specific to
each database) to index flat-files from a
database - Script is also responsible for discrimination
data types and generating links to other relevant
databases - Users search several databases via their indexes
- Maintenance of integrated database schema is not
required - Adding/removal of any number of databases is easy
- Example Sequence Retrieval System (SRS) Chap 5
of Bioinformatics Managing Scientific Data book
21Integration approaches
- Which way of integration is the best?
- Depends on
- Purpose of integration
- E.g., data warehouses may be a good solution for
compiling all available information on a specific
topic (e.g., model organism, metabolic pathway,
etc.) warehouses can even integrate many
resources if these are semantically closely
related - Number of resources
- How often data sources would be added/removed
- How up-to-date data in the integration system
should be - Should the functionality (specific tools and
access methods) of underlying data sources be
integrated?
22Integration approaches
- Depends on
- Is mapping of equivalent entries among different
data sources involved? - Number of users
- Indexing flat-files system can handle many
simultaneous users while federated databases can
certain limits on a number of users - Complexity of schemas to be integrated
- Restriction applied to the usage of the data
- Legal/protective issues some sources cannot be
fully distributed (i.e., it might be illegal to
retreive all the data from a particular resource
and use it in a warehouse system) - E.g., it might be required to always display the
origin of the data source
23- References
- Integration of life science databases by Kohler,
Drug Discovery Today BIOSILICO, 2(2), 2004 - Bioinformatics Managing Scientific Data by
Lacroix Critchlow, Morgan Kaufmann, 2003
(ISBN-10 155860829X)