Biological Database Systems

About This Presentation

Title:

Biological Database Systems

Description:

... web search interfaces in one domain (e.g., car search, hotel search, etc.) are ... Users search several databases via their indexes ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 24

Provided by: denissh

Category:

more less

Transcript and Presenter's Notes

Title: Biological Database Systems

1
Biological Database Systems

12.1. Integration of biological data

2
Schema heterogeneity

Fundamental reason
Data schemas are developed independently
Varying structures represent the same or
overlapping concepts
Even same or overlapping domains are modeled in
different ways
These differences (in schemas describing the same
domain) are referred to as semantic heterogeneity
Increasingly important challenge in the broad
field of data management
Including specifically scientific (biological)
data management of course

3
Schema heterogeneity

Database schemas for the same domain developed by
independent parties are usually quite different
Why differing structures?
People think differently even when faced with the
same modeling goal
Example having the same database requirements
students of database course invariably create
different database designs (reference here)
Another example web search interfaces in one
domain (e.g., car search, hotel search, etc.) are
very heterogeneous

4
Schema heterogeneity

Bio-example
Consider largest nucleotide sequence databases -
DDBJ (1984), EMBL (1982), GenBank (1983)
In February, 1986, GenBank and EMBL began a
collaborative effort (joined by DDBJ in 1987) to
devise a common feature table format and common
standards for annotation practice.
DDBJ/EMBL/GenBank adhere to documented
guidelines
The DDBJ/EMBL/GenBank Feature Table Definition
regulating the content and syntax of the database
entries
A set of database policies issued and published
by the Int. Advisors to DDBJ/EMBL/GenBank
The International Nucleotide Sequence Database
Collaboration (INSDC)

5
Schema heterogeneity

From the practical point of view
Schema heterogeneity is a problem since it
requires people who understand meaning of
combined schemas and people skilled in writing
transformations (e.g., SQL, XQuery, experts)
Even more challenging for programs a schema is
just some text for them they cannot capture
meaning and intent of the schemas
Kinds of semantic discrepancies between schemas
Same schema element in two schemas is given
different names (e.g., InsertDate and
SeqInsertDate)
Attributes in schemas are grouped into table
structures in different ways
One schema may cover aspects of the domain not
covered by the other schema

6
Schema heterogeneity example
7
Schema heterogeneity

Using standard schemas to resolve semantic
heterogeneity?
Create a standard schema describing a particular
domain and use it
According to experience standards have limited
success and only in domains where the incentives
to agree on standards are quite strong
But even with standards data providers may share
their data using a standard and still use their
original data schemas

8
Data heterogeneity

Heterogeneity occurs not only in schema
(semantic-level) but also in data values
themselves (data-level)
Multiple ways of referring to the same product,
gene, protein, etc.
Multiple ways of describing the same things
(order and number of address elements)
Multiple formats (e.g., A,C,T,G,- and IUPAC
nucleic acid codes alphabets for nucleotide
sequence data)
Data incompleteness (e.g., peoples names) and
uncertainty

9
Integration of bio-databases

Data integration - providing a unified interface
to access multiple databases
At present, more than 1000 bioinformatics
databases exist
All types of biological data (e.g., metabolic
pathways, protein structures, nucleotide
sequences, diseases, etc.)
Many biological questions require to
query/search/integrate data from several sources
To integrate databases, heterogeneities at the
various levels have to be overcome
Technical
Data-level and semantic-level
Administrative

10
Integration of bio-databases
From Integration of life science databases by
Kohler, Drug Discovery Today BIOSILICO, 2(2),
2004
11
Integration of bio-databases

Technical heterogeneity
Storage
Flat-file storage
Relational DBMSs
Object-oriented DBMSs
XML data (text2XML convertation)
Access methods and interfaces
HTTP/FTP
Web interfaces
Web services
Query languages
SQL
OQL (O - object)
XQuery/XPath

12
Database integration architectures

Link navigation
Hyperlink to entries in other databases
Provide links to the most relevant (known)
databases

13
Database integration architectures

Data warehousing
Integrates and aggregates data of several
databases into one database
Best suited for the close integration of a
limited number of data sources with stable
database schemas
XML can solve some problematic issues
Simplifies schema integration (i.e., each
database can be have its XML Schema, which
becomes a part of the large integrated schema)
Data exchanged in XML-format
Queries using XQuery/XPath
BUT, performance/reliability of current XML
storage systems is not adequate

14
Database integration architectures

Data warehousing

15
Database integration architectures

Data warehousing

Metadata
16
Database integration architectures

Database mediation and federation
No data is converted to some unique format
Wrappers (uniform access to heterogeneous data
sources)
Integration layer (decomposes user queries, sends
them to corresponding wrappers, and integrates
query results from all involved sources)
Query interface
Database federations
(Autonomous) data sources provide their own query
functionality
Wrappers mainly translate between different
interfaces
Mediated databases
Wrappers have more active role and implement
query methods to data sources if necessary

17
Database integration architectures

Database mediation and federation
Examples DiscoveryLink from IBM Chap 11 of
Bioinformatics Managing Scientific Data book,
K2/BioKleisli Chap 6,8 of Bioinformatics
Managing Scientific Data book

18
Database integration architectures

Database mediation and federation

19
Database integration architectures

Indexing flat-files
Allows to integrate high number of heterogeneous
databases
Principle databases to be integrated are
provided as flat-files

20
Database integration architectures

Indexing flat-files
Integration system uses a script (specific to
each database) to index flat-files from a
database
Script is also responsible for discrimination
data types and generating links to other relevant
databases
Users search several databases via their indexes
Maintenance of integrated database schema is not
required
Adding/removal of any number of databases is easy
Example Sequence Retrieval System (SRS) Chap 5
of Bioinformatics Managing Scientific Data book

21
Integration approaches

Which way of integration is the best?
Depends on
Purpose of integration
E.g., data warehouses may be a good solution for
compiling all available information on a specific
topic (e.g., model organism, metabolic pathway,
etc.) warehouses can even integrate many
resources if these are semantically closely
related
Number of resources
How often data sources would be added/removed
How up-to-date data in the integration system
should be
Should the functionality (specific tools and
access methods) of underlying data sources be
integrated?

22
Integration approaches

Depends on
Is mapping of equivalent entries among different
data sources involved?
Number of users
Indexing flat-files system can handle many
simultaneous users while federated databases can
certain limits on a number of users
Complexity of schemas to be integrated
Restriction applied to the usage of the data
Legal/protective issues some sources cannot be
fully distributed (i.e., it might be illegal to
retreive all the data from a particular resource
and use it in a warehouse system)
E.g., it might be required to always display the
origin of the data source

References
Integration of life science databases by Kohler,
Drug Discovery Today BIOSILICO, 2(2), 2004
Bioinformatics Managing Scientific Data by
Lacroix Critchlow, Morgan Kaufmann, 2003
(ISBN-10 155860829X)

Write a Comment

User Comments (0)

About PowerShow.com

Biological Database Systems - PowerPoint PPT Presentation

Biological Database Systems

... web search interfaces in one domain (e.g., car search, hotel search, etc.) are ... Users search several databases via their indexes ... – PowerPoint PPT presentation