Biological Database Systems - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Biological Database Systems

Description:

... web search interfaces in one domain (e.g., car search, hotel search, etc.) are ... Users search several databases via their indexes ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 24
Provided by: denissh
Category:

less

Transcript and Presenter's Notes

Title: Biological Database Systems


1
Biological Database Systems
  • 12.1. Integration of biological data

2
Schema heterogeneity
  • Fundamental reason
  • Data schemas are developed independently
  • Varying structures represent the same or
    overlapping concepts
  • Even same or overlapping domains are modeled in
    different ways
  • These differences (in schemas describing the same
    domain) are referred to as semantic heterogeneity
  • Increasingly important challenge in the broad
    field of data management
  • Including specifically scientific (biological)
    data management of course

3
Schema heterogeneity
  • Database schemas for the same domain developed by
    independent parties are usually quite different
  • Why differing structures?
  • People think differently even when faced with the
    same modeling goal
  • Example having the same database requirements
    students of database course invariably create
    different database designs (reference here)
  • Another example web search interfaces in one
    domain (e.g., car search, hotel search, etc.) are
    very heterogeneous

4
Schema heterogeneity
  • Bio-example
  • Consider largest nucleotide sequence databases -
    DDBJ (1984), EMBL (1982), GenBank (1983)
  • In February, 1986, GenBank and EMBL began a
    collaborative effort (joined by DDBJ in 1987) to
    devise a common feature table format and common
    standards for annotation practice.
  • DDBJ/EMBL/GenBank adhere to documented
    guidelines
  • The DDBJ/EMBL/GenBank Feature Table Definition
    regulating the content and syntax of the database
    entries
  • A set of database policies issued and published
    by the Int. Advisors to DDBJ/EMBL/GenBank
  • The International Nucleotide Sequence Database
    Collaboration (INSDC)

5
Schema heterogeneity
  • From the practical point of view
  • Schema heterogeneity is a problem since it
    requires people who understand meaning of
    combined schemas and people skilled in writing
    transformations (e.g., SQL, XQuery, experts)
  • Even more challenging for programs a schema is
    just some text for them they cannot capture
    meaning and intent of the schemas
  • Kinds of semantic discrepancies between schemas
  • Same schema element in two schemas is given
    different names (e.g., InsertDate and
    SeqInsertDate)
  • Attributes in schemas are grouped into table
    structures in different ways
  • One schema may cover aspects of the domain not
    covered by the other schema

6
Schema heterogeneity example
7
Schema heterogeneity
  • Using standard schemas to resolve semantic
    heterogeneity?
  • Create a standard schema describing a particular
    domain and use it
  • According to experience standards have limited
    success and only in domains where the incentives
    to agree on standards are quite strong
  • But even with standards data providers may share
    their data using a standard and still use their
    original data schemas

8
Data heterogeneity
  • Heterogeneity occurs not only in schema
    (semantic-level) but also in data values
    themselves (data-level)
  • Multiple ways of referring to the same product,
    gene, protein, etc.
  • Multiple ways of describing the same things
    (order and number of address elements)
  • Multiple formats (e.g., A,C,T,G,- and IUPAC
    nucleic acid codes alphabets for nucleotide
    sequence data)
  • Data incompleteness (e.g., peoples names) and
    uncertainty

9
Integration of bio-databases
  • Data integration - providing a unified interface
    to access multiple databases
  • At present, more than 1000 bioinformatics
    databases exist
  • All types of biological data (e.g., metabolic
    pathways, protein structures, nucleotide
    sequences, diseases, etc.)
  • Many biological questions require to
    query/search/integrate data from several sources
  • To integrate databases, heterogeneities at the
    various levels have to be overcome
  • Technical
  • Data-level and semantic-level
  • Administrative

10
Integration of bio-databases
From Integration of life science databases by
Kohler, Drug Discovery Today BIOSILICO, 2(2),
2004
11
Integration of bio-databases
  • Technical heterogeneity
  • Storage
  • Flat-file storage
  • Relational DBMSs
  • Object-oriented DBMSs
  • XML data (text2XML convertation)
  • Access methods and interfaces
  • HTTP/FTP
  • Web interfaces
  • Web services
  • Query languages
  • SQL
  • OQL (O - object)
  • XQuery/XPath

12
Database integration architectures
  • Link navigation
  • Hyperlink to entries in other databases
  • Provide links to the most relevant (known)
    databases

13
Database integration architectures
  • Data warehousing
  • Integrates and aggregates data of several
    databases into one database
  • Best suited for the close integration of a
    limited number of data sources with stable
    database schemas
  • XML can solve some problematic issues
  • Simplifies schema integration (i.e., each
    database can be have its XML Schema, which
    becomes a part of the large integrated schema)
  • Data exchanged in XML-format
  • Queries using XQuery/XPath
  • BUT, performance/reliability of current XML
    storage systems is not adequate

14
Database integration architectures
  • Data warehousing

15
Database integration architectures
  • Data warehousing

Metadata
16
Database integration architectures
  • Database mediation and federation
  • No data is converted to some unique format
  • Wrappers (uniform access to heterogeneous data
    sources)
  • Integration layer (decomposes user queries, sends
    them to corresponding wrappers, and integrates
    query results from all involved sources)
  • Query interface
  • Database federations
  • (Autonomous) data sources provide their own query
    functionality
  • Wrappers mainly translate between different
    interfaces
  • Mediated databases
  • Wrappers have more active role and implement
    query methods to data sources if necessary

17
Database integration architectures
  • Database mediation and federation
  • Examples DiscoveryLink from IBM Chap 11 of
    Bioinformatics Managing Scientific Data book,
    K2/BioKleisli Chap 6,8 of Bioinformatics
    Managing Scientific Data book

18
Database integration architectures
  • Database mediation and federation

19
Database integration architectures
  • Indexing flat-files
  • Allows to integrate high number of heterogeneous
    databases
  • Principle databases to be integrated are
    provided as flat-files

20
Database integration architectures
  • Indexing flat-files
  • Integration system uses a script (specific to
    each database) to index flat-files from a
    database
  • Script is also responsible for discrimination
    data types and generating links to other relevant
    databases
  • Users search several databases via their indexes
  • Maintenance of integrated database schema is not
    required
  • Adding/removal of any number of databases is easy
  • Example Sequence Retrieval System (SRS) Chap 5
    of Bioinformatics Managing Scientific Data book

21
Integration approaches
  • Which way of integration is the best?
  • Depends on
  • Purpose of integration
  • E.g., data warehouses may be a good solution for
    compiling all available information on a specific
    topic (e.g., model organism, metabolic pathway,
    etc.) warehouses can even integrate many
    resources if these are semantically closely
    related
  • Number of resources
  • How often data sources would be added/removed
  • How up-to-date data in the integration system
    should be
  • Should the functionality (specific tools and
    access methods) of underlying data sources be
    integrated?

22
Integration approaches
  • Depends on
  • Is mapping of equivalent entries among different
    data sources involved?
  • Number of users
  • Indexing flat-files system can handle many
    simultaneous users while federated databases can
    certain limits on a number of users
  • Complexity of schemas to be integrated
  • Restriction applied to the usage of the data
  • Legal/protective issues some sources cannot be
    fully distributed (i.e., it might be illegal to
    retreive all the data from a particular resource
    and use it in a warehouse system)
  • E.g., it might be required to always display the
    origin of the data source

23
  • References
  • Integration of life science databases by Kohler,
    Drug Discovery Today BIOSILICO, 2(2), 2004
  • Bioinformatics Managing Scientific Data by
    Lacroix Critchlow, Morgan Kaufmann, 2003
    (ISBN-10 155860829X)
Write a Comment
User Comments (0)
About PowerShow.com