Data Integration in the Life Sciences

About This Presentation

Title:

Data Integration in the Life Sciences

Description:

Generally done in a centralized fashion and as often as desired, having no ... Enables the application designer to develop generic applications that grow as ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 73

Provided by: kengri

Learn more at: https://www.iscb.org

Category:

more less

Transcript and Presenter's Notes

Title: Data Integration in the Life Sciences

1
Data Integration in the Life Sciences

Kenneth Griffiths and Richard Resnick

2
Tutorial Agenda

130 145 Introduction
145 200 Tutorial Survey
200 300 Approaches to Integration
300 305 Bio Break
305 400 Approaches to Integration (cont.)
400 415 Question and Answer
415 430 Break
430 500 Metadata Session
500 530 Domain-specific example (GxP)
530 Wrap-up

3
Life Science Data
Recent focus on genetic data genomics the study
of genes and their function. Recent advances in
genomics are bringing about a revolution in our
understanding of the molecular mechanisms of
disease, including the complex interplay of
genetic and environmental factors. Genomics is
also stimulating the discovery of breakthrough
healthcare products by revealing thousands of new
biological targets for the development of drugs,
and by giving scientists innovative ways to
design new drugs, vaccines and DNA diagnostics.
Genomics-based therapeutics include "traditional"
small chemical drugs, protein drugs, and
potentially gene therapy. The Pharmaceutical
Research and Manufacturers of America -
http//www.phrma.org/genomics/lexicon/g.html
Study of genes and their function Understanding
molecular mechanisms of disease Development of
drugs, vaccines, and diagnostics
4
The Study of Genes...

Chromosomal location
Sequence
Sequence Variation
Splicing
Protein Sequence
Protein Structure

5
and Their Function

Homology
Motifs
Publications
Expression
HTS
In Vivo/Vitro Functional Characterization

6
Understanding Mechanisms of Disease
7
Development of Drugs, Vaccines, Diagnostics

Differing types of Drugs, Vaccines, and
Diagnostics
Small molecules
Protein therapeutics
Gene therapy
In vitro, In vivo diagnostics
Development requires
Preclinical research
Clinical trials
Long-term clinical research
All of which often feeds back into ongoing
Genomics research and discovery.

8
The Industrys Problem

Too much unintegrated data
from a variety of incompatible sources
no standard naming convention
each with a custom browsing and querying
mechanism (no common interface)
and poor interaction with other data sources

9
What are the Data Sources?

Flat Files
URLs
Proprietary Databases
Public Databases
Data Marts
Spreadsheets
Emails

10
Sample Problem Hyperprolactinemia

Over production of prolactin
prolactin stimulates mammary gland development
and milk production
Hyperprolactinemia is characterized by
inappropriate milk production
disruption of menstrual cycle
can lead to conception difficulty

11
Understanding transcription factors for prolactin
production
Show me all genes in the public literature that
are putatively related to hyperprolactinemia,
have more than 3-fold expression differential
between hyperprolactinemic and normal pituitary
cells, and are homologous to known transcription
factors.
(Q1?Q2?Q3)
12
(No Transcript)
13
Approaches to Integration

In order to ask this type of question across
multiple domains, data integration at some level
is necessary. When discussing the different
approaches to data integration, a number of key
issues need to be addressed

Accessing the original data sources
Handling redundant as well as missing data
Normalizing analytical data from different data
sources
Conforming terminology to industry standards
Accessing the integrated data as a single logical
repository
Metadata (used to traverse domains)

14
Approaches to Integration (cont.)

So if one agrees that the preceding issues are
important, where are they addressed? In the
client application, the middleware, or the
database? Where they are addressed can make a
huge difference in usability and performance.
Currently there are a number of approaches for
data integration
Federated Databases
Data Warehousing
Indexed Data Sources
Memory-mapped Data Structures

15
Federated Database Approach
Show me all genes that are homologous to known
transcription factors
Show me all genes that have more than 3-fold
expression differential between
hyperprolactinemic and normal cells
Show me all genes in the public literature that
are putatively related to hyperprolactinemia
16
Advantages to Federated Database Approach

quick to configure
architecture is easy to understand - no knowledge
of the domain is necessary
achieves a basic level of integration with
minimal effort
can wrapper and plug in new data sources as they
come into existence

17
Problems with Federated Database Approach

Integration of queries and query results occurs
at the integrated application level, requiring
complex low-level logic to be embedded at the
highest level
Naming conventions across systems must be adhered
to or query results will be inaccurate - imposes
constraints on original data sources
Data sources are not necessarily clean
integrating dirty data makes integrated dirty
data.
No query optimization across multiple systems can
be performed
If one source system goes down, the entire
integrated application may fail
Not readily suitable for data mining, generic
visualization tools
Relies on CORBA or other middleware technology,
shown to have performance (and reliability?)
problems

18
Solving Federated Database Problems
Semantic Cleaning Layer
Middleware (CORBA, DCOM, etc)
PubMed
Proprietary App
Medline
LITERATURE
19
Data Warehousing for Integration

Data warehousing is a process as much as it is a
repository. There are a couple of primary
concepts behind data warehousing

ETL (Extraction, Transformation, Load)
Component-based (datamarts)
Typically utilizes a dimensional model
Metadata-driven

20
Data Warehousing
E (Extraction) T (Transformation) L (Load)
21
Data-level Integration Through Data Warehousing
22
Data Staging

Storage area and set of processes that
extracts source data
transforms data
cleans incorrect data, resolves missing elements,
standards conformance
purges fields not needed
combines data sources
creates surrogate keys for data to avoid
dependence on legacy keys
builds aggregates where needed
archives/logs
loads and indexes data
Does not provide query or presentation services

23
Data Staging (cont.)

Sixty to seventy percent of development is here
Engineering is generally done using database
automation and scripting technology
Staging environment is often an RDBMS
Generally done in a centralized fashion and as
often as desired, having no effect on source
systems
Solves the integration problem once and for all,
for most queries

24
Warehouse Development and Deployment
Two development paradigms Top-down warehouse
design conceptualize the entire warehouse, then
build, tends to take 2 years or more, and
requirements change too quickly Bottom-up design
and deployment pivoted around completely
functional subsections of the Warehouse
architecture, takes 2 months, enables modular
development.
25
Warehouse Development and Deployment (cont.)

The Data Mart
A logical subset of the complete data warehouse
represents a completable project
by itself is a fully functional data warehouse
A Data Warehouse is the union of all constituent
data marts.
Enables bottom-up development

26
Warehouse Development and Deployment (cont.)

Examples of data marts in Life Science
Sequence/Annotation - brings together sequence
and annotation from public and proprietary dbs
Expression Profiling datamart - integrates
multiple TxP approaches (cDNA, oligo)
High-throughput screening datamart - stores HTS
information on proprietary high-throughput
compound screens
Clinical trial datamart - integrates clinical
trial information from multiple trials
All of these data marts are pieced together along
conformed entities as they are developed, bottom
up

27
Advantages of Data-level Integration Through
Data Warehousing

Integration of data occurs at the lowest level,
eliminating the need for integration of queries
and query results
Run-time semantic cleaning services are no longer
required - this work is performed in the data
staging environment
FAST!
Original source systems are left completely
untouched, and if they go down, the Data
Warehouse still functions
Query optimization across multiple systems data
can be performed
Readily suitable for data mining by generic
visualization tools

28
Issues with Data-level Integration Through Data
Warehousing

ETL process can take considerable time and effort
Requires an understanding of the domain to
represent relationships among objects correctly
More scalable when accompanied by a Metadata
repository which provides a layer of abstraction
over the warehouse to be used by the application.
Building this repository requires additional
effort.

29
Indexing Data Sources

Indexes and links a large number of data sources
(e.g., files, URLs)
Data integration takes place by using the results
of one query to link and jump to a keyed record
in another location
Users have the ability to develop custom
applications by using a vendor-specific language

30
Indexed Data Source Architecture
Index Traversal Support Mechanism
31
Indexed Data Sources Pros and Cons

Advantages
quick to set up
easy to understand
achieves a basic level of integration with
minimal effort

Disadvantages
does not clean and normalize the data
does not have a way to directly integrate data
from relational DBMSs
difficult to browse and mine
sometimes requires knowledge of a vendor-specific
language

32
Memory-mapped Integration

The idea behind this approach is to integrate the
actual analytical data in memory and not in a
relational database system
Performance is fast since the application
retrieves the data from memory rather than disk
True data integration is achieved for the
analytical data but the descriptive or
complementary data resides in separate databases

33
Memory Map Architecture
Sample/Source Information
Sequence DB 2
Descriptive Information
Sequence DB 1
Data Integration Layer
CORBA
34
Memory Maps Pros and Cons

Disadvantages
typically does not put non-analytical data (gene
names, tissue types, etc.) through the ETL
process
not easily extensible when adding new databases
with descriptive information
performance hit when accessing anything outside
of memory (tough to optimize)
scalability restricted by memory limitations of
machine
difficult to mine due to complicated architecture

Advantages
true analytical data integration
quick access
cleans analytical data
simple matrix representation

35
The Need for Metadata

For all of the previous approaches, one
underlying concept plays a critical role to their
success Metadata.
Metadata is a concept that many people still do
not fully understand. Some common questions
include

What is it?
Where does it come from?
Where do you keep it?
How is it used?

36
Metadata
The data about the data

Describes data types, relationships, joins,
histories, etc.
A layer of abstraction, much like a middle layer,
except...
Stored in the same repository as the data,
accessed in a consistent database-like way

37
Metadata (cont.)
Back-end metadata - supports the
developers Source system metadata versions,
formats, access stats, verbose information Busines
s metadata schedules, logs, procedures,
definitions, maps, security Database metadata -
data models, indexes, physical logical design,
security Front-end metadata - supports the
scientist and application Nomenclature metadata -
valid terms, mapping of DB field names to
understandable names Query metadata - query
templates, join specifications, views, can
include back-end metadata Reporting/visualization
metadata - template definitions, association
maps, transformations Application security
metadata - security profiles at the application
level
38
Metadata Benefits

Enables the application designer to develop
generic applications that grow as the data grows
Provides a repository for the scientist to become
better informed on the nature of the information
in the database
Is a high-performance alternative to developing
an object-relational layer between the database
and the application
Extends gracefully as the database extends

39
(No Transcript)
40
Integration Technologies

Technologies that support integration efforts
Data Interchange
Object Brokering
Modeling techniques

41
Data Interchange

Standards for inter-process and inter-domain
communication
Two types of data
Data the actual information that is being
interchanged
Metadata the information on the structural and
semantic aspects of the Data
Examples
EMBL format
ASN.1
XML

42
XML Emerges

Allows uniform description of data and metadata
Metadata described through DTDs
Data conforms to metadata description
Provides open source solution for data
integration between components
Lots of support in CompSci community
(proportional to cardinality of Perl modules
developed)
XMLCGI - a module to convert CGI parameters to
and from XML
XMLDOM - a Perl extension to XMLParser. It
adds a new 'Style' to XMLParser,called 'Dom',
that allows XMLParser to build an Object
Oriented data structure with a DOM Level 1
compliant interface.
XMLDumper - a simple package to experiment with
converting Perl data structures to XML and
converting XML to perl data structures.
XMLEncoding - a subclass of XMLParser, parses
encoding map XML files.
XMLGenerator is an extremely simple module to
help in the generation of XML.
XMLGrove - provides simple objects for parsed
XML documents. The objects may be modified but no
checking is performed.
XMLParser - a Perl extension interface to James
Clark's XML parser, expat
XMLQL - an early implementation of a note
published by the W3C called "XML-QL A Query
Language for XML".
XMLXQL - a Perl extension that allows you to
perform XQL queries on XML object trees.

43
XML in Life Sciences

Lots of momentum in Bio community
GFF (Gene Finding Features)
GAME (Genomic Annotation Markup Elements)
BIOML (BioPolymer markup language)
EBIs XML format for gene expression data
Will be used to specify ontological descriptions
of Biology data

44
XML DTDs

Interchange format defined through a DTD
Document Type Definition
lt!ELEMENT bioxml-gameseq_relationship
(bioxml-gamespan, bioxml-gamealignment?)gt
lt!ATTLIST bioxml-gameseq_relationship seq
IDREF IMPLIED type (query subject peer
subseq) IMPLIED gt
And data conforms to DTD
ltseq_relationship seq"seq1 "type"query"gt
ltspangt
ltbegingt10lt/begingt
ltendgt15lt/endgt
lt/spangt
lt/seq_relationshipgt

ltseq_relationship seq"seq2" type"subject"gt
ltspangt ltbegingt20lt/begingt ltendgt25lt/endgt
lt/spangt ltalignmentgt query atgccg
subject atgacg
lt/alignmentgtlt/seq_relationshipgt
45
XML Summary
Benefits
Drawbacks

Metadata and data have same format
HTML-like
Broad support in CompSci and Biology
Sufficiently flexible to represent any data model
XSL style sheets map from one DTD to another

Doesnt allow for abstraction or partial
inheritance
Interchange can be slow in certain data migration
tasks

46
Object Brokering

The details of data can often be encapsulated in
objects
Only the interfaces need definition
Forget DTDs and data description
Mechanisms for moving objects around based solely
on their interfaces would allow for seamless
integration

47
Enter CORBA

Common Object Request Broker Architecture

Applications have access to method calls through
IDL stubs
Makes a method call which is transferred through
an ORB to the Object implementation
Implementation returns result back through ORB

48
CORBA IDL

IDL Interface Definition Language
Like C/Java headers, but with slightly more
type flexibility

49
CORBA Summary
Benefits
Drawbacks

Distributed
Component-based architecture
Promotes reuse
Doesnt require knowledge of implementation
Platform independent

Distributed
Level of abstraction is sometimes not useful
Can be slow to broker objects
Different ORBS do different things
Unreliable?
OMG website is brutal

50
Modeling Techniques

E-R Modeling
Optimized for transactional data
Eliminates redundant data
Preserves dependencies in UPDATEs
Doesnt allow for inconsistent data
Useful for transactional systems
Dimensional Modeling
Optimized for queryability and performance
Does not eliminate redundant data, where
appropriate
Constraints unenforced
Models data as a hypercube
Useful for analytical systems

51
Illustrating Dimensional Data Space
Nomenclature x, y, z, and t are
dimensions temperature is a fact the data
space is a hypercube of size 4
52
Dimensional Modeling Primer

Represents the data domain as a collection of
hypercubes that share dimensions
Allows for highly understandable data spaces
Direct optimizations for such configurations are
provided through most DBMS frameworks
Supports data mining and statistical methods such
as multi-dimensional scaling, clustering,
self-organizing maps
Ties in directly with most generalized
visualization tools
Only two types of entities - dimensions and facts

53
Dimensional Modeling Primer -Relational
Representation

Contains a table for each dimension

Contains one central table for all facts, with a
multi-part key

Each dimension table has a single part primary
key that corresponds to exactly one of the
components of the multipart key in the fact table.

The Star Schema the basic component of
Dimensional Modeling
54
Dimensional Modeling Primer -Relational
Representation

Each dimension table most often contains
descriptive textual information about a
particular scientific object. Dimension tables
are typically the entry points into a datamart.
Examples Gene, Sample, Experiment
The fact table relates the dimensions that
surround it, expressing a many-to-many
relationship. The more useful fact tables also
contain facts about the relationship --
additional information not stored in any of the
dimension tables.

X Dimension
PK
The Star Schema the basic component of
Dimensional Modeling
Temperature Fact
FK
Y Dimension
FK
PK
CK
Z Dimension
FK
FK
PK
PK
Time Dimension
55
Dimensional Modeling Primer -Relational
Representation

Dimension tables are typically small, on the
order of 100 to 100,000 records. Each record
measures a physical or conceptual entity.
The fact table is typically very large, on the
order of 1,000,000 or more records. Each record
measures a fact around a grouping of physical or
conceptual entities.

X Dimension
PK
The Star Schema the basic component of
Dimensional Modeling
Temperature Fact
FK
Y Dimension
FK
PK
CK
Z Dimension
FK
FK
PK
PK
Time Dimension
56
Dimensional Modeling Primer -Relational
Representation

Neither dimension tables nor fact tables are
necessarily normalized!
Normalization increases complexity of design,
worsens performance with joins
Non-normalized tables can easily be understood
with SELECT and GROUP BY
Database tablespace is therefore required to be
larger to store the same data - the gain in
overall performance and understandability
outweighs the cost of extra disks!

X Dimension
PK
The Star Schema the basic component of
Dimensional Modeling
Temperature Fact
FK
Y Dimension
FK
PK
CK
Z Dimension
FK
FK
PK
PK
Time Dimension
57
Sequence Clustering
Case in Point
Run run_id who when purpose
Show me all sequences in the same cluster as
sequence XA501 from my last run.
Sequence seq_id bases length

PROBLEMS
not browsable (confusing)
poor query performance
little or no data mining support

58
Dimensionally Speaking
Sequence Clustering
Show me all sequences in the same cluster as
sequence XA501 from my last run.
CONCEPTUAL IDEA - The Star Schema A historical,
denormalized, subject-oriented view of scientific
facts -- the data mart. A centralized fact table
stores the single scientific fact of sequence
membership in cluster and a subcluster. Smaller
dimensional tables around the fact table
represent key scientific objects (e.g., sequence).
Membership Facts seq_id cluster_id subcluster_id r
un_id paramset_id run_date run_initiator seq_star
t seq_end seq_orientation cluster_size subcluster_
size

Benefits
Highly browsable, understandable model for
scientists
Vastly improved query performance
Immediate data mining support
Extensible database componentry model

59
Dimensional Modeling - Strengths

Predictable, standard framework allows database
systems and end user query tools to make strong
assumptions about the data
Star schemas withstand unexpected changes in user
behavior -- every dimension is equivalent
symmetrically equal entry points into the fact
table.
Gracefully extensible to accommodate unexpected
new data elements and design decisions
High performance, optimized for analytical queries

60
The Need for Standards

In order for any integration effort to be
successful, there needs to be agreement on
certain topics

Ontologies concepts, objects, and their
relationships
Object models how are the ontologies represented
as objects
Data models how the objects and data are stored
persistently

61
Standard Bio-Ontologies

Currently, there are efforts being undertaken to
help identify a practical set of technologies
that will aid in the knowledge management and
exchange of concepts and representations in the
life sciences.
GO Consortium http//genome-www.stanford.edu/GO/
The third annual Bio-Ontologies meeting is being
held after ISMB 2000 on August 24th.

62
Standard Object Models

Currently, there is an effort being undertaken to
develop object models for the different domains
in the Life Sciences. This is primarily being
done by the Life Science Research (LSR) working
group within the OMG (Object Management Group).
Please see their homepage for further details
http//www.omg.org/homepages/lsr/index.html

63
In Conclusion

Data integration is the problem to solve to
support human and computer discovery in the Life
Sciences.
There are a number of approaches one can take to
achieve data integration.
Each approach has advantages and disadvantages
associated with it. Particular problem spaces
require particular solutions.
Regardless of the approach, Metadata is a
critical component for any integrated repository.
Many technologies exist to support integration.
Technologies do nothing without syntactic and
semantic standards.

64
(No Transcript)
65
Accessing Integrated Data

Once you have an integrated repository of
information, access tools enable future
experimental design and discovery. They can be
categorized into four types
browsing tools
query tools
visualization tools
mining tools

66
Browsing

One of the most critical requirements that is
overlooked is the ability to browse the
integrated repository since users typically do
not know what is in it and are not familiar with
other investigators projects. Requirements
include

ability to view summary data
ability to view high level descriptive
information on a variety of objects (projects,
genes, tissues, etc.)
ability to dynamically build queries while
browsing (using a wizard or drag and drop
mechanism)

67
Querying

Along with browsing, retrieving the data from the
repository is one of the most underdeveloped
areas in bioinformatics. All of the
visualization tools that are currently available
are great at visualizing data. But if users
cannot get their data into these tools, how
useful are they? Requirements include

ability to intelligently help the user build
ad-hoc queries (wizard paradigm, dynamic
filtering of values)
provide a power user interface for analysts
(query templates with the ability to edit the
actual SQL)
should allow users to iterate over the queries so
they do not have to build them from scratch each
time
should be tightly integrated with the browser to
allow for easier query construction

68
Visualizing

There are a number of visualization tools
currently available to help investigators analyze
their data. Some are easier to use than others
and some are better suited for either smaller or
larger data sets. Regardless, they should all
provide the ability to

be easy to use
save templates which can be used in future
visualizations
view different slices of the data simultaneously
apply complex statistical rules and algorithms to
the data to help elucidate associations and
relationships

69
Data Mining

Life science has large volumes of data that, in
its rawest form, is not easy to use to help drive
new experimentation. Ideally, one would like to
automate data mining tools to extract
information by allowing them to take advantage
of a predicable database architecture. This is
more easily attainable using dimensional modeling
(star schemas), however, since E-R schemas are
very different from database to database and do
not conform to any standard architecture.

70
(No Transcript)
71
Database Schemas for 3 independent Genomics
systems
Homology Data
ORGANISM
SEQUENCE_DATABASE
Organism_Key
Seq_DB_Key
Seq_DB_Key
Seq_DB_Name
Species
GE_RESULTS
QUALIFIER
SCORE
Results_Key
Qualifier_Key
Score_Key
PARAMETER_SET
Analysis_Key
SEQUENCE
PARAMETER_SET
Map_Key
Alignment_Key
Parametet_Set_Key
Parameter_Set_Key
Sequence_Key
Chip_Key
Parameter_Set_Key
P_Value
Qualifier_Key
Gene_Name
Algorithm_key
Score
Map_Key
RNA_Source_Key
Percent_Homology
Qualifier_Key
Expression_Level
Seq_DB_Key
GENOTYPE
Absent_Present
RNA_SOURCE
Type
Fold_Change
Genotype_Key
ALIGNMENT
RNA_Source_Key
Name
Type
Alignment_Key
Name
ALGORITHM
Treatment_Key
Algorithm_key
Algorithm_key
Genotype_Key
Sequence_Key
Cell_Line_Key
Algorithm_Name
Tissue_Key
TREATMENT
CELL_LINE
Disease_Key
Treatmemt_Key
Cell_Line_Key
Species
Name
Name
ALLELE
CHIP
ANALYSIS
Allele_Key
MAP_POSITION
DISEASE
TISSUE
Chip_Key
Analysis_Key
Disease_Key
Tissue_Key
Map_Key
Map_Key
Chip_Name
Allele_Name
Analysis_Decision
Name
Name
Species
Base_Change
SNP_FREQUENCY
Frequency_Key
STS_SOURCE
PCR_PROTOCOL
Linkage_Key
Source_Key
Protocol_Key
Gene Expression
Population_Key
Method_Key
Allele_Key
Source_Key
Allele_Frequency
SNP_METHOD
Buffer_Key
Method_Key
Linkage
SNP_POPULATION
Linkage_Key
PCR_BUFFER
Population_Key
Disease_Link
Buffer_Key
Sample_Size
SNP Data
Linkage_Distance
72
The Warehouse
Three star schemas of heterogenous data joined
through a conformed dimension
Gene Expression
Conformed sequence dimension
SNP Data
Homology Data

Write a Comment

User Comments (0)