Title: Data Integration in the Life Sciences
1Data Integration in the Life Sciences
- Kenneth Griffiths and Richard Resnick
2Tutorial Agenda
- 130 145 Introduction
- 145 200 Tutorial Survey
- 200 300 Approaches to Integration
- 300 305 Bio Break
- 305 400 Approaches to Integration (cont.)
- 400 415 Question and Answer
- 415 430 Break
- 430 500 Metadata Session
- 500 530 Domain-specific example (GxP)
- 530 Wrap-up
3Life Science Data
Recent focus on genetic data genomics the study
of genes and their function. Recent advances in
genomics are bringing about a revolution in our
understanding of the molecular mechanisms of
disease, including the complex interplay of
genetic and environmental factors. Genomics is
also stimulating the discovery of breakthrough
healthcare products by revealing thousands of new
biological targets for the development of drugs,
and by giving scientists innovative ways to
design new drugs, vaccines and DNA diagnostics.
Genomics-based therapeutics include "traditional"
small chemical drugs, protein drugs, and
potentially gene therapy. The Pharmaceutical
Research and Manufacturers of America -
http//www.phrma.org/genomics/lexicon/g.html
Study of genes and their function Understanding
molecular mechanisms of disease Development of
drugs, vaccines, and diagnostics
4The Study of Genes...
- Chromosomal location
- Sequence
- Sequence Variation
- Splicing
- Protein Sequence
- Protein Structure
5 and Their Function
- Homology
- Motifs
- Publications
- Expression
- HTS
- In Vivo/Vitro Functional Characterization
6Understanding Mechanisms of Disease
7Development of Drugs, Vaccines, Diagnostics
- Differing types of Drugs, Vaccines, and
Diagnostics - Small molecules
- Protein therapeutics
- Gene therapy
- In vitro, In vivo diagnostics
- Development requires
- Preclinical research
- Clinical trials
- Long-term clinical research
- All of which often feeds back into ongoing
Genomics research and discovery.
8The Industrys Problem
- Too much unintegrated data
- from a variety of incompatible sources
- no standard naming convention
- each with a custom browsing and querying
mechanism (no common interface) - and poor interaction with other data sources
9What are the Data Sources?
- Flat Files
- URLs
- Proprietary Databases
- Public Databases
- Data Marts
- Spreadsheets
- Emails
-
10Sample Problem Hyperprolactinemia
- Over production of prolactin
- prolactin stimulates mammary gland development
and milk production - Hyperprolactinemia is characterized by
- inappropriate milk production
- disruption of menstrual cycle
- can lead to conception difficulty
11Understanding transcription factors for prolactin
production
Show me all genes in the public literature that
are putatively related to hyperprolactinemia,
have more than 3-fold expression differential
between hyperprolactinemic and normal pituitary
cells, and are homologous to known transcription
factors.
(Q1?Q2?Q3)
12(No Transcript)
13Approaches to Integration
- In order to ask this type of question across
multiple domains, data integration at some level
is necessary. When discussing the different
approaches to data integration, a number of key
issues need to be addressed
- Accessing the original data sources
- Handling redundant as well as missing data
- Normalizing analytical data from different data
sources - Conforming terminology to industry standards
- Accessing the integrated data as a single logical
repository - Metadata (used to traverse domains)
14Approaches to Integration (cont.)
- So if one agrees that the preceding issues are
important, where are they addressed? In the
client application, the middleware, or the
database? Where they are addressed can make a
huge difference in usability and performance.
Currently there are a number of approaches for
data integration - Federated Databases
- Data Warehousing
- Indexed Data Sources
- Memory-mapped Data Structures
15Federated Database Approach
Show me all genes that are homologous to known
transcription factors
Show me all genes that have more than 3-fold
expression differential between
hyperprolactinemic and normal cells
Show me all genes in the public literature that
are putatively related to hyperprolactinemia
16Advantages to Federated Database Approach
- quick to configure
- architecture is easy to understand - no knowledge
of the domain is necessary - achieves a basic level of integration with
minimal effort - can wrapper and plug in new data sources as they
come into existence
17Problems with Federated Database Approach
- Integration of queries and query results occurs
at the integrated application level, requiring
complex low-level logic to be embedded at the
highest level - Naming conventions across systems must be adhered
to or query results will be inaccurate - imposes
constraints on original data sources - Data sources are not necessarily clean
integrating dirty data makes integrated dirty
data. - No query optimization across multiple systems can
be performed - If one source system goes down, the entire
integrated application may fail - Not readily suitable for data mining, generic
visualization tools - Relies on CORBA or other middleware technology,
shown to have performance (and reliability?)
problems
18Solving Federated Database Problems
Semantic Cleaning Layer
Middleware (CORBA, DCOM, etc)
PubMed
Proprietary App
Medline
LITERATURE
19Data Warehousing for Integration
- Data warehousing is a process as much as it is a
repository. There are a couple of primary
concepts behind data warehousing
- ETL (Extraction, Transformation, Load)
- Component-based (datamarts)
- Typically utilizes a dimensional model
- Metadata-driven
20Data Warehousing
E (Extraction) T (Transformation) L (Load)
21Data-level Integration Through Data Warehousing
22Data Staging
- Storage area and set of processes that
- extracts source data
- transforms data
- cleans incorrect data, resolves missing elements,
standards conformance - purges fields not needed
- combines data sources
- creates surrogate keys for data to avoid
dependence on legacy keys - builds aggregates where needed
- archives/logs
- loads and indexes data
- Does not provide query or presentation services
23Data Staging (cont.)
- Sixty to seventy percent of development is here
- Engineering is generally done using database
automation and scripting technology - Staging environment is often an RDBMS
- Generally done in a centralized fashion and as
often as desired, having no effect on source
systems - Solves the integration problem once and for all,
for most queries
24Warehouse Development and Deployment
Two development paradigms Top-down warehouse
design conceptualize the entire warehouse, then
build, tends to take 2 years or more, and
requirements change too quickly Bottom-up design
and deployment pivoted around completely
functional subsections of the Warehouse
architecture, takes 2 months, enables modular
development.
25Warehouse Development and Deployment (cont.)
- The Data Mart
- A logical subset of the complete data warehouse
- represents a completable project
- by itself is a fully functional data warehouse
- A Data Warehouse is the union of all constituent
data marts. - Enables bottom-up development
26Warehouse Development and Deployment (cont.)
- Examples of data marts in Life Science
- Sequence/Annotation - brings together sequence
and annotation from public and proprietary dbs - Expression Profiling datamart - integrates
multiple TxP approaches (cDNA, oligo) - High-throughput screening datamart - stores HTS
information on proprietary high-throughput
compound screens - Clinical trial datamart - integrates clinical
trial information from multiple trials - All of these data marts are pieced together along
conformed entities as they are developed, bottom
up
27Advantages of Data-level Integration Through
Data Warehousing
- Integration of data occurs at the lowest level,
eliminating the need for integration of queries
and query results - Run-time semantic cleaning services are no longer
required - this work is performed in the data
staging environment - FAST!
- Original source systems are left completely
untouched, and if they go down, the Data
Warehouse still functions - Query optimization across multiple systems data
can be performed - Readily suitable for data mining by generic
visualization tools
28Issues with Data-level Integration Through Data
Warehousing
- ETL process can take considerable time and effort
- Requires an understanding of the domain to
represent relationships among objects correctly - More scalable when accompanied by a Metadata
repository which provides a layer of abstraction
over the warehouse to be used by the application.
Building this repository requires additional
effort.
29Indexing Data Sources
- Indexes and links a large number of data sources
(e.g., files, URLs) - Data integration takes place by using the results
of one query to link and jump to a keyed record
in another location - Users have the ability to develop custom
applications by using a vendor-specific language
30Indexed Data Source Architecture
Index Traversal Support Mechanism
31Indexed Data Sources Pros and Cons
- Advantages
- quick to set up
- easy to understand
- achieves a basic level of integration with
minimal effort
- Disadvantages
- does not clean and normalize the data
- does not have a way to directly integrate data
from relational DBMSs - difficult to browse and mine
- sometimes requires knowledge of a vendor-specific
language
32Memory-mapped Integration
- The idea behind this approach is to integrate the
actual analytical data in memory and not in a
relational database system - Performance is fast since the application
retrieves the data from memory rather than disk - True data integration is achieved for the
analytical data but the descriptive or
complementary data resides in separate databases
33Memory Map Architecture
Sample/Source Information
Sequence DB 2
Descriptive Information
Sequence DB 1
Data Integration Layer
CORBA
34Memory Maps Pros and Cons
- Disadvantages
- typically does not put non-analytical data (gene
names, tissue types, etc.) through the ETL
process - not easily extensible when adding new databases
with descriptive information - performance hit when accessing anything outside
of memory (tough to optimize) - scalability restricted by memory limitations of
machine - difficult to mine due to complicated architecture
- Advantages
- true analytical data integration
- quick access
- cleans analytical data
- simple matrix representation
35The Need for Metadata
- For all of the previous approaches, one
underlying concept plays a critical role to their
success Metadata. - Metadata is a concept that many people still do
not fully understand. Some common questions
include
- What is it?
- Where does it come from?
- Where do you keep it?
- How is it used?
36Metadata
The data about the data
- Describes data types, relationships, joins,
histories, etc. - A layer of abstraction, much like a middle layer,
except... - Stored in the same repository as the data,
accessed in a consistent database-like way
37Metadata (cont.)
Back-end metadata - supports the
developers Source system metadata versions,
formats, access stats, verbose information Busines
s metadata schedules, logs, procedures,
definitions, maps, security Database metadata -
data models, indexes, physical logical design,
security Front-end metadata - supports the
scientist and application Nomenclature metadata -
valid terms, mapping of DB field names to
understandable names Query metadata - query
templates, join specifications, views, can
include back-end metadata Reporting/visualization
metadata - template definitions, association
maps, transformations Application security
metadata - security profiles at the application
level
38Metadata Benefits
- Enables the application designer to develop
generic applications that grow as the data grows - Provides a repository for the scientist to become
better informed on the nature of the information
in the database - Is a high-performance alternative to developing
an object-relational layer between the database
and the application - Extends gracefully as the database extends
39(No Transcript)
40Integration Technologies
- Technologies that support integration efforts
- Data Interchange
- Object Brokering
- Modeling techniques
41Data Interchange
- Standards for inter-process and inter-domain
communication - Two types of data
- Data the actual information that is being
interchanged - Metadata the information on the structural and
semantic aspects of the Data - Examples
- EMBL format
- ASN.1
- XML
42XML Emerges
- Allows uniform description of data and metadata
- Metadata described through DTDs
- Data conforms to metadata description
- Provides open source solution for data
integration between components - Lots of support in CompSci community
(proportional to cardinality of Perl modules
developed) - XMLCGI - a module to convert CGI parameters to
and from XML - XMLDOM - a Perl extension to XMLParser. It
adds a new 'Style' to XMLParser,called 'Dom',
that allows XMLParser to build an Object
Oriented data structure with a DOM Level 1
compliant interface. - XMLDumper - a simple package to experiment with
converting Perl data structures to XML and
converting XML to perl data structures. - XMLEncoding - a subclass of XMLParser, parses
encoding map XML files. - XMLGenerator is an extremely simple module to
help in the generation of XML. - XMLGrove - provides simple objects for parsed
XML documents. The objects may be modified but no
checking is performed. - XMLParser - a Perl extension interface to James
Clark's XML parser, expat - XMLQL - an early implementation of a note
published by the W3C called "XML-QL A Query
Language for XML". - XMLXQL - a Perl extension that allows you to
perform XQL queries on XML object trees.
43XML in Life Sciences
- Lots of momentum in Bio community
- GFF (Gene Finding Features)
- GAME (Genomic Annotation Markup Elements)
- BIOML (BioPolymer markup language)
- EBIs XML format for gene expression data
-
- Will be used to specify ontological descriptions
of Biology data
44XML DTDs
- Interchange format defined through a DTD
Document Type Definition - lt!ELEMENT bioxml-gameseq_relationship
(bioxml-gamespan, bioxml-gamealignment?)gt
lt!ATTLIST bioxml-gameseq_relationship seq
IDREF IMPLIED type (query subject peer
subseq) IMPLIED gt - And data conforms to DTD
- ltseq_relationship seq"seq1 "type"query"gt
- ltspangt
- ltbegingt10lt/begingt
- ltendgt15lt/endgt
- lt/spangt
- lt/seq_relationshipgt
ltseq_relationship seq"seq2" type"subject"gt
ltspangt ltbegingt20lt/begingt ltendgt25lt/endgt
lt/spangt ltalignmentgt query atgccg
subject atgacg
lt/alignmentgtlt/seq_relationshipgt
45XML Summary
Benefits
Drawbacks
- Metadata and data have same format
- HTML-like
- Broad support in CompSci and Biology
- Sufficiently flexible to represent any data model
- XSL style sheets map from one DTD to another
- Doesnt allow for abstraction or partial
inheritance - Interchange can be slow in certain data migration
tasks
46Object Brokering
- The details of data can often be encapsulated in
objects - Only the interfaces need definition
- Forget DTDs and data description
- Mechanisms for moving objects around based solely
on their interfaces would allow for seamless
integration
47Enter CORBA
- Common Object Request Broker Architecture
- Applications have access to method calls through
IDL stubs - Makes a method call which is transferred through
an ORB to the Object implementation - Implementation returns result back through ORB
48CORBA IDL
- IDL Interface Definition Language
- Like C/Java headers, but with slightly more
type flexibility
49CORBA Summary
Benefits
Drawbacks
- Distributed
- Component-based architecture
- Promotes reuse
- Doesnt require knowledge of implementation
- Platform independent
- Distributed
- Level of abstraction is sometimes not useful
- Can be slow to broker objects
- Different ORBS do different things
- Unreliable?
- OMG website is brutal
50Modeling Techniques
- E-R Modeling
- Optimized for transactional data
- Eliminates redundant data
- Preserves dependencies in UPDATEs
- Doesnt allow for inconsistent data
- Useful for transactional systems
- Dimensional Modeling
- Optimized for queryability and performance
- Does not eliminate redundant data, where
appropriate - Constraints unenforced
- Models data as a hypercube
- Useful for analytical systems
51Illustrating Dimensional Data Space
Nomenclature x, y, z, and t are
dimensions temperature is a fact the data
space is a hypercube of size 4
52Dimensional Modeling Primer
- Represents the data domain as a collection of
hypercubes that share dimensions - Allows for highly understandable data spaces
- Direct optimizations for such configurations are
provided through most DBMS frameworks - Supports data mining and statistical methods such
as multi-dimensional scaling, clustering,
self-organizing maps - Ties in directly with most generalized
visualization tools - Only two types of entities - dimensions and facts
53Dimensional Modeling Primer -Relational
Representation
- Contains a table for each dimension
- Contains one central table for all facts, with a
multi-part key
- Each dimension table has a single part primary
key that corresponds to exactly one of the
components of the multipart key in the fact table.
The Star Schema the basic component of
Dimensional Modeling
54Dimensional Modeling Primer -Relational
Representation
- Each dimension table most often contains
descriptive textual information about a
particular scientific object. Dimension tables
are typically the entry points into a datamart.
Examples Gene, Sample, Experiment - The fact table relates the dimensions that
surround it, expressing a many-to-many
relationship. The more useful fact tables also
contain facts about the relationship --
additional information not stored in any of the
dimension tables.
X Dimension
PK
The Star Schema the basic component of
Dimensional Modeling
Temperature Fact
FK
Y Dimension
FK
PK
CK
Z Dimension
FK
FK
PK
PK
Time Dimension
55Dimensional Modeling Primer -Relational
Representation
- Dimension tables are typically small, on the
order of 100 to 100,000 records. Each record
measures a physical or conceptual entity. - The fact table is typically very large, on the
order of 1,000,000 or more records. Each record
measures a fact around a grouping of physical or
conceptual entities.
X Dimension
PK
The Star Schema the basic component of
Dimensional Modeling
Temperature Fact
FK
Y Dimension
FK
PK
CK
Z Dimension
FK
FK
PK
PK
Time Dimension
56Dimensional Modeling Primer -Relational
Representation
- Neither dimension tables nor fact tables are
necessarily normalized! - Normalization increases complexity of design,
worsens performance with joins - Non-normalized tables can easily be understood
with SELECT and GROUP BY - Database tablespace is therefore required to be
larger to store the same data - the gain in
overall performance and understandability
outweighs the cost of extra disks!
X Dimension
PK
The Star Schema the basic component of
Dimensional Modeling
Temperature Fact
FK
Y Dimension
FK
PK
CK
Z Dimension
FK
FK
PK
PK
Time Dimension
57Sequence Clustering
Case in Point
Run run_id who when purpose
Show me all sequences in the same cluster as
sequence XA501 from my last run.
Sequence seq_id bases length
- PROBLEMS
- not browsable (confusing)
- poor query performance
- little or no data mining support
58Dimensionally Speaking
Sequence Clustering
Show me all sequences in the same cluster as
sequence XA501 from my last run.
CONCEPTUAL IDEA - The Star Schema A historical,
denormalized, subject-oriented view of scientific
facts -- the data mart. A centralized fact table
stores the single scientific fact of sequence
membership in cluster and a subcluster. Smaller
dimensional tables around the fact table
represent key scientific objects (e.g., sequence).
Membership Facts seq_id cluster_id subcluster_id r
un_id paramset_id run_date run_initiator seq_star
t seq_end seq_orientation cluster_size subcluster_
size
- Benefits
- Highly browsable, understandable model for
scientists - Vastly improved query performance
- Immediate data mining support
- Extensible database componentry model
59Dimensional Modeling - Strengths
- Predictable, standard framework allows database
systems and end user query tools to make strong
assumptions about the data - Star schemas withstand unexpected changes in user
behavior -- every dimension is equivalent
symmetrically equal entry points into the fact
table. - Gracefully extensible to accommodate unexpected
new data elements and design decisions - High performance, optimized for analytical queries
60The Need for Standards
- In order for any integration effort to be
successful, there needs to be agreement on
certain topics
- Ontologies concepts, objects, and their
relationships - Object models how are the ontologies represented
as objects - Data models how the objects and data are stored
persistently
61Standard Bio-Ontologies
- Currently, there are efforts being undertaken to
help identify a practical set of technologies
that will aid in the knowledge management and
exchange of concepts and representations in the
life sciences. - GO Consortium http//genome-www.stanford.edu/GO/
- The third annual Bio-Ontologies meeting is being
held after ISMB 2000 on August 24th.
62Standard Object Models
- Currently, there is an effort being undertaken to
develop object models for the different domains
in the Life Sciences. This is primarily being
done by the Life Science Research (LSR) working
group within the OMG (Object Management Group).
Please see their homepage for further details - http//www.omg.org/homepages/lsr/index.html
63In Conclusion
- Data integration is the problem to solve to
support human and computer discovery in the Life
Sciences. - There are a number of approaches one can take to
achieve data integration. - Each approach has advantages and disadvantages
associated with it. Particular problem spaces
require particular solutions. - Regardless of the approach, Metadata is a
critical component for any integrated repository. - Many technologies exist to support integration.
- Technologies do nothing without syntactic and
semantic standards.
64(No Transcript)
65Accessing Integrated Data
- Once you have an integrated repository of
information, access tools enable future
experimental design and discovery. They can be
categorized into four types - browsing tools
- query tools
- visualization tools
- mining tools
66Browsing
- One of the most critical requirements that is
overlooked is the ability to browse the
integrated repository since users typically do
not know what is in it and are not familiar with
other investigators projects. Requirements
include
- ability to view summary data
- ability to view high level descriptive
information on a variety of objects (projects,
genes, tissues, etc.) - ability to dynamically build queries while
browsing (using a wizard or drag and drop
mechanism)
67Querying
- Along with browsing, retrieving the data from the
repository is one of the most underdeveloped
areas in bioinformatics. All of the
visualization tools that are currently available
are great at visualizing data. But if users
cannot get their data into these tools, how
useful are they? Requirements include
- ability to intelligently help the user build
ad-hoc queries (wizard paradigm, dynamic
filtering of values) - provide a power user interface for analysts
(query templates with the ability to edit the
actual SQL) - should allow users to iterate over the queries so
they do not have to build them from scratch each
time - should be tightly integrated with the browser to
allow for easier query construction
68Visualizing
- There are a number of visualization tools
currently available to help investigators analyze
their data. Some are easier to use than others
and some are better suited for either smaller or
larger data sets. Regardless, they should all
provide the ability to
- be easy to use
- save templates which can be used in future
visualizations - view different slices of the data simultaneously
- apply complex statistical rules and algorithms to
the data to help elucidate associations and
relationships
69Data Mining
- Life science has large volumes of data that, in
its rawest form, is not easy to use to help drive
new experimentation. Ideally, one would like to
automate data mining tools to extract
information by allowing them to take advantage
of a predicable database architecture. This is
more easily attainable using dimensional modeling
(star schemas), however, since E-R schemas are
very different from database to database and do
not conform to any standard architecture.
70(No Transcript)
71Database Schemas for 3 independent Genomics
systems
Homology Data
ORGANISM
SEQUENCE_DATABASE
Organism_Key
Seq_DB_Key
Seq_DB_Key
Seq_DB_Name
Species
GE_RESULTS
QUALIFIER
SCORE
Results_Key
Qualifier_Key
Score_Key
PARAMETER_SET
Analysis_Key
SEQUENCE
PARAMETER_SET
Map_Key
Alignment_Key
Parametet_Set_Key
Parameter_Set_Key
Sequence_Key
Chip_Key
Parameter_Set_Key
P_Value
Qualifier_Key
Gene_Name
Algorithm_key
Score
Map_Key
RNA_Source_Key
Percent_Homology
Qualifier_Key
Expression_Level
Seq_DB_Key
GENOTYPE
Absent_Present
RNA_SOURCE
Type
Fold_Change
Genotype_Key
ALIGNMENT
RNA_Source_Key
Name
Type
Alignment_Key
Name
ALGORITHM
Treatment_Key
Algorithm_key
Algorithm_key
Genotype_Key
Sequence_Key
Cell_Line_Key
Algorithm_Name
Tissue_Key
TREATMENT
CELL_LINE
Disease_Key
Treatmemt_Key
Cell_Line_Key
Species
Name
Name
ALLELE
CHIP
ANALYSIS
Allele_Key
MAP_POSITION
DISEASE
TISSUE
Chip_Key
Analysis_Key
Disease_Key
Tissue_Key
Map_Key
Map_Key
Chip_Name
Allele_Name
Analysis_Decision
Name
Name
Species
Base_Change
SNP_FREQUENCY
Frequency_Key
STS_SOURCE
PCR_PROTOCOL
Linkage_Key
Source_Key
Protocol_Key
Gene Expression
Population_Key
Method_Key
Allele_Key
Source_Key
Allele_Frequency
SNP_METHOD
Buffer_Key
Method_Key
Linkage
SNP_POPULATION
Linkage_Key
PCR_BUFFER
Population_Key
Disease_Link
Buffer_Key
Sample_Size
SNP Data
Linkage_Distance
72The Warehouse
Three star schemas of heterogenous data joined
through a conformed dimension
Gene Expression
Conformed sequence dimension
SNP Data
Homology Data