Title: Databases for Microarrays
1Databases for Microarrays
- Vidhya Jagannathan
- SIB, Lausanne
2Overview
- Microarray data in a nutshell
- Why databases?
- What data to represent?
- What is a database?
- Different data models
- E-R modelling
- Microarray Databases
- Standards being developed
3Microarray Experiment
4Microarray Data in a Nutshell
- Lots of data to be managed before and after the
experiment. - Data to be stored before the experiment .
- Description of the array and the sample.
- Direct access to all the cDNA and gene sequences,
annotations, and physical DNA resources. - Data to be stored after the experiment
- Raw Data - scanned images.
- Gene Expression Matrix - Relative expression
levels observed on various sites on the array. - Hence we can see that database software capable
of dealing with larger volumes of numeric and
image data is required.
5Why Databases?
- Tailored to datatype
- Tailored to the Scientists
- Intuitive ways to query the data
- Diagrams, forms, point and click, text etc.
- Support for efficient answering of queries.
- Query optimisation, indexes, compact physical
storage.
6Data Representation
- Goal Represent data in an intuitive and
convenient manner - Without unnecessary replication of information
- Making it easy to write queries to find required
information - Supporting efficient retrieval of required
information
7What is a Database?
- A database is an organised collection of pieces
of structured electronic information. - Example 1 Libraires use a database system to
keep track of library inventory and loans. - Example 2 All airlines use database system to
manage their flights and reservations. - The collection of records kept for a common
purpose such as these is known as a database. - The records of the database normally reside on a
hard disk and the records are retrieved into
computer memory only when they are accessed. - So the reasons are obvious why we need to discuss
about a Microarray database.
8Data Models
- Describes a container for data and methods to
store and retrieve data from that container. - Abstract math algorithms and concepts.
- Cannot touch a data model.
- Very useful
9Types of Data Models
- Ad-hoc file formats (not really data models!)
- Relational data model
- Object-relational data model
- Object-oriented data model
- XML (Extensible Markup Language)
10Ad-hoc File Formats
- The various 'ad-hoc' file formats in use for
microarray data are - Flat file formats.
- Spread sheet formats.
- Not the least - Even MS-Word documents !!!
- Very rudimentary method to store data .
- Sometimes contains redundant information.
- Extremely inefficient for retrieval of particular
subsets of the results.
11Relational Data Model
- Most prevalent and used in many databases
developed today.
- The collection of related information is
represented as a set of tables.
- Data value is stored in the intersection of row
and column
- Column values are of the same kind. A Simple
data validation.
- Rows are unique. So no data redundancy and every
row is meaningful and can be identified by the
unique key.
- Utilises Structured Query Language (SQL) for data
storage, retrival and manipulation.
12Terminology
- Relation or Table
- Attributes or Columns
- Records or Rows
13Example
Table
Row or Record
Field or Column
14Advantages of Relational Model
- Allows information to be broken up into logical
units and stored in tables. - Allows combining data from different tables in
different ways to derive useful information. - Great for queries involving information from
multiple original sources. - Can easily gather related information.
- e.g. information about a particular gene from
multiple datasets/experiments
15Example - ArrayExpress
16Database Design Entity-Relationship Concept
Relationship
Entity A
Entity B
Examples
17Entities
- are real world objects
- ex gene
- contain attributes
- ex gene_id, sequence
- are drawn as rectangle boxes that holds the name
of the entity and attribute in two different
notations as there is no standard!
Gene
notation 2
notation 1
18Relationship
- Relationships provide connections between two or
more entities - ex Which genes were used in which experiment
- When two entities are involved in a relationship,
it is known as binary relationship. - When three entities are invoved in a
relationship, it is called as ternary
relationship. - When more than three entities are involved in a
relationship, it is usually broken in to one or
more binary or ternary relationships. - are drawn as a line linking the involved entities
as
used_in
Gene
Experiment
19Connectivity and Cardinality
- Connectivity - describes the mapping of
associated entity instances in a relationship - one or many
- Cardinality - actual number of related occurances
for each of the two entities - one-to-one, one-to-many, many-to-many
20Example of one to one relationship
21Example of one to many relationship
22Example of many to many relationship (unresolved)
23Example of many to many relationship (linking
table)
24Example E-R Diagram
Expt-Exptr
Expt-Sample
Multivalued attribute
Notation
Expt-Array
Many-to-one
25Transforming E-R to Relational Database
- Entities and Relationships are translated to
relations or tables - Attributes of an entity are translated to columns
are fields - The identifying attribute forms the primary key
of a table - An instance of an entity is nothing but records
or rows
26E-R Diagram to Relational Schema
Multivalued attribute
27Object Oriented Model
- Object Oriented Model allows real world data to
be represented as objects. - Objects encapsulate the data and provide methods
to access or manipulate it. - Objects with specific structure and set of
methods are said to belong to the object class. - Allows new classes to be created by extending
the description of the parent class. - Child classes inherit the data and methods of the
parent class.
28Example
OODBMS
29Object relational data model
- Improved relational model by adding some features
from object data models. - Information is represented as in relational
models but column values not restricted to one
mutliple values are allowed. - Example (sample table in previous slides)
30Queries, queries, queries!!
- Given a collection of microarray generated gene
expression data, what kind of questions the users
wish to pose. - Constructing an extensive list of possible
interesting queries and data mining problems that
has to be supported by the database will
facilitate the design process.
31Queries, queries, queries!!
- Query to the data
- Which genes are linked ?
- Which genes are expressed similarly to my gene
XYZ? - Which genes have a changed the expression in a
second condition ? - Which genes are co-expressed in differing
conditions ? - classification (of tumors, diseased tissues
etc.) which patterns are characteristic for a
certain class of samples, which genes are
involved?
32More Queries !!!
- Queries that add a link in additional knowledge
- functional classification of genes Are changes
clustered in particular classes? - metabolic pathway information Is a certain
pathway/route in a pathway affected? - disease information clinical follow up
correlation to expression patterns. - phenotype information for mutants Are there
correlations between particular phenotypes and
expression patterns?
33More Queries !!!
- in what region is the interesting gene located in
the genome? - is there synteny in this region with other
species? - is there a known trait that maps to this region?
34Query Language
- Language in which user requests information from
the database. - SQL
- Data definition helps you implement your model
and data manipulation helps you modify and
retrive data - Advantages
- Can specify query declaratively and let database
system figure out best way of finding answers - Supports queries of medium complexity
- Specialized languages
- SQL language statements are not abstract but very
close to spoken language.
35Basic SQL Queries
- Find the image for experiment number 1345
- select image from experiment where
experiment-id 1345 - Find the experiment-id and image of all
experiments involving e-coli - select experiment-id, image from experiment,
sample where experiment.sample-id
sample.sample-id and sample.organism e.coli - All combinations of rows from the relations in
the from clause are considered, and those that
satisfy the where conditions are output
36(No Transcript)
37Gene Expression Databases Require Integration
- There are many different types of data presenting
numerous relationships. - There are a number of Databases with lots of
information. - Experiments need to be compared because the
experiments are very difficult to perform and
very expensive. - Solution Make all the databases talk the same
language. - XML was the choice of data interchange format.
38Why XML?
- Why XML ? XML provides the method for defining
the meaning or semantics of data. - Example A XML file of the earlier table we
defined
ltgene_featuresgt ltgene_idgtGBVN32lt/gene_idgt
ltcontig_idgtNT_010651lt/contig_idgt
ltcontig_startgt2354807lt/contig_startgt
ltcontig_endgt2360778lt/contig_endgt
ltcontig_strandgtComplementlt/contig_strandgt lt/gene_f
eaturesgt
39Mapping XML to Relational Database
- The Data Structure in XML is defined in Document
Type Descrciptor as follows - lt!ELEMENT gene_id (PCDATA)gt
- lt!ELEMENT contig_id (PCDATA)gt
- lt!ELEMENT contig_start (PCDATA)gt
- lt!ELEMENT contig_end (PCDATA)gt
- lt!ELEMENT contig_sequence (PCDATA)gt
- This kind of DTD also helps us to have control
over the vocabulary used. - SQL
- create table gene (
- gene_id varchar(5) primary key,
- contig_id varchar(10) not null,
- contig_start integer not null,
- contig_end integer not null,
- contig_sequence text not null)
- So the DTD can be directly mapped into a
relational database.
40MAGE-ML As Data Interchage Format
Expression Data
Converter (program)
MAGE-ML
Databases