Title: Making Sense of Public Domain Expression Data- GeneVestigator
1Making Sense of Public Domain Expression Data-
GeneVestigator
2On the Agenda -
- Microarray databases characteristics
- pros and cons
- Examples
- GEO and ArrayExpress
- GeneVestigator - meta-analytical approach
3Meta-data in Microarray Experiments
Gene expression studies generate large amounts of
data !
http//titan.biotec.uiuc.edu/cs491jh/slides/cs491j
h-Yong.ppt268,6,Capturing Data and Meta-data in
Microarray Experiments
4Properties of High-throughput Data
Microarray databases have the ability to accept,
store and export (share) large quantities of
data. Data (stored) contain Many genes Many
samples Various organisms/tissues Variety of
biological phenomena Time course Replicates Differ
ent technologies various data format Data
Retrieval user-friendly web-based interfaces
Links to Analysis Tools
5Gene Expression Matrix
The final gene expression matrix (on the right)
is needed for higher level analysis and mining
Samples
?
Genes
Gene expression levels
http//titan.biotec.uiuc.edu/cs491jh/slides/cs491j
h-Yong.ppt271,8,Gene Expression Matrix
6Microarray Data Precision and Loss
Electron microscopy
Only provided in 0.1 of public experiments
Processed data loses precision !
90 of CEL files generated from microarray
experiments have never been deposited to any
repository. Stokes et al. BMC Bioinformatics 2008
9(Suppl 6)S18
http//www.bio-miblab.org/arraywiki
7Microarray Data Formats
- Raw image data, the intensity of the signal at
each spot is proportional to the expression level
of the gene under test. - Image intensities are quantified using image
analysis software. - B. Raw numerical data (signal intensities).
- C. Processed data.
A.
B.
C.
8Problem Raw Data
- Complete description of complex experiments is
desired. - We dont always know whats important
- Noise probes could end up being informative
(e.g. detection of a splice variant). - The Future
- Better (more accurate) summarization algorithms
will emerge. - New uses for raw data may emerge.
- Challenge Store the raw data in accessible form.
Different labs have different needs a central
system is needed !
9Complexity and Categories of Data
and MIAME 6 parts
The MIAME (Minimum Information About a Microarray
Experiment) guidelines contain standards for
publication of information. Brazma et al. (2001),
Nature Genetics 29(4), 365-71
Publication
Experimental design
Sample Source treatment, prep. labelling
Source (e.g., Taxonomy)
Array design
Normalization
Data measurements
http//www.ict.ox.ac.uk/odit/projects/digitalrepos
itory/docs/workshop/Helen_Parkinson-RDMW0608.ppt4
29,18,Slide 18
10Microarray Database Repositories are Biased
The relative size of each pie corresponds to the
number of experiments contained in each
repository.
All human data
Mostly old data
Mostly custom arrays
Mostly human data
Mainly Affy chips
Stokes et al. BMC Bioinformatics 2008 9 (Suppl
6) S18 http//www.biomedcentral.com/1471-2105/9/S
6/S18
11Overlaps of Data Between Repositories
Stokes et al. BMC Bioinformatics 2008 9 (Suppl
6) S18 http//www.biomedcentral.com/1471-2105/9
/S6/S18
Total Experiments 2376
August 2005 June 2006
12User-Friendly Microarray Databases
- Many gene expression databases exist commercial
and non-commercial. - Most focus on either a particular technology,
particular organism or both. - We will discuss most promising ones
- ArrayExpress EBI (AE)
- The Gene expression Omnibus (GEO NCBI)
- GeneVestigator
13http//www.ncbi.nlm.nih.gov/geo/
The Gene Expression Omnibus is a public
repository in the Entrez database that includes
high-throughput gene expression data, hosted at
the National library of Medicine (NIH). GEO was
designed to accommodate diverse types of data.
14Gene Express Omnibus - Experiment centered view
(GDS)
15Gene Express Omnibus - Gene centered view
Expression profile of the Dystrophin gene in a
DataSet examining skeletal muscle biopsies from
12 Duchenne muscular dystrophy patients and 12
normal subjects. Red bars level of abundance
of an individual transcript across the Samples
that make up a DataSet. Values are presented as
arbitrary units. Single channel normalized
Values signal count data. Dual channel
submitted Values are normalized log ratios.Blue
square rank order, give an indication of where
the expression of that gene falls with respect to
all other genes on that array (enrichment).
16http//www.ebi.ac.uk/microarray-as/ae/
Metsada Pasmanik-Chor, TAU Bioinformatics Unit,
19/3/09
16
17Query ArrayExpress
Annotations
Experiments and description
Click
Condition
Gene name
Species
Results a list of all experiments, ordered by p
value. For each experiment short description,
experimental factors and gene expression.
18Query ArrayExpress similar expressed genes
Select the find 3 closest genes option. IER2,
FOS, JUN, have similar expression to nfkbia.
19HeatMap Atlas Output
Number of up/down regulated genes
Experimental condition
http//www.ebi.ac.uk/microarray-as/atlas/qr?q_gene
saa4q_updnupdnq_orgnMUSMUSCULUSq_expt28al
lconditions29viewheatmapview
20GeneVesigator a reference expression database
and meta-analysis system
21Genevestigator a system for the meta-analysis
of microarray data
A database Web-browser data mining interface
for Affymetrix GeneChip data, based on a the new
concept of Meta-Profiles, relying on reference
expression databases. Allows biologists to study
the expression and regulation of genes in a broad
variety of contexts by summarizing information
from hundreds of manually curated microarray
experiments. Workspaces and views can be stored
into files and re-opened for another analysis
session (.gvw which stands for
GenevestigatorWorkspace).
Application server
Java application
Analysis output
http//bar.utoronto.ca/ICAR19/ICAR19_BioinfoWorksh
op20-20Genevestigator.ppt257,2,Overview of the
Genevestigator system
22Database Content and Quality
- Database consist of large and various manually
curated and quality-controlled Affymetrix chips - Quality control of EACH experiment is manually
done by Genevestigator curators using a pipeline
of Bioconductor packages performing
normalization and probe-level analysis. - Low quality arrays are characterized by
- fall out of range relative to the other arrays
from the same experiment, - exhibit higher RNA degradation,
- particularly noisy,
- do not correlate with replicate samples.
Metsada Pasmanik-Chor, TAU Bioinformatics Unit,
19/3/09
22
23User Hardware Requirements
- Genevestigator is a web-based application running
in Java. - Java applet provides several advantages
- users dont have to install any software
- users always work with the latest software
release - Java is more powerful than HTML/Javascript for
data manipulation - To run the application, client machines must have
Java runtime environment - (JRE version 1.4.2 or higher) installed (usually
available by default on PCs). - JRE is freely available for download at Sun
Microsystems (http//www.Java.com). - To optimally work with the Genevestigator
application, we recommend - screen resolution 1024 x 768 or higher
- memory preferably 512 MB RAM or more
24GeneVestigator Species Availability
Species Human Mouse
Rat Mammals
Arrays
Human 133_2 Human Genome 10k 20k 47
k 1109, 3786, 2782
Mouse Genome 12k 40k 3071, 1967
Rat Genome 8k 31k 2146, 858
Number of arrays
Species Arabidopsis Barley
Rice Soybean Plants
Barley Genome 22k 706
Rice Genome 22k -
Arrays
Arabidopsis Genome 22k 3110
Number of arrays
25Data Sources and Referencing
The Genevestigator analysis platform comprises a
large database of manually curated microarray
experiments collected from the public domain or
from individual contributors. The array
annotations necessary for data analysis were
retrieved from public repositories and/or, if
insufficiently available, from the authors
themselves. Genevestigator contains data from
the following repositories and databases
Link Database
http//www.ncbi.nlm.nih.gov/geo/ Gene Expression Omnibus (GEO)
http//www.ebi.ac.uk/arrayexpress/ ArrayExpress
http//chipperdb.chip.org/adb/adb-home ChipperDB
http//www.arabidopsis.org/ The Arabidopsis Information Resource (TAIR)
httpproteogenomics.musc.eduma MUSC Microarray Database
http//pepr.cnmcresearch.org Public Expression Profiling Resource (PEPR)
http//affymetrix.arabidopsis.info/narrays/experimentbrowse.pl NASC Microarray Database (NASCArrays)
http//arrayconsortium.tgen.org/np2/home.do NIH Neuroscience Microarray Consortium
https//genes.med.virginia.edu/intro to geoss.html Gene Expression Open Source System (GEOSS)
http//www.cbil.upenn.edu/RAD/php/index.php RNA Abundance Database (RAD)
26GeneVestigator focus on gene expression in the
context of
- Time (Gene expression during stages of
development\life-cycle). - Space (Tissue specific expression).
- Response (Expression caused by stimuli biotic
stress, abiotic stress, chemical, hormone,
light, drug treatment, disease).
Users can query the database to retrieve the
expression patterns of individual genes
throughout chosen environmental conditions,
growth stages, or organs. Reversely, mining
tools allow users to identify genes specifically
expressed during selected stresses, growth
stages, or in particular organs
Access
Free / By license
27http//sbw.kgi.edu/
28Thank-you !
Dr. Metsada Pasmanik-Chor Bioinformatics
Unit, Life Science, TAU Tel x 6992 E-mail
metsada_at_bioinfo.tau.ac.il Bioinfo. Unit webpage
http//bioinfo.tau.ac.il
Bioinformatics Intro, 15/12/2008, Metsada
Pasmanik-Chor
28