Title: Data Analysis Tools
1- Data Analysis Tools Techniques II
2In this presentation
- Part 1 Gene Expression Microarray Data
- Part 2 Global Expression Sequence Data
Analysis - Part 3 Proteomic Data Analysis
3Part1
Gene Expression Data Processing
4Conversion to matrix
- Whichever platform is used, aim of data
processing is to convert the hybridization
signals into numbers, which can be used to build
a gene expression matrix - This matrix can be regarded as a table in which
the rows represent genes (different features on
array) and the columns represent treatments,
samples or conditions used in experiment
5What do they represent?
- For a dual hybridization experiment using a glass
microarray, each of the probes represents a
different experimental condition - In other cases, a whole series of conditions or
treatments may be used, e.g. representing a
series of concentrations of a particular drug, or
a series of developmental time points
6Schematic of an idealized expression array, in
which the results from 3 experiments are
combined. Three genes NG (G1, G2, G3) are
labeled on vertical axis and three experimental
conditions NC (C1, C2, C3) are labeled on
horizontal axis, giving a total of nine data
points represented by NC x NG. The shading of
each data point represents the level of gene
expression, with darker colours representing
higher expression levels
7(No Transcript)
8Gene expression matrix
9Expression profile
- Interpretation of microarray experiment is
carried out by grouping data according to similar
expression profiles - It is defined as expression measurements of a
given gene over a set of conditions essentially
it means reading along a row of data in the
matrix - Intensity of shading is used to represent
expression levels - With experimental conditions C1 and C2, genes G1
and G2 look functionally similar and G3 appears
different. However, if C3 is included, a
functional link between genes G1 and G3 can be
seen - Analysis methods are either supervised or
unsupervised
10Microarray Data Analysis Types
- Gene Selection
- find genes for therapeutic targets
- Classification
- classify disease based on genes
- predict outcome / select best treatment
- Clustering
- find new biological classes / refining existing
ones - Exploration
-
11Microarray Data Mining Challenges
- too few records (samples), usually lt 100
- too many columns (genes), usually gt 1,000
- Too many columns likely to lead to False
positives - for exploration, a large set of all relevant
genes is desired - for diagnostics or identification of therapeutic
targets, smallest reliable set of genes is needed - model needs to be explainable to biologists
12Data Mining Methodology is Critical!
CRISP-DM methodology
Data Mining is a Continuous Process! Following
Correct Methodology is Critical!
13Building Classification Models
Preparation
Gene data
Feature Selection
Class data
Model Building
Evaluation
14Supervised analysis method
- Supervised methods are essentially classification
systems, i.e. they incorporate some kind of
classifier so that expression profiles are
assigned to one or more predefined categories - For instance, supervised analysis of gene
expression profiles from different leukemias
allows samples to be divided into two distinct
subtypes acute myeloid leukemia (AML) and acute
lymphoblastoid leukemia (ALL) - For example, support vector machine (SVM),
learning vector quantization (LVQ), etc.
15Clustering
16Unsupervised analysis method
- They have no inbuilt classifiers, so the number
and nature of groups depends only on the
algorithm used and nature of data themselves - This type of analysis is known as clustering
- For example, k-means, principal component
analysis (PCA), self-organizing maps (SOM),
hierarchical clustering, etc.
17Classification
18Feature reduction
- Since microarray data sets are so large,
classification and clustering can be laborious
and demanding in terms of computer resources - It is possible to use feature reduction, where
non-informative or redundant data points are
removed from data set, to make the algorithms run
more quickly - For instance, if two conditions have exactly same
effect on gene expression, these data are
redundant and one entire column of the matrix can
be eliminated - If the expression of a particular gene is same
over a range of conditions, it is neither
necessary nor beneficial to use this gene in
further analysis because it provides no useful
information on differential gene expression. An
entire row can be removed
19Other feature reduction methods
- Several approaches can be used to automatically
select such redundant or non-informative data
sets, but a popular method is principal component
analysis (also called singular value
decomposition) - Redundant data are combined to form a single,
composite data set, thus reducing the dimensions
of gene expression matrix and simplifying
analysis - Feature reduction can also be used in supervised
analysis methods to reduce number of features
required to classify profiles correctly (also
called cherry picking) - In one method, this can be achieved simply by
weighting classification features according to
their usefulness and eliminating those that are
least informative
20Microarray data format
- Unlike sequence and structural data, there is no
international convention for the representation
of data from microarray experiments - This is due to the wide variation in experimental
design, assay platforms and methodologies - Recently, an initiative to develop a common
language for the representation and communication
of microarray data has been proposed - Experiments are described in a standard format
called MIAME and communicated using a
standardized data exchange model and microarray
markup language based on XML
21Micro Array Gene Expression Markup Language
- Micro Array Gene Expression Markup Language
(MAGE-ML) creates a syntax that can manage the
enormous number of variables involved in
microarray experiments, and provides a mutually
intelligible format to permit data merges or
comparisons - This is a collaborative effort of Lion
Bioscience, The Institute for Genomic Research,
Rosetta Biosoftware, the Institute for Systems
Biology, among others under the chairmanship of
Paul Spellman - This will soon become standard for all microarray
experiments world wide being run under different
conditions in different labs - Look for Paul T Spellman et al (2002) for more
information on MAGE-ML
22Tools for microarray data analysis
- Many software applications are available for the
analysis of microarray data and these can be
downloaded and installed on local computers - There are also several resources, Expression
Profiler being the most widely used, for
microarray data analysis over the Internet - Several gene expression databases have been
constructed for the storage and dissemination of
microarray data - These include the NCBI Gene Expression Omnibus
and the EBI ArrayExpress database
23From expression data to pathways
- Reconstructing molecular pathways from expression
data is a difficult task - One approach is to simulate pathways using a
variety of mathematical models and then choose
the model that fits the data - Reverse engineering is a less demanding approach
in which models are built on the basis of the
observed behaviour of molecular pathways - Models using simultaneous differential equations
or Boolean networks each suffer from
disadvantages, so hybrid models, such as the
finite linear state model, are preferred
24Representation of molecular pathways
- There are two well-studied ways of representing a
molecular pathways - The classical biochemical representation involves
use of simultaneous differential equations - The Boolean network representation
25Part2
- Global Expression Sequence Data Analysis
26Sequence sampling data analysis
- Differential gene expression can be investigated
by sampling random clones from different cDNA
libraries, or by sampling EST data, which is
obtained by single-pass sequencing of randomly
picked cDNA clones and deposited in public or
proprietary databases - Thousands of sequences have to be sampled for
such analysis to be statistically significant,
even in the case of moderately abundant mRNAs
27Global expression data analysis
- Refers to any experiment in which the expression
of all genes is monitored simultaneously - Such experiments generate large amounts of data,
but unlike sequence and structural data, there is
no universal system for description of gene
expression profiles - Global protein expression data are obtained
predominantly as signal intensities on 2D protein
gels
28RNA expression data analysis
- At the RNA level, expression data may be obtained
as digital expression readouts following direct
sequence sampling from libraries or databases, or
using more sophisticated techniques like SAGE - Most global RNA expression data, however, are
obtained as signal intensities from microarray
experiments
29SAGE
- SAGE is a sequence sampling technique in which
very short sequence tags (9-15 nt) are joined
into long concatamers - The size of the SAGE tag is optimal for
high-throughput analysis but genes can still be
identified unambiguously - A concatamer may contain more than 50 tags, and
each SAGE sequence is thus equivalent to more
than 50 independent cDNA sequencing experiments - SAGE is therefore appropriate for the analysis of
rare mRNAs
30Starting points for SAGE analysis
Resource URL
John Hopkins SAGE site. Includes protocols, access to SAGE data and an extensive bibliography www.sagenet.org
NCBI SAGE site. Includes tools for data analysis, access to SAGE data, and library of tags and ditags www.ncbi.nlm.nih.gov/SAGE
Saccharomyces genome database SAGE query site http//genome-www.stanford.edu/cgi-bin/SGD/SAGE/querySAGE
A useful SAGE site run by Genzyme Molecular Oncology Inc., which owns the license for commercial distribution of SAGE technology www.genzymemolecularoncology.com/sage
31Part3
32Proteomic data analysis
- 2D-PAGE or gel electrophoresis
- Mass spectrometry
332D protein gels
- Global protein expression analysis is achieved
using high resolution 2D gel electrophoresis - In this technique, proteins are separated in the
first dimension by isoelectric focusing in an
immobilized pH gradient, and in the second
dimension according to molecular mass - After staining the gel, the resulting pattern of
sports is a reproducible fingerprint of proteins
in the sample - Comparison between samples can identify proteins
that are differentially expressed, or induced in
response to drugs, and so on - Excised spots are analyzed by MS to characterize
proteins
34Raw data from 2D-PAGE gels
- 2D-PAGE is a protein separation technique that
allows the resolution of thousands of proteins on
a single gel, on the basis of charge and mass - Separated proteins appear as spots, the nature
and distribution of which constitute a protein
fingerprint of any sample
35Data processing
- Data extraction from 2D-PAGE gels involves
- staining (to reveal the position of individual
protein spots) - scanning (to obtain a digital image)
- spot detection and quantization
- The quality of the image, in terms of spatial and
densitometric resolution, is an important factor
in accurate spot measurement - A number of algorithms are used to resolve
complex overlapping spots and assemble a final
spot list
36Gel matching
- To study differential protein expression, a
series of 2D-PAGE gels must be compared - However, minute inconsistencies in gel structure
and electrophoretic conditions make it impossible
to exactly replicate any experiment - Sophisticated algorithms are required to follow
individual spots through a series of gel, a
process known as gel matching - MELANIE II is a widely used gel-matching software
application
37Protein expression matrices
- Differential protein expression data are
assembled into a protein expression matrix - This can be used to find distances between
particular proteins or treatments, leading to
classification or clustering of proteins
according to similar expression profiles
382D-PAGE database
- Data from 2D-PAGE experiments are deposited in
dedicated 2D-PAGE databases containing digital
gel images with links from individual protein
spots to useful annotations - Internet 2D-PAGE databases are indexed at the
ExPASy WORLD-2PAGE - These allow 2D-PAGE data to be shared with
scientists around the world, and comparisons
between gels can be carried out using Java
applets such as Flicker or CAROL
39Raw data from mass spectrometry
- Raw data from MS experiments are the mass/charge
(m/z) ratios of ions in a vacuum - These are used to determine accurate molecular
masses - The masses can be used in peptide mass
fingerprinting or fragment ion searching to find
correlations in protein databases - Alternatively, peptide ladders can be generated
and used to determine protein sequences de novo
40Virtual digests
- They are theoretical protein cleavage reactions
performed by computers based on known protein
sequences and the known specificity of a cleavage
agent such as an endoproteinase - Although many different polypeptides can generate
the same peptide digest pattern, in practice a
correlation between the masses of two or more
peptides produced from the same protein and the
theoretical peptides produced in a virtual digest
provides very strong evidence for a database match
41Dual digests
- Dual digests, carried out on the same protein
either separately or sequentially, can provide
extra data to correlate experimentally determined
molecular masses with less robust data resources
such as dbEST - Alternatively, single digests can be carried out
before and after protein modification, or ragged
termini can be generated from proteins with
clustered arginine and lysine residues, providing
the masses of multiple fragments to use as
database search terms
42Database search tools
- Algorithms for database searching may attempt to
match the experimentally determined mass of a
peptide or peptide fragment to mass predicted
from sequence database entries. The program
SEQUEST works on this principle - Alternatively, the amino acid composition of a
particular peptide or peptide fragment can be
predicted from its mass - The order of amino acids cannot be predicted, so
all permitted permutations are used as a database
search query. The program Lutkefisk works on this
principle
43Limitations of MS analysis
- Failure of MS data to elicit a high-confidence
hit on a sequence database may not always reflect
the absence of that protein from database - In some cases, it may reflect the presence of
unknown or unanticipated post-translational
modifications, or it may be caused by
non-specific proteolysis or contaminating
proteins - Imperfect matches may be generated if the
experimental protein itself is absent from the
database but a close homolog, with a related
sequence, is present
44WWW resources for MS based protein identification
Resource URL Features and comments
CBRG, ETH-Zurich cbrg.inf.ethz.ch/Masssearch.html Peptide mass search
European Molecular Biology Laboratory, Heidelberg www.mann.embl-heidelberg/Services/PeptideSearch/PeptideSearchIntro.html Peptide mass and fragment ion search
ExPASy www.expasy.ch/tools/proteome Peptide mass and fragment ion search
Mascot www.matrix-science.com/cgi/index.pl?page/home.html Peptide mass and fragment ion search
Rockfeller University, New York prowl.rockfeller.edu Peptide mass and fragment ion search
SEQNET, Daresbury, UK www.seqnet.dl.ac.uk/Bioinformatics/welapp/mowse Peptide mass and fragment ion search
University of California prospector.ucsf.edu dontatello.ucsf.edu Peptide mass (MS-Fit) and fragment ion (MS-Tag) search
University of Washington thompson.mbt.washington.edu/sequest Instruction on how to get SEQUEST fragment ion search program
45Part4
Microarray Data Format
46Standard format
- Scope of bioinformatics has widened to include
analysis of gene and protein expression data - Standard format has been adopted for
representation of 2D gel electrophoresis
(2D-PAGE) protein gels but there is no similar
convention for microarrays, even though
microarray experiments produce some of the
largest data sets bioinformatics has to deal with - This reflects different array platforms available
(i.e. nylon macroarrays, spotted glass
microarrays, high-density oligonucleotide chips)
and large amount of variation in experimental
design, hybridization protocols and data
gathering techniques
47Recent development
- Recently, there has been an international effort
to develop a common language for communication of
microarray data - Requirements for this language are that it should
be minimal but it should convey enough
information to enable experiment to be repeated,
if necessary - The convention is known as MIAME (minimum
information about a microarray experiment)
devised by MAGE group (microarray and gene
expression group)
48MIAME standard
- Incorporates six elements
- Overall experimental design
- Array design (identification of each spot on each
array) - Probe source and labeling method
- Hybridization procedures and parameters
- Measurement procedure (including normalization
methods) - Control types, values and specifications
49Contents of MIAME standard
- A data exchange model (MAGE-Object Model or
MAGE-OM) is modeled using unified modeling
language (UML) - A data exchange format (MAGE-Markup Language or
MAGE-ML) uses extensible markup language (XML) - For more information visit the Microarray Gene
Expression Database (MGED) website at
http//www.mged.org
50Part5
General Information
51Analysis software and resources
URL Product(s) Comments
http//genome-www4.stanford.edu/Microarray/SMD/restech.html Cluster, Xcluster, SAM, Scanalyze, many more Extensive list of software resources from Stanford University and other sources, both downloadable and WWW-based
http//ihome.cukh.edu.hk/b400559/arraysoft.html Cluster, Cleaver, GeneSpring, Genesis, many more Comprehensive list of downloadable and WWW-based software of microarray analysis and data mining, plus links to gene expression databases
http//ep.ebi.ac.uk/EP Expression Profiler Very powerful suite of programs from EBI for analysis and clustering of expression data
http//www.ncgr.org/genex GeneX GeneX gene expression database is an integrated tool set for analysis and comparison of microarray data
52Analysis software and resources
URL Product(s) Comments
http//bioinfo.cnio.es/dnarray/analysis DNA arrays analysis tools A suite of programs from National Spanish Cancer Centre (CNIO) including two-sample correlation plot, hierarchical clustering, SOM, SVM, tree viewers, etc.
http//www.ncbi.nlm.nih.gov/geo NCBI Gene Expression Omnibus Gene expression and hybridization database could be searched directly or through Entrez ProbeSet search interface
http//www.ebi.ac.uk/microarray/ArrayExpress/arrayexpress.html ArrayExpress EBI microarray gene expression database, developed by MGED and supports MIAME
53More on microarray chips
- Protein chip market expected to be of 700
million by 2006 - Chips for agricultural purposes will be great
demand - Peptide microarray chips
- Silicon based micro-fluidics chips
- 2000 to 4000 peptide sequence on a 1.5 cm2 chip
- Protein
- Secreted
- Membranal
54Accuracy of new tech chips
- New software technologies can reduce the
inter-experiment variability from 1500-200 genes
down to 10-15 genes by identification and
suppression of background noise in producing
microarray data - They can be used for high throughput sequencing,
protein detection and SNP analysis - Reduces error rate of false positives from 30
down to 1 - Current DNA chips III are equipped to handle
multiple mRNA transcripts
55Front-end and back-end processing
- This term is widely used by biotech industry
- Front end DNA microarray processes
- Sample preparation
- Microarray production
- Back end DNA microarray processes
- Hybridization
- Imaging and analysis
56DNA chip test
- Cancers can act differently even when they look
the same. To decide how to treat breast tumors,
doctors look at a range of indicators such as
whether the cancer has spread to nearby lymph
nodes, tumor size, and certain characteristics of
the tumor cells. However, none of these factors
is very accurate - The DNA chip test reveals how 70 genes turned on
or off in the cancer cells - According to Netherlands Cancer Institute, the
tumors most likely to spread usually show a
different pattern of gene expression than their
less dangerous counterparts