Title: RennieC1
1A systematic, data-driven approach to the
combined analysis of microarray and QTL data
Rennie C1 Hulme H2 Fisher P2 Hall L3 Agaba
M4 Noyes HA1 Kemp SJ1,4 Brass A2,5
Abstract High throughput technologies inevitably
produce vast quantities of data. This presents
challenges in terms of developing effective
analysis methods, particularly where the analysis
involves combining data derived from different
experimental technologies. In this investigation,
we applied a systematic approach to combine
microarray gene expression data, QTL data and
pathway analysis resources in order to identify
functional candidate genes underlying tolerance
of Trypanosoma congolense infection in cattle
(see Agaba et al poster at this conference). We
automated much of the analysis using Taverna
workflows previously developed for the study of
trypanotolerance in the mouse model. We
identified pathways represented by genes within
the QTL regions, and subsequently ranked this
list according to which pathways were
over-represented in the set of genes that were
differentially expressed (over time or between
tolerant Ndama and susceptible Boran breeds) at
various timepoints after T. congolense infection.
The genes within the QTL that played a role in
the highest-ranked pathways were flagged as
strong candidates for experimental confirmation.
Background African bovine trypanosomiasis is one
of the most important diseases affecting African
livestock production. West African taurine
cattle, such as the N'dama, are more resistant to
the pathological consequences of trypanosomiasis
(trypanotolerant) than East African zebu cattle,
such as the Boran. A microarray timecourse
experiment was carried out to investigate gene
expression in N'dama and Boran cattle infected
with Trypanosoma congolense, in order to identify
genes underlying trypanotolerance (see Agaba et
al poster at this conference).
Trypanotolerance Trypanotolerance is a complex
phenotype involving several distinct components,
likely to involve separate genetic control
mechanisms. Key features include the ability to
control anaemia, control parasitaemia and
maintain bodyweight. Data on trypanotolerance QTL
suggests that phenotypic traits involved in
trypanotolerance may be influenced by multiple
genetic loci and possibly complex epistatic or
environmental effects (Proc Natl Acad Sci USA
2003100(13)7443-7448).
Microarray data Microarray data for liver samples
extracted from Boran and N'dama cattle at 0, 12,
15, 18, 21, 26, 29, 32 and 35 days post-infection
were analysed. Outliers were identified using
dChip and removed before the remaining
hybridisations were normalised using the Robust
Multi-Array (RMA) method. Principal Components
Analysis (PCA) was used to check that the
hybridisations clustered as expected. T-tests
were used to identify genes that were
differentially expressed (plt0.01) between the
two breeds at each timepoint and paired T-tests
(using data for the same individual animals at
different timepoints) were used to identify genes
that were differentially expressed (plt0.01)
within breed at any timepoint compared to day 0.
1 School of Biological Sciences
BioSciences Building University of Liverpool
Crown Street Liverpool
L69 7ZB UK 2
School of Computer Science
Kilburn Building University of Manchester
Oxford Road Manchester
M13 9PL UK 3 Roslin
Institute and Royal (Dick) School of Veterinary
Studies University of Edinburgh
Roslin Midlothian
EH25 9PS UK 4
International Livestock Research Institute (ILRI)
PO Box 30709 Nairobi 00100
Kenya 5 Faculty of Life Sciences University of
Manchester Smith Building Oxford Road
Manchester M13 9PT
UK
QTL location Phenotype
BTA2 Anaemia
BTA4 Parasitaemia
BTA7 Anaemia and parasitaemia
BTA16 Anaemia
BTA27 Anaemia
QTL data 16 trypanotolerance QTL had been
identified in a previous mapping study (Proc Natl
Acad Sci USA 2003100(13)7443-7448). 5 of these
QTL were selected based on the phenotypic trait
involved, the mapping resolution and the strength
of the effect (see table on the left for a
summary of the QTL and associated
phenotypes). The base-pair positions of these QTL
relative to the EnsEMBL bovine genome preliminary
build Btau2.0 were determined manually
Combined analysis approach The gene underlying a
QTL is not assumed to be differentially
expressed. However, it is expected to connect
biologically with differentially expressed genes.
The rationale behind this approach is to
establish the possible connections. The analysis
procedure is described in Figure 1 (right). In
brief, it involves mapping QTL genes and
Affymetrix microarray probes to genes in the
EnsEMBL bovine preliminary build Btau2.0 then
identifying KEGG pathways that include the
EnsEMBL genes. The two resulting pathway lists
are compared to generate a list of KEGG pathways
that include at least one differentially
expressed gene and at least one gene in the QTL.
The pathway list is then ranked according to the
results of a Fisher exact test performed on the
microarray data using DAVID, and annotated using
literature searches and various public databases
of gene and pathway information. Large sections
of the analysis were automated (shown in blue in
Figure 1) by adapting Taverna workflows
previously developed for the study of
trypanosomiasis responses in mice (Nucl Acids Res
200735(16)5625-5633). The adaptations required
involved mapping genes to human homologues and
using bovine IDs and human IDs in the analysis,
rather than murine IDs.
Results The analysis procedure itself could be
reused or adapted for studying another species or
another phenotypic trait for which QTL data are
available. In the case of the bovine
trypanotolerance study, the result can be
quantified in terms of the reduction of an
enormous set of potential targets for
investigation to a manageable shortlist of the
most likely targets. Out of 24128 probe-sets on
the array, 12591 were significantly
differentially expressed (p lt 0.01 in one or
more T-tests comparing expression between breeds
or over time). 8342 of these probe-sets could be
mapped to a known gene. In total they represented
7071 unique gene symbols. In contrast, there were
127 genes in the QTL that were involved in
pathways identified by the combined analysis
protocol. If we only include pathways with a
significant (plt0.05) score on the DAVID Fisher
exact test, the list of targets is reduced to
only 51 genes (shown in the table below. Note
that these results are based on an analysis with
EnsEMBL bovine genome preliminary build Btau2.0.
A more recent preliminary build is available, and
the analysis will be repeated, and key findings
discussed in a future publication).
Figure 1. Summary of the combined analysis
procedure. Stages of the analysis that were
automated using Taverna workflows are in blue
Discussion Automated approaches are becoming
increasingly necessary to enable researchers to
handle the output from modern high-throughput
technologies. Data-driven methods are useful in
studying complex phenotypes where an analysis
based solely on biological processes already
known to be involved may be insufficient.
Pathway-based approaches provide a means to link
microarray data to QTL data in a biologically
meaningful way. Pathway-based, data-driven,
systematic, semi-automated analysis approaches
provide an excellent means to triage data from
high-throughput technologies providing a
shortlist of viable targets for thorough manual
investigation and experimental confirmation
Acknowledgements This work was wholly supported
by The Wellcome Trust. The authors would also
like to thank Dr Park based in Dr McHughs group
at University College Dublin for sharing bovine
gene symbol information for Affymetrix probes.