DiscoverySpace - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

DiscoverySpace

Description:

Biological data is complex, dynamic and is generated at a phenomenal rate. ... Long SAGE libraries from the species Homo Sapiens and from Mammary Gland tissue. ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 2
Provided by: Mart326
Category:

less

Transcript and Presenter's Notes

Title: DiscoverySpace


1
DiscoverySpace A tool for gene expression
analysis and biological discovery
Richard Varhol, Neil Robertson, Mehrdad
Oveisi-Fordorei, Chris Fjell, Derek Leung, Asim
Siddiqui, Marco Marra, Steve Jones
www.bcgsc.ca/discoveryspace
Analysis using DiscoverySpace
DiscoverySpace
Further analysis
To begin a SAGE experiment we need to select SAGE
Libraries for further analysis. For the purposes
of this experiment we will use public data from
CGAP (Cancer Genome Anatomy Project), which
publishes over 280 SAGE libraries. To narrow the
field of available libraries, we search for
libraries with the specific qualities we are
interested in Long SAGE libraries from the
species Homo Sapiens and from Mammary
Gland tissue. This is done by defining a Query
(below) to restrict the set of CGAP Libraries to
the subset of interest.
Biological data is complex, dynamic and is
generated at a phenomenal rate. To obtain
meaningful results from such data
sets, familiarity is needed not only with the
types of data available, but also with how these
data sets are structured and how they are related.
DiscoverySpace is a graphical software
application that eliminates the need for a
detailed understanding of the underlying
data structures, allowing the researcher to
focus better on experimental results and
implications. Originally designed to support SAGE
gene expression analysis, DiscoverySpace is
targeted at the biologist who wishes to utilize
high-throughput bioinformatic techniques without
the complication of scripting languages
and database queries.
The figure to the left shows a DiscoverySpace
Query. Each green arrow represents the location
of a constraint, the details of which can be seen
in the lower section of the panel. e.g. The Name
property is restricted so that we view only those
libraries with names beginning with "LSAGE" (Long
SAGE). The figure above shows the results of the
Query defined. This query has narrowed the 280
CGAP libraries to just 14 of interest.
In another screenshot from the Explorer (above)
our set of Tag Sequences is mapped to only those
Refseq genes which are predicted to produce to
extracellular proteins. One can see that Refseq
NM_004994 is given a 78 chance of producing
extracellular proteins. We also view the Gene
Ontology (GO) Terms associated with each gene.
The one-to-many relationship to GO Terms is
represented in the Explorer using "expansion
points" which can be collapsed and expanded. Two
expansion points in this image are expanded, with
the others collapsed. The greyed out sections
signify the repeated data that is projected in
such an expansion. In this screenshot Refseq
NM_004994 is related to seven GO terms including
"extracellular space".
Now that we have identified two SAGE libraries of
interest, one benign and one cancer, we need to
define further queries to capture the tags from
those libraries. Once these two sets are defined
and loaded they can be compared using
DiscoverySpace's inbuilt "Scatter Plot" tool. The
Scatter Plot uses the Audic-Claverie algorithm to
detect statistical similarities and
differences between two sets of Tag Sequences.
The Scatter Plot provides an interactive
visualization that allows the user to select
further subsets. In this example we will isolate
only those Tag Sequences which are highly
expressed in the cancer library.
The image above shows a view of the
DiscoverySpace desktop and the various tools
available within the application. DiscoverySpace
provides an intuitive query builder for defining
subsets of the available data, which can then
be related to associated datasets in a familiar
spreadsheet-like interface (the Explorer). The
application provides statistical analysis tools
(Scatter Plot, Venn Table, Tag Search), which
allow the user to compare large sets of data and
elicit similarities and patterns. Depending upon
available annotations, datasets can be also
interpreted with regards to their associated Gene
Ontology Terms using DiscoverySpaces GO Analysis
Tool.
DiscoveryDB
In the screenshot from the Venn Table (above) our
two cancer libraries are compared against other
breast cancer libraries from CGAP. The Venn Table
allows the user to perform bulk statistical
analysis upon multisets of Tag Sequences to
detect patterns of expression. Expression can be
interpreted using a number of formulae, including
percentage, ratio and p-value. In the screenshot
we are excluding those Tag Sequences present in
the benign library and we are including only
those Tag Sequences which are present in all
cancer libraries. By applying this constraint
and by removing singleton tags, we have reduced
our set of interest to only 137 Tag Sequences.
Backed by a relational database system,
DiscoverySpace is able to query data from over 25
publicly available data sources, as well as
internal experimental results. DiscoveryDB
currently contains data about genes, proteins,
pathways, functional genetic components and
biological ontologies, all of which are
accessible through DiscoverySpace. The
collection of available data sources can easily
be expanded. While some resources publish data in
relational formats, others use proprietary
flat-file formats that require the development of
custom parsers. However, once import mechanisms
have been developed for a specific data source,
the database can be automatically updated based
on a schedule specified by the administrator.
The Query tool provides an intuitive interface
for joining and constraining across multiple data
sources. Here we create two Queries, one for each
desired tag set. In the Query above, we constrain
the CGAP SAGE Library datatype to specify only
library "LSAGE_Breast_fibroadenoma_MD". We
have moved the endpoint of the Query to capture
the Tag Sequences of the Experimental Tags of the
specified Library.
The figure above shows an image from the Scatter
Plot. Each point represents a particular Tag
Sequence and its location indicates the relative
expression of the given tag. The points in blue
represent those tags that are similarly
expressed. The points coloured red are those tags
that are upregulated in the cancer library. These
Tags have been selected and will now be isolated
into a further set for continued analysis.
3. Sub-Section Labels
The image above is a screenshot from the
DiscoverySpace Gene Ontology (GO) Analyzer. In
this example we have isolated those Refseq genes
which have unambiguously mapped to Tag Sequences
from the benign library. It is important to use
unambiguous mappings so that we can accurately
preserve expression statistics. The Refseq gene
records are then viewed in the GO Analyzer. Each
gene is mapped to its associated GO Terms and
those Terms are then scored with relation to the
whole expression set. In this image we are
looking at the level 3 GO Terms related to our
gene set. One can see, for example, that around
48 of expressed genes are related to the Term
"metabolism".
The image above represents the publicly available
data sources resident within DiscoveryDB and the
linkages between them. Most major resources
cross-reference one another, facilitating queries
across the knowledge space. In this image above
one can see that SwissProt is very heavily
referenced by other resources. In addition
analytical linkages can be generated by
performing post-processing of primary data. These
large scale computational analyses, for instance
BLAST and HMM, are stored in the database to
allow for rapid access. DiscoveryDB provides
specific support for SAGE (Serial Analysis of
Gene Expression) gene mapping operations. As gene
sequence data, such as Refseq and MGC, is
imported into DiscoveryDB, the sequence data is
parsed and Virtual Tags are extracted. These
Virtual Tags can then quickly be linked to the
Tag Sequences from experimental SAGE libraries.
DiscoveryDB also houses the CMOST database which
compiles the results of extensive data processing
to increase the accuracy and reliability of tag
to gene mapping.
Acknowledgments
Now that we have isolated the set of upregulated
Tag Sequences we need to map them to counterpart
genes. The figure above is a screenshot from the
DiscoverySpace Explorer which depicts a SAGE
mapping operation. In the main Table View (right)
one can see our set of Tag Sequences (light
green). The numbers in the left hand gutter
indicate the expression value of each of the
tags. The Tag Sequences are linked to the Refseq
Virtual Tags (darker green), which are themselves
linked to Refseq genes (blue). The Tree View
(left), which reflects the structure of the Table
View, is used to explore the data model and to
add and remove columns of interest. The hatched
rows in the table mark Tag Sequences for which no
mapping can be found. The data from the table can
easily be exported to other tools (MS Excel) for
additional analysis.
Write a Comment
User Comments (0)
About PowerShow.com