Title: Microarray Data Capture Workshop
1Microarray Data Capture Workshop
Friday 17th June 2005
2Program
3Presentation Overview
- Importance of meta-data capture
- MIAME, MGED ontology and MAGE
- Introduction to microarray storage using
maxdLoad2 - Advanced features of maxdLoad2 including import
and export of data
4Meta-data capture
5Applications of Microarrays
- Has many applications
- Can target genes which react to various
pharmacological agents over time. - Which genes are involved in disease and which
treatments affect these genes. - Which genes are involved in the reaction of
plants to environmental conditions - Determine formulas based on a genes expression
for diagnosing or predicting future outcomes
(e.g., cancer recurrence).
6Microarrays
Biological question
Experiment design
Microarray experiment
Image analysis
Pre-processing
Analysis
Expression quantification
Normalisation
..
Prediction
Testing
Clustering
Estimation
Biological verification and interpretation
7Microarrays
Diagram taken from NCBI
8Diagram taken from NCBI
9The common scenario.....
- So weve done our experiment
- Extracted, amplified and labelled the mRNA
- Hybridised our samples to the arrays
- Scanned the arrays
- Analysed the data
- Written a paper
- Submitted it to PNAS
- Oh no, Ive just re-read the information for
authors and they want it MIAME compliant and
publicly accessible for the review process
10Microarray Data
- Data is an asset
- Very long lived
- Can be used in many unforeseen ways
- Data mining
- Microarray Data
- Costly to generate
- Can be irreproducible
11Why capture meta-data?
- Sequence data is static.
- Post-genome data is highly state-dependent.
- transcriptomic meta-data no. of cells no. of
environmental conditions. - Annotation is important!
- e.g., hybridisations carried out by different
experimenters can account for one of the largest
sources of systematic variation in an array-based
experiment. - We need to take lessons from the gene debacle.
- Protein-tyrosine phosphatase, non-receptor type
6, Protein-tyrosine phosphatase 1C, PTP-1C,
Hematopoietic cell protein-tyrosine phosphatase,
SH-PTP1, Protein-tyrosine phosphatase SHP-1 - LARD, death receptor 3 beta, WSL-1R protein,
lymphocyte associated receptor of death, death
receptor 3
12Meta-data quality
- Accuracy
- Completeness
- Currency
- Important to be able reference external sources
rather than duplicate them - Functional annotation that is not updated
- Gene names can change or obtain synonyms, without
this being reflected in the data - Chip files can be out of data
- Credibility
13Meta-data quality contd
- Portability
- Can the data be used outside of the context of
its creation? - Incomplete meta-data limits portability
14Microarray data repositories
- Repository needs to keep all relevant meta-data
associated with a data set - To be easily submitted, and to be searchable,
data must adhere to standards, both in content
and format
15Microarray repositories
- ArrayExpress is the repository of choice for many
groups, particularly within Europe. - Its good points
- High quality data to search against
- Accepts MAGE-ML input from software pipelines
- Some of its disadvantages
- Complicated web-based data entry tool
(MIAMExpress) - Convincing people to gather the extra data when
other repositories may require less and are still
MIAME compliant for publication. Activation
energy. - GEO (Gene Expression Omnibus) is hosted at the
NCBI.
16Benefits of using a data repository
17The MGED Society
- To facilitate microarray data storage and
communication, MGED have created - MAGE-OM
- An object model linking the concepts behind a
microarray experiment in packages - MAGE-ML
- An XML based language that represents the
packages in MAGE-OM - MGED Ontology
- A controlled hierarchical vocabulary representing
experimental concepts for annotation
18What is MIAME?
- MIAME is the internationally adopted standard for
the Minimal Information About a Microarray
Experiment. - The result of a MGED driven effort to codify the
description of a microarray experiment. - MIAME aims to define the core that is common to
most experiments. - Ultimately, it tries to specify the collection of
information that would be needed to allow
somebody to completely reproduce an experiment
that was performed elsewhere. - Exactly what minimum means is open to
interpretation and depends on operator, software
and most importantly the experiment being
described.
19MIAME extensions
- MIAME does not have all the required vocabulary
to describe all types of experiments. - e.g., environment genomics and toxicogenomics.
- This resulting in the development of MIAME/Env
and MIAME/Tox. - MIAME/Env is an initiative spearheaded by the
EGTDC to extend MIAME standards for annotation of
environmental genomic data - Includes the development of controlled
vocabularies / ontologies to describe
environmental genomic experiments. - MIAME/Env developed with the support of MGED
society and in collaboration with MIAME/Tox and
members of the EBI.
20MAGE
- MAGE stands for MicroArray Gene Expression.
- It is broken into two equally important parts
MAGE-OM and MAGE-ML. - MAGE-OM is an object model of microarray
experiment. - It represents a generalised experiment which-
- can be manipulated to represent a specific
experiment by adding information to objects
(attributes). - linking objects to each other by treatments
reiteratively to model any complexity of
experiment.
21MAGE-OM (doesnt stand for oh my)
22MAGE-OM slightly simpler
23MGED Ontology
- Provides standard terms for the annotation of
microarray experiments. - An ontology is the formal representation of a
domain, and allows complex paradigms to be
reasoned over by automated systems. - The terms enable
- structured queries of the elements of the
experiments. - Unambiguous descriptions of how the experiment
was performed. - Current version 1.1.9 updated every few months
- 226 classes 109 properties 644 individuals
- Expands to add new terms to map to new experiment
types/new uses of terms (and to correct existing
errors as theyre found).
24maxdLoad2 A tool for microarray experimental
annotation
25Main features
- Loading, browsing, editing and searching.
- Extensible customisable attributes for each part
of the schema. - MIAME data capture.
- MAGE-ML data export.
26maxdLoad2 An extensible, MIAME-compliant
database for microarray experiments
- A database schema and a software application.
- The second-generation of maxdLoad.
- Integrated data loading, browsing, editing and
searching. - Written in Java, runs on most computers
- Supports any SQL92 database Oracle, MySQL,
Postgres, Sybase, Firebird
27Evolution of maxdLoad2
- The maxd software has been in development since
2000. - The analysis and visualisation suite maxdView
- Is based on a modular design - new features can
be added as plugins. - Lots of normalisation, filtering and plotting
features are provided. - The database component, maxdLoad was based on the
EBIs original ArrayExpress reference model. - In maxdLoad2, the database design has been
modified to more closely correspond to MIAME and
MAGE concepts. - The major advance is the customisable/extensib
le attribute mechanism this feature is being
used for rapid prototyping by the MIAME/Env
project
28System architecture
- maxdLoad2 is NOT accessed via a web-browser
- It is a stand-alone application, written in Java
(this makes it very portable). - maxdLoad2 and the database server can run on the
same machine, no network connection or web server
is needed. - However, maxdLoad2 and the database server can be
on separate machines connected via a network.
29Microarray experiment workflow
- A typical microarray experiment is a sequence of
steps starting with one or more BioMaterials
and ending up with a big pile of numbers. - These steps can be thought of as
transformations material A treatment
material Band combinations image scanning
data - Each of the steps needs to be recorded in the
database. - Many of the steps will be standardised, for
example, the protocol used for labelling. They
will only have to be defined once.
30Why record everything?
- The more meta-data that is captured, the better
the chance of explaining things when it all goes
wrong! - Most studies that have looked at between study
variation find that the biggest component of
difference is lab (person, protocol, equipment)
then array then biology
31Why all the structuring
Free Texteasy to generate, hard to understand
- Whats wrong with just describing what happened
as a nice big document? - It is very hard for software to understand the
process and therefore difficult for the software
to behave intelligently, or to assist the user in
any way - It makes reusing common bits of the description
tricky a general rule of thumb is reuse is
good, cut-and-paste is bad
Structured Objects hard to generate, easy to
understand
32What is in the database?
- Experiment
- A collection of related hybridisations and the
resulting data - Experiment, Measurements, Images and
Hybridisations - Array Design
- The contents and the layout of a microarray
- ArrayType, Features, Reporters and Genes
- Bio-Materials
- The actual biological entities that are used
- LabelledExtract, Extract, TreatedSample, Sample,
Sources - Protocols
- Standardised methods of operation in the
laboratory - ImageAnalysisProtocol, ScanningProtocol, etc..
33Bio-Materials model the experiment
- Source
- original organism, tissue sample
- Sample
- acquisition of material from a source
- Treated Sample
- is a sample which has something done to it
- Extract
- a portion of a TreatedSample selected for
analysis - LabelledExtract
- a TreatedSample that has been prepared for
hybridisation - These elements are generally constructed in the
order shown above. The methods used in
preparation and production are recorded using
their associated Protocol elements.
34Modelling an experiment
LabelledExtract
- The various elements can be plugged together in
different ways to represent the way the
experiment is constructed. - Components are wired together in reverse order
connections are based on where things came from,
rather than on the sequence in which they were
generated. - Pooling and splitting operations are represented
by having one instance linked to more than one
other instance, or vice versa.
Extract
TreatedSample
Sample
Source
35Protocols
Extract Control 20 minutes
Extract Control 40 minutes
Extract Shocked 20 minutes
Extract Shocked 40 minutes
A
TreatedSample
TreatmentProtocol wait 20 minutes
A
TreatedSample
A
TreatedSample
TreatmentProtocol wait 40 minutes
A
TreatedSample
TreatmentProtocol heat_shock
A
TreatedSample Shocked
TreatedSample Control
TreatmentProtocol do nothing
A
Represents the application of a protocol
A
Sample
- The Protocol links explain why the Bio-Material
components have been connected in the way they
are.
36Arrays, Features, Reporters and Genes
- ArrayType models the platform
- Feature models the spaces where Reporters go
(number, placing, size) - Reporter models the contents of the Features,
type of content (control, experimental) nature of
sequence - Gene models the relation of sequences to genetic
information
Array
ArrayType
Feature Row 34, Col 17
Feature Row 3, Col 91
Feature Row 19, Col 28
Reporter
Reporter
Reporter
Gene
37Hybridisations Storing the data
- A Measurement represents the collection of
results from analysing the scanned image of
microarray after hybridisation. Measurements
can have any number of Propertys can be
associated with them. - Each Property corresponds to one column in the
file that came from the scanner (or to data
generated by subsequent data analysis such as
normalisation). -
38Connecting to the database
- Connecting to a database requires the following
information - The Database, which identifies the machine and
server that is hosting the database, and the name
of the database (one server may be hosting more
than database) - The Driver File and Driver Name, which tells
maxdLoad2 which database driver to use (these
drivers are database specific) - The User Name and Password identify which
account on the database server should be used. - Information about one or more connections can be
saved and accessed from the list on the left-hand
side. - The built in help system provides more details on
how to set up a new database connection.
39The User Interface
40The User-interface continued..
- These buttons control which mode the software is
in (create, browse, find, edit or load) - These buttons are used to open the form used to
input or explore the data for each of the
database components - The arrows show how the components are
interconnected - These buttons access the other main features
import, export, options and the built-in help
system.
41The Navigator Tree
- A representation of the schema as a hierarchy can
also be displayed (in a separate window). - This view shows all of the links from one
instance to all others. - Instances can be selected by clicking on their
name. - When multiple links exist between instances
(e.g., 1 Extract linked to 5 Samples),
individual links are highlighted as the mouse
passes over them.
42The Navigator Tree
- An alternative representation of the schema
- Displayed either in a separate window or down the
side of the main window - Instances are selected by clicking on their name
- The navigator view is useful during instance
creation as an aid in keeping track of which
instances have been provided and which have not. - The red line shows the path taken to the current
form. - Instances which have not yet been specified are
tagged with yellow dots. - As instances are selected or created, they are
tagged with green dots, and their names are
shown.
43The User-interface continued..
Clicking on one of the buttons opens up a
panel/form in which full details can be browsed
or edited
44The User-interface continued..
- Name(s) identify instances
- Links combine instances together and are
defined by - Selecting the item from lists.
- Recursively filing in another form.
- Attributes store all other data about the
instance - Entered by typing data directly into the fields.
- Useful information can be found by clicking on
attribute name. - Quick-copy function in data entry modes.
45Create Mode
- Required fields (yellow) are coloured
differently to optional fields (blue). - All required fields must be completed before a
new instance can be created. - Links to other instances that have recently been
visited are chosen from pulled down lists.
- If a link to an instance which has not been
defined is required, the Create New button
opens a new form, which is then used to define
the new instance
46Browse mode
- Allows the exploration of the database and the
examination of links between instances. - The Find Linked can find the connections
between the current instance and instance(s) in
any other table. - Instances can be viewed by selecting from list.
- List can be filtered for ease of searching.
- List can be sorted in chronological or
alphabetical order.
47Find mode
- Search for instance(s).
- Instances can be found by specifying any
combination of - One or more linked instances
- one of more attribute values
- All or part of a name
- This is done by filling in one or more fields in
a form. - The collection of matching instances is then
displayed using a Browse Mode.
48Edit mode
- This mode is a combination of Create mode and
Browse mode. - It is essentially the same as that of Create
mode. - Names, links and attributes can all be edited.
- Warning!!
- There is no audit trail, therefore once a value
is changed, the previous value is lost forever.
49Advanced features
- In addition to the annotation shown there
maxdLoad2 has some tricks up its sleeve to
improve data loading - These are useful when numbers of arrays exceed a
moderate figure - weve processed experiments with 50 arrays
manually - Weve processed experiments with gt350 arrays with
loading scripts - They are also useful if dealing with less
computer confident experimentalists - They are based around spreadsheets, which can be
pre-filled with most relevant data, leaving gaps
for day-to-day details
50Loading data
- In addition to entering data by hand using
Create Mode, it is also possible to create
instances by extracting data directly from a text
file or Excel spreadsheet. - Data is extracted by tagging lines and columns of
the data source that are interesting. - The Load Mode forms are essentially the same as
Create Mode forms. However, instead of
supplying final values for things, the column(s)
containing the values are identified. - As this process can be automated, it is useful
for integrating maxdLoad2 with other lab
software, especially LIM systems.
51Loading data user interface
- Presets
- File parser settings
- Data value settings
52Loading Data simple example
Extracted Data
Source Data File
Column Specification
5regex1,0-9
1
2
3
4
5
6
3regex2,0-9
53Automated data loading
- A data loading script (in an XML format) can be
used to automate the process of loading data from
one or more files. - The format is exactly the same as the column
specification method used in manual data
loading. - Existing preset settings can be used directly.
- Loading scripts could be generated
automatically.
54Loading expression data
- The actual expression data (i.e., the results of
the processing the scanned image) are also loaded
using the bulk data loading system. - A collection of values (one per Feature) is
called a Property - All of the data manipulation methods described
previously are available when loading the
expression data (this is useful for handling
missing values).
55Measurements and Propertys
- A Measurement represents the collection of
results from analysing the scanned image of
microarray after hybridisation. - They can have any number of Properties can be
associated with them. - Each Property corresponds to one column in the
file that came from the scanner (or to data
generated by subsequent data analysis, such as
normalisation). - Extra properties can be added to an existing
Measurement at any time.
56Measurement meta-data
- To define a property the following information is
required - Quantitation Type
- Scale
- Unit
- Origin
- An optional link can be created between
Property and a LabelledExtract instance. - The Property element is directly compatible
with the corresponding entities in the MAGE
Object Model (MAGE-OM). - Sufficient detail for the data to be describable
by MAGE-OM is required. - Data can be subsequently uploaded to public
repositories without further annotation.
57Customising the database
- Involves changing the attribute description.
- Each table has fully customisable attributes.
- The description of the attributes is stored in
the database. - The descriptions can be referenced via URLs to
facilitate sharing.
58Attribute descriptions
ltGroup name"Feature"gt ltString
name"Feature Group"/gt ltGroup
name"Location"gt ltInteger name"Plate Row"
completion"OPTIONAL" comment"Which plate on the
array " /gt ltInteger name"Plate Column"
completion"OPTIONAL comment"Which plate on the
array " /gt ltInteger name"Row"
completion"OPTIONAL" comment"Which row
position on the plate" /gt ltInteger
name"Column" completion"OPTIONAL"
comment"Which column position on the plate" /gt
lt/Groupgt ltListRef name"Shape"
url"file//xml/MGEDOntology.xmlMGEDOntology.Arra
yDesign.FeatureShape" /gt ltGroup
name"Absolute Position" gt ltGroupRef
name"X" url"file//xml/MAGE_Fundamental_Types.x
mlMAGE.Measurement.Distance" /gt ltGroupRef
name"Y" url"file//xml/MAGE_Fundamental_Types.x
mlMAGE.Measurement.Distance" /gt lt/Groupgt
.
59Attribute descriptions continued..
- Attributes can be tagged as OPTIONAL and
REQUIRED. - A default value can be provided.
- Integer and double attributes are type checked,
and illegal values are indicated. - Attribute descriptions can refer to external
elements using an HTTP URL. - This makes it easier to access predefined
standard attribute definitions (e.g., MGED
ontology terms).
60Adding and removing attributes
- Attributes are easily inserted by specifying
where it should appear relative to an existing
attribute - Unwanted attributes can be removed from the
definition using the RemoveAttribute element as
follows
ltAddAttribute parent"Source"
position"after" target"Sex"gt
ltInteger name"Specimen Weight"
comment"weight in grams"/gtlt/AddAttributegt
ltRemoveAttribute name"Source.Specimen Weight" /gt
61MAGE-ML Export
- Datasets in the database can be exported into
MAGE-ML to ease submission into the ArrayExpress
database. - For each table there is an associated output
template. - There is a set of standard attribute files
adhering to the MIAME attribute definitions. - The file is modified when new attributes are
added to an element, so they are included in the
MAGE-ML outputs.
62Output templates
- A portion of an output template used to generate
the MAGE-ML for describing an Image instance - It instructs maxdLoad2 to generate a
BioMaterialMeasurement for each of the
LabelledExtract instances that are linked to
the Image. - The variables will be replaced with values
extracted from the database as the output file is
created.
Raw output
Control flow
Variable
63Output templates
- Most users will never see (or want to see) the
MAGE-ML output templates. - They only need to be manipulated when new
attributes have been added and these new
attributes are required to appear in the MAGE-ML
that maxdLoad2 creates. - In most cases, extending the output templates
will be a simple cut paste operation. - Completely different output templates could
potentially be defined for exporting data in
other formats (e.g., for import into a LIMS, or
for automated web-page generation).
64Availability
- maxdLoad2 is an open-source product, released
under the Perl Artistic Licence. - The latest version is available at
- http//www.bioinf.man.ac.uk/microarray/maxd/
- To join the mailing list for further
announcements send an email to - ecartis_at_cs.man.ac.uk
65Developers
- maxd development is currently funded by NERC as
part of a large UK-wide project themed on
environmental genomics. http//envgen.nox.ac.uk/
- Microarray Bioinformatics Group at The University
of Manchester - Dave Hancock
- Norman Morrison
- Prof. Andy Brass
66File parser settings
- These settings describe how the file should be
converted into a row/column matrix - Text Encoding(most files are US-ASCII, but some
are not) - Delimiter(how to split lines into columns)
- Ignore first, ignore last(skip header and footer
lines) - Ignore until, ignore after(regular expressions
identifying start and end lines) - Ignore beginning(to detect comment lines)
67Data value settings
- Column Specification settings that describe how
the data values are extracted from the row/column
matrix - Default and unwanted values can be specified (on
a per-column basis) - Values can be formed by combining multiple
columns - Regular Expressions can be used to modify the
data format (e.g., changing 11/31/02 to
31-11-02) - Values can be translated (substituting one
value for another) - Values can be converted to upper- or lower-case
68Presets
- Once a set of column specifications and parser
options have been determined, they can be saved
as a Preset. - These settings can then be easily recalled next
time a file with the same format is encountered. - Presets are stored as plain-text files which
can be shared between users.
69Hands-on session
70Questionnaires
- http//www2.cs.man.ac.uk/nashara/questionnaire/
71Background Information
- Saccharomyces cerevisiae is an emerging
opportunistic pathogen in immunosuppressed and
immunocompromised patients and has been
associated with fungemia, endocarditis,
peritonitis, meningitis, ventriculitis, and with
polymicrobial fatal pneumonia in AIDS patients.
72What are we investigating?
- investigating the action of an experimental drugX
purified from the Mabuti tree - The drug has been shown to significantly inhibit
the growth of S. cerevisiae in minimal media - to determine drugXs mechanism of action by
microarray analysis.
73(No Transcript)
74Create a Source
- Source original organism, tissue sample etc..
- In this experiment, yeast is the source
75Creating a Sample
- Choose the source
- Choose the sampling protocol
- Fill in various fields about how the sample was
produced
76Create TreatedSamples
- How many TreatedSamples are there?
- A TreatedSample is something that has had
something done to it
77TreatedSamples
78TreatedSamples
- Provide a name
- Select the Sample
- Select the TreatmentProtocol
- Describe the TreatedSample
79Creating LabelledExtracts
- Provide a name
- Select Extract and LabellingProtocol
- Enter
- Material type
- Labelling compound
- Quantity
- Do for each LabelledExtract
80Create Array
- Name
- ArrayType
- Repeat for the number of arrays in experiment (in
this case there are 2 for one experiment as there
is a dye-swap)
81Create Hybridisation
- Provide a Name
- Select
- HybridisationProtocol
- LabelledExtracts for hybridisation
- Array used for hybridisation
- Fill in any other fields
82Create Image entry
83Create Experiment entry
- Create a new submitter and fill in required
details - Select the Measurements in the experiment
- Describe the experiment by selecting the relevant
design type, experimental factors etc..