Microarray Data Capture Workshop - PowerPoint PPT Presentation

1 / 83
About This Presentation
Title:

Microarray Data Capture Workshop

Description:

Integrated data loading, browsing, editing and searching. ... the buttons opens up a panel/form in which full details can be browsed or edited ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 84
Provided by: gmo69
Category:

less

Transcript and Presenter's Notes

Title: Microarray Data Capture Workshop


1
Microarray Data Capture Workshop
Friday 17th June 2005
2
Program
3
Presentation Overview
  • Importance of meta-data capture
  • MIAME, MGED ontology and MAGE
  • Introduction to microarray storage using
    maxdLoad2
  • Advanced features of maxdLoad2 including import
    and export of data

4
Meta-data capture
5
Applications of Microarrays
  • Has many applications
  • Can target genes which react to various
    pharmacological agents over time.
  • Which genes are involved in disease and which
    treatments affect these genes.
  • Which genes are involved in the reaction of
    plants to environmental conditions
  • Determine formulas based on a genes expression
    for diagnosing or predicting future outcomes
    (e.g., cancer recurrence).

6
Microarrays
Biological question
Experiment design
Microarray experiment
Image analysis
Pre-processing
Analysis
Expression quantification
Normalisation
..
Prediction
Testing
Clustering
Estimation
Biological verification and interpretation
7
Microarrays
Diagram taken from NCBI
8
Diagram taken from NCBI
9
The common scenario.....
  • So weve done our experiment
  • Extracted, amplified and labelled the mRNA
  • Hybridised our samples to the arrays
  • Scanned the arrays
  • Analysed the data
  • Written a paper
  • Submitted it to PNAS
  • Oh no, Ive just re-read the information for
    authors and they want it MIAME compliant and
    publicly accessible for the review process

10
Microarray Data
  • Data is an asset
  • Very long lived
  • Can be used in many unforeseen ways
  • Data mining
  • Microarray Data
  • Costly to generate
  • Can be irreproducible

11
Why capture meta-data?
  • Sequence data is static.
  • Post-genome data is highly state-dependent.
  • transcriptomic meta-data no. of cells no. of
    environmental conditions.
  • Annotation is important!
  • e.g., hybridisations carried out by different
    experimenters can account for one of the largest
    sources of systematic variation in an array-based
    experiment.
  • We need to take lessons from the gene debacle.
  • Protein-tyrosine phosphatase, non-receptor type
    6, Protein-tyrosine phosphatase 1C, PTP-1C,
    Hematopoietic cell protein-tyrosine phosphatase,
    SH-PTP1, Protein-tyrosine phosphatase SHP-1
  • LARD, death receptor 3 beta, WSL-1R protein,
    lymphocyte associated receptor of death, death
    receptor 3

12
Meta-data quality
  • Accuracy
  • Completeness
  • Currency
  • Important to be able reference external sources
    rather than duplicate them
  • Functional annotation that is not updated
  • Gene names can change or obtain synonyms, without
    this being reflected in the data
  • Chip files can be out of data
  • Credibility

13
Meta-data quality contd
  • Portability
  • Can the data be used outside of the context of
    its creation?
  • Incomplete meta-data limits portability

14
Microarray data repositories
  • Repository needs to keep all relevant meta-data
    associated with a data set
  • To be easily submitted, and to be searchable,
    data must adhere to standards, both in content
    and format

15
Microarray repositories
  • ArrayExpress is the repository of choice for many
    groups, particularly within Europe.
  • Its good points
  • High quality data to search against
  • Accepts MAGE-ML input from software pipelines
  • Some of its disadvantages
  • Complicated web-based data entry tool
    (MIAMExpress)
  • Convincing people to gather the extra data when
    other repositories may require less and are still
    MIAME compliant for publication. Activation
    energy.
  • GEO (Gene Expression Omnibus) is hosted at the
    NCBI.

16
Benefits of using a data repository
17
The MGED Society
  • To facilitate microarray data storage and
    communication, MGED have created
  • MAGE-OM
  • An object model linking the concepts behind a
    microarray experiment in packages
  • MAGE-ML
  • An XML based language that represents the
    packages in MAGE-OM
  • MGED Ontology
  • A controlled hierarchical vocabulary representing
    experimental concepts for annotation

18
What is MIAME?
  • MIAME is the internationally adopted standard for
    the Minimal Information About a Microarray
    Experiment.
  • The result of a MGED driven effort to codify the
    description of a microarray experiment.
  • MIAME aims to define the core that is common to
    most experiments.
  • Ultimately, it tries to specify the collection of
    information that would be needed to allow
    somebody to completely reproduce an experiment
    that was performed elsewhere.
  • Exactly what minimum means is open to
    interpretation and depends on operator, software
    and most importantly the experiment being
    described.

19
MIAME extensions
  • MIAME does not have all the required vocabulary
    to describe all types of experiments.
  • e.g., environment genomics and toxicogenomics.
  • This resulting in the development of MIAME/Env
    and MIAME/Tox.
  • MIAME/Env is an initiative spearheaded by the
    EGTDC to extend MIAME standards for annotation of
    environmental genomic data
  • Includes the development of controlled
    vocabularies / ontologies to describe
    environmental genomic experiments.
  • MIAME/Env developed with the support of MGED
    society and in collaboration with MIAME/Tox and
    members of the EBI.

20
MAGE
  • MAGE stands for MicroArray Gene Expression.
  • It is broken into two equally important parts
    MAGE-OM and MAGE-ML.
  • MAGE-OM is an object model of microarray
    experiment.
  • It represents a generalised experiment which-
  • can be manipulated to represent a specific
    experiment by adding information to objects
    (attributes).
  • linking objects to each other by treatments
    reiteratively to model any complexity of
    experiment.

21
MAGE-OM (doesnt stand for oh my)
22
MAGE-OM slightly simpler
23
MGED Ontology
  • Provides standard terms for the annotation of
    microarray experiments.
  • An ontology is the formal representation of a
    domain, and allows complex paradigms to be
    reasoned over by automated systems.
  • The terms enable
  • structured queries of the elements of the
    experiments.
  • Unambiguous descriptions of how the experiment
    was performed.
  • Current version 1.1.9 updated every few months
  • 226 classes 109 properties 644 individuals
  • Expands to add new terms to map to new experiment
    types/new uses of terms (and to correct existing
    errors as theyre found).

24
maxdLoad2 A tool for microarray experimental
annotation
25
Main features
  • Loading, browsing, editing and searching.
  • Extensible customisable attributes for each part
    of the schema.
  • MIAME data capture.
  • MAGE-ML data export.

26
maxdLoad2 An extensible, MIAME-compliant
database for microarray experiments
  • A database schema and a software application.
  • The second-generation of maxdLoad.
  • Integrated data loading, browsing, editing and
    searching.
  • Written in Java, runs on most computers
  • Supports any SQL92 database Oracle, MySQL,
    Postgres, Sybase, Firebird

27
Evolution of maxdLoad2
  • The maxd software has been in development since
    2000.
  • The analysis and visualisation suite maxdView
  • Is based on a modular design - new features can
    be added as plugins.
  • Lots of normalisation, filtering and plotting
    features are provided.
  • The database component, maxdLoad was based on the
    EBIs original ArrayExpress reference model.
  • In maxdLoad2, the database design has been
    modified to more closely correspond to MIAME and
    MAGE concepts.
  • The major advance is the customisable/extensib
    le attribute mechanism this feature is being
    used for rapid prototyping by the MIAME/Env
    project

28
System architecture
  • maxdLoad2 is NOT accessed via a web-browser
  • It is a stand-alone application, written in Java
    (this makes it very portable).
  • maxdLoad2 and the database server can run on the
    same machine, no network connection or web server
    is needed.
  • However, maxdLoad2 and the database server can be
    on separate machines connected via a network.

29
Microarray experiment workflow
  • A typical microarray experiment is a sequence of
    steps starting with one or more BioMaterials
    and ending up with a big pile of numbers.
  • These steps can be thought of as
    transformations material A treatment
    material Band combinations image scanning
    data
  • Each of the steps needs to be recorded in the
    database.
  • Many of the steps will be standardised, for
    example, the protocol used for labelling. They
    will only have to be defined once.

30
Why record everything?
  • The more meta-data that is captured, the better
    the chance of explaining things when it all goes
    wrong!
  • Most studies that have looked at between study
    variation find that the biggest component of
    difference is lab (person, protocol, equipment)
    then array then biology

31
Why all the structuring
Free Texteasy to generate, hard to understand
  • Whats wrong with just describing what happened
    as a nice big document?
  • It is very hard for software to understand the
    process and therefore difficult for the software
    to behave intelligently, or to assist the user in
    any way
  • It makes reusing common bits of the description
    tricky a general rule of thumb is reuse is
    good, cut-and-paste is bad

Structured Objects hard to generate, easy to
understand
32
What is in the database?
  • Experiment
  • A collection of related hybridisations and the
    resulting data
  • Experiment, Measurements, Images and
    Hybridisations
  • Array Design
  • The contents and the layout of a microarray
  • ArrayType, Features, Reporters and Genes
  • Bio-Materials
  • The actual biological entities that are used
  • LabelledExtract, Extract, TreatedSample, Sample,
    Sources
  • Protocols
  • Standardised methods of operation in the
    laboratory
  • ImageAnalysisProtocol, ScanningProtocol, etc..

33
Bio-Materials model the experiment
  • Source
  • original organism, tissue sample
  • Sample
  • acquisition of material from a source
  • Treated Sample
  • is a sample which has something done to it
  • Extract
  • a portion of a TreatedSample selected for
    analysis
  • LabelledExtract
  • a TreatedSample that has been prepared for
    hybridisation
  • These elements are generally constructed in the
    order shown above. The methods used in
    preparation and production are recorded using
    their associated Protocol elements.

34
Modelling an experiment
LabelledExtract
  • The various elements can be plugged together in
    different ways to represent the way the
    experiment is constructed.
  • Components are wired together in reverse order
    connections are based on where things came from,
    rather than on the sequence in which they were
    generated.
  • Pooling and splitting operations are represented
    by having one instance linked to more than one
    other instance, or vice versa.

Extract
TreatedSample
Sample
Source
35
Protocols
Extract Control 20 minutes
Extract Control 40 minutes
Extract Shocked 20 minutes
Extract Shocked 40 minutes
A
TreatedSample
TreatmentProtocol wait 20 minutes
A
TreatedSample
A
TreatedSample
TreatmentProtocol wait 40 minutes
A
TreatedSample
TreatmentProtocol heat_shock
A
TreatedSample Shocked
TreatedSample Control
TreatmentProtocol do nothing
A
Represents the application of a protocol
A
Sample
  • The Protocol links explain why the Bio-Material
    components have been connected in the way they
    are.

36
Arrays, Features, Reporters and Genes
  • ArrayType models the platform
  • Feature models the spaces where Reporters go
    (number, placing, size)
  • Reporter models the contents of the Features,
    type of content (control, experimental) nature of
    sequence
  • Gene models the relation of sequences to genetic
    information

Array
ArrayType
Feature Row 34, Col 17
Feature Row 3, Col 91
Feature Row 19, Col 28
Reporter
Reporter
Reporter
Gene
37
Hybridisations Storing the data
  • A Measurement represents the collection of
    results from analysing the scanned image of
    microarray after hybridisation. Measurements
    can have any number of Propertys can be
    associated with them.
  • Each Property corresponds to one column in the
    file that came from the scanner (or to data
    generated by subsequent data analysis such as
    normalisation).

38
Connecting to the database
  • Connecting to a database requires the following
    information
  • The Database, which identifies the machine and
    server that is hosting the database, and the name
    of the database (one server may be hosting more
    than database)
  • The Driver File and Driver Name, which tells
    maxdLoad2 which database driver to use (these
    drivers are database specific)
  • The User Name and Password identify which
    account on the database server should be used.
  • Information about one or more connections can be
    saved and accessed from the list on the left-hand
    side.
  • The built in help system provides more details on
    how to set up a new database connection.

39
The User Interface
40
The User-interface continued..
  • These buttons control which mode the software is
    in (create, browse, find, edit or load)
  • These buttons are used to open the form used to
    input or explore the data for each of the
    database components
  • The arrows show how the components are
    interconnected
  • These buttons access the other main features
    import, export, options and the built-in help
    system.

41
The Navigator Tree
  • A representation of the schema as a hierarchy can
    also be displayed (in a separate window).
  • This view shows all of the links from one
    instance to all others.
  • Instances can be selected by clicking on their
    name.
  • When multiple links exist between instances
    (e.g., 1 Extract linked to 5 Samples),
    individual links are highlighted as the mouse
    passes over them.

42
The Navigator Tree
  • An alternative representation of the schema
  • Displayed either in a separate window or down the
    side of the main window
  • Instances are selected by clicking on their name
  • The navigator view is useful during instance
    creation as an aid in keeping track of which
    instances have been provided and which have not.
  • The red line shows the path taken to the current
    form.
  • Instances which have not yet been specified are
    tagged with yellow dots.
  • As instances are selected or created, they are
    tagged with green dots, and their names are
    shown.

43
The User-interface continued..
Clicking on one of the buttons opens up a
panel/form in which full details can be browsed
or edited
44
The User-interface continued..
  • Name(s) identify instances
  • Links combine instances together and are
    defined by
  • Selecting the item from lists.
  • Recursively filing in another form.
  • Attributes store all other data about the
    instance
  • Entered by typing data directly into the fields.
  • Useful information can be found by clicking on
    attribute name.
  • Quick-copy function in data entry modes.

45
Create Mode
  • Required fields (yellow) are coloured
    differently to optional fields (blue).
  • All required fields must be completed before a
    new instance can be created.
  • Links to other instances that have recently been
    visited are chosen from pulled down lists.
  • If a link to an instance which has not been
    defined is required, the Create New button
    opens a new form, which is then used to define
    the new instance

46
Browse mode
  • Allows the exploration of the database and the
    examination of links between instances.
  • The Find Linked can find the connections
    between the current instance and instance(s) in
    any other table.
  • Instances can be viewed by selecting from list.
  • List can be filtered for ease of searching.
  • List can be sorted in chronological or
    alphabetical order.

47
Find mode
  • Search for instance(s).
  • Instances can be found by specifying any
    combination of
  • One or more linked instances
  • one of more attribute values
  • All or part of a name
  • This is done by filling in one or more fields in
    a form.
  • The collection of matching instances is then
    displayed using a Browse Mode.

48
Edit mode
  • This mode is a combination of Create mode and
    Browse mode.
  • It is essentially the same as that of Create
    mode.
  • Names, links and attributes can all be edited.
  • Warning!!
  • There is no audit trail, therefore once a value
    is changed, the previous value is lost forever.

49
Advanced features
  • In addition to the annotation shown there
    maxdLoad2 has some tricks up its sleeve to
    improve data loading
  • These are useful when numbers of arrays exceed a
    moderate figure
  • weve processed experiments with 50 arrays
    manually
  • Weve processed experiments with gt350 arrays with
    loading scripts
  • They are also useful if dealing with less
    computer confident experimentalists
  • They are based around spreadsheets, which can be
    pre-filled with most relevant data, leaving gaps
    for day-to-day details

50
Loading data
  • In addition to entering data by hand using
    Create Mode, it is also possible to create
    instances by extracting data directly from a text
    file or Excel spreadsheet.
  • Data is extracted by tagging lines and columns of
    the data source that are interesting.
  • The Load Mode forms are essentially the same as
    Create Mode forms. However, instead of
    supplying final values for things, the column(s)
    containing the values are identified.
  • As this process can be automated, it is useful
    for integrating maxdLoad2 with other lab
    software, especially LIM systems.

51
Loading data user interface
  • Presets
  • File parser settings
  • Data value settings

52
Loading Data simple example
Extracted Data
Source Data File
Column Specification
5regex1,0-9
1
2
3
4
5
6
3regex2,0-9
53
Automated data loading
  • A data loading script (in an XML format) can be
    used to automate the process of loading data from
    one or more files.
  • The format is exactly the same as the column
    specification method used in manual data
    loading.
  • Existing preset settings can be used directly.
  • Loading scripts could be generated
    automatically.

54
Loading expression data
  • The actual expression data (i.e., the results of
    the processing the scanned image) are also loaded
    using the bulk data loading system.
  • A collection of values (one per Feature) is
    called a Property
  • All of the data manipulation methods described
    previously are available when loading the
    expression data (this is useful for handling
    missing values).

55
Measurements and Propertys
  • A Measurement represents the collection of
    results from analysing the scanned image of
    microarray after hybridisation.
  • They can have any number of Properties can be
    associated with them.
  • Each Property corresponds to one column in the
    file that came from the scanner (or to data
    generated by subsequent data analysis, such as
    normalisation).
  • Extra properties can be added to an existing
    Measurement at any time.

56
Measurement meta-data
  • To define a property the following information is
    required
  • Quantitation Type
  • Scale
  • Unit
  • Origin
  • An optional link can be created between
    Property and a LabelledExtract instance.
  • The Property element is directly compatible
    with the corresponding entities in the MAGE
    Object Model (MAGE-OM).
  • Sufficient detail for the data to be describable
    by MAGE-OM is required.
  • Data can be subsequently uploaded to public
    repositories without further annotation.

57
Customising the database
  • Involves changing the attribute description.
  • Each table has fully customisable attributes.
  • The description of the attributes is stored in
    the database.
  • The descriptions can be referenced via URLs to
    facilitate sharing.

58
Attribute descriptions
ltGroup name"Feature"gt ltString
name"Feature Group"/gt ltGroup
name"Location"gt ltInteger name"Plate Row"
completion"OPTIONAL" comment"Which plate on the
array " /gt ltInteger name"Plate Column"
completion"OPTIONAL comment"Which plate on the
array " /gt ltInteger name"Row"
completion"OPTIONAL" comment"Which row
position on the plate" /gt ltInteger
name"Column" completion"OPTIONAL"
comment"Which column position on the plate" /gt
lt/Groupgt ltListRef name"Shape"
url"file//xml/MGEDOntology.xmlMGEDOntology.Arra
yDesign.FeatureShape" /gt ltGroup
name"Absolute Position" gt ltGroupRef
name"X" url"file//xml/MAGE_Fundamental_Types.x
mlMAGE.Measurement.Distance" /gt ltGroupRef
name"Y" url"file//xml/MAGE_Fundamental_Types.x
mlMAGE.Measurement.Distance" /gt lt/Groupgt
.
59
Attribute descriptions continued..
  • Attributes can be tagged as OPTIONAL and
    REQUIRED.
  • A default value can be provided.
  • Integer and double attributes are type checked,
    and illegal values are indicated.
  • Attribute descriptions can refer to external
    elements using an HTTP URL.
  • This makes it easier to access predefined
    standard attribute definitions (e.g., MGED
    ontology terms).

60
Adding and removing attributes
  • Attributes are easily inserted by specifying
    where it should appear relative to an existing
    attribute
  • Unwanted attributes can be removed from the
    definition using the RemoveAttribute element as
    follows

ltAddAttribute parent"Source"
position"after" target"Sex"gt
ltInteger name"Specimen Weight"
comment"weight in grams"/gtlt/AddAttributegt
ltRemoveAttribute name"Source.Specimen Weight" /gt
61
MAGE-ML Export
  • Datasets in the database can be exported into
    MAGE-ML to ease submission into the ArrayExpress
    database.
  • For each table there is an associated output
    template.
  • There is a set of standard attribute files
    adhering to the MIAME attribute definitions.
  • The file is modified when new attributes are
    added to an element, so they are included in the
    MAGE-ML outputs.

62
Output templates
  • A portion of an output template used to generate
    the MAGE-ML for describing an Image instance
  • It instructs maxdLoad2 to generate a
    BioMaterialMeasurement for each of the
    LabelledExtract instances that are linked to
    the Image.
  • The variables will be replaced with values
    extracted from the database as the output file is
    created.

Raw output
Control flow
Variable
63
Output templates
  • Most users will never see (or want to see) the
    MAGE-ML output templates.
  • They only need to be manipulated when new
    attributes have been added and these new
    attributes are required to appear in the MAGE-ML
    that maxdLoad2 creates.
  • In most cases, extending the output templates
    will be a simple cut paste operation.
  • Completely different output templates could
    potentially be defined for exporting data in
    other formats (e.g., for import into a LIMS, or
    for automated web-page generation).

64
Availability
  • maxdLoad2 is an open-source product, released
    under the Perl Artistic Licence.
  • The latest version is available at
  • http//www.bioinf.man.ac.uk/microarray/maxd/
  • To join the mailing list for further
    announcements send an email to
  • ecartis_at_cs.man.ac.uk

65
Developers
  • maxd development is currently funded by NERC as
    part of a large UK-wide project themed on
    environmental genomics. http//envgen.nox.ac.uk/
  • Microarray Bioinformatics Group at The University
    of Manchester
  • Dave Hancock
  • Norman Morrison
  • Prof. Andy Brass

66
File parser settings
  • These settings describe how the file should be
    converted into a row/column matrix
  • Text Encoding(most files are US-ASCII, but some
    are not)
  • Delimiter(how to split lines into columns)
  • Ignore first, ignore last(skip header and footer
    lines)
  • Ignore until, ignore after(regular expressions
    identifying start and end lines)
  • Ignore beginning(to detect comment lines)

67
Data value settings
  • Column Specification settings that describe how
    the data values are extracted from the row/column
    matrix
  • Default and unwanted values can be specified (on
    a per-column basis)
  • Values can be formed by combining multiple
    columns
  • Regular Expressions can be used to modify the
    data format (e.g., changing 11/31/02 to
    31-11-02)
  • Values can be translated (substituting one
    value for another)
  • Values can be converted to upper- or lower-case

68
Presets
  • Once a set of column specifications and parser
    options have been determined, they can be saved
    as a Preset.
  • These settings can then be easily recalled next
    time a file with the same format is encountered.
  • Presets are stored as plain-text files which
    can be shared between users.

69
Hands-on session
70
Questionnaires
  • http//www2.cs.man.ac.uk/nashara/questionnaire/

71
Background Information
  • Saccharomyces cerevisiae is an emerging
    opportunistic pathogen in immunosuppressed and
    immunocompromised patients and has been
    associated with fungemia, endocarditis,
    peritonitis, meningitis, ventriculitis, and with
    polymicrobial fatal pneumonia in AIDS patients.

72
What are we investigating?
  • investigating the action of an experimental drugX
    purified from the Mabuti tree
  • The drug has been shown to significantly inhibit
    the growth of S. cerevisiae in minimal media
  • to determine drugXs mechanism of action by
    microarray analysis.

73
(No Transcript)
74
Create a Source
  • Source original organism, tissue sample etc..
  • In this experiment, yeast is the source

75
Creating a Sample
  • Choose the source
  • Choose the sampling protocol
  • Fill in various fields about how the sample was
    produced

76
Create TreatedSamples
  • How many TreatedSamples are there?
  • A TreatedSample is something that has had
    something done to it

77
TreatedSamples
78
TreatedSamples
  • Provide a name
  • Select the Sample
  • Select the TreatmentProtocol
  • Describe the TreatedSample

79
Creating LabelledExtracts
  • Provide a name
  • Select Extract and LabellingProtocol
  • Enter
  • Material type
  • Labelling compound
  • Quantity
  • Do for each LabelledExtract

80
Create Array
  • Name
  • ArrayType
  • Repeat for the number of arrays in experiment (in
    this case there are 2 for one experiment as there
    is a dye-swap)

81
Create Hybridisation
  • Provide a Name
  • Select
  • HybridisationProtocol
  • LabelledExtracts for hybridisation
  • Array used for hybridisation
  • Fill in any other fields

82
Create Image entry
83
Create Experiment entry
  • Create a new submitter and fill in required
    details
  • Select the Measurements in the experiment
  • Describe the experiment by selecting the relevant
    design type, experimental factors etc..
Write a Comment
User Comments (0)
About PowerShow.com