Data integration via XML - PowerPoint PPT Presentation

About This Presentation

Title:

Data integration via XML

Description:

Ela Hunt. John Wilson. Vangelis Pafilis. Inga Tulloch. http://xtect.cis.strath.ac.uk ... Hunt, Wilson, Pafilis and Tulloch, Glasgow ... – PowerPoint PPT presentation

Number of Views:125

Avg rating:3.0/5.0

Slides: 24

Provided by: ela50

Category:

more less

Transcript and Presenter's Notes

Title: Data integration via XML

1
Data integration via XML

Ela Hunt
John Wilson
Vangelis Pafilis
Inga Tulloch

http//xtect.cis.strath.ac.uk/
2
Overview

Four biological scenarios of data integration
Data integration - problem definition
XTECT indexing approach
Literature review
Current status and further work

3
Scenario 1 Cardiovascular Functional Genomics

AIM discover genes causing hypertension
Rat animal models of hypertension (rat strains
which suffer from stroke)
Microarrays are used to compare gene expression
in sick and healthy rats, typically 100-400 genes
are differentially expressed
microarray results are visualised on maps and
data are interpreted using public web databases
(browsing and querying)

4
SyntenyVista
5
Scenario 2 Mouse mammary gland development as a
model of cancer proliferation

AIM find genes active in cancer growth
Take mouse samples and apply to a microarray
slide
Measure trends in gene expression, identify 400
genes of interest
Use public web databases to interpret information
on 400 genes (interpreting 100 genes took 6
months, now the information is out of date)

6
Scenario 3 Rat model of schizophrenia

AIM understand which genes are expressed during
schizophrenia
Rats have symptoms of schizophrenia after a
chemical treatment (2 models are used)
Measure gene expression in two models
Interpret data on 250 genes find if microarray
probes correspond to genes by using BLAST (DNA
sequence comparison) and PubMed (bibliographic
database)
Gather DNA sequences for real genes from Ensembl
(BLAST hits), design probes

7
Scenario 4Proteomics

AIM understand and record protein functions
Case 1 study the proteome of Trypanosoma brucei.
For all proteins identified, find information on
the web which might shed light on their function
Case 2 interpret data on human proteins
differentially expressed in human cells invaded
by Toxoplasma gondii.
Compare protein and gene expression
Use SwissProt, PubMed, GeneOntology and any
other web resources

8
Problem definition

Given a large microarray or proteomics experiment
(a list of gene names or peptide masses)
Find all known information about those genes or
proteins on the web
Make this information accessible

9
What we expect to achieve
Result1 table of integrated information
Result2 map of probes and synteny
Query table of names
Result3 Clusters based on to the number of
relevant query terms found
10

Use item matching - XML leaves - to start
Match starting from leaves and extend towards the
schemas expressed as paths
Use database techniques - indexing
Use data mining techniques get statistics on
data

11
More detail

Index all paths and leaves in XML trees for a
representative set of biological databases
Relational technology
Warehouse
Match leaves (data values)
Find path overlaps gt remove redundancies in data

12
First problem solvedquery expansion

30K human, 30K rat, and 30K mouse genes, some of
them have synonyms
Query expansion to include the synonyms
Prototype in Java, 300 ms for synonym lookup
Same idea as in GeneCards which focuses on human
data

13
Second indexing XML

Medline (40 GB) in XML (bibliographic)
SwissProt Trembl, 1 GB in XML (proteins)
OMIM and HUGO databases of genes, small (human
diseases and human genes)
Affymetrix microarray files for the mouse, small,
XML
Ensembl no XML files, access via MySQL (human,
mouse, rat genomes and predicted genes)
Mouse Genome MGD direct access to Sybase, no
XML
Rat database RGD stores little data!
Gene Ontology around 1GB in XML

Paths and tags indexed using integer encoding,
preserving XML order
Indexing of Medline and OMIM needs to be resolved
(text XML)

15
How the index will work
PubMed
Swiss-Prot
accession
abstract
PubMedID
GeneName
12345
.. interactions of agene1 with agene2 ...
12345
agene1
Swiss-Prot/PubMedID PubMed/accession Swiss-Pr
ot/GeneName PubMed/abstract
16
Matching

Db1/path1/socs3 and Db2/path2/socs3 gt synonymous
paths
Get statistics for full and partial path matches
and postulate schema matches
Manually inspect the matched paths, and examine
support for each path match
Automate the procedure

17
Architecture
Microarray experiment Proteomics experiment
Visualisation
INTERACTION
List of names
Synonym expander
XML tree merger
PROCESSING LAYER
XML tree finder
INDEX
WAREHOUSE
Gene trees XML
Mapping generation and lookup
18
Status

Mirroring external XML data
Query expansion is implemented
Software to XMLise OMIM and some of the MGD
Testing indexing software for loading into Oracle
Designing an algorithm for data mining
Developing ideas on adding sequence comparison
and text retrieval, and connecting to
visualisation tools (collaboration with e-Science
project BRIDGES)

19
THE VISION
To tabular summaries
To multiple alignment
To sequence
20
Other work

Schema-based approaches look at the schemas to
find mappings between them
use constraints, tree shape, some data
involve the user/programmer YATL, Clio, REVERE
Data-based approaches look at data values in
order to find mappings between attributes
ML approaches are inefficient, all-against-all
Problems
Expensive in terms of labour (programmer or user)
Only very similar schemas can be matched
Not scalable

21
Recent papers

Kurgan et al., 2002, machine learning for schema
matching (2 very similar schemas)
Doan et al., VLDBJ03, machine learning, 2
semi-structured schemas (ontologies), schemas
some data
Chua et al., VLDBJ03, (RDBMS) given entity
matches (table names), match attributes (values),
based on a variety of statistical tests
Halevy et al, CIDR-2003, user-driven schema
matching by example, and mapping by transitivity
(no algorithm has been given)

22
Summary

Aim - to overcome the problems associated with
manual or schema-based mapping approaches which
are expensive
Scale up, take into account data values
Provide a digest of information for a list of
gene/protein names of interest
Using XML and relational indexes

23
Collaborators at Glasgow
Barry Gusterson
Andy Jones Torsten Stein Inga Tulloch Catherine
Winchester Anna F. Dominiczak Neil Hanlon BRIDGES
project (uses DB2)
Vangelis Pafilis
FUNDING Carnegie Trust for the Universities of
Scotland Medical Research Council (UK) Royal
Society Synergy
John Wilson

Write a Comment

User Comments (0)