Title: Data integration via XML
1Data integration via XML
- Ela Hunt
- John Wilson
- Vangelis Pafilis
- Inga Tulloch
http//xtect.cis.strath.ac.uk/
2Overview
- Four biological scenarios of data integration
- Data integration - problem definition
- XTECT indexing approach
- Literature review
- Current status and further work
3Scenario 1 Cardiovascular Functional Genomics
- AIM discover genes causing hypertension
- Rat animal models of hypertension (rat strains
which suffer from stroke) - Microarrays are used to compare gene expression
in sick and healthy rats, typically 100-400 genes
are differentially expressed - microarray results are visualised on maps and
data are interpreted using public web databases
(browsing and querying)
4SyntenyVista
5Scenario 2 Mouse mammary gland development as a
model of cancer proliferation
- AIM find genes active in cancer growth
- Take mouse samples and apply to a microarray
slide - Measure trends in gene expression, identify 400
genes of interest - Use public web databases to interpret information
on 400 genes (interpreting 100 genes took 6
months, now the information is out of date)
6Scenario 3 Rat model of schizophrenia
- AIM understand which genes are expressed during
schizophrenia - Rats have symptoms of schizophrenia after a
chemical treatment (2 models are used) - Measure gene expression in two models
- Interpret data on 250 genes find if microarray
probes correspond to genes by using BLAST (DNA
sequence comparison) and PubMed (bibliographic
database) - Gather DNA sequences for real genes from Ensembl
(BLAST hits), design probes
7Scenario 4Proteomics
- AIM understand and record protein functions
- Case 1 study the proteome of Trypanosoma brucei.
For all proteins identified, find information on
the web which might shed light on their function - Case 2 interpret data on human proteins
differentially expressed in human cells invaded
by Toxoplasma gondii. - Compare protein and gene expression
- Use SwissProt, PubMed, GeneOntology and any
other web resources
8Problem definition
- Given a large microarray or proteomics experiment
(a list of gene names or peptide masses) - Find all known information about those genes or
proteins on the web - Make this information accessible
9What we expect to achieve
Result1 table of integrated information
Result2 map of probes and synteny
Query table of names
Result3 Clusters based on to the number of
relevant query terms found
10- Use item matching - XML leaves - to start
- Match starting from leaves and extend towards the
schemas expressed as paths - Use database techniques - indexing
- Use data mining techniques get statistics on
data
11More detail
- Index all paths and leaves in XML trees for a
representative set of biological databases - Relational technology
- Warehouse
- Match leaves (data values)
- Find path overlaps gt remove redundancies in data
12 First problem solvedquery expansion
- 30K human, 30K rat, and 30K mouse genes, some of
them have synonyms - Query expansion to include the synonyms
- Prototype in Java, 300 ms for synonym lookup
- Same idea as in GeneCards which focuses on human
data
13Second indexing XML
- Medline (40 GB) in XML (bibliographic)
- SwissProt Trembl, 1 GB in XML (proteins)
- OMIM and HUGO databases of genes, small (human
diseases and human genes) - Affymetrix microarray files for the mouse, small,
XML - Ensembl no XML files, access via MySQL (human,
mouse, rat genomes and predicted genes) - Mouse Genome MGD direct access to Sybase, no
XML - Rat database RGD stores little data!
- Gene Ontology around 1GB in XML
14- Paths and tags indexed using integer encoding,
preserving XML order - Indexing of Medline and OMIM needs to be resolved
(text XML)
15How the index will work
PubMed
Swiss-Prot
accession
abstract
PubMedID
GeneName
12345
.. interactions of agene1 with agene2 ...
12345
agene1
Swiss-Prot/PubMedID PubMed/accession Swiss-Pr
ot/GeneName PubMed/abstract
16Matching
- Db1/path1/socs3 and Db2/path2/socs3 gt synonymous
paths - Get statistics for full and partial path matches
and postulate schema matches - Manually inspect the matched paths, and examine
support for each path match - Automate the procedure
17Architecture
Microarray experiment Proteomics experiment
Visualisation
INTERACTION
List of names
Synonym expander
XML tree merger
PROCESSING LAYER
XML tree finder
INDEX
WAREHOUSE
Gene trees XML
Mapping generation and lookup
18Status
- Mirroring external XML data
- Query expansion is implemented
- Software to XMLise OMIM and some of the MGD
- Testing indexing software for loading into Oracle
- Designing an algorithm for data mining
- Developing ideas on adding sequence comparison
and text retrieval, and connecting to
visualisation tools (collaboration with e-Science
project BRIDGES)
19THE VISION
To tabular summaries
To multiple alignment
To sequence
20Other work
- Schema-based approaches look at the schemas to
find mappings between them - use constraints, tree shape, some data
- involve the user/programmer YATL, Clio, REVERE
- Data-based approaches look at data values in
order to find mappings between attributes - ML approaches are inefficient, all-against-all
- Problems
- Expensive in terms of labour (programmer or user)
- Only very similar schemas can be matched
- Not scalable
21Recent papers
- Kurgan et al., 2002, machine learning for schema
matching (2 very similar schemas) - Doan et al., VLDBJ03, machine learning, 2
semi-structured schemas (ontologies), schemas
some data - Chua et al., VLDBJ03, (RDBMS) given entity
matches (table names), match attributes (values),
based on a variety of statistical tests - Halevy et al, CIDR-2003, user-driven schema
matching by example, and mapping by transitivity
(no algorithm has been given)
22Summary
- Aim - to overcome the problems associated with
manual or schema-based mapping approaches which
are expensive - Scale up, take into account data values
- Provide a digest of information for a list of
gene/protein names of interest - Using XML and relational indexes
23Collaborators at Glasgow
Barry Gusterson
Andy Jones Torsten Stein Inga Tulloch Catherine
Winchester Anna F. Dominiczak Neil Hanlon BRIDGES
project (uses DB2)
Vangelis Pafilis
FUNDING Carnegie Trust for the Universities of
Scotland Medical Research Council (UK) Royal
Society Synergy
John Wilson