Title: Comparing Two Approaches of WDK Federation
1Comparing Two Approaches of WDK Federation
2Background
We have three Apicomplexan organisms Plamso,
Crypto and Tox Question how can we provide an
uniform access way to end user
3ApiDB
- GUS(Genomics Unified Schema)all of them share
the same database schema, accommodating
information emerging from DNA sequencing and
functional and comparative genomics projects - WDKthe front-end GUI for end user based on MVC
- ApiDBWDK website based upon database federation
on the same schema(GUS) (testing site)
4Five Strategies for Federation
- DB Link
- Small centralized ApiDB with DB Links to
component DBs - Web Services
- No fixed centralized ApiDB with minimal
centralized caching - JDBC
- Analogous to WS except accessing component dbs
through port 1521 - Central ApiDB Data Warehouse
- Automatic updates via streams
- Manual updates via import/export or data pump
scripts - Portlets
- Simultaneous and convenient access to ApiDB and
the component DBs
5Similarities between DB Link, WS and JDBC
Approaches
- Maintain much of the data only in the component
DBs - Require no major modifications to WDK
- Primary key definition needs to be modified in
WDK
6Differences between DB Link, WS and JDBC
Approaches
- DB Link
- need to copy coreprojectInfo table and CLOB data
tables to ApiDB central site - synchronization problem (difficult to keep the
central site data up-to-date) - Web Service
- no centralized data (except for cache PK tables)
- access DBs through WS
- CLOB problem
- JDBC
- no centralized data (except for cache PK tables)
- access DBs through JDBC (port 1521)
- security problem (port 1521 open to certain IP
addresses)
7Portal As Future Work
- A useful approach when user login is enabled
- Apache JetSpeed2 currently evolving, will pursue
the portal approach later - Can include portlets for ApiDB and each component
DB respectively - The ApiDB portlet can itself be implemented with
other federation strategies -
for more information see Portals.ppt
8Central Warehouse Lower Priority
- Component DBs constantly growing, difficult to
centralize all the data - Schemas must be used uniformly
- Keeping the warehouse up-to-date is an issue
- Not an option until weve explored all the other
strategies
9First Two Candidate Approaches
- DB Link
- Allows Oracle to handle data distribution,
therefore all sorts of operations can be
performed on the component data - Likely to perform well between Oracle DBs
- Web Services
- DB Link is an Oracle only solution otherwise, to
connect to non-Oracle DBs, we need Transparent
Gateway, so we want more platform-independent
solution
10Pros and Cons
11DB Link Strategy
Tomcat
1
6
Changes to WDK
jspwrap
Model
Implementation
Show model.xml
SqlUtils
2
5
naimp
naimp
naimp
aaimp
aaimp
aaimp
Redux, UGA
3
cryptodots.nasequenceimp
3
3
4
4
4
Distributed Query Using DB link
Currently an extra table (app.srt) needs to be
streamed up from CryptoDB On redux for initial
testing
12WS Strategy
Tomcat
wdk.model.jspwrap
2
5
wdk.model Question ? Query ?QueryInstance ?
ResultFactory
wdk.model.implementation Oracle SqlUtils
WsUtils
WSDL
1
4
4
1
4
1
JDBC
3
3
SOAP
13Federation Screenshot
14Realistic Model
- Four common questions between Crypto and Plasmo
- Finding genes by gene ID
- Finding genes by Annotated Keyword
- Finding genes by pfam domain
- Finding genes by Exon Count
- Deal with differences between Crypto model and
Plasmo model
15Cryptos Organism--sqlQuery
- lt!-- VocabQueries.Organisms --gt
- ltsqlQuery name"Organisms"gt
- ltcolumn name"internal"/gt
- ltcolumn name"term"/gt
- ltcolumn name"idx"/gt
- ltsqlgt
- SELECT FROM (
- SELECT 'C. hominis' as term, 3 as idx,
- '''' 'Cryptosporidium hominis' '''' as
internal - FROM dual
- UNION
- SELECT 'C. parvum' as term, 2 as idx,
- '''' 'Cryptosporidium parvum' '''' as
internal - FROM dual
- UNION
- SELECT 'C. parvum and C. hominis' as
term, 1 as idx, - '''' 'Cryptosporidium hominis' ''',
- ''' 'Cryptosporidium parvum' ''''
as internal - FROM dual
16Plasmos Organism--sqlQuery
- ltsqlQuery name"Organism"gt
- ltcolumn name"internal"/gt
- ltcolumn name"term"/gt
- ltcolumn name"idx"/gt
- ltsqlgt
- SELECT name AS term, taxon_id AS
internal, rownum AS idx - FROM sres.TaxonName
- WHERE name_class'scientific name'
- AND taxon_id in (SELECT taxon_id
FROM dots.ExternalNaSequence) - UNION
- SELECT 'any' as term, -1 as internal,
- count() 1 as idx
- FROM sres.TaxonName
- WHERE name_class'scientific name'
- AND taxon_id in (SELECT taxon_id
FROM dots.ExternalNaSequence) - lt/sqlgt
- lt/sqlQuerygt
17 18Product Attribute Query
- (select u.product as product
- from (
- SELECT r.source_id,
- r.product
- FROM dots.rnatype_at_CRYPTOA_DBLINK_at_ r
- UNION
- SELECT t.source_id,
- t.product
- FROM dots.transcript_at_CRYPTOA_DBLIN
K_at_ t - ) u, dots.genefeature_at_CRYPTOA_DBLI
NK_at_ g - WHERE g.source_id u.source_id()
- AND g.source_id 'primaryKey'
- AND g.row_project_id projectId)
- UNION
- (SELECT gf.product
- FROM dots.GeneFeature_at_PLASMO_DBLIN
K_at_ gf - WHERE gf.source_id'primaryKey
' - AND gf.row_project_id
projectId)
19Organism Attribute Query
- (SELECT
- SUBSTR(tn.name, 1, 1) '.'
SUBSTR(tn.name, INSTR(tn.name, ' ', 1, 1) 1) as
organism - FROM dots.genefeature_at_CRYPTOA_DBLINK_at_
g, - dots.externalnasequence_at_CRYPTOA_DBLINK
_at_ enas, - dots.nalocation_at_CRYPTOA_DBLINK_at_ nal,
- dots.source_at_CRYPTOA_DBLINK_at_ src,
sres.taxonname_at_CRYPTOA_DBLINK_at_ tn - WHERE g.na_sequence_id
enas.na_sequence_id - AND enas.na_sequence_id src.na_sequence_id
- AND nal.na_feature_id g.na_feature_id
- AND enas.taxon_id tn.taxon_id
- AND g.source_id 'primaryKey'
- AND g.row_project_id projectId)
- UNION
- ( SELECT SUBSTR(tn.name, 1, 1) '.
' SUBSTR(tn.name, INSTR(tn.name, ' ', 1, 1)
1) - as organism
- FROM dots.GeneFeature_at_PLASMO_DBLINK
_at_ gf, - dots.NaSequence_at_PLASMO_DBLINK_at_
s, - sres.TaxonName_at_PLASMO_DBLINK_at_
tn - WHERE gf.na_sequence_id
s.na_sequence_id
20Sequence Attribute Query
- (SELECT sequence
- FROM DoTS.GeneFeature_at_CRYPTOA_DB
LINK_at_ g, - _at_CRYPTOA_SCHEMA_at_SRT s
- WHERE g.source_id
'primaryKey' - AND g.source_id s.sequence_id()
- AND g.row_project_id
projectId) - UNION all
- (SELECT ens.sequence
- FROM dots.GeneFeature_at_PLASMO_DBLIN
K_at_ gf, - _at_PLASMO_SCHEMA_at_ExternalNaSequence
ens - WHERE gf.source_id'primaryKey
' - AND gf.na_sequence_id
ens.na_sequence_id - AND gf.row_project_id
projectId)