Title: Chado%20and%20interoperability
1 Chado and interoperability
- Chris Mungall, BDGP
- Pinglei Zhou, FlyBase-Harvard
2Databases and applications
Java
SQL
Chado
Application
?
seq
rad
cv
genetic
Application
phylo
pub
Perl
How do we get databases and applications
speaking to one another?
3Databases and applications
SQL
Java
Chado
posgresql driver
Application
JDBC
method calls
row objects
seq
rad
posgresql driver
cv
genetic
Application
DBI
method calls
perl arrays
phylo
pub
Perl
Generic database interfaces only solve part of
the problem
They let us embed SQL inside application code
4Why this alone isnt enough
- Interfacing applications to databases is a tricky
business - Issue Language mismatch
- Issue Data structure mismatch
- Issue Repetitive code
- Issue No centralized domain logic
5Language mismatch
String sql "SELECT srcfeature_id, fmax,
fmin FROM featureloc " "WHERE
feature_id " featId try Statement
s conn.createStatement() ResultSet rs
s.executeQuery(sql) rs.next()
sourceFeatureId rs.getInt("srcfeature_id")
fmin rs.getInt("fmin") fmax
rs.getInt("fmax") catch (SQLException
sqle) System.err.println(this.getClass()
" SQLException
retrieving feature loc"
" for feature_id " featId)
sqle.printStackTrace(System.err)
6Data structure mismatch
relations - set theoretic - relational
algebra
X
classes and structs - pointers - programs
7Repetitive code
- Database fetch pattern
- construct, ask, transform, repeat, stitch
- Example fetching gene models
- fetch genes
- fetch transcripts
- fetch exons, polypeptides
- fetch ancillary data (props, cvs, pubs, syns,
etc) - Optimisation is difficult
8No centralized domain logic
- Examples of domain logic
- project a feature onto a virtual contig
- revcomp or translate a sequence
- search by ontology term
- delete a gene model
- Domain logic should be reusable by different
applications
9A solutionObject Oriented APIs
Perl
Perl
chado
driver
chado adapter
DBI
Application
API
method calls
domain objects
Application
biosql adapter
biosql
domain objects
driver
DBI
Perl
Different perl applications share the same API
Different schemas can be added by writing adapters
10How do OO APIs help?
- Issue Language mismatch
- Seperation of interface from implementation
- Issue Data structure mismatch
- API talks objects
- adapters hide and deal with conversion
- Issue Repetitive code
- code centralized in both API and adapter
- Issue No centralized domain logic
- object model encapsulates domain logic
- object model can be used independently of
database
11How do OO APIs hinder?
- Writing or generating adapters
- brittle, difficult to maintain
- Restrictive
- canned parameterized queries vs query language
- Application language bound
- very difficult to use a perl API from java
- Application bound
- sometimes generic, but often limited to one
application - Opaque domain logic
12XML can help
- XML is application-language neutral
- XML can be used to specify
- data
- transactions
- queries and query constraints
- XML can be used within both application languages
and specialized XML languages - XPath
- XSLT
- XQuery
13XML middleware
Perl
Perl
any
chado
driver
impl
DBI
Application
interface
query params
chado xml
Application
query params
chado xml
Generic SQL to XML mapper
Java
Database XML as lingua franca
OO APIs can be implemented on top of XML layer
14XML middleware
Java
Perl
any
chado
driver
impl
JDBC
Application
interface
query params
chado xml
Application
query params
Implementation can be any language
chado xml
Java
15Mapping with XORT
- XORT is a specification of how to map XML to the
relational model - generic independent of chado and biology
- XMLXORT is a perl implementation of the XORT
specification - Other implementations possible
- DBIxDBStag implements XORT xml-gtdb
- Application language agnostic
- Easily wrapped for other languages
16Highlights
- Proposal XML mapping specification for Chado
- Tools
- Real Case
17XORT Mapping
- Elements
- Table
- Column (except DB-specific value, e.g primary key
in Chado schema -- not visible in XML) - Attributes
- few and generic transaction and reference
control - Element nesting
- column within table
- joined table within table -- joining column is
implicit - foreign key table within foreign key column
- Modules
- No module distinctions in chadoXML
- Limitations of DTD
- Cardinality, NULLness, data type
18Transactions and Operations
- Lookup
- - Select only
- Insert
- - Insert explicitly
- Delete
- - Unique identifier with unique key(s)
- - One record per operation
- Update
- - Two elements
- - Unique identifier with unique key(s)
- - One record per operation
-
- Force
- Combination of lookup, insert and
update (if not lookup, then insert, else update)
19Referencing Objects
- By global accession
- - Format dbnameaccession.version
- - Only for dbxref, feature ?, cvterm ?
- By a pre-defined local id
- - Allows reference to objects in same
file - - Need not be in DB
- - Can be any symbol
- By lookup using unique key value(s)
- - Object can be in file or DB
- Implicitly, using foreign key to identify
information in the related link table
20Object Reference By Global Accession
ltfeaturegt ltuniquenamegtCG3123lt/uniquenamegt
lttype_idgtgenelt/type_idgt ltfeature_relationsh
ipgt ltsubject_idgtGadflyCG3123-RA1lt/subje
ct_idgt lttype_idgtproducedbylt/type_idgt
ltfeature_relationshipgt lt/featuregt lt/featu
regt
21Object Reference By Local ID
- ltcv idSOgt
- ltnamegtSequence Ontologylt/namegt
- lt/cvgt
- ltcvterm idexongt
- ltcv_idgtSOlt/cv_idgt
- ltnamegtexonlt/namegt
- lt/cvtermgt
- ltfeaturegt
- lttype_idgtexonlt/type_idgt
22Object Reference By key Value (s)
ltfeaturegt lttype_idgt  ltcvtermgt Â
  ltcv_idgt
ltcvgt ltnamegtSequence
Ontologylt/namegt        lt/cvgt
lt/cv_idgt   Â
ltnamegtexonlt/namegt  lt/cvtermgt
lt/type_idgt .   Â
23ChadoXML Example
ltcv idSOgt ltnamegtSequence
Ontologylt/namegt lt/cvgt ltfeature oplookup
idCG3312gt ltuniquenamegtCG3312lt/uniquenamegt
lttype_idgt ltcvtermgt
ltnamegtgenelt/namegt
ltcv_idgtSOlt/cv_idgt lt/cvtermgt
lttype_idgt ltorganism_idgt
ltfeature_relationshipgt
ltsubject_idgtGadflyCG3312-RAlt/subject_idgt
lttype_idgtproducedbylt/type_idgt
lt/feature_relationshipgt lt/featuregt
24Schema-Driven Tools
- DTD Generator DDL-DTD
- Validator
- DB Not connected
- Syntax verification legal XML, correct
element nesting - Some Semantic verfication NULLness,
cardinality, local ID reference - DB Connected reference validation
- Loader XML-gtDB
- DumperDB-gtXML
- Driven by XML dumpspec
- XORTDiff diff two XORT XML files
25DumpSpec Driven Dumper
- Default behavior given an object class and ID,
dump all direct values and link tables, with refs
to foreign keys. - Non-default behavior specified by XML dumpspecs
using same DTD with a few additions - attribute dump all cols select none
- attribute test yes no
- element _sql
- element _appdata
- Workaround with views, _sql
- Current use cases
- Dump a gene for a gene detail page
- Dump a scaffold for Apollo
26DumpSpec Sample
ltfeature dump"all"gt ltuniquename
test"yes"gtCG3312lt/uniquenamegt lt!-- get all
mRNAs of this gene --gt ltfeature_relationship
dump"all"gt ltsubject_id test"yes"gt
ltfeaturegt lttype_idgtltcvtermgt
ltnamegtmRNAlt/namegt lt/cvtermgt lt/type_idgt
lt/featuregt lt/subject_idgt
ltsubject_idgt ltfeature dump"all"gt
lt!-- get all exons of those mRNAs --gt
ltfeature_relationship dump"all"gt
ltsubject_id test"yes"gt
ltfeaturegt
lttype_idgt gtltcvtermgtltnamegtexonlt/namegt lt/cvtermgt
lt/type_idgt lt/featuregt
lt/subject_idgt
ltsubject_idgt ltfeature
dump"all/gt lt/subject_idgt
lt/feature_relationshipgt lt/featuregt
27Use Case 1Chado lt-gt Apollo Interaction
DumpSpec1
DDL
XML Dumper
XML Loader validator
Chado
Chado XML
Chado XML
By_product
GameBridge
GameBridge
XML Schema
GAME XML
GAME XML
28Use Case 2Gene Page Dataflow
DumpSpec2
Chado
XML Dumper
Chado XML
acode
FlyBase
29To Do Lists
- External Object reference
- Dump with auto-generated XML Schema
- Output human-friendly
30Resources
- Todays slides
- XORT package http//www.gmod.org
- Protocol draft submit to Current Protocol In
Bioinformatics - Using chado to Store Genome Annotation
Data
31XORT Key points
- Application language-neutral
- reusable from within multiple languages and
applications - Where does the domain logic live?
- Unlike objects, XML does not have behaviour
- One solution ChadoXML Services
- Another solution Inside the DBMS
32ChadoXML Services
Application
chado
XORT interface
query params
driver
XML XORT
chado xml
DBI
chado xml query params
chado xml other xml
Lisp
XSLT
Java
Prolog
Format converters, dumpers
Ontology services
XQuery
ChadoXML services interface
sequence domain logic
companalysis logic
Perl
Java
C
Perl
33ChadoXML Services
Can be independent of DB
Application
chado xml query params
chado xml other xml
Lisp
XSLT
Java
Prolog
Format converters, dumpers
Ontology services
XQuery
ChadoXML services interface
sequence domain logic
companalysis logic
Perl
Java
C
Perl
34DB Functions API
Chado
JDBC/ DBI
posgresql driver
Application
db function calls
DB Functions API
sql result objects
PL/PgSQL Function Impl
sequence library
C
Implementation inside database
35DB Functions API
- cv module
- get_all_subject_ids(cvterm_id int)
- get_all_object_ids(cvterm_id int)
- fill_cvtermpath(cv_id int)
Chado
DB Functions API
PL/PgSQL Function Impl
Existing functions
36DB Functions API
- sequence module
- get_sub_feature_ids(feature_id int)
- featuresplice(fmin int, fmax int)
- get_subsequence(srcfeature_id int, fmin int, fmax
int, strand int) - next_uniquename()
Chado
DB Functions API
PL/PgSQL Function Impl
Existing functions
37Putting it together storing ontologies in chado
XMLXORT or DBIxDBStag
Chado XML
cv
fill_cvtermpath()
oboxml_to_ chadoxml.xsl
Obo XML
obo-edit
owl_to_ oboxml.xsl
protege
OWL
38Benefits
- Issue Language mismatch
- XORT dumpspecs and sql functions a more natural
fit for application languages - Issue Data structure mismatch
- XML maps naturally to objects and structs
- Issue Repetitive code
- XORT dumpspecs centralize db-fetch code
- XORT loader centralizes db-store code
- Issue No centralized domain logic
- domain logic can be encoded in
- PostgreSQL functions and triggers
- ChadoXML services
39Other issues
- Speed?
- chained transformations may be slower (-)
- generic code is often slower (-)
- single point for optimization()
- Verbosity
- inevitable with a normalized database
- reduced with XORT macros
- Portability
- XORT highly portable ()
- PostgreSQL functions must be manually ported to
different DBMSs (-)
40Current plans
- XORT wrappers
- Improving efficiency
- Documentation
- Extend PostgreSQL function repertoire
- More ChadoXML XSLTs
- ChadoXML adapters
- CGL
- Apollo
- BioPerl - BioSeq,Search,Tree,..IOchadoxml
41Conclusions
- ChadoXML
- a common GMOD format
- converted to other formats with XSLTs
- XORT
- centralises database interoperation logic
- PostgreSQL functions
- useful for certain kinds of domain logic
- Object APIs
- still required by many applications
- can be layered on top of XORT if so desired
42Thanks to
- Richard Bruskiewich
- Scott Cain
- Allen Day
- Karen Eilbeck
- Dave Emmert
- William Gelbart
- Mark Gibson
- Don Gilbert
- Aubrey de Grey
- Nomi Harris
- Stan Letovsky
- Suzanna Lewis
- Aaron Mackey
- Sima Misra
- Emmannel Mongin
- Simon Prochnik
- Gerald Rubin
- Susan Russo
- ShengQiang Shu
- Chris Smith
- Frank Smutniak
- Lincoln Stein
- Colin Wiel
- Mark Yandell
- Peili Zhang
- Mark Zythovicz
43Nothing to see here, move along..
- random deleted slides follow
44How do we develop an OO API?
- Ad-hoc
- eg GO API, GadFly API, Ensembl API,
ApolloChadoJDBC - Object model and db can be developed
independently - hand tuned
- Generic
- ClassDBI, JDO
- efficiency issues?
45GBrowse API
Perl
BioSQL
GBrowse
BioDB DAS
SQL
Chado
DBI
Bio-DB-GFF
BioPerl
Gbrowse can connect to different database via the
same API
APIs are typically associated with an object
model
46Choice of XML Datamodel
- XML is a different formalism from the relational
model - Should we
- design an XML datamodel from scratch?
- adopt existing XML datamodel?
- Generate XML datamodel from relational model?
47Chado XML
- XML datamodel derived automatically from
relational schema - mapping uses XORT specification
- XML tightly coupled to relational schema
- benefits advantages..
48Data retrieval XORT Dumpspecs
- Examples chado gene model dumpspec
49Data storage