Chado%20and%20interoperability - PowerPoint PPT Presentation

About This Presentation
Title:

Chado%20and%20interoperability

Description:

Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 50
Provided by: ChrisMu153
Learn more at: http://gmod.org
Category:

less

Transcript and Presenter's Notes

Title: Chado%20and%20interoperability


1
Chado and interoperability
  • Chris Mungall, BDGP
  • Pinglei Zhou, FlyBase-Harvard

2
Databases and applications
Java
SQL
Chado
Application
?
seq
rad
cv
genetic
Application
phylo
pub
Perl
How do we get databases and applications
speaking to one another?
3
Databases and applications
SQL
Java
Chado
posgresql driver
Application
JDBC
method calls
row objects
seq
rad
posgresql driver
cv
genetic
Application
DBI
method calls
perl arrays
phylo
pub
Perl
Generic database interfaces only solve part of
the problem
They let us embed SQL inside application code
4
Why this alone isnt enough
  • Interfacing applications to databases is a tricky
    business
  • Issue Language mismatch
  • Issue Data structure mismatch
  • Issue Repetitive code
  • Issue No centralized domain logic

5
Language mismatch
String sql "SELECT srcfeature_id, fmax,
fmin FROM featureloc " "WHERE
feature_id " featId try Statement
s conn.createStatement() ResultSet rs
s.executeQuery(sql) rs.next()
sourceFeatureId rs.getInt("srcfeature_id")
fmin rs.getInt("fmin") fmax
rs.getInt("fmax") catch (SQLException
sqle) System.err.println(this.getClass()
" SQLException
retrieving feature loc"
" for feature_id " featId)
sqle.printStackTrace(System.err)
6
Data structure mismatch
  • Different formalisms








relations - set theoretic - relational
algebra




X
classes and structs - pointers - programs
7
Repetitive code
  • Database fetch pattern
  • construct, ask, transform, repeat, stitch
  • Example fetching gene models
  • fetch genes
  • fetch transcripts
  • fetch exons, polypeptides
  • fetch ancillary data (props, cvs, pubs, syns,
    etc)
  • Optimisation is difficult

8
No centralized domain logic
  • Examples of domain logic
  • project a feature onto a virtual contig
  • revcomp or translate a sequence
  • search by ontology term
  • delete a gene model
  • Domain logic should be reusable by different
    applications

9
A solutionObject Oriented APIs
Perl
Perl
chado
driver
chado adapter
DBI
Application
API
method calls
domain objects
Application
biosql adapter
biosql
domain objects
driver
DBI
Perl
Different perl applications share the same API
Different schemas can be added by writing adapters
10
How do OO APIs help?
  • Issue Language mismatch
  • Seperation of interface from implementation
  • Issue Data structure mismatch
  • API talks objects
  • adapters hide and deal with conversion
  • Issue Repetitive code
  • code centralized in both API and adapter
  • Issue No centralized domain logic
  • object model encapsulates domain logic
  • object model can be used independently of
    database

11
How do OO APIs hinder?
  • Writing or generating adapters
  • brittle, difficult to maintain
  • Restrictive
  • canned parameterized queries vs query language
  • Application language bound
  • very difficult to use a perl API from java
  • Application bound
  • sometimes generic, but often limited to one
    application
  • Opaque domain logic

12
XML can help
  • XML is application-language neutral
  • XML can be used to specify
  • data
  • transactions
  • queries and query constraints
  • XML can be used within both application languages
    and specialized XML languages
  • XPath
  • XSLT
  • XQuery

13
XML middleware
Perl
Perl
any
chado
driver
impl
DBI
Application
interface
query params
chado xml
Application
query params
chado xml
Generic SQL to XML mapper
Java
Database XML as lingua franca
OO APIs can be implemented on top of XML layer
14
XML middleware
Java
Perl
any
chado
driver
impl
JDBC
Application
interface
query params
chado xml
Application
query params
Implementation can be any language
chado xml
Java
15
Mapping with XORT
  • XORT is a specification of how to map XML to the
    relational model
  • generic independent of chado and biology
  • XMLXORT is a perl implementation of the XORT
    specification
  • Other implementations possible
  • DBIxDBStag implements XORT xml-gtdb
  • Application language agnostic
  • Easily wrapped for other languages

16
Highlights
  • Proposal XML mapping specification for Chado
  • Tools
  • Real Case

17
XORT Mapping
  • Elements
  • Table
  • Column (except DB-specific value, e.g primary key
    in Chado schema -- not visible in XML)
  • Attributes
  • few and generic transaction and reference
    control
  • Element nesting
  • column within table
  • joined table within table -- joining column is
    implicit
  • foreign key table within foreign key column
  • Modules
  • No module distinctions in chadoXML
  • Limitations of DTD
  • Cardinality, NULLness, data type

18
Transactions and Operations
  • Lookup
  • - Select only
  • Insert
  • - Insert explicitly
  • Delete
  • - Unique identifier with unique key(s)
  • - One record per operation
  • Update
  • - Two elements
  • - Unique identifier with unique key(s)
  • - One record per operation
  • Force
  • Combination of lookup, insert and
    update (if not lookup, then insert, else update)

19
Referencing Objects
  • By global accession
  • - Format dbnameaccession.version
  • - Only for dbxref, feature ?, cvterm ?
  • By a pre-defined local id
  • - Allows reference to objects in same
    file
  • - Need not be in DB
  • - Can be any symbol
  • By lookup using unique key value(s)
  • - Object can be in file or DB
  • Implicitly, using foreign key to identify
    information in the related link table

20
Object Reference By Global Accession
ltfeaturegt ltuniquenamegtCG3123lt/uniquenamegt
lttype_idgtgenelt/type_idgt ltfeature_relationsh
ipgt ltsubject_idgtGadflyCG3123-RA1lt/subje
ct_idgt lttype_idgtproducedbylt/type_idgt
ltfeature_relationshipgt lt/featuregt lt/featu
regt
21
Object Reference By Local ID
  • ltcv idSOgt
  • ltnamegtSequence Ontologylt/namegt
  • lt/cvgt
  • ltcvterm idexongt
  • ltcv_idgtSOlt/cv_idgt
  • ltnamegtexonlt/namegt
  • lt/cvtermgt
  • ltfeaturegt
  • lttype_idgtexonlt/type_idgt

22
Object Reference By key Value (s)
ltfeaturegt lttype_idgt    ltcvtermgt   
    ltcv_idgt
ltcvgt ltnamegtSequence
Ontologylt/namegt               lt/cvgt
lt/cv_idgt       
ltnamegtexonlt/namegt    lt/cvtermgt
lt/type_idgt .       
23
ChadoXML Example
ltcv idSOgt ltnamegtSequence
Ontologylt/namegt lt/cvgt ltfeature oplookup
idCG3312gt ltuniquenamegtCG3312lt/uniquenamegt
lttype_idgt ltcvtermgt
ltnamegtgenelt/namegt
ltcv_idgtSOlt/cv_idgt lt/cvtermgt
lttype_idgt ltorganism_idgt
ltfeature_relationshipgt
ltsubject_idgtGadflyCG3312-RAlt/subject_idgt
lttype_idgtproducedbylt/type_idgt
lt/feature_relationshipgt lt/featuregt
24
Schema-Driven Tools
  • DTD Generator DDL-DTD
  • Validator
  • DB Not connected
  • Syntax verification legal XML, correct
    element nesting
  • Some Semantic verfication NULLness,
    cardinality, local ID reference
  • DB Connected reference validation
  • Loader XML-gtDB
  • DumperDB-gtXML
  • Driven by XML dumpspec
  • XORTDiff diff two XORT XML files

25
DumpSpec Driven Dumper
  • Default behavior given an object class and ID,
    dump all direct values and link tables, with refs
    to foreign keys.
  • Non-default behavior specified by XML dumpspecs
    using same DTD with a few additions
  • attribute dump all cols select none
  • attribute test yes no
  • element _sql
  • element _appdata
  • Workaround with views, _sql
  • Current use cases
  • Dump a gene for a gene detail page
  • Dump a scaffold for Apollo

26
DumpSpec Sample
ltfeature dump"all"gt ltuniquename
test"yes"gtCG3312lt/uniquenamegt lt!-- get all
mRNAs of this gene --gt ltfeature_relationship
dump"all"gt ltsubject_id test"yes"gt
ltfeaturegt lttype_idgtltcvtermgt
ltnamegtmRNAlt/namegt lt/cvtermgt lt/type_idgt
lt/featuregt lt/subject_idgt
ltsubject_idgt ltfeature dump"all"gt
lt!-- get all exons of those mRNAs --gt
ltfeature_relationship dump"all"gt
ltsubject_id test"yes"gt
ltfeaturegt
lttype_idgt gtltcvtermgtltnamegtexonlt/namegt lt/cvtermgt
lt/type_idgt lt/featuregt
lt/subject_idgt
ltsubject_idgt ltfeature
dump"all/gt lt/subject_idgt
lt/feature_relationshipgt lt/featuregt
27
Use Case 1Chado lt-gt Apollo Interaction
DumpSpec1
DDL
XML Dumper
XML Loader validator
Chado
Chado XML
Chado XML
By_product
GameBridge
GameBridge
XML Schema
GAME XML
GAME XML
28
Use Case 2Gene Page Dataflow
DumpSpec2
Chado
XML Dumper
Chado XML
acode
FlyBase
29
To Do Lists
  • External Object reference
  • Dump with auto-generated XML Schema
  • Output human-friendly

30
Resources
  • Todays slides
  • XORT package http//www.gmod.org
  • Protocol draft submit to Current Protocol In
    Bioinformatics
  • Using chado to Store Genome Annotation
    Data

31
XORT Key points
  • Application language-neutral
  • reusable from within multiple languages and
    applications
  • Where does the domain logic live?
  • Unlike objects, XML does not have behaviour
  • One solution ChadoXML Services
  • Another solution Inside the DBMS

32
ChadoXML Services
Application
chado
XORT interface
query params
driver
XML XORT
chado xml
DBI
chado xml query params
chado xml other xml
Lisp
XSLT
Java
Prolog
Format converters, dumpers
Ontology services
XQuery
ChadoXML services interface
sequence domain logic
companalysis logic
Perl
Java
C
Perl
33
ChadoXML Services
Can be independent of DB
Application
chado xml query params
chado xml other xml
Lisp
XSLT
Java
Prolog
Format converters, dumpers
Ontology services
XQuery
ChadoXML services interface
sequence domain logic
companalysis logic
Perl
Java
C
Perl
34
DB Functions API
Chado
JDBC/ DBI
posgresql driver
Application
db function calls
DB Functions API
sql result objects
PL/PgSQL Function Impl
sequence library
C
Implementation inside database
35
DB Functions API
  • cv module
  • get_all_subject_ids(cvterm_id int)
  • get_all_object_ids(cvterm_id int)
  • fill_cvtermpath(cv_id int)

Chado
DB Functions API
PL/PgSQL Function Impl
Existing functions
36
DB Functions API
  • sequence module
  • get_sub_feature_ids(feature_id int)
  • featuresplice(fmin int, fmax int)
  • get_subsequence(srcfeature_id int, fmin int, fmax
    int, strand int)
  • next_uniquename()

Chado
DB Functions API
PL/PgSQL Function Impl
Existing functions
37
Putting it together storing ontologies in chado
XMLXORT or DBIxDBStag
Chado XML
cv
fill_cvtermpath()
oboxml_to_ chadoxml.xsl
Obo XML
obo-edit
owl_to_ oboxml.xsl
protege
OWL
38
Benefits
  • Issue Language mismatch
  • XORT dumpspecs and sql functions a more natural
    fit for application languages
  • Issue Data structure mismatch
  • XML maps naturally to objects and structs
  • Issue Repetitive code
  • XORT dumpspecs centralize db-fetch code
  • XORT loader centralizes db-store code
  • Issue No centralized domain logic
  • domain logic can be encoded in
  • PostgreSQL functions and triggers
  • ChadoXML services

39
Other issues
  • Speed?
  • chained transformations may be slower (-)
  • generic code is often slower (-)
  • single point for optimization()
  • Verbosity
  • inevitable with a normalized database
  • reduced with XORT macros
  • Portability
  • XORT highly portable ()
  • PostgreSQL functions must be manually ported to
    different DBMSs (-)

40
Current plans
  • XORT wrappers
  • Improving efficiency
  • Documentation
  • Extend PostgreSQL function repertoire
  • More ChadoXML XSLTs
  • ChadoXML adapters
  • CGL
  • Apollo
  • BioPerl - BioSeq,Search,Tree,..IOchadoxml

41
Conclusions
  • ChadoXML
  • a common GMOD format
  • converted to other formats with XSLTs
  • XORT
  • centralises database interoperation logic
  • PostgreSQL functions
  • useful for certain kinds of domain logic
  • Object APIs
  • still required by many applications
  • can be layered on top of XORT if so desired

42
Thanks to
  • Richard Bruskiewich
  • Scott Cain
  • Allen Day
  • Karen Eilbeck
  • Dave Emmert
  • William Gelbart
  • Mark Gibson
  • Don Gilbert
  • Aubrey de Grey
  • Nomi Harris
  • Stan Letovsky
  • Suzanna Lewis
  • Aaron Mackey
  • Sima Misra
  • Emmannel Mongin
  • Simon Prochnik
  • Gerald Rubin
  • Susan Russo
  • ShengQiang Shu
  • Chris Smith
  • Frank Smutniak
  • Lincoln Stein
  • Colin Wiel
  • Mark Yandell
  • Peili Zhang
  • Mark Zythovicz

43
Nothing to see here, move along..
  • random deleted slides follow

44
How do we develop an OO API?
  • Ad-hoc
  • eg GO API, GadFly API, Ensembl API,
    ApolloChadoJDBC
  • Object model and db can be developed
    independently
  • hand tuned
  • Generic
  • ClassDBI, JDO
  • efficiency issues?

45
GBrowse API
Perl
BioSQL
GBrowse
BioDB DAS
SQL
Chado
DBI
Bio-DB-GFF
BioPerl
Gbrowse can connect to different database via the
same API
APIs are typically associated with an object
model
46
Choice of XML Datamodel
  • XML is a different formalism from the relational
    model
  • Should we
  • design an XML datamodel from scratch?
  • adopt existing XML datamodel?
  • Generate XML datamodel from relational model?

47
Chado XML
  • XML datamodel derived automatically from
    relational schema
  • mapping uses XORT specification
  • XML tightly coupled to relational schema
  • benefits advantages..

48
Data retrieval XORT Dumpspecs
  • Examples chado gene model dumpspec

49
Data storage
  • Static or transactional
Write a Comment
User Comments (0)
About PowerShow.com