Title: chado
1chado
- Generic model organism database schema
2Chado modules
3dbxref
cvterm
feature_relationship
feature_dbxref
feature_cvterm
feature
featureloc
featureprop
feature_synonym
organism
featureprop_pub
synonym
pub
4Sequence Ontology
5Central Dogmasingle spliced transcript
Feature (Colors DNA, RNA, Protein)
protein
Featureloc
Feature_relation (subj-gtobj)
produced by
CDS
produced by
transcript
part of
exon
produced by
Use rank to order
gene
Genomic Contig
6Central Dogma2nd transcript (alt. Splicing)
Feature (Colors DNA, RNA, Protein)
protein
Featureloc
produced_by
Feature_relation (subj-gtobj)
CDS
produced_by
transcript
part of
exon
produced_by
Use rank to order
gene
Genomic Contig
7Pathological Casestrans-splicing
Feature (Colors DNA, RNA, Protein)
produced by
Featureloc
CDS
Feature_relation (subj-gtobj)
produced by
transcript
part of
transcript
produced by
part of
exon
produced by
Use rank to order
gene
Genomic Contig
8Pairwise Alignments
query sequence
Feature_relation (subj-gtobj)
rank 1
HSP
rank 0
Genomic Contig
9Sequence variationsSNPs
residue_info G rank 1
SNP
A gt G
residue_info A rank 0
Genomic Contig
10Sequence variationsSNPs (redundant mapping to
protein)
I gt T
protein
residue_info I rank 0 locgroup
1
residue_info T rank 1 locgroup 1
SNP
A gt G
residue_info G rank 1 locgroup 0
residue_info A rank 0 locgroup 0
Genomic Contig
11Query Performance
- ROI Query
- Reasonable, not stellar performance on PostGreSQL
with index on (srcfeature_id, min, max) - Exploration of more sophisticated approaches
yielded performance improvements in MySQL but not
PostGreSQL - PostGreSQL functions simplify queries, e.g.
select from contains(src,min,max) - Central Dogma Query
- a few seconds for 3 levels, all info
- 1 minute to include all overlapping features
12- Chado Schema
- Sequence
- Genetics
- Expression
- . . .
Cambridge UK
Harvard
FTP Site Flatfile extracts Apollo XML PostgreSQL
Dumps Portable Mirror
Ontologies GO SO Others
Indiana
Berkeley
Literature Curation
Genes, Phenotypes DB Cross refs
AberrationsTransposon ConstructsTransposon
insertionsBibliography
report generators
Cambridge Working DB
chado XML DTD
gobo2chadx
- FB WebInterface
- Integrated Gene Reports
- Gene Annotation
- Genome Browser
- Phenotypes
- Expression
- Interactions
- etc.
XML Dumper
XML Loader validator
FlyBase (read-write)
CV- Annotator
Public (read only)
Biological Image Annotation
Stock List
Users
Data Entry Forms
Interactions Expression
QA/QC
chadx2game
Gbrowse (GMOD)
game2chadx
Apollo chado adaptor?
Java SEAN
chadx2gb
game2chadx
Sequence Features from Literature
GenBank Reference sequence Annotation Updates
Sequence Analysis Pipeline BOP
GenBank SwissProt Community BDGP
Gene Model Annotation
13XORTXML Object to Relational Translator
- Schema-driven tools
- DTD generator DDL -gt DTD
- also generates html, xml, .pl versions of schema
- Validator
- Not connected
- Syntax Verification legal XML, correct element
nesting - Some Semantic verification NULLness,
cardinality, local ID reference - Connected reference validation
- Loader-only constraints, triggers
- Loader XML -gt DB
- Dumper DB -gt XML
- driven by XML dumpspec
14Mapping XML to R-DBMS
- Policy1 XML is independent of schema
- Pro ensures modularity, freedom to change one
without the other (but why would you want to?) - Con must maintain mapping when either changes
- Policy2 XML locked to schema
- Pro dont have to learn two things, mapping is
frozen - Con see Pro above.
15XORT Mapping
- Elements
- Table
- Column (except primary key -- not visible in XML)
- Attributes
- few and generic transaction and reference
control - Element nesting
- column within table
- joined table within table -- joining column is
implicit - foreign key table within foreign key column
- Modules
- No module distinctions in chadoXML
- Limitations of DTD
- Cardinality, NULLness, data type
16(No Transcript)
17Object ReferencesHow to refer to persistent
objects within XML?(a.k.a. foreign key columns)
- By Unique Key Value(s)
- object can be in XML file or DB
- By local ID
- only for references to objects in same XML file
- need not be in DB
- local ID can be any symbol - def before ref
- reduces duplication within XML
- By Global accession
- currently only for feature
- simple extension mechanism using Perl fragments
18Object Referenceby key values
- ltforeign_key_colgt ltprimarytablegt
ltkeycol1gtkeyval1lt/keycol1gt ... more key
cols if needed lt/primarytablegt - lt/foreign_key_colgt
- E.g. ltfeaturegt
- lttype_idgt ltcvtermgt
ltcv_idgt - ltcvgt
- ltnamegtSequence
Ontologylt/namegt lt/cvgt - lt/cv_idgt
ltnamegtexonlt/namegt lt/cvtermgt
lt/type_idgt - .
19Object Referenceby Local ID
- ltcv idSOgt
- ltnamegtSequence Ontologylt/namegt
- lt/cvgt
- ltcvterm idexongt
- ltcv_idgtSOlt/cv_idgt
- ltnamegtexonlt/namegt
- lt/cvtermgt
- ltfeaturegt
- lttype_idgtexonlt/type_idgt
- ...
20Object Referenceby Global Accession
- ltfeature_relationshipgt
- ltsubjfeature_idgtGBg012345
- lt/subjfeature_idgt
-
21Transactions
- Lookup lttable oplookupgt...
- Insert lttable opinsertgt...
- Delete
- lttable opdeletegt
- ltkeycol1gtval1lt/keycol1gt
- Update
- lttable opupdategt
- ltkeycol1gtval1lt/keycol1gt
- ltkeycol2gtval2lt/keycol1gt
- ltkeycol1gtnewvallt/keycol1gt
- Force lttable opforcegt...
- Combination of lookup, insert and update
22DumperXML-driven extraction
- Default behavior given an object class and ID,
dump all direct values and linktables, with refs
to foreign keys. - Nondefault behavior specified by XML dumpspecs
using same DTD with a few additions - attribute dump all cols select none
- attribute test yes no
- element OR
- element _sql
- element _appdata
- Workaround with views, _sql
- Current use cases
- Dump a gene for a gene detail page
- Dump a scaffold for Apollo
23lt?xml version"1.0" encoding"ISO-8859-1"?gtlt!DOCT
YPE chado SYSTEM "/users/zhou/work/flybase/xml/cha
do_stan.dtd"gtlt!-- 1. dump all information for
gene CG9570 and all information for transcript,
all for translation, for feature_evidence, dump
all cols of foreign objectfeatureloc
--gtltchadogtltfeature dump"all"gtltuniquename
test"yes"gtltorgtCG3665lt/orgtltorgtCG3139lt/orgtltorgtCG349
7lt/orgtlt/uniquenamegtlt!-- get all mRNA of those
gene --gtltfeature_relationship dump"all"gtltsubjfe
ature_id test"yes"gtltfeaturegtlttype_idgtltcvtermgt
ltnamegtmRNAlt/namegtlt/cvtermgtlt/type_idgtlt/featuregt
lt/subjfeature_idgtltsubjfeature_idgtltfeature
dump"all"gt lt!-- get all exon of those mRNA --gt
ltfeature_relationship dump"all"gtltsubjfeature_id
test"yes"gtltfeaturegtlttype_idgtltcvtermgtltnamegtex
onlt/namegtlt/cvtermgt lt/type_idgtlt/featuregtlt/subjf
eature_idgtltsubjfeature_idgtltfeature
dump"all"gtlt!-- feature_evidence for exon, type
of evidence is either alignment_hit or
alignment_hsp --gtltfeature_evidence
dump"no_dump"gtlt/feature_evidencegtlt!--
feature_evidence for exon, type of evidence is
neithor alignment_hit nor alignment_hsp
--gtltfeature_evidence dump"no_dump"gtlt/feature_ev
idencegtltscaffold_feature dump"no_dump"
/gtlt/featuregtlt/subjfeature_idgtlt/feature_relation
shipgtlt!-- get all protein of those mRNA --gt
ltfeature_relationship dump"all"gtltsubjfeature_id
test"yes"gtltfeaturegtlttype_idgtltcvtermgtltnamegtpr
oteinlt/namegtlt/cvtermgt lt/type_idgtlt/featuregtlt/su
bjfeature_idgtltsubjfeature_idgtltfeature
dump"all"gtlt!-- feature_evidence for protein,
type of evidence is either alignment_hit or
alignment_hsp --gtltfeature_evidence
dump"no_dump"gt lt/feature_evidencegtlt!--
feature_evidence for protein, type of evidence is
neithor alignment_hit nor alignment_hsp
--gtltfeature_evidence dump"no_dump"gtlt/feature_e
videncegtltscaffold_feature dump"no_dump" /gt
lt/featuregtlt/subjfeature_idgtlt/feature_relationsh
ipgtltfeature_relationship dump"all"gtltsubjfeatur
e_id test"yes"gtltfeaturegtlttype_idgtltcvtermgtltnam
e test"no"gtltorgtproteinlt/orgtltorgtexonlt/orgtlt/namegtlt
/cvtermgt lt/type_idgtlt/featuregtlt/subjfeature_idgt
ltsubjfeature_idgtltfeature dump"all"gtlt!--
feature_evidence for feature neither protein nor
exon, type of evidence is either alignment_hit or
alignment_hsp --gtltfeature_evidence
dump"no_dump"gt..
lt?xml version"1.0" encoding"ISO-8859-1"?gtlt!DOCT
YPE chado SYSTEM "/users/zhou/work/flybase/xml/cha
do_stan.dtd"gtlt!-- 1. dump all information for
gene CG9570 and all information for transcript,
all for translation, for feature_evidence, dump
all cols of foreign objectfeatureloc
--gtltchadogt... ltfeaturegt ltuniquename
test"yes"gtltorgtCG3665lt/orgt ltorgtCG3139lt/orgtlto
rgtCG3497lt/orgtlt/uniquenamegt lt!-- get all mRNA of
those genes --gt ltfeature_relationship
dump"all"gt ltsubjfeature_id
test"yes"gt ltfeaturegt lttype_idgtmRNAlt/type
_idgt lt/featuregt lt/subjfeature_idgt ltsubjfeature
_idgt ltfeaturegt lt!-- get all exons of those
mRNA --gt ltfeature_relationshipgt ltsubj
feature_id test"yes"gt ltfeaturegt lttype_id
gtexonlt/type_idgt .
24Chado lt-gt Apollo Interaction
XML Dumper
XML Loader validator
Chado
Chado XML
Chado XML
game2chadx
chadx2game
GAME XML
GAME XML
25DUMPER Concerns
- Expressivity
- Speed
- XML file size
- Memory
26Whats next
- Debug Apollo / chado roundtrip
- CV issues
- Hierarchical queries
- SO compliance
- feature relationship types
- Schema extensions
- genetics module - review in Fall
- expression?
- UI development
27Architectural Principles
- Semi-permeable XML layer
- Fix mapping, let schema vary
- Plan for schema evolution -- schema-driven tools
- Course-grained coupling of modules made possible
by XML standardization
28Credits
- Pinglei Zhou - loader, dumper, XML design
- Frank Smutniak - game2chadx, chadx2game
- Colin Wiel - gadfly2chado migration, schema
- David Emmert - schema, migration
- Chris Mungall, Suzi Lewis - schema, SO
- Stan Letovsky - XML/tool design, dtd generator
- Susan Russo, Mark Z - PostGreSQL
- Don Gilbert - XML customer
- Scott Cain - GBROWSE/Chado
- Allen Day - schema (expression)
- Hilmar Lapp - ROI query optimization
- ...