Title: Databases
1Databases
2Database Federation
3Definitions
- Data warehouse
- aggregation and summation of data sources
- used to quickly answer very specific questions
- Database Federation
- richer schema reflecting more of source schemas
- optimized for general rather than specific queries
4Federation Considerations
- Advantages
- One-stop shopping
- single query (and query language)
- multiple resources brought together
- Automatic result aggregation
- Technical complexities
- Updating from source databases
- Unifying disparate schemas
5Designing a Data Federation
- Concrete vs. Virtual
- How is federation to take place?
- Data Sources
- Are sources curated?
- Include boutique databases?
- Schema Integration
- How do schemas map to federated schema?
6Concrete Database Federation
7Virtual Database Federation
Query Gateway
8Comparison
Concrete
Virtual
- Pros
- quicker
- single schema
- Cons
- bulkier
- maintenance issues
- Pros
- less disk intensive
- less maintenance
- Cons
- slower
- fault tolerance needed
9Data Sources
- Curated databases increase the level of trust you
can have in the data. - Uncurated databases often have the latest and
greatest data. - Boutique databases are small, specialty databases
that may add a modicum of knowledge.
10Schema Integration
- Decide what data is important to federate
- Decide what fields correspond to other fields
- Decide how to merge duplicate records (and decide
what is a duplicate record).
11Concrete Federation IGD
- Integrated Genome Database
- federated genomic data from multiple genome
databases - used ACEDB as db engine
- driven by most genome projects using ACEDB
- ran out of room
- too many records used up all the inodes on system
- slow and clunky
12Virtual Federation ENQUire
- Extensible Network Query Unifier
- WWW-based front end
- single generic query split into db-specific query
and sent out over network - results merged according to type (genomic,
sequence, etc.) - used as basis for NCSA Biology Workbench
13Semi-Virtual Federation SRS
- Sequence Retrieval System
- federated databases stored locally, with a query
integration engine acting as middleware - good performance (no network hit)
- simplified maintenance (native format dbs)
- basis for Lion Biology Analysis Environment
- Similar to DBGET
14MOBY
- Model Organism Bring Your (Own Schema)
- MOBY is a system through which a client will be
able to interact with multiple sources of
biological data regardless of the underlying
format or schema. The system also allows for the
dynamic identification of new relationships
between data from different sources.
15(No Transcript)
16- A classification is not a neutral hat rack it
expresses a theory of relationships that controls
our concepts. - Stephen Jay Gould
- Ever Since Darwin
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21Ontologies
- "An ontology is a specification of a
conceptualization. " - In other words, a hierarchical mapping of
concepts within a given frame of reference.
22Gene Ontology
- The goal of the Gene Ontology (GO) Consortium is
to produce a controlled vocabulary that can be
applied to all organisms even as knowledge of
gene and protein roles in cells is accumulating
and changing. - http//www.geneontology.org/
23What GO Is
- A collaborative effort to address the need for
consistent descriptions of gene products in
different databases - Three structured, controlled vocabularies
(ontologies) that describe gene products in a
species-independent manner - Uniform query facilitator
24What GO Is Not
- GO is not a database of gene sequences, nor a
catalog of gene products. Rather, GO describes
how gene products behave in a cellular context. - GO is not a 'federated solution'. Sharing
vocabulary is a step towards unification, but is
not, in itself, sufficient. Reasons for this
include the following. - Knowledge changes and updates lag behind.
- Individual curators evaluate data differently.
- GO does not attempt to describe every aspect of
biology.
25GO Categories
- Molecular Function Ontology
- the tasks performed by individual gene products
examples are carbohydrate binding and ATPase
activity - Biological Process Ontology
- broad biological goals, such as mitosis or purine
metabolism, that are accomplished by ordered
assemblies of molecular functions - Cellular Component Ontology
- subcellular structures, locations, and
macromolecular complexes examples include
nucleus, telomere, and origin recognition complex
26GO Structure
27ltgoterm rdfabouthttp//www.geneontology.org/go
GO0008708 n_associations"0"gt
ltgoaccessiongtGO0008708lt/goaccessiongt
ltgonamegtglucose dehydrogenase activitylt/gonamegt
ltgodefinitiongt Catalysis of the reaction
D-glucose acceptor D-glucono-1,5-lactone
reduced acceptor. lt/godefinitiongt
ltgois_a rdfresource"http//www.geneontology.org
/goGO0016902" /gt ltgodbxref
rdfparseType"Resource"gt
ltgodatabase_symbolgtEClt/godatabase_symbolgt
ltgoreferencegt1.1.99.-lt/goreferencegt
lt/godbxrefgt lt/gotermgt
28ltgoterm rdfabout"http//www.geneontology.org/go
GO0008708" n_associations"4"gt
ltgoaccessiongtGO0008708lt/goaccessiongt
ltgonamegtglucose dehydrogenase activitylt/gonamegt
ltgodefinitiongt Catalysis of the
reaction D-glucose acceptor
D-glucono-1,5-lactone reduced
acceptor. lt/godefinitiongt ltgois_a
rdfresource"http//www.geneontology.org/goGO00
16902" /gt ltgodbxref rdfparseType"Resource"gt
ltgodatabase_symbolgtEClt/godatabase_symbo
lgt ltgoreferencegt1.1.99.-lt/goreferencegt
lt/godbxrefgt ltgoassociation
rdfparseType"Resource"gt ltgoevidence
evidence_code"ISS"gt ltgodbxref
rdfparseType"Resource"gt
ltgodatabase_symbolgtFBlt/godatabase_symbolgt
ltgoreferencegtFBrf0141274lt/gorefer
encegt lt/godbxrefgt
lt/goevidencegt ltgogene_product
rdfparseType"Resource"gt
ltgonamegtCG9517lt/gonamegt
ltgodbxref rdfparseType"Resource"gt
ltgodatabase_symbolgtfblt/godatabase_symbo
lgt ltgoreferencegtFBgn0030591lt
/goreferencegt lt/godbxrefgt
lt/gogene_productgt
lt/goassociationgt lt/gotermgt
29GO Relational Schema
30Data Miningin Gene Expression
- Levels of data mining in gene expression studies
- Classification studies through pattern discovery
- Knowledge discovery through linkage with a priori
biological knowledge - Hypotheses generation
31(No Transcript)
32GO Agent
- Genes of similar function are often co-expressed.
- Our GO agent mines the GO database and creates
a knowledge store of functions for each of our
gene expression clusters. - Cluster (1,3) has 14 genes, 5 of which are
characterized by GO. - Another agent then groups and reports mined
information.
33Visualization of GO Terms