Title: Biological Databases
1Biological Databases
- Asif Jan
- Brain Mind Institute
- EPFL
2Introduction
- The purpose of the biological experiment is to
understand working of the biological organs such
as brain, cells etc - The purpose is to study the interplay of
different structural, chemical and electrical
signals that gave rise to natural and disease
processes. - The data is acquired from different angles to
serve different research purposes i.e. different
animal models, using physiological approaches,
anatomical approach and levels of protein
activities etc
3Sequence
- Biological Data - Types and Constraints
- Neocortical Microcircuit Data
- Databasing Neocortical Microcircuit
- Conclusion and Discussion
4Part I
5Biological Databases- Challenges
- A great deal of diversity in the data types
- Unconventional and adhoc query requirements
- Ubiquitous uncertainty in the data
- Requirements for data curation
- Need for detailed Data annotations
- A need for large scale data integration
- Non-availability of universal taxonomy
- Support for rapid Schema evolution
- Temporal Data Management
- (Ref Data Management for Molecular and Cell
Biology - Workshop report www.lbl.gov/olken/wdmbi
o )
6Data Types (1/2)
- Sequences
- DNA, RNA, amino-acid sequences (proteins). The
data has grown enormously due to availability of
automated sequence machines and large scale
sequencing projects such as human and mouse
genome. - Graphs
- Biological pathways such as metabolic pathways,
gene regulatory networks, 3D protein structures - High Dimensional Data
- Micro-array experiments (thousands of genes),
hundreds of experimental conditions, clustering
studies on genes etc - Shapes
- 3D molecular structural data augmented by
chemical distribution, 3D cell morphology data
7Data Types (2/2)
- Temporal Data
- Useful for studying the dynamics of biological
system e.g. electrophysiology recordings,
development biology, protein structure dynamics,
cellular structure dynamics etc. - Model Data
- Representation of biological phenomenon as
computational, mathematical and statistical
models used for parameter estimation, testing
etc. Models shall also be represented and stored
in query-able format. - Scalar and Vector Fields
- Charge distribution across cell surface, calcium
and protein fluxes across cell surface etc - Extracted Features Data
- Numerical data extracted from the combination of
one of the above data types
8Adhoc Query Requirements
- Biologists understand the relation across
different data types and these relations are not
necessary obvious from the database point of view
i.e. - Two labs one studying dendritic spines of PC
in hippocampus, primary schema element being the
anatomical entities (dendrites etc) reconstructed
from 3D serial sections. The other studying
Purkinje cells in the cerebellum branching
patterns from the dendrites of neurons and
protein localization in various compartments - Thus a researcher, modeling effects of
neurotransmission in hippocampal spines would get
structural information from lab 1 and information
on calcium binding proteins found in spines from
lab 2. Assumption Like PC, Purkinje cells also
possess dendritic spines and release of calcium
in spiny dendrites occurs as a result of
neurotransmission and causes change in spine
morphology. Propagation of calcium signals
through out the neuron depends on the morphology
of dendrites - Ref Ludascher B, Gupta A and Martone M, Model
Based Mediator System for Scientific Data
Management -
9Uncertainty in the data
- Biological data has a great deal of uncertainty
as it represents a biological phenomenon that is
observed and assumed (based on some evidence) to
be true. - For example, the spiking behavior of cell under
specific stimuli, protein sequence in the protein
database that is based on partial protein report
etc. - The uncertainty must also be modeled and recorded
as part of the data as it has consequences for
subsequent usage of the data. -
-
10Requirements for data curation
- The data is collected across different structural
and function boundaries, there might be many
missing links and inconsistencies (some
inconsistencies due to lack of core domain
knowledge etc). - Often is the case that expert intervention is
required for cross correlation of the data and
for filling in missing links and/or improving the
data consistency. - However, large scale biological database entail
explicit representation of uncertainties and
cross structural/functional boundaries in order
to have automatic curation. -
-
11Data Annotations
- Biological data is specific to the purpose of the
individual performing the data collection. - For example, while studying calcium regulation
researcher A might adapt a physiological approach
using patch electrodes and researcher B may take
anatomical approach mapping different isoforms of
calcium current to structure of organelles that
expresses them etc. - Different animal models, need to integrate data
collected from different brain regions, across
different species etc. Furthermore, the
experimental conditions have a great influence on
the experimental results. - This requires that the data shall be properly
annotated during different stages of data
collection and all conditions/parameters properly
recorded. Furthermore, assumptions in doing
experiment etc shall also be recorded. - In case of data derived from primary data the
need for proper annotations is further enhanced.
-
-
12Need for large scale data integration
- It is very difficult, if not impossible, to
collect information about various biological
entities at a single institute or laboratory. - Data collected from years of research, across
different functional and anatomical scales, and
for normal as well as disease cases is available
for use. - Often this data is poorly annotated and
inadequately structured yet contains precious
information that can not be ignored. - However, lack of universal taxonomy, or a uniform
structure presents many challenges for effective
utilization of this data. - While improving the readability of the existing
databases, it is imperative for the new databases
to adapt proper descriptions, query interfaces
and annotation frameworks. -
-
13Other Issues
- Lack of Taxonomies
- Schema Evolution
- Biological Constraints
- Data Cleaning
-
-
14Part II
- Neocortical Microcircuit Data
15The Neocortex and the Cortical Column
Cortical Sheet
Neocortex
Cortical Column
Layer I II III
IV V VI
16- Key properties of neurons and synapses
numerically represented as profiles
17Neuron ProfilesMorphology Data (m-Profile)
- Neuron 3D reconstructed and converted to
Neurolucida format - Analyzed by a MATLAB based tool to extract a
vector of 200 values - Values can be used to artificially rebuild
neurons with specific statistical properties - Example Parameters
- TreeLengthMean (mean of lengths of segments with
same order in each tree) - IndivTreeLengthMean (mean of segment lengths in a
tree) - XY_Angle (angle between projection of a segment
on XY plane and X axis)
18(No Transcript)
19Neuron ProfilesElectrophysiology Data (e-Profile)
- Obtained by applying a series of current
injections to the samota - Response measured to obtain a spectrum ( 140) of
electrophysiological parameters (EPs) - Most parameters sensitive to the ion channel
composition - Raw electrophysiological data also stored
- Example parameters
- ADP (after depolarization immediately following
APs) - APThreshhold (threshold of AP generation during a
ramp polarization) - SineSpectrum (various measures of frequency
filtering by neuron) - sAHP (amplitude of hyperpolarization after a
burst of APs)
20(No Transcript)
21Neuron ProfilesGene Expression Data (g-Profile)
- Obtained from single cell multiplex RT-PCR
studies and single cell DNA microarray analyses - Enable non quantitative detection of expression
vs non expression of 50 genes - Extended the system to conduct single cell DNA
microarray studies to screen for over 20,000
genes
22(No Transcript)
23Synaptic ProfilesMorphology Data (sm-Profile)
- Describes anatomy of a synaptic connection
- Contains information about number of synapses,
their location on axonal and dendritic arbors of
pre , post synaptic neuron, axonal and dendritic
geometric and electronic distances - Examples
- Axonal Branch Order (number of branch points
between the bouton forming the synapse and the
soma of the source neuron) - Dendritic Branch Order (the location of the
synapse along the dendritic arbor according to
the branching frequency of the dendritic tree) - Geometrical Distance (the distance along the
dendritic from the synaptic location to the
postsynaptic soma)
24(No Transcript)
25Synaptic ProfilesElectrophysiological Data
(se-Profile)
- Characterized in terms of the biophysical
dynamic properties - The biophysical properties focus on the
amplitudes, latencies, rise and decay times
synaptic conductances synaptic charge transfer,
etc - The dynamic properties include the time-constants
governing the rates of recovery from synaptic
depression (D) and facilitation (F) as well as
the absolute and effective utilization of
synaptic efficacy parameters - Other parameters include estimates of probability
of release and number of functional release sites
26(No Transcript)
27Synaptic ProfilesPharmacological Data
(sp-Profile)
- Contains information describing the sensitivity
of the synaptic connection to different chemicals - Described in terms of synapse response to various
blockers, agonists and antagonists - Commonly used chemicals are
- bicuculine (GABA-a antagonist)
- APV (NMDA receptor antagonist)
- CNQX (AMPA receptor antagonist)
- CGP 35348 (GABA-b antagonist)
- NMDA (NMDA receptor agonist)
- diazepam (GABA-a facilitator)
28Additional Data
- General Data (g-Profile)
- Animal Information
- Brain Region Information
- Experiment Information
- Model Data (mod-Profile)
- A complete NEURON model that will include
- active properties by inclusion of ion channel
constellations and parameters - electrical properties of the neurons
- possible ion channel constellations
- Canonical Data (x-Profile)
- statistical analysis of stored neurons and
synapses - these simplified models to be used for
visualization and simulation etc
29Part III
- Databasing Neocortical Microcircuit
30Blue Brain Project - Goals
- Gather and share raw data about different facets
of the neurons - Develop biologically accurate model of neurons
and their interactions - Obtain a biologically accurate simulation of the
cortical column - Develop visualization tools on the simulation to
perform in silico biology studies
31Blue Brain Data Usage
Visualization
Produces data
Simulation
Are used for
Experiments
Models
Test against Experimental data
Is stored
Produces statistics On experiments
Database
32Overview of current system
33Storage Resource Broker
- Distributed data storage
- Uniform access to a variety of data sources
- Intuitive file system-like interface with the
data - Ability to annotate data with metadata
- Data access management
- Security management
34SRB Federation architecture
35SRB Server Architecture
36The Metadata Catalog (MCAT)
- Stores and manages the information about an
- SRB System
- Physical and connection information about the
data sources - User information and privileges
- Logical and physical mapping to the files
- Files metadata
- System metadata
- User-defined metadata
37Current Database Infrastructure
- Uses SRB for storing the primary data
- Customized tools for data upload, annotation and
download - Metadata about primary data is stored in the
metadata catalog - Secondary databases storing morphology,
electrophysiology and gene expression data - Specialized database for storing extracted
features
38Current Set of Databases
- Morphology Data - structure of neurons
- Electrophysiology Data- Recordings from patch
clamp system, response of neurons to stimuli - Gene Expression Data expression of specific
genes. Started to collect DNA microarrays - Feature DB primary features extracted from
electrophysiology data - Synapse DB properties of synaptic connections,
release probabilities , conductances etc - Index DB A central bookkeeping and
synchronization system. - Microcircuit DB a description of the
microcircuit with 10,000 neurons and 2 million
connections
39Morph DB
Synapse DB
EP DB
Gene Expr DB
Index DB
Feature DB
SRB System
Files
Relational DB
40Part IV
- Conclusion and Discussions
41Reapplying challenges
- Diversity - morph , electro, gene , statistics ,
models - Unconventional Queries - need for expert
knowledge to draw relations - Uncertainty in the data
- Need for annotations, taxonomy , metadata (person
doing experiemnts on PC does not record the type
of cell in his lab book) - Schema Evolution
- Need for integration
42Conclusion
- A basic framework for storing neocortical
microcircuit data (primary) as well as metadata - Tools for data upload, consistency checking, data
download and browsing - Capability to store different type of data i.e.
images, recordings, ascii files etc - Hierarchical database infrastructure supporting
secondary and specialized database - Flexible structure catering for new data types
43Conclusion (in progress)
- Using standard taxonomy , Using ontologies to
facilitate integration within various databases,
and with external databases - Data Integration across multiple databases for
supporting experimentalists - Development of efficient and user friendly
interaction environments - Knowledge based Query Environment
- Scalability Issues
44Thank you