Title: BioPostgres
1BioPostgres
www.biopostgres.org
Stott Parker Ruey-Lung Hsiao UCLA Computer
Science Dept.
UCLA Center for Computational Biology (CCB)
2The Future of Science
- Evolution of science
- Increasing emphasis on
- scale
- networking
- informatics
3bScience ?
- eScience
- large-scale, data-centric, computationally
mind-numbing science http//research.microsoft.c
om/workshops/escience2005/ - the future of science (for eScientists)
enormous centralized data centers that provide
many information services - bScience
- NCBI is a good model for an eScience data center
- The importance of extensible database systems
- Jim Gray stresses the importance of extensible
databases in eScience future scientists will
spend their days writing large-scale SQL queries
(!) - Even if Jim Gray is not right about this, at some
point the scale of biological information
requires using database systems for some kind of
data management. Think about terabytes and
exabytes
4Aligning protein features exon structure
CREATE TABLE exon_feature exon_feature
AS
gene exon protein interval
type å description SELECT
DISTINCT ------------------
-----------------------------------------------
---- e.gene,
H2-DMb2 2 P35737 75..75
glycosylation N-linked (GlcNAc...)
e.exon_no as exon, H2-DMb2 3
P35737 1..18 signal potential
f.protein_id as protein,
H2-DMb2 4 P35737 19..112 domain
lumenal beta-1 f.interval,
H2-DMb2 4 P35737 75..75
glycosylation N-linked (GlcNAc...)
f.type, H2-DMb2 5
P35737 113..207 domain lumenal
beta-2 f.description
H2-DMb2 5 P35737 114..204 domain
Ig-like FROM
H2-DMb2 6 P35737
208..218 domain connecting peptide
gene g, H2-DMb2
6 P35737 219..239 transmembrane
potential exon e,
H2-DMb2 7 P35737 248..251
site YXXZ motif mrna m,
HLA-DMB 1
P28068 1..18 signal potential
protein p,
HLA-DMB 2 P28068 19..112 domain
lumenal beta-1 protein_feature f
HLA-DMB 2 P28068
110..110 glycosylation N-linked (GlcNAc...)
WHERE HLA-DMB
3 P28068 113..207 domain
lumenal beta-2 e.gene g.gene
HLA-DMB 3 P28068 114..208
domain Ig-like AND
e.gene m.gene HLA-DMB 4
P28068 208..218 domain connecting
peptide AND e.gene p.gene
HLA-DMB 4 P28068 219..239
transmembrane potential AND
p.protein_id f.protein_id HLA-DMB 5
P28068 248..251 site YXXZ motif
AND loc_range(e.coding_mrna)
'-2..2'range - range_lower(loc_range(m.
coding_mrna)) _at_ --
contains (f.interval - 1) 3
Databases take work to set up, but permit
exploration -- asking and quickly getting
answers to questions. The join is a large-scale
information connection operator that is both very
basic and very annoying to program.
5But Biologists dont like databases
- The main programming model is Perl, not SQL
- Databases have negative associations
- Inflexibility
- Quirkiness
- Possible slowness
- Possible expense !
- Operations often must be done outside the
database SQL is not usually enough - Steep learning curve
6Why do DBMS get an F in Biology?
- There arent that many choices for DBMS
- and for open-source DBMS there are few
- DBMS were designed for business, not science
- hard-to-change database schemas
- limited set of data types
- peculiarities of the SQL query language
- quirks of query optimization
- arcane programming models
- Strong challenges are inherent in bioscience
- very large scale
- diverse types of information
- extremely complex analytical queries
7Amazingly Few Large-Scale DBMS Options
- Commercial DBMS
- IBM DB2/DiscoveryLink
- Microsoft SQL Server
- Oracle 10g
- Open-Source DBMS
- MySQL
- PostgreSQL
Even though these have been designed with
scalability as a key design goal, they have
scalability limits that bScience will push
8PostgreSQL? Why not just MySQL?
- MySQL and PostgreSQL are the primary scalable
open-source DBMS - Objective, feature-by-feature Comparisons
- http//troels.arvin.dk/db/rdbms/
- http//en.wikipedia.org/wiki/Comparison_of_relatio
nal_database_management_systems - More subjective, focal-issue Comparison
9Aligning protein features exon structure
CREATE TABLE exon_feature exon_feature
AS
gene exon protein interval
type å description SELECT
DISTINCT ------------------
-----------------------------------------------
---- e.gene,
H2-DMb2 2 P35737 75..75
glycosylation N-linked (GlcNAc...)
e.exon_no as exon, H2-DMb2 3
P35737 1..18 signal potential
f.protein_id as protein,
H2-DMb2 4 P35737 19..112 domain
lumenal beta-1 f.interval,
H2-DMb2 4 P35737 75..75
glycosylation N-linked (GlcNAc...)
f.type, H2-DMb2 5
P35737 113..207 domain lumenal
beta-2 f.description
H2-DMb2 5 P35737 114..204 domain
Ig-like FROM
H2-DMb2 6 P35737
208..218 domain connecting peptide
gene g, H2-DMb2
6 P35737 219..239 transmembrane
potential exon e,
H2-DMb2 7 P35737 248..251
site YXXZ motif mrna m,
HLA-DMB 1
P28068 1..18 signal potential
protein p,
HLA-DMB 2 P28068 19..112 domain
lumenal beta-1 protein_feature f
HLA-DMB 2 P28068
110..110 glycosylation N-linked (GlcNAc...)
WHERE HLA-DMB
3 P28068 113..207 domain
lumenal beta-2 e.gene g.gene
HLA-DMB 3 P28068 114..208
domain Ig-like AND
e.gene m.gene HLA-DMB 4
P28068 208..218 domain connecting
peptide AND e.gene p.gene
HLA-DMB 4 P28068 219..239
transmembrane potential AND
p.protein_id f.protein_id HLA-DMB 5
P28068 248..251 site YXXZ motif
AND loc_range(e.coding_mrna)
'-2..2'range - range_lower(loc_range(m.
coding_mrna)) _at_ --
contains (f.interval - 1) 3
What is really needed here is a new datatype for
sequence locations. This query is painful to
express without this, and not painful with it.
10BioPostgres -- some modular database
infrastructure for Computational Biology
- BioPostgres PostgreSQL Extensions
- PostgreSQL an open-source, industrial-strength,
scalable DBMS http//www.postgresql.org - Extension a new SQL API, with query operators
and tools - Each Extension is a separate package
- Biosequence extensions
- GeneOntology extensions
- SQL datatype extensions
- System management extensions
- Working toward complementing other BOSC
platforms - BioPerl, BioPython, BioJava, BioSQL,
BioConductor, GMOD,
11Extensibility features of PostgreSQL
- PostgreSQL was designed specifically to be
extensible - Individual databases can be extended with
- New datatypes
- New functions
- New query operators
- New indexing methods
- New query languages
- These can be added or dropped anytime, on the fly
- dynamic linking of implementation libraries as
needed - Flexible conventions for user-contributed modules
- implementations are typically in C (like
PostgreSQL).
12Extending PostgreSQL
- Within a given database, one can add new
datatypes - A datatype can be added to all databases also
MyBioStuff (PostgreSQL database)
PostgreSQL
Graph datatype query operators
Seq Location datatype query operators
13PostgreSQL user-contributed modules
- A new module (say PostFoo) typically contains
- New functions (in PostFoo.c)
- SQL interface bindings for these functions (in
PostFoo.sql) - The PostFoo module gets downloaded by others into
their copy of the PostgreSQL source tree as a new
directory postgresql-8../cont
rib/PostFoo/ - In this directory, the source tree owner types
- gmake
- gmake install as root
- This compiles only the module (NOT PostgreSQL)!
- Afterwards anyone can dynamically add the module
to any given PostgreSQL database (say
MyBioStuff) with a command like - psql -d MyBioStuff lt PostFoo.sql
- See www.biopostgres.org/install.html
14BioPostgres Modules
- BioPostgres is a collection of modules that
extend PostgreSQL for Computational Biology.
These modules are basically independent, and can
be used separately or in conjunction with others
15Quick overview BLASTgres
- BLASTgres -- extensions for biosequence
management - Sequence location datatype (seq_id,
start,end) - Sequence location operators loc intersection,
etc. - Sequence location query find overlapping locs,
etc. - Access to BLAST services remote and local
servers - URL http//www.biopostgres.org/BLASTgres/
16Two sets of BLASTgres extensions
- 1. BLASTgres provides BLAST query, BLAST hit
database -
- SELECT FROM blast_sequence()
- 2. BLASTgres provides biosequence-related
datatypes, with accompanying query operators and
indexing methods - Sequence range (and array of range)
- 17679235..17679427
- Sequence location (and array of location)
- NT_011109.1517679235..17679427
- Hit ( high-scoring sequence alignment
information) - (in1353..966, HUMAPOE43779..4402
- 99.84, 624, 1, 0, 0, 1229)
17BLAST access via BLASTgres queries
- Simple BLAST queries
- SELECT FROM local_blast_hit(atcgatcgatcg,
lab-sequences, blastn) - SELECT FROM remote_blast_hps(lab_protein.fasta
, nr, blastp) - SELECT FROM remote_blast_hit(NM_010387.2245..5
46, nr, blastn) - SELECT FROM fasta_sequence( /lab/ests,
P082345, 20, 40 ) - 2. Annotations to BLAST query results
- SELECT count(), species FROM annotated_remote_bla
st_hit(AF101044, nr, blastn) - GROUP BY species -- automatic annotation to
BLAST query results - SELECT FROM annotated_remote_blast_hit(AF101044
, nr) - WHERE description LIKE SNRPN, and species
ltgt Homo sapiens - 3. Advanced filtering for BLAST results in SQL
- SELECT subject_location, length FROM
local_blast_hit(AF101044)
- -- sensitivity test (comparison of BLAST query
results using different parameters) - CREATE TABLE parameter1 ( name TEXT, value TEXT
) - INSERT INTO parameter1 VALUES ( WORD_SIZE, 7
) -- change the default settings - INSERT INTO parameter1 VALUES ( GAPCOSTS, 5 2
) - CREATE TABLE parameter2 ( name TEXT, value TEXT
) - INSERT INTO parameter2 VALUES ( WORD_SIZE, 13
) - INSERT INTO parameter2 VALUES ( GAPCOSTS, 3 1
) - -- retrieve matches that are not found in both
results - SELECT R1., R2. FROM remote_blast_hit_v(
AF101044, parameter1) R1, - remote_blast_hit_v( AF101044, parameter2)
R2 - WHERE (R1.subject_location
R2.subject_location) AND (R1.length ltgt R2.length
)
18BLASTgres functions
Range Operators range range range int8 int8
range range - range range ? int8 range range
range int4 int4 range range range range lt
range rangelt range range gt range range gt range
range ltlt range range lt range range range
range gt range range gtgt range range range range
ltgt range range _at_ range range range range _at_lt
range range _at_gt range Range Aggregate minmax
(range)
_range_concat(_range,_range) _range_overlaps(_rang
e,_range) _range_contains(_range,_range)
_range_contained(_range,_range)
_range_eq(_range,_range) _range_ne(_range,_range)
_range_union(_range,_range) _range_inters(_range,_
range) _range_push_elem(_range, RANGE)
_range_push_array(_range, _range)
_range_del_elem(_range, RANGE) _range_union_elem(_
range, RANGE) _range_subtract(_range, _range)
_range_contains(_range, range) _range_contains_int
erval_any(_range, range) _range_contains_interval_
all(_range, range) _range_contained_interval_any(_
range, range) _range_contained_interval_all(_range
, range) _range_overlaps_interval_any(_range,
range) _range_overlaps_interval_all(_range,
range) _range_in(cstring) _range_out(_range_key)
Location Operators loc loc loc range loc
int8 loc - loc loc - range loc ? int8 loc loc
loc range loc int4 loc loc loc ltgt loc loc
ltgt range loc lt loc loc lt range loc lt loc loc lt
range loc gt loc loc gt range loc gt loc loc gt
range loc ltlt loc loc ltlt range loc lt loc loc lt
range loc loc loc range loc gt loc loc gt
range loc gtgt loc loc gtgt range loc _at_ loc loc _at_
range loc loc loc range
Location Functions loc_range(loc) loc_seqid(loc)
loc_size(loc) loc_lower(loc) loc_upper(loc)
loc_positive_strand(loc) loc_negative_strand(loc)
loc_same_strand(loc,loc) loc_negate(loc)
loc_eq(loc, loc) loc_eq(loc, range) loc_ne(loc,
loc) loc_ne(loc, range) loc_over_left(loc, loc)
loc_over_left(loc, range) loc_over_right(loc,
loc) loc_over_right(loc, range) loc_left(loc,loc)
loc_left(loc,range) loc_right(loc,loc)
loc_right(loc,range) loc_lt(loc,loc)
loc_lt(loc,range) loc_le(loc,loc)
loc_le(loc,range) loc_gt(loc,loc)
loc_gt(loc,range) loc_ge(loc,loc)
loc_ge(loc,range)
Aggregate functions coalescing( text, text,
text, int4 ) coalescing( text, text, text)
partition(text, text, text, text, text
) revcom(text) transcribe(text)
translate(text) translate(text, int4)
range_agg_state (range, range)
range_agg_final_array (range)
range_array_aggregate ( range_array_enum(range)
loc_agg_state (loc, loc) loc_agg_final_array
(loc) loc_array_aggregate loc_array_enum(loc)
rangeset(range) rcount(_range) sort(_range,
text) sort(_range) sort_asc(_range) sort_desc(_r
ange) uniq(_range) idx(_range, range)
subarray(_range, int4, int4) subarray(_range,
int4)
Range Functions range_over_left(range, range)
range_over_right(range, range) range_left(range,
range) range_right(range, range) range_lt(range,
range) range_le(range, range) range_gt(range,
range) range_ge(range, range) range_ne(range,
range) range_inside(range,range)
range_contains(range, range) range_contained(range
, range) range_overlaps(range, range)
range_eq(range, range) range_meets(range, range)
range_met_by(range, range) range_starts(range,
range) range_started_by(range, range)
range_finishes(range, range) range_finished_by(ran
ge, range) range_same_lower(range, range)
range_same_upper(range, range) range_minus(range,
range) range_plus(range, range)
range_torange(int8, int8) range_maxmin(range,
range) range_minmax(range, range)
range_extend(range, int4) range_cmp(range, range)
range_union(range, range) range_inter(range,
range) range_size(range) range_upper(range)
range_lower(range) range_times(range,range)
range_times(range, int4) range_times(int4, range)
range_plus(range, int8) range_plus(int8, range)
range_minus(range, int8)
BLAST-related functions fasta_sequence(text,text,
int,int) fasta_sequence(text,text)
fasta_dbinfo(text) remote_blast_hit(text,text,te
xt) remote_blast_hit( text, text )
remote_blast_hit(text) remote_blast_hsp(text,text,
text) remote_blast_hsp( text, text )
remote_blast_hsp(text) local_blast_hsp(text,text
,text,text) local_blast_hsp( text, text, text )
local_blast_hit(text,text,text,text)
local_blast_hit( text, text, text )
local_ublast_hit( text, text, text )
local_ublast_hit( text, text ) local_ublast_hit(
text ) genbank_search(text)
loc_contains(loc,loc) loc_contains(loc,range)
loc_contained(loc,loc) loc_contained(range,loc)
loc_overlaps(loc,loc) loc_overlaps(loc,range)
loc_meets(loc,loc) loc_meets(loc,range)
loc_met_by(loc,loc) loc_met_by(loc,range)
loc_starts(loc,loc) loc_starts(loc,range)
loc_started_by(loc,loc) loc_started_by(loc,range)
loc_finishes(loc,loc) loc_finishes(loc,range)
loc_finished_by(loc,loc) loc_finished_by(loc,range
) loc_minus(loc,loc) loc_minus(loc, integer)
loc_minus(loc, range) loc_plus(loc,loc)
loc_plus(loc, integer) loc_plus(integer, loc)
loc_plus(loc, range) loc_times(loc,loc)
loc_times(loc, integer) loc_times(integer,loc)
loc_times(loc, range) toloc(cstring, int8, int8)
toloc(text, int8, int8) toloc(cstring, range)
toloc(text, range) loc_maxmin(loc,loc)
loc_minmax(loc,loc) loc_extend(loc,int4)
loc_maxmin(loc,range) loc_minmax(loc,range)
19Aligning protein features exon structure
CREATE TABLE exon_feature exon_feature
AS
gene exon protein interval
type å description SELECT
DISTINCT ------------------
-----------------------------------------------
---- e.gene,
H2-DMb2 2 P35737 75..75
glycosylation N-linked (GlcNAc...)
e.exon_no as exon, H2-DMb2 3
P35737 1..18 signal potential
f.protein_id as protein,
H2-DMb2 4 P35737 19..112 domain
lumenal beta-1 f.interval,
H2-DMb2 4 P35737 75..75
glycosylation N-linked (GlcNAc...)
f.type, H2-DMb2 5
P35737 113..207 domain lumenal
beta-2 f.description
H2-DMb2 5 P35737 114..204 domain
Ig-like FROM
H2-DMb2 6 P35737
208..218 domain connecting peptide
gene g, H2-DMb2
6 P35737 219..239 transmembrane
potential exon e,
H2-DMb2 7 P35737 248..251
site YXXZ motif mrna m,
HLA-DMB 1
P28068 1..18 signal potential
protein p,
HLA-DMB 2 P28068 19..112 domain
lumenal beta-1 protein_feature f
HLA-DMB 2 P28068
110..110 glycosylation N-linked (GlcNAc...)
WHERE HLA-DMB
3 P28068 113..207 domain
lumenal beta-2 e.gene g.gene
HLA-DMB 3 P28068 114..208
domain Ig-like AND
e.gene m.gene HLA-DMB 4
P28068 208..218 domain connecting
peptide AND e.gene p.gene
HLA-DMB 4 P28068 219..239
transmembrane potential AND
p.protein_id f.protein_id HLA-DMB 5
P28068 248..251 site YXXZ motif
AND loc_range(e.coding_mrna)
'-2..2'range - range_lower(loc_range(m.
coding_mrna)) _at_ --
contains (f.interval - 1) 3
BLASTgres query operators X Y
range_overlaps(X,Y) X _at_ Y range_contains(X,Y)
X Y range_times(X,Y) X Y
range_plus(X,Y) etc
20Indexing for the loc range datatypes
Search keys are represented as two bounding
intervals (for leaf nodes, the lower bound is
equal to upper bound for the first bounding
intervals). The bounding interval of an internal
node contains those of its subnodes.
21GiST Indexing in BLASTgres
GiST (Generalized Index Search Tree) is the
PostgreSQL framework for user-defined
indexing. BLASTgres uses GiST indexing for
locations, ranges, location arrays, and range
arrays.
- GiST indexing requires user
- definition of four methods
- Consistent determine if a subtree traversal is
necessary. - Union merge two nodes.
- Penalty determine the cost of inserting an
entry in a node. - PickSplit determine how to split a full node
22BioPostgres Modules
- Many modules are needed! We have only just begun
to develop modules we feel are most clearly
needed. Please contact us if you have opinions
or ideas.
23Quick overview PostGraph
- PostGraph -- extensions for graph management
- Graph datatype each graph is a single data
object - Graph operators insert edge, delete node, etc.
- Graph query find connected components, etc.
- Access to graph tools dot/graphviz, prefuse,
etc. - URL http//www.biopostgres.org/PostGraph/
24GeneOntology relational database
- GO is naturally a graph, distributed as
relational tables - This representation is fine for storage but can
be awkward for exploration
SELECT id, name, term_type, acc FROM term LIMIT
7 id name
term_type acc ------------------
------------------------------------------------
- 1 all
universal all 2 is_a
relationship is_a 3
part_of gene_ontology
part_of 4 mitochondrion inheritance
biological_process GO0000001 5
mitochondrial genome maintena
biological_process GO0000002 6
reproduction
biological_process GO0000003 7 alt_id
synonym_type
alt_id SELECT FROM term2term LIMIT 7 id
relationship_type_id term1_id term2_id
complete -------------------------------------
------------------- 1
2 10 9 0 2
2 10 13 0
3 2 26 25
0 4 2 10
42 0 5 2
10 50 0 6
2 26 67 0 7
2 26 92
0
25Augmenting the GeneOntology database with a graph
database
- PostGraph permits GO terms to be stored queried
as graphs - GOtermViewer uses this to provide a Graphical
Interface for GO - These are extended in GObase www.biopostgres.org
/GObase/
26Quick overview PostMake
- PostMake -- extensions for managing data
dependencies - PostMake covers common dependencies in
bioscience - PostCron database operations at pre-specified
times (crontab within a database) - Database dump/load, tracking when things change
- Materialized views, triggered by updates
(make within a database) - PostMakefiles are translated to SQL DDL
- URL http//www.biopostgres.org/PostMake/
27BioPostgres and open-source
- Like PostgreSQL, BioPostgres is committed to
openness - BioPostgres modules are GPLed most code so far
is in C, SQL, and Java - BioPostgres modules are downloadable from
SourceForge - We think BioPostgres shows that extensible DBMS
have a lot to offer the open-source movement,
particularly in bioscience.
28Will DBMS ever get an A in Biology?
- Actually we dont know.
- We do not think DBMS will ever replace files or
programming languages in bioscience. - We think however that scalability is a driving
issue, and it will impact the effectiveness of
files and programming languages in bioscience.
And soon. - It does appear that extensibile DBMS can suggest
steps in the right direction.
29THANKYOU!