BioPostgres

About This Presentation

Title:

BioPostgres

Description:

MySQL. PostgreSQL. Even though these have been designed with scalability as a key ... Why not just MySQL? MySQL and PostgreSQL are the primary scalable open ... – PowerPoint PPT presentation

Number of Views:88

Avg rating:3.0/5.0

Slides: 30

Provided by: IvoD9

Category:

more less

Transcript and Presenter's Notes

Title: BioPostgres

1
BioPostgres
www.biopostgres.org
Stott Parker Ruey-Lung Hsiao UCLA Computer
Science Dept.
UCLA Center for Computational Biology (CCB)
2
The Future of Science

Evolution of science
Increasing emphasis on
scale
networking
informatics

3
bScience ?

eScience
large-scale, data-centric, computationally
mind-numbing science http//research.microsoft.c
om/workshops/escience2005/
the future of science (for eScientists)
enormous centralized data centers that provide
many information services
bScience
NCBI is a good model for an eScience data center
The importance of extensible database systems
Jim Gray stresses the importance of extensible
databases in eScience future scientists will
spend their days writing large-scale SQL queries
(!)
Even if Jim Gray is not right about this, at some
point the scale of biological information
requires using database systems for some kind of
data management. Think about terabytes and
exabytes

4
Aligning protein features exon structure
CREATE TABLE exon_feature exon_feature

AS
gene exon protein interval
type å description SELECT
DISTINCT ------------------
-----------------------------------------------
---- e.gene,
H2-DMb2 2 P35737 75..75
glycosylation N-linked (GlcNAc...)
e.exon_no as exon, H2-DMb2 3
P35737 1..18 signal potential
f.protein_id as protein,
H2-DMb2 4 P35737 19..112 domain
lumenal beta-1 f.interval,
H2-DMb2 4 P35737 75..75
glycosylation N-linked (GlcNAc...)
f.type, H2-DMb2 5
P35737 113..207 domain lumenal
beta-2 f.description
H2-DMb2 5 P35737 114..204 domain
Ig-like FROM
H2-DMb2 6 P35737
208..218 domain connecting peptide
gene g, H2-DMb2
6 P35737 219..239 transmembrane
potential exon e,
H2-DMb2 7 P35737 248..251
site YXXZ motif mrna m,
HLA-DMB 1
P28068 1..18 signal potential
protein p,
HLA-DMB 2 P28068 19..112 domain
lumenal beta-1 protein_feature f
HLA-DMB 2 P28068
110..110 glycosylation N-linked (GlcNAc...)
WHERE HLA-DMB
3 P28068 113..207 domain
lumenal beta-2 e.gene g.gene
HLA-DMB 3 P28068 114..208
domain Ig-like AND
e.gene m.gene HLA-DMB 4
P28068 208..218 domain connecting
peptide AND e.gene p.gene
HLA-DMB 4 P28068 219..239
transmembrane potential AND
p.protein_id f.protein_id HLA-DMB 5
P28068 248..251 site YXXZ motif
AND loc_range(e.coding_mrna)
'-2..2'range - range_lower(loc_range(m.
coding_mrna)) _at_ --
contains (f.interval - 1) 3
Databases take work to set up, but permit
exploration -- asking and quickly getting
answers to questions. The join is a large-scale
information connection operator that is both very
basic and very annoying to program.
5
But Biologists dont like databases

The main programming model is Perl, not SQL
Databases have negative associations
Inflexibility
Quirkiness
Possible slowness
Possible expense !
Operations often must be done outside the
database SQL is not usually enough
Steep learning curve

6
Why do DBMS get an F in Biology?

There arent that many choices for DBMS
and for open-source DBMS there are few
DBMS were designed for business, not science
hard-to-change database schemas
limited set of data types
peculiarities of the SQL query language
quirks of query optimization
arcane programming models
Strong challenges are inherent in bioscience
very large scale
diverse types of information
extremely complex analytical queries

7
Amazingly Few Large-Scale DBMS Options

Commercial DBMS
IBM DB2/DiscoveryLink
Microsoft SQL Server
Oracle 10g

Open-Source DBMS
MySQL
PostgreSQL

Even though these have been designed with
scalability as a key design goal, they have
scalability limits that bScience will push
8
PostgreSQL? Why not just MySQL?

MySQL and PostgreSQL are the primary scalable
open-source DBMS
Objective, feature-by-feature Comparisons
http//troels.arvin.dk/db/rdbms/
http//en.wikipedia.org/wiki/Comparison_of_relatio
nal_database_management_systems
More subjective, focal-issue Comparison

9
Aligning protein features exon structure
CREATE TABLE exon_feature exon_feature

AS
gene exon protein interval
type å description SELECT
DISTINCT ------------------
-----------------------------------------------
---- e.gene,
H2-DMb2 2 P35737 75..75
glycosylation N-linked (GlcNAc...)
e.exon_no as exon, H2-DMb2 3
P35737 1..18 signal potential
f.protein_id as protein,
H2-DMb2 4 P35737 19..112 domain
lumenal beta-1 f.interval,
H2-DMb2 4 P35737 75..75
glycosylation N-linked (GlcNAc...)
f.type, H2-DMb2 5
P35737 113..207 domain lumenal
beta-2 f.description
H2-DMb2 5 P35737 114..204 domain
Ig-like FROM
H2-DMb2 6 P35737
208..218 domain connecting peptide
gene g, H2-DMb2
6 P35737 219..239 transmembrane
potential exon e,
H2-DMb2 7 P35737 248..251
site YXXZ motif mrna m,
HLA-DMB 1
P28068 1..18 signal potential
protein p,
HLA-DMB 2 P28068 19..112 domain
lumenal beta-1 protein_feature f
HLA-DMB 2 P28068
110..110 glycosylation N-linked (GlcNAc...)
WHERE HLA-DMB
3 P28068 113..207 domain
lumenal beta-2 e.gene g.gene
HLA-DMB 3 P28068 114..208
domain Ig-like AND
e.gene m.gene HLA-DMB 4
P28068 208..218 domain connecting
peptide AND e.gene p.gene
HLA-DMB 4 P28068 219..239
transmembrane potential AND
p.protein_id f.protein_id HLA-DMB 5
P28068 248..251 site YXXZ motif
AND loc_range(e.coding_mrna)
'-2..2'range - range_lower(loc_range(m.
coding_mrna)) _at_ --
contains (f.interval - 1) 3
What is really needed here is a new datatype for
sequence locations. This query is painful to
express without this, and not painful with it.
10
BioPostgres -- some modular database
infrastructure for Computational Biology

BioPostgres PostgreSQL Extensions
PostgreSQL an open-source, industrial-strength,
scalable DBMS http//www.postgresql.org
Extension a new SQL API, with query operators
and tools
Each Extension is a separate package
Biosequence extensions
GeneOntology extensions
SQL datatype extensions
System management extensions
Working toward complementing other BOSC
platforms
BioPerl, BioPython, BioJava, BioSQL,
BioConductor, GMOD,

11
Extensibility features of PostgreSQL

PostgreSQL was designed specifically to be
extensible
Individual databases can be extended with
New datatypes
New functions
New query operators
New indexing methods
New query languages
These can be added or dropped anytime, on the fly
dynamic linking of implementation libraries as
needed
Flexible conventions for user-contributed modules
implementations are typically in C (like
PostgreSQL).

12
Extending PostgreSQL

Within a given database, one can add new
datatypes
A datatype can be added to all databases also

MyBioStuff (PostgreSQL database)
PostgreSQL
Graph datatype query operators
Seq Location datatype query operators
13
PostgreSQL user-contributed modules

A new module (say PostFoo) typically contains
New functions (in PostFoo.c)
SQL interface bindings for these functions (in
PostFoo.sql)
The PostFoo module gets downloaded by others into
their copy of the PostgreSQL source tree as a new
directory postgresql-8../cont
rib/PostFoo/
In this directory, the source tree owner types
gmake
gmake install as root
This compiles only the module (NOT PostgreSQL)!
Afterwards anyone can dynamically add the module
to any given PostgreSQL database (say
MyBioStuff) with a command like
psql -d MyBioStuff lt PostFoo.sql
See www.biopostgres.org/install.html

14
BioPostgres Modules

BioPostgres is a collection of modules that
extend PostgreSQL for Computational Biology.
These modules are basically independent, and can
be used separately or in conjunction with others

15
Quick overview BLASTgres

BLASTgres -- extensions for biosequence
management
Sequence location datatype (seq_id,
start,end)
Sequence location operators loc intersection,
etc.
Sequence location query find overlapping locs,
etc.
Access to BLAST services remote and local
servers
URL http//www.biopostgres.org/BLASTgres/

16
Two sets of BLASTgres extensions

1. BLASTgres provides BLAST query, BLAST hit
database
SELECT FROM blast_sequence()
2. BLASTgres provides biosequence-related
datatypes, with accompanying query operators and
indexing methods
Sequence range (and array of range)
17679235..17679427
Sequence location (and array of location)
NT_011109.1517679235..17679427
Hit ( high-scoring sequence alignment
information)
(in1353..966, HUMAPOE43779..4402
99.84, 624, 1, 0, 0, 1229)

17
BLAST access via BLASTgres queries

Simple BLAST queries
SELECT FROM local_blast_hit(atcgatcgatcg,
lab-sequences, blastn)
SELECT FROM remote_blast_hps(lab_protein.fasta
, nr, blastp)
SELECT FROM remote_blast_hit(NM_010387.2245..5
46, nr, blastn)
SELECT FROM fasta_sequence( /lab/ests,
P082345, 20, 40 )
2. Annotations to BLAST query results
SELECT count(), species FROM annotated_remote_bla
st_hit(AF101044, nr, blastn)
GROUP BY species -- automatic annotation to
BLAST query results
SELECT FROM annotated_remote_blast_hit(AF101044
, nr)
WHERE description LIKE SNRPN, and species
ltgt Homo sapiens
3. Advanced filtering for BLAST results in SQL
SELECT subject_location, length FROM
local_blast_hit(AF101044)

-- sensitivity test (comparison of BLAST query
results using different parameters)
CREATE TABLE parameter1 ( name TEXT, value TEXT
)
INSERT INTO parameter1 VALUES ( WORD_SIZE, 7
) -- change the default settings
INSERT INTO parameter1 VALUES ( GAPCOSTS, 5 2
)
CREATE TABLE parameter2 ( name TEXT, value TEXT
)
INSERT INTO parameter2 VALUES ( WORD_SIZE, 13
)
INSERT INTO parameter2 VALUES ( GAPCOSTS, 3 1
)
-- retrieve matches that are not found in both
results
SELECT R1., R2. FROM remote_blast_hit_v(
AF101044, parameter1) R1,
remote_blast_hit_v( AF101044, parameter2)
R2
WHERE (R1.subject_location
R2.subject_location) AND (R1.length ltgt R2.length
)

18
BLASTgres functions
Range Operators range range range int8 int8
range range - range range ? int8 range range
range int4 int4 range range range range lt
range rangelt range range gt range range gt range
range ltlt range range lt range range range
range gt range range gtgt range range range range
ltgt range range _at_ range range range range _at_lt
range range _at_gt range Range Aggregate minmax
(range)
_range_concat(_range,_range) _range_overlaps(_rang
e,_range) _range_contains(_range,_range)
_range_contained(_range,_range)
_range_eq(_range,_range) _range_ne(_range,_range)
_range_union(_range,_range) _range_inters(_range,_
range) _range_push_elem(_range, RANGE)
_range_push_array(_range, _range)
_range_del_elem(_range, RANGE) _range_union_elem(_
range, RANGE) _range_subtract(_range, _range)
_range_contains(_range, range) _range_contains_int
erval_any(_range, range) _range_contains_interval_
all(_range, range) _range_contained_interval_any(_
range, range) _range_contained_interval_all(_range
, range) _range_overlaps_interval_any(_range,
range) _range_overlaps_interval_all(_range,
range) _range_in(cstring) _range_out(_range_key)
Location Operators loc loc loc range loc
int8 loc - loc loc - range loc ? int8 loc loc
loc range loc int4 loc loc loc ltgt loc loc
ltgt range loc lt loc loc lt range loc lt loc loc lt
range loc gt loc loc gt range loc gt loc loc gt
range loc ltlt loc loc ltlt range loc lt loc loc lt
range loc loc loc range loc gt loc loc gt
range loc gtgt loc loc gtgt range loc _at_ loc loc _at_
range loc loc loc range
Location Functions loc_range(loc) loc_seqid(loc)
loc_size(loc) loc_lower(loc) loc_upper(loc)
loc_positive_strand(loc) loc_negative_strand(loc)
loc_same_strand(loc,loc) loc_negate(loc)
loc_eq(loc, loc) loc_eq(loc, range) loc_ne(loc,
loc) loc_ne(loc, range) loc_over_left(loc, loc)
loc_over_left(loc, range) loc_over_right(loc,
loc) loc_over_right(loc, range) loc_left(loc,loc)
loc_left(loc,range) loc_right(loc,loc)
loc_right(loc,range) loc_lt(loc,loc)
loc_lt(loc,range) loc_le(loc,loc)
loc_le(loc,range) loc_gt(loc,loc)
loc_gt(loc,range) loc_ge(loc,loc)
loc_ge(loc,range)
Aggregate functions coalescing( text, text,
text, int4 ) coalescing( text, text, text)
partition(text, text, text, text, text
) revcom(text) transcribe(text)
translate(text) translate(text, int4)
range_agg_state (range, range)
range_agg_final_array (range)
range_array_aggregate ( range_array_enum(range)
loc_agg_state (loc, loc) loc_agg_final_array
(loc) loc_array_aggregate loc_array_enum(loc)
rangeset(range) rcount(_range) sort(_range,
text) sort(_range) sort_asc(_range) sort_desc(_r
ange) uniq(_range) idx(_range, range)
subarray(_range, int4, int4) subarray(_range,
int4)
Range Functions range_over_left(range, range)
range_over_right(range, range) range_left(range,
range) range_right(range, range) range_lt(range,
range) range_le(range, range) range_gt(range,
range) range_ge(range, range) range_ne(range,
range) range_inside(range,range)
range_contains(range, range) range_contained(range
, range) range_overlaps(range, range)
range_eq(range, range) range_meets(range, range)
range_met_by(range, range) range_starts(range,
range) range_started_by(range, range)
range_finishes(range, range) range_finished_by(ran
ge, range) range_same_lower(range, range)
range_same_upper(range, range) range_minus(range,
range) range_plus(range, range)
range_torange(int8, int8) range_maxmin(range,
range) range_minmax(range, range)
range_extend(range, int4) range_cmp(range, range)
range_union(range, range) range_inter(range,
range) range_size(range) range_upper(range)
range_lower(range) range_times(range,range)
range_times(range, int4) range_times(int4, range)
range_plus(range, int8) range_plus(int8, range)
range_minus(range, int8)
BLAST-related functions fasta_sequence(text,text,
int,int) fasta_sequence(text,text)
fasta_dbinfo(text) remote_blast_hit(text,text,te
xt) remote_blast_hit( text, text )
remote_blast_hit(text) remote_blast_hsp(text,text,
text) remote_blast_hsp( text, text )
remote_blast_hsp(text) local_blast_hsp(text,text
,text,text) local_blast_hsp( text, text, text )
local_blast_hit(text,text,text,text)
local_blast_hit( text, text, text )
local_ublast_hit( text, text, text )
local_ublast_hit( text, text ) local_ublast_hit(
text ) genbank_search(text)
loc_contains(loc,loc) loc_contains(loc,range)
loc_contained(loc,loc) loc_contained(range,loc)
loc_overlaps(loc,loc) loc_overlaps(loc,range)
loc_meets(loc,loc) loc_meets(loc,range)
loc_met_by(loc,loc) loc_met_by(loc,range)
loc_starts(loc,loc) loc_starts(loc,range)
loc_started_by(loc,loc) loc_started_by(loc,range)
loc_finishes(loc,loc) loc_finishes(loc,range)
loc_finished_by(loc,loc) loc_finished_by(loc,range
) loc_minus(loc,loc) loc_minus(loc, integer)
loc_minus(loc, range) loc_plus(loc,loc)
loc_plus(loc, integer) loc_plus(integer, loc)
loc_plus(loc, range) loc_times(loc,loc)
loc_times(loc, integer) loc_times(integer,loc)
loc_times(loc, range) toloc(cstring, int8, int8)
toloc(text, int8, int8) toloc(cstring, range)
toloc(text, range) loc_maxmin(loc,loc)
loc_minmax(loc,loc) loc_extend(loc,int4)
loc_maxmin(loc,range) loc_minmax(loc,range)
19
Aligning protein features exon structure
CREATE TABLE exon_feature exon_feature

AS
gene exon protein interval
type å description SELECT
DISTINCT ------------------
-----------------------------------------------
---- e.gene,
H2-DMb2 2 P35737 75..75
glycosylation N-linked (GlcNAc...)
e.exon_no as exon, H2-DMb2 3
P35737 1..18 signal potential
f.protein_id as protein,
H2-DMb2 4 P35737 19..112 domain
lumenal beta-1 f.interval,
H2-DMb2 4 P35737 75..75
glycosylation N-linked (GlcNAc...)
f.type, H2-DMb2 5
P35737 113..207 domain lumenal
beta-2 f.description
H2-DMb2 5 P35737 114..204 domain
Ig-like FROM
H2-DMb2 6 P35737
208..218 domain connecting peptide
gene g, H2-DMb2
6 P35737 219..239 transmembrane
potential exon e,
H2-DMb2 7 P35737 248..251
site YXXZ motif mrna m,
HLA-DMB 1
P28068 1..18 signal potential
protein p,
HLA-DMB 2 P28068 19..112 domain
lumenal beta-1 protein_feature f
HLA-DMB 2 P28068
110..110 glycosylation N-linked (GlcNAc...)
WHERE HLA-DMB
3 P28068 113..207 domain
lumenal beta-2 e.gene g.gene
HLA-DMB 3 P28068 114..208
domain Ig-like AND
e.gene m.gene HLA-DMB 4
P28068 208..218 domain connecting
peptide AND e.gene p.gene
HLA-DMB 4 P28068 219..239
transmembrane potential AND
p.protein_id f.protein_id HLA-DMB 5
P28068 248..251 site YXXZ motif
AND loc_range(e.coding_mrna)
'-2..2'range - range_lower(loc_range(m.
coding_mrna)) _at_ --
contains (f.interval - 1) 3
BLASTgres query operators X Y
range_overlaps(X,Y) X _at_ Y range_contains(X,Y)
X Y range_times(X,Y) X Y
range_plus(X,Y) etc
20
Indexing for the loc range datatypes
Search keys are represented as two bounding
intervals (for leaf nodes, the lower bound is
equal to upper bound for the first bounding
intervals). The bounding interval of an internal
node contains those of its subnodes.
21
GiST Indexing in BLASTgres
GiST (Generalized Index Search Tree) is the
PostgreSQL framework for user-defined
indexing. BLASTgres uses GiST indexing for
locations, ranges, location arrays, and range
arrays.

GiST indexing requires user
definition of four methods
Consistent determine if a subtree traversal is
necessary.
Union merge two nodes.
Penalty determine the cost of inserting an
entry in a node.
PickSplit determine how to split a full node

22
BioPostgres Modules

Many modules are needed! We have only just begun
to develop modules we feel are most clearly
needed. Please contact us if you have opinions
or ideas.

23
Quick overview PostGraph

PostGraph -- extensions for graph management
Graph datatype each graph is a single data
object
Graph operators insert edge, delete node, etc.
Graph query find connected components, etc.
Access to graph tools dot/graphviz, prefuse,
etc.
URL http//www.biopostgres.org/PostGraph/

24
GeneOntology relational database

GO is naturally a graph, distributed as
relational tables
This representation is fine for storage but can
be awkward for exploration

SELECT id, name, term_type, acc FROM term LIMIT
7 id name
term_type acc ------------------
------------------------------------------------
- 1 all
universal all 2 is_a
relationship is_a 3
part_of gene_ontology
part_of 4 mitochondrion inheritance
biological_process GO0000001 5
mitochondrial genome maintena
biological_process GO0000002 6
reproduction
biological_process GO0000003 7 alt_id
synonym_type
alt_id SELECT FROM term2term LIMIT 7 id
relationship_type_id term1_id term2_id
complete -------------------------------------
------------------- 1
2 10 9 0 2
2 10 13 0
3 2 26 25
0 4 2 10
42 0 5 2
10 50 0 6
2 26 67 0 7
2 26 92
0
25
Augmenting the GeneOntology database with a graph
database

PostGraph permits GO terms to be stored queried
as graphs
GOtermViewer uses this to provide a Graphical
Interface for GO
These are extended in GObase www.biopostgres.org
/GObase/

26
Quick overview PostMake

PostMake -- extensions for managing data
dependencies
PostMake covers common dependencies in
bioscience
PostCron database operations at pre-specified
times (crontab within a database)
Database dump/load, tracking when things change
Materialized views, triggered by updates
(make within a database)
PostMakefiles are translated to SQL DDL
URL http//www.biopostgres.org/PostMake/

27
BioPostgres and open-source

Like PostgreSQL, BioPostgres is committed to
openness
BioPostgres modules are GPLed most code so far
is in C, SQL, and Java
BioPostgres modules are downloadable from
SourceForge
We think BioPostgres shows that extensible DBMS
have a lot to offer the open-source movement,
particularly in bioscience.

28
Will DBMS ever get an A in Biology?

Actually we dont know.
We do not think DBMS will ever replace files or
programming languages in bioscience.
We think however that scalability is a driving
issue, and it will impact the effectiveness of
files and programming languages in bioscience.
And soon.
It does appear that extensibile DBMS can suggest
steps in the right direction.

29
THANKYOU!

Write a Comment

User Comments (0)