BioPostgres - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

BioPostgres

Description:

MySQL. PostgreSQL. Even though these have been designed with scalability as a key ... Why not just MySQL? MySQL and PostgreSQL are the primary scalable open ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 30
Provided by: IvoD9
Category:

less

Transcript and Presenter's Notes

Title: BioPostgres


1
BioPostgres
www.biopostgres.org
Stott Parker Ruey-Lung Hsiao UCLA Computer
Science Dept.
UCLA Center for Computational Biology (CCB)
2
The Future of Science
  • Evolution of science
  • Increasing emphasis on
  • scale
  • networking
  • informatics

3
bScience ?
  • eScience
  • large-scale, data-centric, computationally
    mind-numbing science http//research.microsoft.c
    om/workshops/escience2005/
  • the future of science (for eScientists)
    enormous centralized data centers that provide
    many information services
  • bScience
  • NCBI is a good model for an eScience data center
  • The importance of extensible database systems
  • Jim Gray stresses the importance of extensible
    databases in eScience future scientists will
    spend their days writing large-scale SQL queries
    (!)
  • Even if Jim Gray is not right about this, at some
    point the scale of biological information
    requires using database systems for some kind of
    data management. Think about terabytes and
    exabytes

4
Aligning protein features exon structure
CREATE TABLE exon_feature exon_feature

AS
gene exon protein interval
type å description SELECT
DISTINCT ------------------
-----------------------------------------------
---- e.gene,
H2-DMb2 2 P35737 75..75
glycosylation N-linked (GlcNAc...)
e.exon_no as exon, H2-DMb2 3
P35737 1..18 signal potential
f.protein_id as protein,
H2-DMb2 4 P35737 19..112 domain
lumenal beta-1 f.interval,
H2-DMb2 4 P35737 75..75
glycosylation N-linked (GlcNAc...)
f.type, H2-DMb2 5
P35737 113..207 domain lumenal
beta-2 f.description
H2-DMb2 5 P35737 114..204 domain
Ig-like FROM
H2-DMb2 6 P35737
208..218 domain connecting peptide
gene g, H2-DMb2
6 P35737 219..239 transmembrane
potential exon e,
H2-DMb2 7 P35737 248..251
site YXXZ motif mrna m,
HLA-DMB 1
P28068 1..18 signal potential
protein p,
HLA-DMB 2 P28068 19..112 domain
lumenal beta-1 protein_feature f
HLA-DMB 2 P28068
110..110 glycosylation N-linked (GlcNAc...)
WHERE HLA-DMB
3 P28068 113..207 domain
lumenal beta-2 e.gene g.gene
HLA-DMB 3 P28068 114..208
domain Ig-like AND
e.gene m.gene HLA-DMB 4
P28068 208..218 domain connecting
peptide AND e.gene p.gene
HLA-DMB 4 P28068 219..239
transmembrane potential AND
p.protein_id f.protein_id HLA-DMB 5
P28068 248..251 site YXXZ motif
AND loc_range(e.coding_mrna)
'-2..2'range - range_lower(loc_range(m.
coding_mrna)) _at_ --
contains (f.interval - 1) 3
Databases take work to set up, but permit
exploration -- asking and quickly getting
answers to questions. The join is a large-scale
information connection operator that is both very
basic and very annoying to program.
5
But Biologists dont like databases
  • The main programming model is Perl, not SQL
  • Databases have negative associations
  • Inflexibility
  • Quirkiness
  • Possible slowness
  • Possible expense !
  • Operations often must be done outside the
    database SQL is not usually enough
  • Steep learning curve

6
Why do DBMS get an F in Biology?
  • There arent that many choices for DBMS
  • and for open-source DBMS there are few
  • DBMS were designed for business, not science
  • hard-to-change database schemas
  • limited set of data types
  • peculiarities of the SQL query language
  • quirks of query optimization
  • arcane programming models
  • Strong challenges are inherent in bioscience
  • very large scale
  • diverse types of information
  • extremely complex analytical queries

7
Amazingly Few Large-Scale DBMS Options
  • Commercial DBMS
  • IBM DB2/DiscoveryLink
  • Microsoft SQL Server
  • Oracle 10g
  • Open-Source DBMS
  • MySQL
  • PostgreSQL

Even though these have been designed with
scalability as a key design goal, they have
scalability limits that bScience will push
8
PostgreSQL? Why not just MySQL?
  • MySQL and PostgreSQL are the primary scalable
    open-source DBMS
  • Objective, feature-by-feature Comparisons
  • http//troels.arvin.dk/db/rdbms/
  • http//en.wikipedia.org/wiki/Comparison_of_relatio
    nal_database_management_systems
  • More subjective, focal-issue Comparison

9
Aligning protein features exon structure
CREATE TABLE exon_feature exon_feature

AS
gene exon protein interval
type å description SELECT
DISTINCT ------------------
-----------------------------------------------
---- e.gene,
H2-DMb2 2 P35737 75..75
glycosylation N-linked (GlcNAc...)
e.exon_no as exon, H2-DMb2 3
P35737 1..18 signal potential
f.protein_id as protein,
H2-DMb2 4 P35737 19..112 domain
lumenal beta-1 f.interval,
H2-DMb2 4 P35737 75..75
glycosylation N-linked (GlcNAc...)
f.type, H2-DMb2 5
P35737 113..207 domain lumenal
beta-2 f.description
H2-DMb2 5 P35737 114..204 domain
Ig-like FROM
H2-DMb2 6 P35737
208..218 domain connecting peptide
gene g, H2-DMb2
6 P35737 219..239 transmembrane
potential exon e,
H2-DMb2 7 P35737 248..251
site YXXZ motif mrna m,
HLA-DMB 1
P28068 1..18 signal potential
protein p,
HLA-DMB 2 P28068 19..112 domain
lumenal beta-1 protein_feature f
HLA-DMB 2 P28068
110..110 glycosylation N-linked (GlcNAc...)
WHERE HLA-DMB
3 P28068 113..207 domain
lumenal beta-2 e.gene g.gene
HLA-DMB 3 P28068 114..208
domain Ig-like AND
e.gene m.gene HLA-DMB 4
P28068 208..218 domain connecting
peptide AND e.gene p.gene
HLA-DMB 4 P28068 219..239
transmembrane potential AND
p.protein_id f.protein_id HLA-DMB 5
P28068 248..251 site YXXZ motif
AND loc_range(e.coding_mrna)
'-2..2'range - range_lower(loc_range(m.
coding_mrna)) _at_ --
contains (f.interval - 1) 3
What is really needed here is a new datatype for
sequence locations. This query is painful to
express without this, and not painful with it.
10
BioPostgres -- some modular database
infrastructure for Computational Biology
  • BioPostgres PostgreSQL Extensions
  • PostgreSQL an open-source, industrial-strength,
    scalable DBMS http//www.postgresql.org
  • Extension a new SQL API, with query operators
    and tools
  • Each Extension is a separate package
  • Biosequence extensions
  • GeneOntology extensions
  • SQL datatype extensions
  • System management extensions
  • Working toward complementing other BOSC
    platforms
  • BioPerl, BioPython, BioJava, BioSQL,
    BioConductor, GMOD,

11
Extensibility features of PostgreSQL
  • PostgreSQL was designed specifically to be
    extensible
  • Individual databases can be extended with
  • New datatypes
  • New functions
  • New query operators
  • New indexing methods
  • New query languages
  • These can be added or dropped anytime, on the fly
  • dynamic linking of implementation libraries as
    needed
  • Flexible conventions for user-contributed modules
  • implementations are typically in C (like
    PostgreSQL).

12
Extending PostgreSQL
  • Within a given database, one can add new
    datatypes
  • A datatype can be added to all databases also

MyBioStuff (PostgreSQL database)
PostgreSQL
Graph datatype query operators
Seq Location datatype query operators
13
PostgreSQL user-contributed modules
  • A new module (say PostFoo) typically contains
  • New functions (in PostFoo.c)
  • SQL interface bindings for these functions (in
    PostFoo.sql)
  • The PostFoo module gets downloaded by others into
    their copy of the PostgreSQL source tree as a new
    directory postgresql-8../cont
    rib/PostFoo/
  • In this directory, the source tree owner types
  • gmake
  • gmake install as root
  • This compiles only the module (NOT PostgreSQL)!
  • Afterwards anyone can dynamically add the module
    to any given PostgreSQL database (say
    MyBioStuff) with a command like
  • psql -d MyBioStuff lt PostFoo.sql
  • See www.biopostgres.org/install.html

14
BioPostgres Modules
  • BioPostgres is a collection of modules that
    extend PostgreSQL for Computational Biology.
    These modules are basically independent, and can
    be used separately or in conjunction with others

15
Quick overview BLASTgres
  • BLASTgres -- extensions for biosequence
    management
  • Sequence location datatype (seq_id,
    start,end)
  • Sequence location operators loc intersection,
    etc.
  • Sequence location query find overlapping locs,
    etc.
  • Access to BLAST services remote and local
    servers
  • URL http//www.biopostgres.org/BLASTgres/

16
Two sets of BLASTgres extensions
  • 1. BLASTgres provides BLAST query, BLAST hit
    database
  • SELECT FROM blast_sequence()
  • 2. BLASTgres provides biosequence-related
    datatypes, with accompanying query operators and
    indexing methods
  • Sequence range (and array of range)
  • 17679235..17679427
  • Sequence location (and array of location)
  • NT_011109.1517679235..17679427
  • Hit ( high-scoring sequence alignment
    information)
  • (in1353..966, HUMAPOE43779..4402
  • 99.84, 624, 1, 0, 0, 1229)

17
BLAST access via BLASTgres queries
  • Simple BLAST queries
  • SELECT FROM local_blast_hit(atcgatcgatcg,
    lab-sequences, blastn)
  • SELECT FROM remote_blast_hps(lab_protein.fasta
    , nr, blastp)
  • SELECT FROM remote_blast_hit(NM_010387.2245..5
    46, nr, blastn)
  • SELECT FROM fasta_sequence( /lab/ests,
    P082345, 20, 40 )
  • 2. Annotations to BLAST query results
  • SELECT count(), species FROM annotated_remote_bla
    st_hit(AF101044, nr, blastn)
  • GROUP BY species -- automatic annotation to
    BLAST query results
  • SELECT FROM annotated_remote_blast_hit(AF101044
    , nr)
  • WHERE description LIKE SNRPN, and species
    ltgt Homo sapiens
  • 3. Advanced filtering for BLAST results in SQL
  • SELECT subject_location, length FROM
    local_blast_hit(AF101044)
  • -- sensitivity test (comparison of BLAST query
    results using different parameters)
  • CREATE TABLE parameter1 ( name TEXT, value TEXT
    )
  • INSERT INTO parameter1 VALUES ( WORD_SIZE, 7
    ) -- change the default settings
  • INSERT INTO parameter1 VALUES ( GAPCOSTS, 5 2
    )
  • CREATE TABLE parameter2 ( name TEXT, value TEXT
    )
  • INSERT INTO parameter2 VALUES ( WORD_SIZE, 13
    )
  • INSERT INTO parameter2 VALUES ( GAPCOSTS, 3 1
    )
  • -- retrieve matches that are not found in both
    results
  • SELECT R1., R2. FROM remote_blast_hit_v(
    AF101044, parameter1) R1,
  • remote_blast_hit_v( AF101044, parameter2)
    R2
  • WHERE (R1.subject_location
    R2.subject_location) AND (R1.length ltgt R2.length
    )

18
BLASTgres functions
Range Operators range range range int8 int8
range range - range range ? int8 range range
range int4 int4 range range range range lt
range rangelt range range gt range range gt range
range ltlt range range lt range range range
range gt range range gtgt range range range range
ltgt range range _at_ range range range range _at_lt
range range _at_gt range Range Aggregate minmax
(range)
_range_concat(_range,_range) _range_overlaps(_rang
e,_range) _range_contains(_range,_range)
_range_contained(_range,_range)
_range_eq(_range,_range) _range_ne(_range,_range)
_range_union(_range,_range) _range_inters(_range,_
range) _range_push_elem(_range, RANGE)
_range_push_array(_range, _range)
_range_del_elem(_range, RANGE) _range_union_elem(_
range, RANGE) _range_subtract(_range, _range)
_range_contains(_range, range) _range_contains_int
erval_any(_range, range) _range_contains_interval_
all(_range, range) _range_contained_interval_any(_
range, range) _range_contained_interval_all(_range
, range) _range_overlaps_interval_any(_range,
range) _range_overlaps_interval_all(_range,
range) _range_in(cstring) _range_out(_range_key)
Location Operators loc loc loc range loc
int8 loc - loc loc - range loc ? int8 loc loc
loc range loc int4 loc loc loc ltgt loc loc
ltgt range loc lt loc loc lt range loc lt loc loc lt
range loc gt loc loc gt range loc gt loc loc gt
range loc ltlt loc loc ltlt range loc lt loc loc lt
range loc loc loc range loc gt loc loc gt
range loc gtgt loc loc gtgt range loc _at_ loc loc _at_
range loc loc loc range
Location Functions loc_range(loc) loc_seqid(loc)
loc_size(loc) loc_lower(loc) loc_upper(loc)
loc_positive_strand(loc) loc_negative_strand(loc)
loc_same_strand(loc,loc) loc_negate(loc)
loc_eq(loc, loc) loc_eq(loc, range) loc_ne(loc,
loc) loc_ne(loc, range) loc_over_left(loc, loc)
loc_over_left(loc, range) loc_over_right(loc,
loc) loc_over_right(loc, range) loc_left(loc,loc)
loc_left(loc,range) loc_right(loc,loc)
loc_right(loc,range) loc_lt(loc,loc)
loc_lt(loc,range) loc_le(loc,loc)
loc_le(loc,range) loc_gt(loc,loc)
loc_gt(loc,range) loc_ge(loc,loc)
loc_ge(loc,range)
Aggregate functions coalescing( text, text,
text, int4 ) coalescing( text, text, text)
partition(text, text, text, text, text
) revcom(text) transcribe(text)
translate(text) translate(text, int4)
range_agg_state (range, range)
range_agg_final_array (range)
range_array_aggregate ( range_array_enum(range)
loc_agg_state (loc, loc) loc_agg_final_array
(loc) loc_array_aggregate loc_array_enum(loc)
rangeset(range) rcount(_range) sort(_range,
text) sort(_range) sort_asc(_range) sort_desc(_r
ange) uniq(_range) idx(_range, range)
subarray(_range, int4, int4) subarray(_range,
int4)
Range Functions range_over_left(range, range)
range_over_right(range, range) range_left(range,
range) range_right(range, range) range_lt(range,
range) range_le(range, range) range_gt(range,
range) range_ge(range, range) range_ne(range,
range) range_inside(range,range)
range_contains(range, range) range_contained(range
, range) range_overlaps(range, range)
range_eq(range, range) range_meets(range, range)
range_met_by(range, range) range_starts(range,
range) range_started_by(range, range)
range_finishes(range, range) range_finished_by(ran
ge, range) range_same_lower(range, range)
range_same_upper(range, range) range_minus(range,
range) range_plus(range, range)
range_torange(int8, int8) range_maxmin(range,
range) range_minmax(range, range)
range_extend(range, int4) range_cmp(range, range)
range_union(range, range) range_inter(range,
range) range_size(range) range_upper(range)
range_lower(range) range_times(range,range)
range_times(range, int4) range_times(int4, range)
range_plus(range, int8) range_plus(int8, range)
range_minus(range, int8)
BLAST-related functions fasta_sequence(text,text,
int,int) fasta_sequence(text,text)
fasta_dbinfo(text) remote_blast_hit(text,text,te
xt) remote_blast_hit( text, text )
remote_blast_hit(text) remote_blast_hsp(text,text,
text) remote_blast_hsp( text, text )
remote_blast_hsp(text) local_blast_hsp(text,text
,text,text) local_blast_hsp( text, text, text )
local_blast_hit(text,text,text,text)
local_blast_hit( text, text, text )
local_ublast_hit( text, text, text )
local_ublast_hit( text, text ) local_ublast_hit(
text ) genbank_search(text)
loc_contains(loc,loc) loc_contains(loc,range)
loc_contained(loc,loc) loc_contained(range,loc)
loc_overlaps(loc,loc) loc_overlaps(loc,range)
loc_meets(loc,loc) loc_meets(loc,range)
loc_met_by(loc,loc) loc_met_by(loc,range)
loc_starts(loc,loc) loc_starts(loc,range)
loc_started_by(loc,loc) loc_started_by(loc,range)
loc_finishes(loc,loc) loc_finishes(loc,range)
loc_finished_by(loc,loc) loc_finished_by(loc,range
) loc_minus(loc,loc) loc_minus(loc, integer)
loc_minus(loc, range) loc_plus(loc,loc)
loc_plus(loc, integer) loc_plus(integer, loc)
loc_plus(loc, range) loc_times(loc,loc)
loc_times(loc, integer) loc_times(integer,loc)
loc_times(loc, range) toloc(cstring, int8, int8)
toloc(text, int8, int8) toloc(cstring, range)
toloc(text, range) loc_maxmin(loc,loc)
loc_minmax(loc,loc) loc_extend(loc,int4)
loc_maxmin(loc,range) loc_minmax(loc,range)
19
Aligning protein features exon structure
CREATE TABLE exon_feature exon_feature

AS
gene exon protein interval
type å description SELECT
DISTINCT ------------------
-----------------------------------------------
---- e.gene,
H2-DMb2 2 P35737 75..75
glycosylation N-linked (GlcNAc...)
e.exon_no as exon, H2-DMb2 3
P35737 1..18 signal potential
f.protein_id as protein,
H2-DMb2 4 P35737 19..112 domain
lumenal beta-1 f.interval,
H2-DMb2 4 P35737 75..75
glycosylation N-linked (GlcNAc...)
f.type, H2-DMb2 5
P35737 113..207 domain lumenal
beta-2 f.description
H2-DMb2 5 P35737 114..204 domain
Ig-like FROM
H2-DMb2 6 P35737
208..218 domain connecting peptide
gene g, H2-DMb2
6 P35737 219..239 transmembrane
potential exon e,
H2-DMb2 7 P35737 248..251
site YXXZ motif mrna m,
HLA-DMB 1
P28068 1..18 signal potential
protein p,
HLA-DMB 2 P28068 19..112 domain
lumenal beta-1 protein_feature f
HLA-DMB 2 P28068
110..110 glycosylation N-linked (GlcNAc...)
WHERE HLA-DMB
3 P28068 113..207 domain
lumenal beta-2 e.gene g.gene
HLA-DMB 3 P28068 114..208
domain Ig-like AND
e.gene m.gene HLA-DMB 4
P28068 208..218 domain connecting
peptide AND e.gene p.gene
HLA-DMB 4 P28068 219..239
transmembrane potential AND
p.protein_id f.protein_id HLA-DMB 5
P28068 248..251 site YXXZ motif
AND loc_range(e.coding_mrna)
'-2..2'range - range_lower(loc_range(m.
coding_mrna)) _at_ --
contains (f.interval - 1) 3
BLASTgres query operators X Y
range_overlaps(X,Y) X _at_ Y range_contains(X,Y)
X Y range_times(X,Y) X Y
range_plus(X,Y) etc
20
Indexing for the loc range datatypes
Search keys are represented as two bounding
intervals (for leaf nodes, the lower bound is
equal to upper bound for the first bounding
intervals). The bounding interval of an internal
node contains those of its subnodes.
21
GiST Indexing in BLASTgres
GiST (Generalized Index Search Tree) is the
PostgreSQL framework for user-defined
indexing. BLASTgres uses GiST indexing for
locations, ranges, location arrays, and range
arrays.
  • GiST indexing requires user
  • definition of four methods
  • Consistent determine if a subtree traversal is
    necessary.
  • Union merge two nodes.
  • Penalty determine the cost of inserting an
    entry in a node.
  • PickSplit determine how to split a full node

22
BioPostgres Modules
  • Many modules are needed! We have only just begun
    to develop modules we feel are most clearly
    needed. Please contact us if you have opinions
    or ideas.

23
Quick overview PostGraph
  • PostGraph -- extensions for graph management
  • Graph datatype each graph is a single data
    object
  • Graph operators insert edge, delete node, etc.
  • Graph query find connected components, etc.
  • Access to graph tools dot/graphviz, prefuse,
    etc.
  • URL http//www.biopostgres.org/PostGraph/

24
GeneOntology relational database
  • GO is naturally a graph, distributed as
    relational tables
  • This representation is fine for storage but can
    be awkward for exploration

SELECT id, name, term_type, acc FROM term LIMIT
7 id name
term_type acc ------------------
------------------------------------------------
- 1 all
universal all 2 is_a
relationship is_a 3
part_of gene_ontology
part_of 4 mitochondrion inheritance
biological_process GO0000001 5
mitochondrial genome maintena
biological_process GO0000002 6
reproduction
biological_process GO0000003 7 alt_id
synonym_type
alt_id SELECT FROM term2term LIMIT 7 id
relationship_type_id term1_id term2_id
complete -------------------------------------
------------------- 1
2 10 9 0 2
2 10 13 0
3 2 26 25
0 4 2 10
42 0 5 2
10 50 0 6
2 26 67 0 7
2 26 92
0
25
Augmenting the GeneOntology database with a graph
database
  • PostGraph permits GO terms to be stored queried
    as graphs
  • GOtermViewer uses this to provide a Graphical
    Interface for GO
  • These are extended in GObase www.biopostgres.org
    /GObase/

26
Quick overview PostMake
  • PostMake -- extensions for managing data
    dependencies
  • PostMake covers common dependencies in
    bioscience
  • PostCron database operations at pre-specified
    times (crontab within a database)
  • Database dump/load, tracking when things change
  • Materialized views, triggered by updates
    (make within a database)
  • PostMakefiles are translated to SQL DDL
  • URL http//www.biopostgres.org/PostMake/

27
BioPostgres and open-source
  • Like PostgreSQL, BioPostgres is committed to
    openness
  • BioPostgres modules are GPLed most code so far
    is in C, SQL, and Java
  • BioPostgres modules are downloadable from
    SourceForge
  • We think BioPostgres shows that extensible DBMS
    have a lot to offer the open-source movement,
    particularly in bioscience.

28
Will DBMS ever get an A in Biology?
  • Actually we dont know.
  • We do not think DBMS will ever replace files or
    programming languages in bioscience.
  • We think however that scalability is a driving
    issue, and it will impact the effectiveness of
    files and programming languages in bioscience.
    And soon.
  • It does appear that extensibile DBMS can suggest
    steps in the right direction.

29
THANKYOU!
Write a Comment
User Comments (0)
About PowerShow.com