Title: Extending Traditional QueryBased Integration Approaches for Functional Characterization of PostGenom
1Extending Traditional Query-Based Integration
Approaches for Functional Characterization of
Post-Genomic Data
- Barbara A. Eckman, Ph.D.
- Senior Consulting I/T Architect
- IBM Life Sciences Solutions
- baeckman_at_us.ibm.com
- Joint work at GlaxoSmithKline with
- Anthony S. Kosky, Gene Logic, Inc.
- Leonardo A. Laroco, Jr., GlaxoSmithKline
Bioinformatics 2001
2Outline
- Goal and Obstacles
- A Motivating Example
- The TINet System
- Sample Queries and Results
- Related Work
- Lessons Learned
- Acknowledgements
3The Goal
- To make data interpretation easier and faster for
human experts through an integrated, queryable
view of genomic sequence and associated data
4Obstacles to Integration
- Data spread over multiple, heterogeneous dbs
- Not all are easily queried
- flat file sequence dbs, web sites, BLAST
alignments - Some are not even easily parsed!
- Not all represent biology optimally
- Genbank is sequence-centric, not gene-centric
- SwissProt is sequence-centric, not domain-centric
- Hard to keep results up-to-date
- Non-traditional query approaches are needed to
exclude extraneous results
5This is a hard problem.
6Outline
- Goal and Obstacles
- A Motivating Example
- The TINet System
- Sample Queries and Results
- Related Work
- Lessons Learned
- Acknowledgements
7Beyond Browsing to One-Step Querying
8Select kinase cDNAs for microarray experiments
- Return HUGO name, length, GB accession of the
longest full-length cDNA sequence related to each
GeneCards entry that has been annotated as a
kinase
9The browsing approach
- If browsing, would take 4020 web page visits,
assuming an average of 5 cDNAs for each of the
670 GeneCards - Brain-numbing, error-prone
- Sub-optimal use of resources!
10(No Transcript)
11(No Transcript)
12(No Transcript)
13(No Transcript)
14The querying approach
- Execute it
- 1 hour later get result (670 rows)
Write short query
- select
- hugo_name gc. hugo_name,
- seq_length gs2.seq_length,
- longest_acc gs2.primary_accession
- from
- gc in GENECARDSGenecard,
- gc_accs gc.unigene_nucleic_acids.cluster.cdna_
accessions, - maxlen max (select gs.seq_length
- from gs in genbankSeq
- where gs.topology mRNA
- and gs.accessions in gc_accs
- ),
- gs2 in genbankSeq
- where
- gc.text_search match "kinase"
- and gs2.accessions in gc_accs
- and gs2.topology mRNA
- and gs2.seq_length maxlen
-
15GeneCards Schema
16Outline
- Goal and Obstacles
- A Motivating Example
- The TINet System
- Sample Queries and Results
- Related Work
- Lessons Learned
- Acknowledgements
17Federated Technology
- Query multiple heterogeneous data sources as if
they were components of a single large database.
blast1
blast2
SRS
motifs
MGD
Result
Gene Exp
GenBank
GPCR
SwissProt
GeneCards
PubMed
18The Target Informatics Net (TINet) System
- Middleware Gene Logics OPM System
- Object-relational data model
- SQL-like query language
- CORBA used for distributed access
- SDK to write wrappers for new data source types
- Limited, judicious use of data warehousing
- GenBank, SwissProt
- Scientific value added
19(No Transcript)
20Data Sources Integrated in TINet
- Mouse Genome DataBase (MGD) (Sybase)
- SwissProt, GenBank, PROSITE (flat-file)
- GeneCards, PubMed (web sites)
- On-the-fly BLAST searches (WUBLAST2, NCBI BLAST
2 PSI-BLAST in development) - On-the-fly PROSITE motif searches
- SRS databanks
- GlaxoSmithKline proprietary genomic data (Sybase)
21TINet Component Services
- BLAST (WU, NCBI) with BLAST-cache
- Generic web site server
- PubMed, GeneCards
- Generic flat file server (XML)
- generic loaders for releases and nightly updates
- GenBank, SwissProt, PROSITE
- Generic application server
- PROSITE search via GCG motifs
- SRS databanks
- ASDT server for method calls (BLAST, GB)
22Outline
- Goal and Obstacles
- A Motivating Example
- The TINet System
- Sample Queries and Results
- Basic
- Advanced Using Class Methods
- Related Work
- Lessons Learned
- Acknowledgements
23Benchmarking
- Real-world queries
- all 1/2 page long, most 15 minutes to write
- 10 iterations
- Core TINet system deployed on
- 2 CPU UltraSPARC-II, 296 MHz, 1.5 GB, SunOS 5.6
- Visigenic Visibroker ORB
- BLAST server 14x524 DEC alpha
- Consistent performance (max sd 12 mean)
- Times reported omit blast exec time
24Basic Sample Queries
25Find ESTs that may correspond to interesting
neurological targets
- Return accession numbers and definitions of EST
sequences that are similar (60 identical over 50
AA) to calcium channel sequences in SwissProt
that have references published since 1995 with
brain in the abstract. - Data sources SwissProt, PubMed, tblastn2 vs
gbest - Results
- 539 HSPs from 538 ESTs in 1158 ? 46 sec
- Is this best done with a workflow or a query?
26Mine recent literature for serotonin-related
expression data linked to GenBank sequences
- Return the source, abstract and GB accession
number for all articles published in 2000 that
have a MeSH term gene expression, mention
serotonin, and are linked to nucleotide
sequences in GenBank - Accesses PubMed
- Cant be done interactively
- Compensates via post-processing
- 5 articles related to 6 seqs in 10 ? .7 seconds
27Find full-length coding sequences in GenBank
- Return the accession and definition of sequences
annotated as channels for which the CDS start and
stop locations are specified unambiguously. - Accesses GenBank
- Impossible using Entrez interface
- 1574 features on 1331 unique sequences in 2755 ?
3.4 seconds
28Find possible novel drug targets
- Find homologues (60 identity over 50 AA) in
human genomic sequence to all novel genes cloned
and reported in the literature in the past 3
months whose articles are annotated with the
Medline MeSH term neoplasms - Accesses PubMed, GenBank, BLAST
- 1/2 page of OPM-MQL vs. many browser clicks, many
lines of perl glue code - 131 HSPs involving 47 genomic sequences and 6
novel gene sequences in 421 ? 22.9 sec
29Identify potential new orthologs paralogs
related to a gene of interest
- Retrieve, filter and summarize relevant
references, GB annotations, and BLAST alignments
for potential paralog/ortholog sequences
identified through alignment with a gene of
interest (e.g., 95 over 400 bp) - Accesses GenBank, BLAST, and PubMed
- Highly structured XML output organizes large,
hierarchical result set - 5 XML entries involving 5 GB seqs with 18
Features, 3 PubMed entries in 0032 ? 1.5 secs
30XML output
lt?xml version"1.0"?gtblastn of GenBank AF061056
against mRNA subset of GenBank ltResultsgt
ltQuerygtGenBank accession id used to retrieve
query sequence ltqry_accessiongtAF061056lt/qry_a
ccessiongt ltqry_sequencegt 1 tgaaatatag
gtgagagaca agattgtctc atatccgggg aaatcataac
ctatgactag ... lt/qry_sequencegt
lt/Querygt ltOutputgtGenBank/PubMed reference data
linked from filtered alignment ltTargetgt
ltGenbankgt ltgb_accessiongtAF084645lt/gb_acce
ssiongt ltorganismgtHomo Sapienslt/organismgt
ltseq_lengthgt2905lt/seq_lengthgt
ltlocus_namegtAF084645lt/locus_namegt
ltFeaturegt ltfeature_typegtsourcelt/feature_
typegt lttissue_typegtliverlt/tissue_typegt
ltlocationgt1..2905lt/locationgt
ltsequencegtcctctgaag... 2887 bases omitted
...aaaaaaaaalt/sequencegt lt/Featuregt
ltFeaturegt ltfeature_typegtCDSlt/feature_ty
pegt ltlocationgt280..1584lt/locationgt
ltprotein_idgtAAC64558.1lt/protein_idgt
ltsequencegtctggaggtg... 1287 bases omitted
...ggtagctgalt/sequencegt lt/Featuregt
ltReferencegt ltmedlinegt98445350lt/medlinegt
ltPubmedgt
lttitlegtIdentification of a human nuclear receptor
defines a new signaling pathway for CYP3A
induction.lt/titlegt ltauthorsgtBertilsson
Glt/authorsgt ltauthorsgt... 9 authors
omitted ...lt/authorsgt
ltabstractgtNuclear receptors regulate metabolic
pathways...lt/abstractgt ltsourcegtProc
Natl Acad Sci U S A 1998 Oct 1395(21)12208-13lt/s
ourcegt lt/Pubmedgt lt/Referencegt
lt/Genbankgt lt/Targetgt
ltbl_accessiongtgbAF084645AF084645lt/bl_accessiongt
ltbl_descriptiongtHomo sapiens orphan
nuclear receptor (PAR1) mRNA, complete
cds.lt/bl_descriptiongt ltbl_align_lengthgt2106lt/b
l_align_lengthgt ltbl_identitiesgt99.4777lt/bl_ide
ntitiesgt ... lt/Outputgt ltOutputgt... 2
references/alignments omitted ...lt/Outputgt lt/Resu
ltsgt
31Turn free-text web searches into typed searches
- Return GeneCards genes associated with liver
function and related to mouse genes on mouse
chromosome 10 - Accesses GeneCards
- GeneCards only permits free-text mouse and liver
and chromosome 10 (58 Cards) - TINet transforms into typed search on mammalian
homologues section - Returns 16 GeneCards in 239 ?19.1 secs
32Protein Motif Searching
- For a bacterial sequence of interest, return all
matching PROSITE motifs that are attested in more
than 10 prokaryotic sequences in SwissProt. - Data Sources
- Accelrys motifs program
- PROSITE semi-structured text file of annotated
motifs
33Advanced Sample Queries Using Class Methods
34Special-purpose filtering using O-O class methods
- Servers for method calls from within queries
- BLAST Alignment
- queries over alignments
- regexp motif matching over alignments
- GenBank Feature
- splicing exons into putative coding sequences
- extracting putative donor and acceptor sites
- extracting sequence windows (start and stop
codons)
35Sample BLAST alignmentRGS3_HUMAN vs. genomic
- ltBCMPROJECT_HAQJ.frag8 11699694
- Length 14,486
- Plus Strand HSPs
- Score 164 (62.8 bits), Expect 9.6e-07, P
9.6e-07 - Identities 48/130 (36), Positives 52/130
(40), Frame 2 - Query 20 PGAEDSPPSKEP-SPGQELPPGQDLPPNKDSPSGQEP
APSQEPLSSKDSATSEGSPPGPD 78 - P SPPS P SP PP LPP SPS
P PS P PP P - Sbjct 11621 PSPPPSPPSPSPLSPTPPPPPSPSLPP-LPSPS-LPP
PPSPSPSPPPPPSPPPSPPPSPS 11794 - Query 79 APPSKDV--PPCQEPPPAQDLSPCQDLPAGQEPLPHQ
DPLLTKDLPAIQESPTRDLPPCQ 136 - PPS PP PPP SP LP P P
P P SP L P - Sbjct 11795 PPPSPSPSPPPPPSPPPSPPPSPSPSLPPSPSPPPSP
SP---PPSPSPLPSPSPSLSP-- 11959 - Query 137 DLPPSQVSLP 146
- LPPS P
- Sbjct 11960 SLPPSPSPSP 11989
36Exclude proline-rich HSPs
- BLASTing RGS3_HUMAN vs. genomic
- proline-rich sequence yields 2781 HSPs from 250
sequences - Filter on p-value alone (lt1.0e-05)
- 590 HSPs from 148 sequences
- Add filter on prolines 219 HSPs
- percent Ps among perfect matches lt 25
- BLAST cache server allows iterative refining of
filter conditions (0034 ? .5 sec after first
iteration)
37Identify novel neuropeptides
- BLAST NDDB_HUMAN neuropeptide precursor
- against genomic sequence
- Filter on motif conserved in target sequence at
query sequences - Report hits in each
separately - Uses regular expression pattern matching
- From 14 HSPs to 9 HSPs in 004 ? .5 sec
38Interrogate putative primary structure in genomic
sequence
- Of the human genomic sequences annotated as
channels and having exon boundaries in GenBank,
return only those having valid putative
donor/acceptor sites and start/stop codons - Accesses GenBank, uses Feature methods
- 59/126 CDSs from 38/81 GB entries in 3931 ? 28.7
sec
39Sample GenBank record
- LOCUS AB00408S10 2015 bp DNA
ROD 14-APR-2000 - DEFINITION Rat DNA for lanosterol
14-demethylase, complete cds and exon 10. - ACCESSION AB004096
- KEYWORDS CYP51 lanosterol 14-demethylase.
- SOURCE Rattus norvegicus (strainWister)
DNA. - ORGANISM Rattus norvegicus
- Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi - Mammalia Eutheria Rodentia
Sciurognathi Muridae Murinae - Rattus.
- FEATURES Location/Qualifiers
- CDS join(AB004087.1965..1138,AB004088
.144..142, - AB004089.190..266,AB004090.1
29..155,AB004091.151..225, - AB004092.135..154,AB004093.1
53..248,AB004094.144..139, - AB004095.140..208,113..291)
- /gene"CYP51"
- /codon_start1
- /product"lanosterol
14-demethylase" - /protein_id"BAA20354.1"
- /db_xref"GI2159968"
40join(AB004087.1965..1138,...,AB004095.140..208,
113..291)
The GenBank CDS Feature Location for AB004096
. .
. . .
41Outline
- The Goal and Obstacles
- Our Approach
- A Motivating Example
- The TINet System
- Sample Queries and Results
- Related Work
- Lessons Learned
- Acknowledgements
42Choices of Integration Strategies
- Materialized vs. Non-materialized
- Data warehouse vs. virtual warehouse (middleware)
- Semantic vs. Syntactic heterogeneity
- Meaning vs. format
- Declarative vs. Procedural
- Query language vs. method or subroutine calls
- Generic vs. Hard coded
- Materialized vs. Non-materialized
- Data warehouse vs. virtual warehouse (middleware)
- Semantic vs. Syntactic heterogeneity
- Meaning vs. format
- Declarative vs. Procedural
- Query language vs. method or subroutine calls
- Generic vs. Hard coded
- Our Approach
43Other Integration Approaches
- Relational data warehousing
- maintenance-intensive, doesnt scale well
- efficient updating is a research problem
- limited to SQL query language
- permits cleaning, reorganizing data
- potentially excellent query performance
- Hard-coded procedural apps (Perl, EJB, etc)
- maintenance-intensive
- may be preferred for very limited queries over
very large datasets in a stable application - method of choice for one-offs
44Related Work
- SRS (Lion Biosciences)
- fast, popular
- flat-file text data only
- limited query language
- Kleisli (U Penn Kent Ridge, Singapore)
- based on functional programming languages
- Mostly pass-through limited optimization
- Discovery Link (IBM)
- SQL-based
- excellent cost-based optimization
45Related Work
- P/FDM (U Aberdeen)
- schema-based, functional data model
- research prototype only
- TAMBIS (U Manchester)
- addresses primarily semantic heterogeneity
- TaO (TAMBIS ontology) based on a description
logic
46Outline
- The Goal and Obstacles
- Our Approach
- A Motivating Example
- The TINet System
- Sample Queries and Results
- Related Work
- Lessons Learned
- What Worked, What Didnt
- What Real-World Integration in Bioinformatics
Requires - Acknowledgements
47Lessons LearnedIssues
- More optimization and parallelization needed
- Object-relational query language hindered
acceptance by developers - GenBank data warehouse hard to keep current
(size) - Success of web site access highly dependent on
stability of site
48Lessons LearnedWhat Worked
- Data integration via database middleware
- Domain representation
- Rich, object-relational schemas (sets, lists!)
- ASDTs provided non-relational processing
capability - Generic server approach
- Special server successes
- BLAST server
- BLAST Alignment ASDT
- BLAST-cache server
- Web server (PubMed)
- Miscellaneous
- Integration of Perl functions into C ASDT
server - CORBA
49What Real-World Integration in Bioinformatics
Requires
- Embrace the non-traditional data landscape!
- Absence of standardization
- Importance of public data repositories
- Multitude of web sites
- Applications as data sources
- Nimble, prototyping culture
- UDFs/Class methods for specialized filtering of
results - Model limited capabilities of non-traditional
data sources and use them in query planning and
optimization
50Query Planning and Optimization for Web Data
Sources
- Web sources are ubiquitous in Life Sciences
- but query capabilities, metadata and semantics
are not easy to reason about or even to determine
- Challenges
- Capture diverse, often complex source
capabilities in a catalog representation - Develop methods/tools to extract metadata from
sources - Source contents, access costs, interrelationships
(joins) - Expand current state of capability-based
optimization to exploit diverse, complex
capabilities and metadata of biological web data
sources - select sources and capabilities for best result
set - generate low cost query evaluation plans
efficiently
Eckman, Lacroix, Raschid, BIBE 2001 and Intl J
Bioinf Bioeng, 2002
51Acknowledgements
- GlaxoSmithKline
- Pankaj Agarwal
- Roberto Alvarez
- Jim Fickett
- William Hayes
- Mark Hurle
- Rob Knowlton
- Chris Larminie
- David Michalovich
- Bill Morgart
- Pankaj Patel
- Ken Rice
- Bob Reid
- Harpreet Singh
- Lin Yue
- Gene Logic
- Victor Markowitz
- I-Min Chen
- Madhavan Ganesh
- Zoe Lacroix
- Ernest Szeto
- Paula Ta-Shma
- Thodoros Topaloglou
- Victoria Wang