Extending Traditional QueryBased Integration Approaches for Functional Characterization of PostGenom

1 / 51

About This Presentation

Title:

Extending Traditional QueryBased Integration Approaches for Functional Characterization of PostGenom

Description:

splicing exons into putative coding sequences. extracting putative donor and acceptor sites ... Interrogate putative primary structure in genomic sequence ' ... –

Number of Views:55

Avg rating:3.0/5.0

Slides: 52

Provided by: barbara280

Category:

more less

Transcript and Presenter's Notes

Title: Extending Traditional QueryBased Integration Approaches for Functional Characterization of PostGenom

1
Extending Traditional Query-Based Integration
Approaches for Functional Characterization of
Post-Genomic Data

Barbara A. Eckman, Ph.D.
Senior Consulting I/T Architect
IBM Life Sciences Solutions
baeckman_at_us.ibm.com
Joint work at GlaxoSmithKline with
Anthony S. Kosky, Gene Logic, Inc.
Leonardo A. Laroco, Jr., GlaxoSmithKline

Bioinformatics 2001
2
Outline

Goal and Obstacles
A Motivating Example
The TINet System
Sample Queries and Results
Related Work
Lessons Learned
Acknowledgements

3
The Goal

To make data interpretation easier and faster for
human experts through an integrated, queryable
view of genomic sequence and associated data

4

Obstacles to Integration

Data spread over multiple, heterogeneous dbs
Not all are easily queried
flat file sequence dbs, web sites, BLAST
alignments
Some are not even easily parsed!
Not all represent biology optimally
Genbank is sequence-centric, not gene-centric
SwissProt is sequence-centric, not domain-centric
Hard to keep results up-to-date
Non-traditional query approaches are needed to
exclude extraneous results

5
This is a hard problem.
6
Outline

Goal and Obstacles
A Motivating Example
The TINet System
Sample Queries and Results
Related Work
Lessons Learned
Acknowledgements

7
Beyond Browsing to One-Step Querying
8

Select kinase cDNAs for microarray experiments

Return HUGO name, length, GB accession of the
longest full-length cDNA sequence related to each
GeneCards entry that has been annotated as a
kinase

9
The browsing approach

If browsing, would take 4020 web page visits,
assuming an average of 5 cDNAs for each of the
670 GeneCards
Brain-numbing, error-prone
Sub-optimal use of resources!

10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
The querying approach

Execute it
1 hour later get result (670 rows)

Write short query

select
hugo_name gc. hugo_name,
seq_length gs2.seq_length,
longest_acc gs2.primary_accession
from
gc in GENECARDSGenecard,
gc_accs gc.unigene_nucleic_acids.cluster.cdna_
accessions,
maxlen max (select gs.seq_length
from gs in genbankSeq
where gs.topology mRNA
and gs.accessions in gc_accs
),
gs2 in genbankSeq
where
gc.text_search match "kinase"
and gs2.accessions in gc_accs
and gs2.topology mRNA
and gs2.seq_length maxlen

15
GeneCards Schema
16
Outline

Goal and Obstacles
A Motivating Example
The TINet System
Sample Queries and Results
Related Work
Lessons Learned
Acknowledgements

17
Federated Technology

Query multiple heterogeneous data sources as if
they were components of a single large database.

blast1
blast2
SRS
motifs
MGD
Result
Gene Exp
GenBank
GPCR
SwissProt
GeneCards
PubMed
18
The Target Informatics Net (TINet) System

Middleware Gene Logics OPM System
Object-relational data model
SQL-like query language
CORBA used for distributed access
SDK to write wrappers for new data source types
Limited, judicious use of data warehousing
GenBank, SwissProt
Scientific value added

19
(No Transcript)
20
Data Sources Integrated in TINet

Mouse Genome DataBase (MGD) (Sybase)
SwissProt, GenBank, PROSITE (flat-file)
GeneCards, PubMed (web sites)
On-the-fly BLAST searches (WUBLAST2, NCBI BLAST
2 PSI-BLAST in development)
On-the-fly PROSITE motif searches
SRS databanks
GlaxoSmithKline proprietary genomic data (Sybase)

21
TINet Component Services

BLAST (WU, NCBI) with BLAST-cache
Generic web site server
PubMed, GeneCards
Generic flat file server (XML)
generic loaders for releases and nightly updates
GenBank, SwissProt, PROSITE
Generic application server
PROSITE search via GCG motifs
SRS databanks
ASDT server for method calls (BLAST, GB)

22
Outline

Goal and Obstacles
A Motivating Example
The TINet System
Sample Queries and Results
Basic
Advanced Using Class Methods
Related Work
Lessons Learned
Acknowledgements

23
Benchmarking

Real-world queries
all 1/2 page long, most 15 minutes to write
10 iterations
Core TINet system deployed on
2 CPU UltraSPARC-II, 296 MHz, 1.5 GB, SunOS 5.6
Visigenic Visibroker ORB
BLAST server 14x524 DEC alpha
Consistent performance (max sd 12 mean)
Times reported omit blast exec time

24
Basic Sample Queries
25
Find ESTs that may correspond to interesting
neurological targets

Return accession numbers and definitions of EST
sequences that are similar (60 identical over 50
AA) to calcium channel sequences in SwissProt
that have references published since 1995 with
brain in the abstract.
Data sources SwissProt, PubMed, tblastn2 vs
gbest
Results
539 HSPs from 538 ESTs in 1158 ? 46 sec
Is this best done with a workflow or a query?

26
Mine recent literature for serotonin-related
expression data linked to GenBank sequences

Return the source, abstract and GB accession
number for all articles published in 2000 that
have a MeSH term gene expression, mention
serotonin, and are linked to nucleotide
sequences in GenBank
Accesses PubMed
Cant be done interactively
Compensates via post-processing
5 articles related to 6 seqs in 10 ? .7 seconds

27
Find full-length coding sequences in GenBank

Return the accession and definition of sequences
annotated as channels for which the CDS start and
stop locations are specified unambiguously.
Accesses GenBank
Impossible using Entrez interface
1574 features on 1331 unique sequences in 2755 ?
3.4 seconds

28
Find possible novel drug targets

Find homologues (60 identity over 50 AA) in
human genomic sequence to all novel genes cloned
and reported in the literature in the past 3
months whose articles are annotated with the
Medline MeSH term neoplasms
Accesses PubMed, GenBank, BLAST
1/2 page of OPM-MQL vs. many browser clicks, many
lines of perl glue code
131 HSPs involving 47 genomic sequences and 6
novel gene sequences in 421 ? 22.9 sec

29
Identify potential new orthologs paralogs
related to a gene of interest

Retrieve, filter and summarize relevant
references, GB annotations, and BLAST alignments
for potential paralog/ortholog sequences
identified through alignment with a gene of
interest (e.g., 95 over 400 bp)
Accesses GenBank, BLAST, and PubMed
Highly structured XML output organizes large,
hierarchical result set
5 XML entries involving 5 GB seqs with 18
Features, 3 PubMed entries in 0032 ? 1.5 secs

30
XML output
lt?xml version"1.0"?gtblastn of GenBank AF061056
against mRNA subset of GenBank ltResultsgt
ltQuerygtGenBank accession id used to retrieve
query sequence ltqry_accessiongtAF061056lt/qry_a
ccessiongt ltqry_sequencegt 1 tgaaatatag
gtgagagaca agattgtctc atatccgggg aaatcataac
ctatgactag ... lt/qry_sequencegt
lt/Querygt ltOutputgtGenBank/PubMed reference data
linked from filtered alignment ltTargetgt
ltGenbankgt ltgb_accessiongtAF084645lt/gb_acce
ssiongt ltorganismgtHomo Sapienslt/organismgt
ltseq_lengthgt2905lt/seq_lengthgt
ltlocus_namegtAF084645lt/locus_namegt
ltFeaturegt ltfeature_typegtsourcelt/feature_
typegt lttissue_typegtliverlt/tissue_typegt
ltlocationgt1..2905lt/locationgt
ltsequencegtcctctgaag... 2887 bases omitted
...aaaaaaaaalt/sequencegt lt/Featuregt
ltFeaturegt ltfeature_typegtCDSlt/feature_ty
pegt ltlocationgt280..1584lt/locationgt
ltprotein_idgtAAC64558.1lt/protein_idgt
ltsequencegtctggaggtg... 1287 bases omitted
...ggtagctgalt/sequencegt lt/Featuregt
ltReferencegt ltmedlinegt98445350lt/medlinegt
ltPubmedgt
lttitlegtIdentification of a human nuclear receptor
defines a new signaling pathway for CYP3A
induction.lt/titlegt ltauthorsgtBertilsson
Glt/authorsgt ltauthorsgt... 9 authors
omitted ...lt/authorsgt
ltabstractgtNuclear receptors regulate metabolic
pathways...lt/abstractgt ltsourcegtProc
Natl Acad Sci U S A 1998 Oct 1395(21)12208-13lt/s
ourcegt lt/Pubmedgt lt/Referencegt
lt/Genbankgt lt/Targetgt
ltbl_accessiongtgbAF084645AF084645lt/bl_accessiongt
ltbl_descriptiongtHomo sapiens orphan
nuclear receptor (PAR1) mRNA, complete
cds.lt/bl_descriptiongt ltbl_align_lengthgt2106lt/b
l_align_lengthgt ltbl_identitiesgt99.4777lt/bl_ide
ntitiesgt ... lt/Outputgt ltOutputgt... 2
references/alignments omitted ...lt/Outputgt lt/Resu
ltsgt
31
Turn free-text web searches into typed searches

Return GeneCards genes associated with liver
function and related to mouse genes on mouse
chromosome 10
Accesses GeneCards
GeneCards only permits free-text mouse and liver
and chromosome 10 (58 Cards)
TINet transforms into typed search on mammalian
homologues section
Returns 16 GeneCards in 239 ?19.1 secs

32
Protein Motif Searching

For a bacterial sequence of interest, return all
matching PROSITE motifs that are attested in more
than 10 prokaryotic sequences in SwissProt.
Data Sources
Accelrys motifs program
PROSITE semi-structured text file of annotated
motifs

33
Advanced Sample Queries Using Class Methods
34
Special-purpose filtering using O-O class methods

Servers for method calls from within queries
BLAST Alignment
queries over alignments
regexp motif matching over alignments
GenBank Feature
splicing exons into putative coding sequences
extracting putative donor and acceptor sites
extracting sequence windows (start and stop
codons)

35
Sample BLAST alignmentRGS3_HUMAN vs. genomic

ltBCMPROJECT_HAQJ.frag8 11699694
Length 14,486
Plus Strand HSPs
Score 164 (62.8 bits), Expect 9.6e-07, P
9.6e-07
Identities 48/130 (36), Positives 52/130
(40), Frame 2
Query 20 PGAEDSPPSKEP-SPGQELPPGQDLPPNKDSPSGQEP
APSQEPLSSKDSATSEGSPPGPD 78
P SPPS P SP PP LPP SPS
P PS P PP P
Sbjct 11621 PSPPPSPPSPSPLSPTPPPPPSPSLPP-LPSPS-LPP
PPSPSPSPPPPPSPPPSPPPSPS 11794
Query 79 APPSKDV--PPCQEPPPAQDLSPCQDLPAGQEPLPHQ
DPLLTKDLPAIQESPTRDLPPCQ 136
PPS PP PPP SP LP P P
P P SP L P
Sbjct 11795 PPPSPSPSPPPPPSPPPSPPPSPSPSLPPSPSPPPSP
SP---PPSPSPLPSPSPSLSP-- 11959
Query 137 DLPPSQVSLP 146
LPPS P
Sbjct 11960 SLPPSPSPSP 11989

36
Exclude proline-rich HSPs

BLASTing RGS3_HUMAN vs. genomic
proline-rich sequence yields 2781 HSPs from 250
sequences
Filter on p-value alone (lt1.0e-05)
590 HSPs from 148 sequences
Add filter on prolines 219 HSPs
percent Ps among perfect matches lt 25
BLAST cache server allows iterative refining of
filter conditions (0034 ? .5 sec after first
iteration)

37
Identify novel neuropeptides

BLAST NDDB_HUMAN neuropeptide precursor
against genomic sequence

Filter on motif conserved in target sequence at
query sequences
Report hits in each
separately
Uses regular expression pattern matching
From 14 HSPs to 9 HSPs in 004 ? .5 sec

38
Interrogate putative primary structure in genomic
sequence

Of the human genomic sequences annotated as
channels and having exon boundaries in GenBank,
return only those having valid putative
donor/acceptor sites and start/stop codons
Accesses GenBank, uses Feature methods
59/126 CDSs from 38/81 GB entries in 3931 ? 28.7
sec

39
Sample GenBank record

LOCUS AB00408S10 2015 bp DNA
ROD 14-APR-2000
DEFINITION Rat DNA for lanosterol
14-demethylase, complete cds and exon 10.
ACCESSION AB004096
KEYWORDS CYP51 lanosterol 14-demethylase.
SOURCE Rattus norvegicus (strainWister)
DNA.
ORGANISM Rattus norvegicus
Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi
Mammalia Eutheria Rodentia
Sciurognathi Muridae Murinae
Rattus.
FEATURES Location/Qualifiers
CDS join(AB004087.1965..1138,AB004088
.144..142,
AB004089.190..266,AB004090.1
29..155,AB004091.151..225,
AB004092.135..154,AB004093.1
53..248,AB004094.144..139,
AB004095.140..208,113..291)
/gene"CYP51"
/codon_start1
/product"lanosterol
14-demethylase"
/protein_id"BAA20354.1"
/db_xref"GI2159968"

40
join(AB004087.1965..1138,...,AB004095.140..208,
113..291)
The GenBank CDS Feature Location for AB004096
. .
. . .
41
Outline

The Goal and Obstacles
Our Approach
A Motivating Example
The TINet System
Sample Queries and Results
Related Work
Lessons Learned
Acknowledgements

42
Choices of Integration Strategies

Materialized vs. Non-materialized
Data warehouse vs. virtual warehouse (middleware)
Semantic vs. Syntactic heterogeneity
Meaning vs. format
Declarative vs. Procedural
Query language vs. method or subroutine calls
Generic vs. Hard coded

Materialized vs. Non-materialized
Data warehouse vs. virtual warehouse (middleware)
Semantic vs. Syntactic heterogeneity
Meaning vs. format
Declarative vs. Procedural
Query language vs. method or subroutine calls
Generic vs. Hard coded
Our Approach

43
Other Integration Approaches

Relational data warehousing
maintenance-intensive, doesnt scale well
efficient updating is a research problem
limited to SQL query language
permits cleaning, reorganizing data
potentially excellent query performance
Hard-coded procedural apps (Perl, EJB, etc)
maintenance-intensive
may be preferred for very limited queries over
very large datasets in a stable application
method of choice for one-offs

44
Related Work

SRS (Lion Biosciences)
fast, popular
flat-file text data only
limited query language
Kleisli (U Penn Kent Ridge, Singapore)
based on functional programming languages
Mostly pass-through limited optimization
Discovery Link (IBM)
SQL-based
excellent cost-based optimization

45
Related Work

P/FDM (U Aberdeen)
schema-based, functional data model
research prototype only
TAMBIS (U Manchester)
addresses primarily semantic heterogeneity
TaO (TAMBIS ontology) based on a description
logic

46
Outline

The Goal and Obstacles
Our Approach
A Motivating Example
The TINet System
Sample Queries and Results
Related Work
Lessons Learned
What Worked, What Didnt
What Real-World Integration in Bioinformatics
Requires
Acknowledgements

47
Lessons LearnedIssues

More optimization and parallelization needed
Object-relational query language hindered
acceptance by developers
GenBank data warehouse hard to keep current
(size)
Success of web site access highly dependent on
stability of site

48
Lessons LearnedWhat Worked

Data integration via database middleware
Domain representation
Rich, object-relational schemas (sets, lists!)
ASDTs provided non-relational processing
capability
Generic server approach
Special server successes
BLAST server
BLAST Alignment ASDT
BLAST-cache server
Web server (PubMed)
Miscellaneous
Integration of Perl functions into C ASDT
server
CORBA

49
What Real-World Integration in Bioinformatics
Requires

Embrace the non-traditional data landscape!
Absence of standardization
Importance of public data repositories
Multitude of web sites
Applications as data sources
Nimble, prototyping culture
UDFs/Class methods for specialized filtering of
results
Model limited capabilities of non-traditional
data sources and use them in query planning and
optimization

50
Query Planning and Optimization for Web Data
Sources

Web sources are ubiquitous in Life Sciences
but query capabilities, metadata and semantics
are not easy to reason about or even to determine
Challenges
Capture diverse, often complex source
capabilities in a catalog representation
Develop methods/tools to extract metadata from
sources
Source contents, access costs, interrelationships
(joins)
Expand current state of capability-based
optimization to exploit diverse, complex
capabilities and metadata of biological web data
sources
select sources and capabilities for best result
set
generate low cost query evaluation plans
efficiently