Extending Traditional QueryBased Integration Approaches for Functional Characterization of PostGenom

1 / 51
About This Presentation
Title:

Extending Traditional QueryBased Integration Approaches for Functional Characterization of PostGenom

Description:

splicing exons into putative coding sequences. extracting putative donor and acceptor sites ... Interrogate putative primary structure in genomic sequence ' ... –

Number of Views:55
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Extending Traditional QueryBased Integration Approaches for Functional Characterization of PostGenom


1
Extending Traditional Query-Based Integration
Approaches for Functional Characterization of
Post-Genomic Data
  • Barbara A. Eckman, Ph.D.
  • Senior Consulting I/T Architect
  • IBM Life Sciences Solutions
  • baeckman_at_us.ibm.com
  • Joint work at GlaxoSmithKline with
  • Anthony S. Kosky, Gene Logic, Inc.
  • Leonardo A. Laroco, Jr., GlaxoSmithKline

Bioinformatics 2001
2
Outline
  • Goal and Obstacles
  • A Motivating Example
  • The TINet System
  • Sample Queries and Results
  • Related Work
  • Lessons Learned
  • Acknowledgements

3
The Goal
  • To make data interpretation easier and faster for
    human experts through an integrated, queryable
    view of genomic sequence and associated data

4

Obstacles to Integration
  • Data spread over multiple, heterogeneous dbs
  • Not all are easily queried
  • flat file sequence dbs, web sites, BLAST
    alignments
  • Some are not even easily parsed!
  • Not all represent biology optimally
  • Genbank is sequence-centric, not gene-centric
  • SwissProt is sequence-centric, not domain-centric
  • Hard to keep results up-to-date
  • Non-traditional query approaches are needed to
    exclude extraneous results

5
This is a hard problem.
6
Outline
  • Goal and Obstacles
  • A Motivating Example
  • The TINet System
  • Sample Queries and Results
  • Related Work
  • Lessons Learned
  • Acknowledgements

7
Beyond Browsing to One-Step Querying
8

Select kinase cDNAs for microarray experiments
  • Return HUGO name, length, GB accession of the
    longest full-length cDNA sequence related to each
    GeneCards entry that has been annotated as a
    kinase

9
The browsing approach
  • If browsing, would take 4020 web page visits,
    assuming an average of 5 cDNAs for each of the
    670 GeneCards
  • Brain-numbing, error-prone
  • Sub-optimal use of resources!

10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
The querying approach
  • Execute it
  • 1 hour later get result (670 rows)

Write short query
  • select
  • hugo_name gc. hugo_name,
  • seq_length gs2.seq_length,
  • longest_acc gs2.primary_accession
  • from
  • gc in GENECARDSGenecard,
  • gc_accs gc.unigene_nucleic_acids.cluster.cdna_
    accessions,
  • maxlen max (select gs.seq_length
  • from gs in genbankSeq
  • where gs.topology mRNA
  • and gs.accessions in gc_accs
  • ),
  • gs2 in genbankSeq
  • where
  • gc.text_search match "kinase"
  • and gs2.accessions in gc_accs
  • and gs2.topology mRNA
  • and gs2.seq_length maxlen

15
GeneCards Schema
16
Outline
  • Goal and Obstacles
  • A Motivating Example
  • The TINet System
  • Sample Queries and Results
  • Related Work
  • Lessons Learned
  • Acknowledgements

17
Federated Technology
  • Query multiple heterogeneous data sources as if
    they were components of a single large database.

blast1
blast2
SRS
motifs
MGD
Result
Gene Exp
GenBank
GPCR
SwissProt
GeneCards
PubMed
18
The Target Informatics Net (TINet) System
  • Middleware Gene Logics OPM System
  • Object-relational data model
  • SQL-like query language
  • CORBA used for distributed access
  • SDK to write wrappers for new data source types
  • Limited, judicious use of data warehousing
  • GenBank, SwissProt
  • Scientific value added

19
(No Transcript)
20
Data Sources Integrated in TINet
  • Mouse Genome DataBase (MGD) (Sybase)
  • SwissProt, GenBank, PROSITE (flat-file)
  • GeneCards, PubMed (web sites)
  • On-the-fly BLAST searches (WUBLAST2, NCBI BLAST
    2 PSI-BLAST in development)
  • On-the-fly PROSITE motif searches
  • SRS databanks
  • GlaxoSmithKline proprietary genomic data (Sybase)

21
TINet Component Services
  • BLAST (WU, NCBI) with BLAST-cache
  • Generic web site server
  • PubMed, GeneCards
  • Generic flat file server (XML)
  • generic loaders for releases and nightly updates
  • GenBank, SwissProt, PROSITE
  • Generic application server
  • PROSITE search via GCG motifs
  • SRS databanks
  • ASDT server for method calls (BLAST, GB)

22
Outline
  • Goal and Obstacles
  • A Motivating Example
  • The TINet System
  • Sample Queries and Results
  • Basic
  • Advanced Using Class Methods
  • Related Work
  • Lessons Learned
  • Acknowledgements

23
Benchmarking
  • Real-world queries
  • all 1/2 page long, most 15 minutes to write
  • 10 iterations
  • Core TINet system deployed on
  • 2 CPU UltraSPARC-II, 296 MHz, 1.5 GB, SunOS 5.6
  • Visigenic Visibroker ORB
  • BLAST server 14x524 DEC alpha
  • Consistent performance (max sd 12 mean)
  • Times reported omit blast exec time

24
Basic Sample Queries
25
Find ESTs that may correspond to interesting
neurological targets
  • Return accession numbers and definitions of EST
    sequences that are similar (60 identical over 50
    AA) to calcium channel sequences in SwissProt
    that have references published since 1995 with
    brain in the abstract.
  • Data sources SwissProt, PubMed, tblastn2 vs
    gbest
  • Results
  • 539 HSPs from 538 ESTs in 1158 ? 46 sec
  • Is this best done with a workflow or a query?

26
Mine recent literature for serotonin-related
expression data linked to GenBank sequences
  • Return the source, abstract and GB accession
    number for all articles published in 2000 that
    have a MeSH term gene expression, mention
    serotonin, and are linked to nucleotide
    sequences in GenBank
  • Accesses PubMed
  • Cant be done interactively
  • Compensates via post-processing
  • 5 articles related to 6 seqs in 10 ? .7 seconds

27
Find full-length coding sequences in GenBank
  • Return the accession and definition of sequences
    annotated as channels for which the CDS start and
    stop locations are specified unambiguously.
  • Accesses GenBank
  • Impossible using Entrez interface
  • 1574 features on 1331 unique sequences in 2755 ?
    3.4 seconds

28
Find possible novel drug targets
  • Find homologues (60 identity over 50 AA) in
    human genomic sequence to all novel genes cloned
    and reported in the literature in the past 3
    months whose articles are annotated with the
    Medline MeSH term neoplasms
  • Accesses PubMed, GenBank, BLAST
  • 1/2 page of OPM-MQL vs. many browser clicks, many
    lines of perl glue code
  • 131 HSPs involving 47 genomic sequences and 6
    novel gene sequences in 421 ? 22.9 sec

29
Identify potential new orthologs paralogs
related to a gene of interest
  • Retrieve, filter and summarize relevant
    references, GB annotations, and BLAST alignments
    for potential paralog/ortholog sequences
    identified through alignment with a gene of
    interest (e.g., 95 over 400 bp)
  • Accesses GenBank, BLAST, and PubMed
  • Highly structured XML output organizes large,
    hierarchical result set
  • 5 XML entries involving 5 GB seqs with 18
    Features, 3 PubMed entries in 0032 ? 1.5 secs

30
XML output
lt?xml version"1.0"?gtblastn of GenBank AF061056
against mRNA subset of GenBank ltResultsgt
ltQuerygtGenBank accession id used to retrieve
query sequence ltqry_accessiongtAF061056lt/qry_a
ccessiongt ltqry_sequencegt 1 tgaaatatag
gtgagagaca agattgtctc atatccgggg aaatcataac
ctatgactag ... lt/qry_sequencegt
lt/Querygt ltOutputgtGenBank/PubMed reference data
linked from filtered alignment ltTargetgt
ltGenbankgt ltgb_accessiongtAF084645lt/gb_acce
ssiongt ltorganismgtHomo Sapienslt/organismgt
ltseq_lengthgt2905lt/seq_lengthgt
ltlocus_namegtAF084645lt/locus_namegt
ltFeaturegt ltfeature_typegtsourcelt/feature_
typegt lttissue_typegtliverlt/tissue_typegt
ltlocationgt1..2905lt/locationgt
ltsequencegtcctctgaag... 2887 bases omitted
...aaaaaaaaalt/sequencegt lt/Featuregt
ltFeaturegt ltfeature_typegtCDSlt/feature_ty
pegt ltlocationgt280..1584lt/locationgt
ltprotein_idgtAAC64558.1lt/protein_idgt
ltsequencegtctggaggtg... 1287 bases omitted
...ggtagctgalt/sequencegt lt/Featuregt
ltReferencegt ltmedlinegt98445350lt/medlinegt
ltPubmedgt
lttitlegtIdentification of a human nuclear receptor
defines a new signaling pathway for CYP3A
induction.lt/titlegt ltauthorsgtBertilsson
Glt/authorsgt ltauthorsgt... 9 authors
omitted ...lt/authorsgt
ltabstractgtNuclear receptors regulate metabolic
pathways...lt/abstractgt ltsourcegtProc
Natl Acad Sci U S A 1998 Oct 1395(21)12208-13lt/s
ourcegt lt/Pubmedgt lt/Referencegt
lt/Genbankgt lt/Targetgt
ltbl_accessiongtgbAF084645AF084645lt/bl_accessiongt
ltbl_descriptiongtHomo sapiens orphan
nuclear receptor (PAR1) mRNA, complete
cds.lt/bl_descriptiongt ltbl_align_lengthgt2106lt/b
l_align_lengthgt ltbl_identitiesgt99.4777lt/bl_ide
ntitiesgt ... lt/Outputgt ltOutputgt... 2
references/alignments omitted ...lt/Outputgt lt/Resu
ltsgt
31
Turn free-text web searches into typed searches
  • Return GeneCards genes associated with liver
    function and related to mouse genes on mouse
    chromosome 10
  • Accesses GeneCards
  • GeneCards only permits free-text mouse and liver
    and chromosome 10 (58 Cards)
  • TINet transforms into typed search on mammalian
    homologues section
  • Returns 16 GeneCards in 239 ?19.1 secs

32
Protein Motif Searching
  • For a bacterial sequence of interest, return all
    matching PROSITE motifs that are attested in more
    than 10 prokaryotic sequences in SwissProt.
  • Data Sources
  • Accelrys motifs program
  • PROSITE semi-structured text file of annotated
    motifs

33
Advanced Sample Queries Using Class Methods
34
Special-purpose filtering using O-O class methods
  • Servers for method calls from within queries
  • BLAST Alignment
  • queries over alignments
  • regexp motif matching over alignments
  • GenBank Feature
  • splicing exons into putative coding sequences
  • extracting putative donor and acceptor sites
  • extracting sequence windows (start and stop
    codons)

35
Sample BLAST alignmentRGS3_HUMAN vs. genomic
  • ltBCMPROJECT_HAQJ.frag8 11699694
  • Length 14,486
  • Plus Strand HSPs
  • Score 164 (62.8 bits), Expect 9.6e-07, P
    9.6e-07
  • Identities 48/130 (36), Positives 52/130
    (40), Frame 2
  • Query 20 PGAEDSPPSKEP-SPGQELPPGQDLPPNKDSPSGQEP
    APSQEPLSSKDSATSEGSPPGPD 78
  • P SPPS P SP PP LPP SPS
    P PS P PP P
  • Sbjct 11621 PSPPPSPPSPSPLSPTPPPPPSPSLPP-LPSPS-LPP
    PPSPSPSPPPPPSPPPSPPPSPS 11794
  • Query 79 APPSKDV--PPCQEPPPAQDLSPCQDLPAGQEPLPHQ
    DPLLTKDLPAIQESPTRDLPPCQ 136
  • PPS PP PPP SP LP P P
    P P SP L P
  • Sbjct 11795 PPPSPSPSPPPPPSPPPSPPPSPSPSLPPSPSPPPSP
    SP---PPSPSPLPSPSPSLSP-- 11959
  • Query 137 DLPPSQVSLP 146
  • LPPS P
  • Sbjct 11960 SLPPSPSPSP 11989

36
Exclude proline-rich HSPs
  • BLASTing RGS3_HUMAN vs. genomic
  • proline-rich sequence yields 2781 HSPs from 250
    sequences
  • Filter on p-value alone (lt1.0e-05)
  • 590 HSPs from 148 sequences
  • Add filter on prolines 219 HSPs
  • percent Ps among perfect matches lt 25
  • BLAST cache server allows iterative refining of
    filter conditions (0034 ? .5 sec after first
    iteration)

37
Identify novel neuropeptides
  • BLAST NDDB_HUMAN neuropeptide precursor
  • against genomic sequence
  • Filter on motif conserved in target sequence at
    query sequences
  • Report hits in each
    separately
  • Uses regular expression pattern matching
  • From 14 HSPs to 9 HSPs in 004 ? .5 sec

38
Interrogate putative primary structure in genomic
sequence
  • Of the human genomic sequences annotated as
    channels and having exon boundaries in GenBank,
    return only those having valid putative
    donor/acceptor sites and start/stop codons
  • Accesses GenBank, uses Feature methods
  • 59/126 CDSs from 38/81 GB entries in 3931 ? 28.7
    sec

39
Sample GenBank record
  • LOCUS AB00408S10 2015 bp DNA
    ROD 14-APR-2000
  • DEFINITION Rat DNA for lanosterol
    14-demethylase, complete cds and exon 10.
  • ACCESSION AB004096
  • KEYWORDS CYP51 lanosterol 14-demethylase.
  • SOURCE Rattus norvegicus (strainWister)
    DNA.
  • ORGANISM Rattus norvegicus
  • Eukaryota Metazoa Chordata
    Craniata Vertebrata Euteleostomi
  • Mammalia Eutheria Rodentia
    Sciurognathi Muridae Murinae
  • Rattus.
  • FEATURES Location/Qualifiers
  • CDS join(AB004087.1965..1138,AB004088
    .144..142,
  • AB004089.190..266,AB004090.1
    29..155,AB004091.151..225,
  • AB004092.135..154,AB004093.1
    53..248,AB004094.144..139,
  • AB004095.140..208,113..291)
  • /gene"CYP51"
  • /codon_start1
  • /product"lanosterol
    14-demethylase"
  • /protein_id"BAA20354.1"
  • /db_xref"GI2159968"

40
join(AB004087.1965..1138,...,AB004095.140..208,
113..291)
The GenBank CDS Feature Location for AB004096
. .
. . .
41
Outline
  • The Goal and Obstacles
  • Our Approach
  • A Motivating Example
  • The TINet System
  • Sample Queries and Results
  • Related Work
  • Lessons Learned
  • Acknowledgements

42
Choices of Integration Strategies
  • Materialized vs. Non-materialized
  • Data warehouse vs. virtual warehouse (middleware)
  • Semantic vs. Syntactic heterogeneity
  • Meaning vs. format
  • Declarative vs. Procedural
  • Query language vs. method or subroutine calls
  • Generic vs. Hard coded
  • Materialized vs. Non-materialized
  • Data warehouse vs. virtual warehouse (middleware)
  • Semantic vs. Syntactic heterogeneity
  • Meaning vs. format
  • Declarative vs. Procedural
  • Query language vs. method or subroutine calls
  • Generic vs. Hard coded
  • Our Approach

43
Other Integration Approaches
  • Relational data warehousing
  • maintenance-intensive, doesnt scale well
  • efficient updating is a research problem
  • limited to SQL query language
  • permits cleaning, reorganizing data
  • potentially excellent query performance
  • Hard-coded procedural apps (Perl, EJB, etc)
  • maintenance-intensive
  • may be preferred for very limited queries over
    very large datasets in a stable application
  • method of choice for one-offs

44
Related Work
  • SRS (Lion Biosciences)
  • fast, popular
  • flat-file text data only
  • limited query language
  • Kleisli (U Penn Kent Ridge, Singapore)
  • based on functional programming languages
  • Mostly pass-through limited optimization
  • Discovery Link (IBM)
  • SQL-based
  • excellent cost-based optimization

45
Related Work
  • P/FDM (U Aberdeen)
  • schema-based, functional data model
  • research prototype only
  • TAMBIS (U Manchester)
  • addresses primarily semantic heterogeneity
  • TaO (TAMBIS ontology) based on a description
    logic

46
Outline
  • The Goal and Obstacles
  • Our Approach
  • A Motivating Example
  • The TINet System
  • Sample Queries and Results
  • Related Work
  • Lessons Learned
  • What Worked, What Didnt
  • What Real-World Integration in Bioinformatics
    Requires
  • Acknowledgements

47
Lessons LearnedIssues
  • More optimization and parallelization needed
  • Object-relational query language hindered
    acceptance by developers
  • GenBank data warehouse hard to keep current
    (size)
  • Success of web site access highly dependent on
    stability of site

48
Lessons LearnedWhat Worked
  • Data integration via database middleware
  • Domain representation
  • Rich, object-relational schemas (sets, lists!)
  • ASDTs provided non-relational processing
    capability
  • Generic server approach
  • Special server successes
  • BLAST server
  • BLAST Alignment ASDT
  • BLAST-cache server
  • Web server (PubMed)
  • Miscellaneous
  • Integration of Perl functions into C ASDT
    server
  • CORBA

49
What Real-World Integration in Bioinformatics
Requires
  • Embrace the non-traditional data landscape!
  • Absence of standardization
  • Importance of public data repositories
  • Multitude of web sites
  • Applications as data sources
  • Nimble, prototyping culture
  • UDFs/Class methods for specialized filtering of
    results
  • Model limited capabilities of non-traditional
    data sources and use them in query planning and
    optimization

50
Query Planning and Optimization for Web Data
Sources
  • Web sources are ubiquitous in Life Sciences
  • but query capabilities, metadata and semantics
    are not easy to reason about or even to determine
  • Challenges
  • Capture diverse, often complex source
    capabilities in a catalog representation
  • Develop methods/tools to extract metadata from
    sources
  • Source contents, access costs, interrelationships
    (joins)
  • Expand current state of capability-based
    optimization to exploit diverse, complex
    capabilities and metadata of biological web data
    sources
  • select sources and capabilities for best result
    set
  • generate low cost query evaluation plans
    efficiently

Eckman, Lacroix, Raschid, BIBE 2001 and Intl J
Bioinf Bioeng, 2002
51
Acknowledgements
  • GlaxoSmithKline
  • Pankaj Agarwal
  • Roberto Alvarez
  • Jim Fickett
  • William Hayes
  • Mark Hurle
  • Rob Knowlton
  • Chris Larminie
  • David Michalovich
  • Bill Morgart
  • Pankaj Patel
  • Ken Rice
  • Bob Reid
  • Harpreet Singh
  • Lin Yue
  • Gene Logic
  • Victor Markowitz
  • I-Min Chen
  • Madhavan Ganesh
  • Zoe Lacroix
  • Ernest Szeto
  • Paula Ta-Shma
  • Thodoros Topaloglou
  • Victoria Wang
Write a Comment
User Comments (0)
About PowerShow.com