Caltech Team: - PowerPoint PPT Presentation

1 / 82
About This Presentation
Title:

Caltech Team:

Description:

Purpose, System Description & Specification. October 1979: ... The postembryonic cell lineages of the hermaphrodite and male gonads. in Caenorhabditis elegans ... – PowerPoint PPT presentation

Number of Views:162
Avg rating:3.0/5.0
Slides: 83
Provided by: hansmicha
Category:

less

Transcript and Presenter's Notes

Title: Caltech Team:


1
Purpose, System Description Specification
Caltech Team Hans-Michael Müller, Arun
Rangarajan,Tracy Teal Kimberly Van Auken,
Juancarlos Chan, Paul Sternberg
2
October 1979 3 relevant papers
S. Brenner (Genetics 1974) The genetics of
Caenorhabditis elegans J. Sulston R. Horvitz
(Developmental Biology 1977) Post-embryonic cell
lineages of the nematode, Caenorhabditis
elegans J. Kimble D. Hirsh (Developmental
Biology 1979) The postembryonic cell lineages of
the hermaphrodite and male gonads in
Caenorhabditis elegans
October 2007 300,000 relevant papers
3
Q What genes are expressed in the striatum?
Most of the time we are skimming rather than
reading. So, if we skim faster, then we can read
more.
  • Keyword search there is no good query
  • A Category search is better
  • gene striatum

4
Introduction of Categories
Locus let-60 eat-4
lin-12
precursor upstream
cascade descendants
SCN CA3 cortex
BRAIN AREA
Worm CELL
vulF HSNL ALA
anchor cell
5
Categories can store synonyms
Polycystic kidney disease 1
Polycystic kidney disease 1 PKD1 UGID139915 13991
5 NP_038658.1 UniGene Hs.75813 Hs.75813 NM_000296.
2 CR613181.1 U24497.1 AB209675.1
6
Full Text Marked up With Categories
Mark up the whole corpus of papers with terms of
categories and index mark-ups for searching.
7
Textpresso for Neuroscience Example 1
Are there any cell types in specific brain areas
that are somehow associated with TRP channels?
Specify categories (NIF) cell types and
brain area and TRP channel
8
Answers to Example 1
9
Textpresso for Neuroscience Example 2
What receptors are known to be involved in
substance abuse?
  • Enter phrase substance abuse and specify
    category receptor.

10
Answers to Example 2
11
Answers to Example 2
Could make this return even better by
introducing a category substance abuse that
contains more terms describing substance abuse or
related terms. .
12
Answers to Example 2
Could make this return even better by
introducing a category substance abuse that
contains more terms describing substance abuse or
related terms. .
New Pilot Categories substance abuse category
substance abuse, cocaine seeking,
cocaine-triggered, addiction, addictive
behavior, drug abuse, alcoholism, dipsomania,
dipsomanic, junkie, substance dependence, dependen
ce disorder, self administration, heroin SA,
psychoactive substance abuse, chemical
dependence, habitual use, physical dependence,
rebound insomnia, discontinuation syndrome,
nicotine abuse, cocaine abuse, heroin abuse,
barbituate abuse, methaqualone abuse, alcohol
abuse, inhalant abuse, LSD abuse, marijuana
abuse, MDMA abuse, Ecstasy abuse, PCP abuse,
Phencyclidine abuse, anabolic steroid
abuse, Quaalude abuse, club drug, amphetamine
dependence, withdrawal drugs of abuse category
(source NIDA website) Nicotine, cocaine, heroin,
barbituate, methaqualone, alcohol, inhalant, LSD,
marijuana, MDMA, Ecstasy, PCP, Phencyclidine,
anabolic steroid, smoking methamphetamine, crystal
meth, club drugs, Special K, ketamine, GHB,
liquid ecstasy, soap, roofies, Rohypnol, coke,
snow, flake, blow, crack, smack, ska, junk,
whippets, popers, snappers, pot, ganga, weed,
grass, XTC, Adam, hug, beans, love drug, speed,
meth, chalk, angel dust, wack, rocket fuel,
blotter, Quaalude, THC
13
Textpresso for Database Curation Tasks
  • Gene-gene interactions
  • script extract sentences
  • curator checks off
  • database populated (2000 interactions)
  • Cellular component curation
  • curate analyze small subset of papers
  • build specific set of categories
  • query complete corpus
  • analyze extract results, refine categories,
    repeat query
  • Now used in the curation pipeline

14
1. Gene-Gene Interactions
  • gene-gene interaction
  • extracted via script all sentences that contain
    more
  • than one gene name and at least one association
  • or regulation word
  • obtained 26000 sentences out of 4400 articles.
  • built simple interface to check off sentences

15
1. Gene-Gene Interactions
16
2. Curation Pipeline Textpresso for GO Cellular
Component Annotations
Weekly search of new C. elegans papers with CCC
categories returns sentences from 10 15 papers
Further supporting these conclusions, GFP-tagged
CMD-1 does not accumulate at the furrow, but does
accumulate on the spindle and centrosomes (as
well as to the interphase nuclear membrane and
the borders of abutting cells) (Fig 3 and
Supplementary Video 2). Cytokinesis is not
controlled by calmodulin or myosin light chain
kinase in the Caenorhabditis elegans early
embryo. Batchelder EL, Thomas-Virnig CL, Hardin
JD, and White JG. FEBS Lett. 2007 Sep 4 581
(22) 4337-41.
17
Features, Specification Technical Details
  • Materials
  • Bibliography
  • Full text papers
  • Ontology
  • Processing
  • Pipeline
  • Methods Scripts

18
Methods Scripts
  • Processing response times
  • Building a database of 15,000 papers takes 30
    hours and 45 GB disk space
  • Web interface has fast response for simple
    retrieval tasks
  • Will consider rewriting core routine in C if
    response takes too long (1.5 million papers
    3000 hours)
  • NIF requests push the envelope
  • Need C for sophisticated aspects (advanced
    queries in TQL, NLP)
  • All 3rd party software (such as pdf-to-text
    converters) are available under GNU public
    license or similar

19
Downloading Installing Textpresso
  • http//www.textpresso.org/cgi-bin/neuroscience/dow
    nloads
  • Hardware requirement
  • Linux box
  • 3 GB per 1000 papers
  • 1.5 GB memory or higher for fast processing
  • WWW server software (such as Apache)
  • Perl 5.6.1 and most common Perl packages
  • Setting up system for a new literature (backend
    frontend) takes two hours (given ontology,
    bibliography and set of full text papers,
    excluding processing time for database build)
  • Package comes with concise installation
    instructions

20
14 Textpresso Systems
21
14 Textpresso Systems
22
Mark-up Density for Neuroscience Corpus
23
Processing Pipeline
  • Retrieve bibliography full text as described
  • Convert PDFs or HTMLs to plain text
  • If font information is useful, extract and tag
  • Remove special characters and formatting
  • New page character, quotation marks, etc.
  • Tokenize
  • Find word sentence boundaries
  • Mark-up (annotate) text with categories
  • Index annotation

24
Processing Pipeline
  • Build keyword index from corpus as-is (no
    modification)
  • People often need to find exact technical terms
  • Automatic wildcard insertion at the end of word
    makes it a little more flexible
  • Corpus of 15,000 papers and 17,000 abstract has
    1.1 million keywords
  • PDF to text conversion, corrupted tables
    introduce some nonsense.

25
Future
  • Website functionality
  • Making web-interfaces even more user-friendly
  • user-defined, up-loadable categories
  • synonyms, central user-defined
  • after expanding query language (statistics?)
    batch queries
  • user feedback-related improvements
  • RSS feeds
  • bracketing for keywords
  • More Textpresso flavors
  • Evaluate which specific literature sets are
    important
  • Decide whether others will pick them up or not
  • Nematode, A. thaliana, ?E. coli, ?Disease XYZ,

26
Future
  • Ontology
  • Continue to update and refine current categories
  • Continue to devise new categories
  • neuroscience, Drosophila
  • new literatures
  • Word sense disambiguation
  • a term can have multiple meanings find the right
    one in context
  • Find a system to extract categories and their
    lexica from full text
  • so far have worked with simple frequency lists
  • is there a smarter way? Higher-order correlations
    of words, graph theory?

27
Future
  • High-fidelity conversion to full text
  • improve pdftotext or switch to htmltotext
  • clearly identify tables and figures and their
    caption
  • identify subsection of papers (introduction,
    results, methods,)
  • sentences not always clearly separated revisit
    tokenizer
  • Literature curation for databases
  • Have extracted gene-gene association,
    gene-allele-reference associations, cellular
    components
  • have researched machine learning algorithms (WSD,
    HMM) in the past, now apply it to curation tasks
  • Find more data types and better methods to
    extract facts with higher recall and precision.

28
More Issues
  • Policies and mechanisms for modifying/updating
    Textpresso vocabularies (synchronized with NIF
    server or handled independently)
  • Categories delivered to us are easily
    incorporated, will close circle by end of 2007
  • Will marked-up text excerpts be delivered to NIF?
  • YES!
  • Feasibility of using Textpresso to aid in
    creating/curating neuroscience databases
  • YES! Semi-automated curation pipelines at
    WormBase
  • Possible synergies with Neurocommons text mining
    project
  • Sophisticated NLP (e.g., entity recognition)
    paper acquisition.

29
a variation on John Satterlees Use Case
Query a database of fly, worm and yeast data with
a pair of worm genes to find out whether there
are yeast and fly orthologs,and whether they
interact
Zhong Sternberg Science 2006
30
click to go to curated database
31
END OF SHORTER VERSION
32
Purpose, System Description Specification
Caltech Team Hans-Michael Müller, Arun
Rangarajan,Tracy Teal Kimberly Van Auken,
Juancarlos Chan, Paul Sternberg
33
Overview
  • What is Textpresso
  • Motivation, purpose
  • Application
  • Interface, example searches
  • Curation with Textpresso
  • Specification
  • Acquisition of materials, development of
    ontology, processing pipeline, system
    requirement, downloading installing Textpresso
  • Textpresso Flavors
  • Future

34
Textpresso - Motivation
  • Biomedical literature increases at rapid pace
  • PubMed has over 17 million citations (growth rate
    of 700,000 entries/year)
  • Much of biological data remains buried in full
    text
  • Difficult to access or just lost
  • Many search engines Can only search abstracts,
    not full text
  • Case studies 1/3 to 2/3 of protein interactions
    not mentioned in abstract
  • Abstract might miss significance of papers w.r.t.
    query
  • Full text contains redundancies
  • good (synonyms) bad (repetition)

35
Textpresso - Motivation
  • Specificity of Search Return is Low (Problem of
    False Positives)
  • Search engines do not focus on literature of
    specific corpus of literature
  • Loss of recall by trying to be more specific
    (more keywords)
  • Conceptual (semantic) searches are hard to
    formulate with keywords only
  • I want to learn about all genes that interact
    with gene x in cell B.

36
Introduction of Categories
precursor upstream
cascade descendants
BRAIN AREA
37
Full Text Marked up With Categories
Mark up the whole corpus of papers with terms of
categories and index mark-ups for searching.
38
Textpresso - Motivation
  • I want to learn about all genes that interact
    with gene x in cell B.
  • When searching for keywords x and B, categories
    gene and interaction
  • likelihood of meaningful returns is now
    drastically increased
  • (up to 39-fold, according to one of our studies)

Muller, Kenny Sternberg PLoS 2004
39
Purpose of Textpresso
  • Build a practical tool for researcher and curator
  • Facilitates full text (keyword) searches of
    research papers
  • Search scope from single sentence to full
    document
  • Return Paper bibliography paragraphs -
    sentences
  • Facilitates category searches
  • Besides keywords, search for concepts such as
    gene, cell, molecular function, etc.
  • Adds meaning to a query (semantics)

40
Purpose of Textpresso
  • Build platform for Natural Language Processing
  • Help model organism database curators extract
    information on a massive scale
  • Gene-allele-reference associations
  • Gene-gene interactions
  • Cellular component extractions
  • Offer research opportunities for computational
    linguists
  • Paper taxonomy (Chen et al., BMC Bioinformatics,
    2006)
  • Research in fact extraction algorithms

41
Textpresso for Neuroscience Example 1
Are there any cell types in specific brain areas
that are somehow associated with TRP channels?
Specify category (NIF) cell types, brain
area and TRP channel.
42
Answers to Example 1
43
Textpresso for Neuroscience Example 2
What receptors are known to be involved in
substance abuse?
  • Enter phrase substance abuse and specify
    category receptor.

44
Answers to Example 2
45
Answers to Example 2
Could make this return even better by
introducing a category substance abuse that
contains more terms describing substance abuse or
related terms. .
46
Answers to Example 2
Could make this return even better by
introducing a category substance abuse that
contains more terms describing substance abuse or
related terms. .
New Pilot Categories substance abuse category
substance abuse, cocaine seeking,
cocaine-triggered, addiction, addictive
behavior, drug abuse, alcoholism, dipsomania,
dipsomanic, junkie, substance dependence, dependen
ce disorder, self administration, heroin SA,
psychoactive substance abuse, chemical
dependence, habitual use, physical dependence,
rebound insomnia, discontinuation syndrome,
nicotine abuse, cocaine abuse, heroin abuse,
barbituate abuse, methaqualone abuse, alcohol
abuse, inhalant abuse, LSD abuse, marijuana
abuse, MDMA abuse, Ecstasy abuse, PCP abuse,
Phencyclidine abuse, anabolic steroid
abuse, Quaalude abuse, club drug, amphetamine
dependence, withdrawal drugs of abuse category
(source NIDA website) Nicotine, cocaine, heroin,
barbituate, methaqualone, alcohol, inhalant, LSD,
marijuana, MDMA, Ecstasy, PCP, Phencyclidine,
anabolic steroid, smoking methamphetamine, crystal
meth, club drugs, Special K, ketamine, GHB,
liquid ecstasy, soap, roofies, Rohypnol, coke,
snow, flake, blow, crack, smack, ska, junk,
whippets, popers, snappers, pot, ganga, weed,
grass, XTC, Adam, hug, beans, love drug, speed,
meth, chalk, angel dust, wack, rocket fuel,
blotter, Quaalude, THC
47
Search Interface
48
Result Page
49
If Interface is not sufficient Query Language
50
Cellular Component Curation (CCC)
  • Do authors describe sub-cellular localization in
    a sufficiently stereotypical manner?
  • If so, can this be used to empirically create new
    Textpresso categories specific for CCC?

51
1) Identify Extract Relevant Sentences
  • 219 publications
  • All sections of paper read
  • Antibody staining only
  • 1429 sentences

52
2) Compute Inspect Histogram of Words
  • Which phrase size affords specificity?
  • Phrase size 1 Count 321 cells
    Phrase size 1 Count 311 nuclei
  • Phrase size 2 Count 101 localized to Phrase
    size 2 Count 97 nuclei of
  • Phrase size 3 Count 49 in the
    cytoplasm Phrase size 3 Count 46 in the
    nuclei
  • Phrase size 4 Count 9 localized to the
    nucleus Phrase size 4 Count 9 localizes to
    the apical

53
3) Create Three New Categories
  • Cellular Component
  • Adherens junction, nuclei, P granules
  • Verbs
  • Localized, expressed, accumulated
  • Other
  • Fluorescence, punctate, uniformly

54
4) Test on New Papers
  • 37 papers flagged for expression data

55
4) Test on New Papers
  • New categories return 86 of available cellular
    component data

56
5) Further Development (CCC)
  • (N1)th generation categories
  • add new terms
  • remove terms that lower specificity
  • test again
  • repeat
  • Establish curation pipeline
  • new paper comes in
  • three categories applied
  • sentence presented to curator
  • database populated

57
Features, Specification Technical Details
  • Materials
  • Bibliography
  • Full text papers
  • Ontology
  • Processing
  • Pipeline
  • Methods Scripts

58
Obtaining Papers Bibliography
  • Worked closely with model organism databases such
    as WormBase or FlyBase to obtain bibliography
    full texts
  • For other literatures such as Neuroscience,
    mouse, human, etc
  • Need an automated procedure to obtain
    bibliography and full texts (pdfs or htmls)

59
Obtaining Papers Bibliography
  • Curator devises a set of keywords or determines a
    set of journals for a particular literature
  • When queried at PubMed, keywords or journal list
    will return all relevant articles for literature
  • Bibliography of articles is downloaded, including
    all bibl. fields Textpresso offers
  • Using the citation info, full texts (pdf/html)
    are downloaded from journal site (only subscribed
    or open-access journals)

60
Obtaining Papers Bibliography
  • Need template for each journal to find pdf link
  • error prone, can do 80 without human
    intervention
  • All steps are currently closely supervised, will
    continue to automate
  • Downloaded 15,000 neuroscience papers in 2006
  • A new round of downloads being prepared (150,000
    entries)
  • 50,000-120,000 full texts available

61
List of Journals for Neuroscience
  • BMC Neurosci
  • Nat Rev Neurosci
  • J Neurosci
  • Nat Neurosci
  • Neuron
  • Neuroscience
  • J Neurophysiol
  • Exp Brain Res
  • Curr Opin Neurobiol

Brain Res Trends Neurosci Cereb Cortex Annu Rev
Neurosci Int J Neurosci Int Rev Neurobiol J Comp
Neurol Eur J Neurosci Neurobiology
Cell, Science, PLoS, PNAS, Genetics,
62
Ontology Development
  • Textpresso ontology has two components
  • Scientific (such as GO)
  • Rich resource of biological meaningful terms and
    their relations, but not necessarily a
    representation of natural language prose
  • Colloquial
  • Synonyms, jargon, talking in metaphors etc.
  • People write in own, distinct style, without
    caring about a formal ontology (such as GO)
  • Makes it difficult to achieve high recall

63
Ontology Development
  • Imported from MOs and other ontologies, assembled
    own lists
  • Automatic extraction of entity names impressive
    (BioCreAtivE challenge in 2004 F 0.8), but
    for the sake of high recall, lists are curated
    carefully by hand
  • 100 categories with 1.7 million terms in lexicon
    (divide by 4)
  • Takes not more than 2-3 months for a curator to
    assemble (semi-automated)
  • Will start working on automatic extraction of
    terms soon
  • 24.5 million terms of biological interest marked
    up with categories (for 15,000 Neuroscience
    papers)

64
C. elegans Categories
C. elegans categories were our base
ontology upon which we built ontologies for other
literatures such as Neuroscience and D.
melanogaster.
65
Neuroscience Expanding the Lexica
  • Brain area 4,800 terms
  • Receptor 5,750 terms
  • NIF cell types (experimental) 550 terms
  • TRP channel (experimental) 40 terms

66
Mark-up Density for Neuroscience Corpus
67
Processing Pipeline
  • Retrieve bibliography full text as described
  • Convert PDFs or HTMLs to plain text
  • If font information is useful, extract and tag
  • Remove special characters and formatting
  • New page character, quotation marks, etc.
  • Tokenize
  • Find word sentence boundaries
  • Mark-up (annotate) text with categories
  • Index annotation

68
Processing Pipeline
  • Build keyword index from corpus as-is (no
    modification)
  • People often need to find exact technical terms
  • Automatic wildcard insertion at the end of word
    makes it a little more flexible
  • Corpus of 15,000 papers and 17,000 abstract has
    1.1 million keywords
  • PDF to text conversion, corrupted tables
    introduce some nonsense.

69
Methods Scripts
  • All data processing scripts (pipeline) and web
    interfaces are written in Perl
  • Perl modules
  • TextpressoDatabaseQuery
  • defines data model for querying database
  • TextpressoDatabaseSearch
  • subroutines performing database searches, etc
  • TextpressoDisplayTasks
  • subroutines necessary for running the web
    interface
  • TextpressoSystemTasks
  • subroutines building the Textpresso database
  • TextpressoWebserviceTasks
  • subroutines for providing the webservice

70
Webservice for NIF
  • http//dev.textpresso.org/wsdl/textpresso.wsdl

71
Methods Scripts
  • Processing response times
  • Building a database of 15,000 papers takes 30
    hours and 45 GB disk space
  • Web interface has fast response for simple
    retrieval tasks
  • Will consider rewriting core routine in C if
    response takes too long (1.5 million papers
    3000 hours)
  • NIF requests push the envelope
  • Need C for sophisticated aspects (advanced
    queries in TQL, NLP)
  • All 3rd party software (such as pdf-to-text
    converters) are available under GNU public
    license or similar

72
Downloading Installing Textpresso
  • http//www.textpresso.org/cgi-bin/neuroscience/dow
    nloads
  • Hardware requirement
  • Linux box
  • 3 GB per 1000 papers
  • 1.5 GB memory or higher for fast processing
  • WWW server software (such as Apache)
  • Perl 5.6.1 and most common Perl packages
  • Setting up system for a new literature (backend
    frontend) takes two hours (given ontology,
    bibliography and set of full text papers,
    excluding processing time for database build)
  • Package comes with concise installation
    instructions

73
14 Textpresso Systems
74
14 Textpresso Systems
75
Future
  • Website functionality
  • Making web-interfaces even more user-friendly
  • user-defined, up-loadable categories
  • synonyms, central user-defined
  • after expanding query language (statistics?)
    batch queries
  • user feedback-related improvements
  • RSS feeds
  • bracketing for keywords
  • More Textpresso flavors
  • Evaluate which specific literature sets are
    important
  • Decide whether others will pick them up or not
  • Nematode, A. thaliana, ?E. coli, ?Disease XYZ,

76
Future
  • Ontology
  • Continue to update and refine current categories
  • Continue to devise new categories
  • neuroscience, Drosophila
  • new literatures
  • Word sense disambiguation
  • a term can have multiple meanings find the right
    one in context
  • Find a system to extract categories and their
    lexica from full text
  • so far have worked with simple frequency lists
  • is there a smarter way? Higher-order correlations
    of words, graph theory?

77
Future
  • High-fidelity conversion to full text
  • improve pdftotext or switch to htmltotext
  • clearly identify tables and figures and their
    caption
  • identify subsection of papers (introduction,
    results, methods,)
  • sentences not always clearly separated revisit
    tokenizer
  • Literature curation for MODs
  • Have extracted gene-gene association,
    gene-allele-reference associations, cellular
    components
  • have researched machine learning algorithms (WSD,
    HMM) in the past, now apply it to curation tasks
  • Find more data types and better methods to
    extract facts with higher recall and precision.

78
http//www.textpresso.org
79
More Issues
  • Policies and mechanisms for modifying/updating
    Textpresso vocabularies (synchronized with NIF
    server or handled independently)
  • Categories delivered to us are easily
    incorporated, will close circle by end of 2007
  • Will marked-up text excerpts be delivered to NIF?
  • YES!
  • Feasibility of using Textpresso to aid in
    creating/curating neuroscience databases
  • YES! Semi-automated curation pipelines at
    WormBase
  • Possible synergies with Neurocommons text mining
    project
  • Sophisticated NLP (e.g., entity recognition)
    paper acquisition.

80
a variation on John Satterlees Use Case
Query a database of fly, worm and yeast data with
a pair of worm genes to find out whether there
are yeast and fly orthologs,and whether they
interact
Zhong Sternberg Science 2006
81
click to go to curated database
82
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com