Title: Caltech Team:
1Purpose, System Description Specification
Caltech Team Hans-Michael Müller, Arun
Rangarajan,Tracy Teal Kimberly Van Auken,
Juancarlos Chan, Paul Sternberg
2October 1979 3 relevant papers
S. Brenner (Genetics 1974) The genetics of
Caenorhabditis elegans J. Sulston R. Horvitz
(Developmental Biology 1977) Post-embryonic cell
lineages of the nematode, Caenorhabditis
elegans J. Kimble D. Hirsh (Developmental
Biology 1979) The postembryonic cell lineages of
the hermaphrodite and male gonads in
Caenorhabditis elegans
October 2007 300,000 relevant papers
3Q What genes are expressed in the striatum?
Most of the time we are skimming rather than
reading. So, if we skim faster, then we can read
more.
- Keyword search there is no good query
- A Category search is better
-
- gene striatum
4Introduction of Categories
Locus let-60 eat-4
lin-12
precursor upstream
cascade descendants
SCN CA3 cortex
BRAIN AREA
Worm CELL
vulF HSNL ALA
anchor cell
5Categories can store synonyms
Polycystic kidney disease 1
Polycystic kidney disease 1 PKD1 UGID139915 13991
5 NP_038658.1 UniGene Hs.75813 Hs.75813 NM_000296.
2 CR613181.1 U24497.1 AB209675.1
6Full Text Marked up With Categories
Mark up the whole corpus of papers with terms of
categories and index mark-ups for searching.
7Textpresso for Neuroscience Example 1
Are there any cell types in specific brain areas
that are somehow associated with TRP channels?
Specify categories (NIF) cell types and
brain area and TRP channel
8Answers to Example 1
9Textpresso for Neuroscience Example 2
What receptors are known to be involved in
substance abuse?
- Enter phrase substance abuse and specify
category receptor.
10Answers to Example 2
11Answers to Example 2
Could make this return even better by
introducing a category substance abuse that
contains more terms describing substance abuse or
related terms. .
12Answers to Example 2
Could make this return even better by
introducing a category substance abuse that
contains more terms describing substance abuse or
related terms. .
New Pilot Categories substance abuse category
substance abuse, cocaine seeking,
cocaine-triggered, addiction, addictive
behavior, drug abuse, alcoholism, dipsomania,
dipsomanic, junkie, substance dependence, dependen
ce disorder, self administration, heroin SA,
psychoactive substance abuse, chemical
dependence, habitual use, physical dependence,
rebound insomnia, discontinuation syndrome,
nicotine abuse, cocaine abuse, heroin abuse,
barbituate abuse, methaqualone abuse, alcohol
abuse, inhalant abuse, LSD abuse, marijuana
abuse, MDMA abuse, Ecstasy abuse, PCP abuse,
Phencyclidine abuse, anabolic steroid
abuse, Quaalude abuse, club drug, amphetamine
dependence, withdrawal drugs of abuse category
(source NIDA website) Nicotine, cocaine, heroin,
barbituate, methaqualone, alcohol, inhalant, LSD,
marijuana, MDMA, Ecstasy, PCP, Phencyclidine,
anabolic steroid, smoking methamphetamine, crystal
meth, club drugs, Special K, ketamine, GHB,
liquid ecstasy, soap, roofies, Rohypnol, coke,
snow, flake, blow, crack, smack, ska, junk,
whippets, popers, snappers, pot, ganga, weed,
grass, XTC, Adam, hug, beans, love drug, speed,
meth, chalk, angel dust, wack, rocket fuel,
blotter, Quaalude, THC
13Textpresso for Database Curation Tasks
- Gene-gene interactions
- script extract sentences
- curator checks off
- database populated (2000 interactions)
- Cellular component curation
- curate analyze small subset of papers
- build specific set of categories
- query complete corpus
- analyze extract results, refine categories,
repeat query - Now used in the curation pipeline
141. Gene-Gene Interactions
-
- gene-gene interaction
- extracted via script all sentences that contain
more - than one gene name and at least one association
- or regulation word
- obtained 26000 sentences out of 4400 articles.
- built simple interface to check off sentences
151. Gene-Gene Interactions
162. Curation Pipeline Textpresso for GO Cellular
Component Annotations
Weekly search of new C. elegans papers with CCC
categories returns sentences from 10 15 papers
Further supporting these conclusions, GFP-tagged
CMD-1 does not accumulate at the furrow, but does
accumulate on the spindle and centrosomes (as
well as to the interphase nuclear membrane and
the borders of abutting cells) (Fig 3 and
Supplementary Video 2). Cytokinesis is not
controlled by calmodulin or myosin light chain
kinase in the Caenorhabditis elegans early
embryo. Batchelder EL, Thomas-Virnig CL, Hardin
JD, and White JG. FEBS Lett. 2007 Sep 4 581
(22) 4337-41.
17Features, Specification Technical Details
- Materials
- Bibliography
- Full text papers
- Ontology
- Processing
- Pipeline
- Methods Scripts
18Methods Scripts
- Processing response times
- Building a database of 15,000 papers takes 30
hours and 45 GB disk space - Web interface has fast response for simple
retrieval tasks - Will consider rewriting core routine in C if
response takes too long (1.5 million papers
3000 hours) - NIF requests push the envelope
- Need C for sophisticated aspects (advanced
queries in TQL, NLP) - All 3rd party software (such as pdf-to-text
converters) are available under GNU public
license or similar
19Downloading Installing Textpresso
- http//www.textpresso.org/cgi-bin/neuroscience/dow
nloads - Hardware requirement
- Linux box
- 3 GB per 1000 papers
- 1.5 GB memory or higher for fast processing
- WWW server software (such as Apache)
- Perl 5.6.1 and most common Perl packages
- Setting up system for a new literature (backend
frontend) takes two hours (given ontology,
bibliography and set of full text papers,
excluding processing time for database build) - Package comes with concise installation
instructions
2014 Textpresso Systems
2114 Textpresso Systems
22Mark-up Density for Neuroscience Corpus
23Processing Pipeline
- Retrieve bibliography full text as described
- Convert PDFs or HTMLs to plain text
- If font information is useful, extract and tag
- Remove special characters and formatting
- New page character, quotation marks, etc.
- Tokenize
- Find word sentence boundaries
- Mark-up (annotate) text with categories
- Index annotation
24Processing Pipeline
- Build keyword index from corpus as-is (no
modification) - People often need to find exact technical terms
- Automatic wildcard insertion at the end of word
makes it a little more flexible - Corpus of 15,000 papers and 17,000 abstract has
1.1 million keywords - PDF to text conversion, corrupted tables
introduce some nonsense.
25Future
- Website functionality
- Making web-interfaces even more user-friendly
- user-defined, up-loadable categories
- synonyms, central user-defined
- after expanding query language (statistics?)
batch queries - user feedback-related improvements
- RSS feeds
- bracketing for keywords
- More Textpresso flavors
- Evaluate which specific literature sets are
important - Decide whether others will pick them up or not
- Nematode, A. thaliana, ?E. coli, ?Disease XYZ,
26Future
- Ontology
- Continue to update and refine current categories
- Continue to devise new categories
- neuroscience, Drosophila
- new literatures
- Word sense disambiguation
- a term can have multiple meanings find the right
one in context - Find a system to extract categories and their
lexica from full text - so far have worked with simple frequency lists
- is there a smarter way? Higher-order correlations
of words, graph theory?
27Future
- High-fidelity conversion to full text
- improve pdftotext or switch to htmltotext
- clearly identify tables and figures and their
caption - identify subsection of papers (introduction,
results, methods,) - sentences not always clearly separated revisit
tokenizer - Literature curation for databases
- Have extracted gene-gene association,
gene-allele-reference associations, cellular
components - have researched machine learning algorithms (WSD,
HMM) in the past, now apply it to curation tasks - Find more data types and better methods to
extract facts with higher recall and precision.
28More Issues
- Policies and mechanisms for modifying/updating
Textpresso vocabularies (synchronized with NIF
server or handled independently) - Categories delivered to us are easily
incorporated, will close circle by end of 2007 - Will marked-up text excerpts be delivered to NIF?
- YES!
- Feasibility of using Textpresso to aid in
creating/curating neuroscience databases - YES! Semi-automated curation pipelines at
WormBase - Possible synergies with Neurocommons text mining
project - Sophisticated NLP (e.g., entity recognition)
paper acquisition.
29a variation on John Satterlees Use Case
Query a database of fly, worm and yeast data with
a pair of worm genes to find out whether there
are yeast and fly orthologs,and whether they
interact
Zhong Sternberg Science 2006
30click to go to curated database
31END OF SHORTER VERSION
32Purpose, System Description Specification
Caltech Team Hans-Michael Müller, Arun
Rangarajan,Tracy Teal Kimberly Van Auken,
Juancarlos Chan, Paul Sternberg
33Overview
- What is Textpresso
- Motivation, purpose
- Application
- Interface, example searches
- Curation with Textpresso
- Specification
- Acquisition of materials, development of
ontology, processing pipeline, system
requirement, downloading installing Textpresso - Textpresso Flavors
- Future
34Textpresso - Motivation
- Biomedical literature increases at rapid pace
- PubMed has over 17 million citations (growth rate
of 700,000 entries/year) - Much of biological data remains buried in full
text - Difficult to access or just lost
- Many search engines Can only search abstracts,
not full text - Case studies 1/3 to 2/3 of protein interactions
not mentioned in abstract - Abstract might miss significance of papers w.r.t.
query - Full text contains redundancies
- good (synonyms) bad (repetition)
35Textpresso - Motivation
- Specificity of Search Return is Low (Problem of
False Positives) - Search engines do not focus on literature of
specific corpus of literature - Loss of recall by trying to be more specific
(more keywords) - Conceptual (semantic) searches are hard to
formulate with keywords only -
- I want to learn about all genes that interact
with gene x in cell B.
36Introduction of Categories
precursor upstream
cascade descendants
BRAIN AREA
37Full Text Marked up With Categories
Mark up the whole corpus of papers with terms of
categories and index mark-ups for searching.
38Textpresso - Motivation
- I want to learn about all genes that interact
with gene x in cell B.
- When searching for keywords x and B, categories
gene and interaction - likelihood of meaningful returns is now
drastically increased - (up to 39-fold, according to one of our studies)
Muller, Kenny Sternberg PLoS 2004
39Purpose of Textpresso
- Build a practical tool for researcher and curator
- Facilitates full text (keyword) searches of
research papers - Search scope from single sentence to full
document - Return Paper bibliography paragraphs -
sentences - Facilitates category searches
- Besides keywords, search for concepts such as
gene, cell, molecular function, etc. - Adds meaning to a query (semantics)
40Purpose of Textpresso
- Build platform for Natural Language Processing
- Help model organism database curators extract
information on a massive scale - Gene-allele-reference associations
- Gene-gene interactions
- Cellular component extractions
- Offer research opportunities for computational
linguists - Paper taxonomy (Chen et al., BMC Bioinformatics,
2006) - Research in fact extraction algorithms
41Textpresso for Neuroscience Example 1
Are there any cell types in specific brain areas
that are somehow associated with TRP channels?
Specify category (NIF) cell types, brain
area and TRP channel.
42Answers to Example 1
43Textpresso for Neuroscience Example 2
What receptors are known to be involved in
substance abuse?
- Enter phrase substance abuse and specify
category receptor.
44Answers to Example 2
45Answers to Example 2
Could make this return even better by
introducing a category substance abuse that
contains more terms describing substance abuse or
related terms. .
46Answers to Example 2
Could make this return even better by
introducing a category substance abuse that
contains more terms describing substance abuse or
related terms. .
New Pilot Categories substance abuse category
substance abuse, cocaine seeking,
cocaine-triggered, addiction, addictive
behavior, drug abuse, alcoholism, dipsomania,
dipsomanic, junkie, substance dependence, dependen
ce disorder, self administration, heroin SA,
psychoactive substance abuse, chemical
dependence, habitual use, physical dependence,
rebound insomnia, discontinuation syndrome,
nicotine abuse, cocaine abuse, heroin abuse,
barbituate abuse, methaqualone abuse, alcohol
abuse, inhalant abuse, LSD abuse, marijuana
abuse, MDMA abuse, Ecstasy abuse, PCP abuse,
Phencyclidine abuse, anabolic steroid
abuse, Quaalude abuse, club drug, amphetamine
dependence, withdrawal drugs of abuse category
(source NIDA website) Nicotine, cocaine, heroin,
barbituate, methaqualone, alcohol, inhalant, LSD,
marijuana, MDMA, Ecstasy, PCP, Phencyclidine,
anabolic steroid, smoking methamphetamine, crystal
meth, club drugs, Special K, ketamine, GHB,
liquid ecstasy, soap, roofies, Rohypnol, coke,
snow, flake, blow, crack, smack, ska, junk,
whippets, popers, snappers, pot, ganga, weed,
grass, XTC, Adam, hug, beans, love drug, speed,
meth, chalk, angel dust, wack, rocket fuel,
blotter, Quaalude, THC
47Search Interface
48Result Page
49If Interface is not sufficient Query Language
50Cellular Component Curation (CCC)
- Do authors describe sub-cellular localization in
a sufficiently stereotypical manner? - If so, can this be used to empirically create new
Textpresso categories specific for CCC?
511) Identify Extract Relevant Sentences
- 219 publications
- All sections of paper read
- Antibody staining only
- 1429 sentences
522) Compute Inspect Histogram of Words
- Which phrase size affords specificity?
-
- Phrase size 1 Count 321 cells
Phrase size 1 Count 311 nuclei - Phrase size 2 Count 101 localized to Phrase
size 2 Count 97 nuclei of - Phrase size 3 Count 49 in the
cytoplasm Phrase size 3 Count 46 in the
nuclei - Phrase size 4 Count 9 localized to the
nucleus Phrase size 4 Count 9 localizes to
the apical
533) Create Three New Categories
- Cellular Component
- Adherens junction, nuclei, P granules
- Verbs
- Localized, expressed, accumulated
- Other
- Fluorescence, punctate, uniformly
544) Test on New Papers
- 37 papers flagged for expression data
554) Test on New Papers
- New categories return 86 of available cellular
component data
565) Further Development (CCC)
- (N1)th generation categories
- add new terms
- remove terms that lower specificity
- test again
- repeat
- Establish curation pipeline
- new paper comes in
- three categories applied
- sentence presented to curator
- database populated
57Features, Specification Technical Details
- Materials
- Bibliography
- Full text papers
- Ontology
- Processing
- Pipeline
- Methods Scripts
58Obtaining Papers Bibliography
- Worked closely with model organism databases such
as WormBase or FlyBase to obtain bibliography
full texts - For other literatures such as Neuroscience,
mouse, human, etc - Need an automated procedure to obtain
bibliography and full texts (pdfs or htmls)
59Obtaining Papers Bibliography
- Curator devises a set of keywords or determines a
set of journals for a particular literature - When queried at PubMed, keywords or journal list
will return all relevant articles for literature - Bibliography of articles is downloaded, including
all bibl. fields Textpresso offers - Using the citation info, full texts (pdf/html)
are downloaded from journal site (only subscribed
or open-access journals)
60Obtaining Papers Bibliography
- Need template for each journal to find pdf link
- error prone, can do 80 without human
intervention - All steps are currently closely supervised, will
continue to automate - Downloaded 15,000 neuroscience papers in 2006
- A new round of downloads being prepared (150,000
entries) - 50,000-120,000 full texts available
61List of Journals for Neuroscience
- BMC Neurosci
- Nat Rev Neurosci
- J Neurosci
- Nat Neurosci
- Neuron
- Neuroscience
- J Neurophysiol
- Exp Brain Res
- Curr Opin Neurobiol
Brain Res Trends Neurosci Cereb Cortex Annu Rev
Neurosci Int J Neurosci Int Rev Neurobiol J Comp
Neurol Eur J Neurosci Neurobiology
Cell, Science, PLoS, PNAS, Genetics,
62Ontology Development
- Textpresso ontology has two components
- Scientific (such as GO)
- Rich resource of biological meaningful terms and
their relations, but not necessarily a
representation of natural language prose - Colloquial
- Synonyms, jargon, talking in metaphors etc.
- People write in own, distinct style, without
caring about a formal ontology (such as GO) - Makes it difficult to achieve high recall
63Ontology Development
- Imported from MOs and other ontologies, assembled
own lists - Automatic extraction of entity names impressive
(BioCreAtivE challenge in 2004 F 0.8), but
for the sake of high recall, lists are curated
carefully by hand - 100 categories with 1.7 million terms in lexicon
(divide by 4) - Takes not more than 2-3 months for a curator to
assemble (semi-automated) - Will start working on automatic extraction of
terms soon - 24.5 million terms of biological interest marked
up with categories (for 15,000 Neuroscience
papers)
64C. elegans Categories
C. elegans categories were our base
ontology upon which we built ontologies for other
literatures such as Neuroscience and D.
melanogaster.
65Neuroscience Expanding the Lexica
- Brain area 4,800 terms
- Receptor 5,750 terms
- NIF cell types (experimental) 550 terms
- TRP channel (experimental) 40 terms
66Mark-up Density for Neuroscience Corpus
67Processing Pipeline
- Retrieve bibliography full text as described
- Convert PDFs or HTMLs to plain text
- If font information is useful, extract and tag
- Remove special characters and formatting
- New page character, quotation marks, etc.
- Tokenize
- Find word sentence boundaries
- Mark-up (annotate) text with categories
- Index annotation
68Processing Pipeline
- Build keyword index from corpus as-is (no
modification) - People often need to find exact technical terms
- Automatic wildcard insertion at the end of word
makes it a little more flexible - Corpus of 15,000 papers and 17,000 abstract has
1.1 million keywords - PDF to text conversion, corrupted tables
introduce some nonsense.
69Methods Scripts
- All data processing scripts (pipeline) and web
interfaces are written in Perl - Perl modules
- TextpressoDatabaseQuery
- defines data model for querying database
- TextpressoDatabaseSearch
- subroutines performing database searches, etc
- TextpressoDisplayTasks
- subroutines necessary for running the web
interface - TextpressoSystemTasks
- subroutines building the Textpresso database
- TextpressoWebserviceTasks
- subroutines for providing the webservice
-
70Webservice for NIF
- http//dev.textpresso.org/wsdl/textpresso.wsdl
71Methods Scripts
- Processing response times
- Building a database of 15,000 papers takes 30
hours and 45 GB disk space - Web interface has fast response for simple
retrieval tasks - Will consider rewriting core routine in C if
response takes too long (1.5 million papers
3000 hours) - NIF requests push the envelope
- Need C for sophisticated aspects (advanced
queries in TQL, NLP) - All 3rd party software (such as pdf-to-text
converters) are available under GNU public
license or similar
72Downloading Installing Textpresso
- http//www.textpresso.org/cgi-bin/neuroscience/dow
nloads - Hardware requirement
- Linux box
- 3 GB per 1000 papers
- 1.5 GB memory or higher for fast processing
- WWW server software (such as Apache)
- Perl 5.6.1 and most common Perl packages
- Setting up system for a new literature (backend
frontend) takes two hours (given ontology,
bibliography and set of full text papers,
excluding processing time for database build) - Package comes with concise installation
instructions
7314 Textpresso Systems
7414 Textpresso Systems
75Future
- Website functionality
- Making web-interfaces even more user-friendly
- user-defined, up-loadable categories
- synonyms, central user-defined
- after expanding query language (statistics?)
batch queries - user feedback-related improvements
- RSS feeds
- bracketing for keywords
- More Textpresso flavors
- Evaluate which specific literature sets are
important - Decide whether others will pick them up or not
- Nematode, A. thaliana, ?E. coli, ?Disease XYZ,
76Future
- Ontology
- Continue to update and refine current categories
- Continue to devise new categories
- neuroscience, Drosophila
- new literatures
- Word sense disambiguation
- a term can have multiple meanings find the right
one in context - Find a system to extract categories and their
lexica from full text - so far have worked with simple frequency lists
- is there a smarter way? Higher-order correlations
of words, graph theory?
77Future
- High-fidelity conversion to full text
- improve pdftotext or switch to htmltotext
- clearly identify tables and figures and their
caption - identify subsection of papers (introduction,
results, methods,) - sentences not always clearly separated revisit
tokenizer - Literature curation for MODs
- Have extracted gene-gene association,
gene-allele-reference associations, cellular
components - have researched machine learning algorithms (WSD,
HMM) in the past, now apply it to curation tasks - Find more data types and better methods to
extract facts with higher recall and precision.
78http//www.textpresso.org
79More Issues
- Policies and mechanisms for modifying/updating
Textpresso vocabularies (synchronized with NIF
server or handled independently) - Categories delivered to us are easily
incorporated, will close circle by end of 2007 - Will marked-up text excerpts be delivered to NIF?
- YES!
- Feasibility of using Textpresso to aid in
creating/curating neuroscience databases - YES! Semi-automated curation pipelines at
WormBase - Possible synergies with Neurocommons text mining
project - Sophisticated NLP (e.g., entity recognition)
paper acquisition.
80a variation on John Satterlees Use Case
Query a database of fly, worm and yeast data with
a pair of worm genes to find out whether there
are yeast and fly orthologs,and whether they
interact
Zhong Sternberg Science 2006
81click to go to curated database
82(No Transcript)