Caltech Team:

About This Presentation

Title:

Caltech Team:

Description:

Purpose, System Description & Specification. October 1979: ... The postembryonic cell lineages of the hermaphrodite and male gonads. in Caenorhabditis elegans ... – PowerPoint PPT presentation

Number of Views:162

Avg rating:3.0/5.0

Slides: 83

Provided by: hansmicha

Category:

more less

Transcript and Presenter's Notes

Title: Caltech Team:

1
Purpose, System Description Specification
Caltech Team Hans-Michael Müller, Arun
Rangarajan,Tracy Teal Kimberly Van Auken,
Juancarlos Chan, Paul Sternberg
2
October 1979 3 relevant papers
S. Brenner (Genetics 1974) The genetics of
Caenorhabditis elegans J. Sulston R. Horvitz
(Developmental Biology 1977) Post-embryonic cell
lineages of the nematode, Caenorhabditis
elegans J. Kimble D. Hirsh (Developmental
Biology 1979) The postembryonic cell lineages of
the hermaphrodite and male gonads in
Caenorhabditis elegans
October 2007 300,000 relevant papers
3
Q What genes are expressed in the striatum?
Most of the time we are skimming rather than
reading. So, if we skim faster, then we can read
more.

Keyword search there is no good query
A Category search is better
gene striatum

4
Introduction of Categories
Locus let-60 eat-4
lin-12
precursor upstream
cascade descendants
SCN CA3 cortex
BRAIN AREA
Worm CELL
vulF HSNL ALA
anchor cell
5
Categories can store synonyms
Polycystic kidney disease 1
Polycystic kidney disease 1 PKD1 UGID139915 13991
5 NP_038658.1 UniGene Hs.75813 Hs.75813 NM_000296.
2 CR613181.1 U24497.1 AB209675.1
6
Full Text Marked up With Categories
Mark up the whole corpus of papers with terms of
categories and index mark-ups for searching.
7
Textpresso for Neuroscience Example 1
Are there any cell types in specific brain areas
that are somehow associated with TRP channels?
Specify categories (NIF) cell types and
brain area and TRP channel
8
Answers to Example 1
9
Textpresso for Neuroscience Example 2
What receptors are known to be involved in
substance abuse?

Enter phrase substance abuse and specify
category receptor.

10
Answers to Example 2
11
Answers to Example 2
Could make this return even better by
introducing a category substance abuse that
contains more terms describing substance abuse or
related terms. .
12
Answers to Example 2
Could make this return even better by
introducing a category substance abuse that
contains more terms describing substance abuse or
related terms. .
New Pilot Categories substance abuse category
substance abuse, cocaine seeking,
cocaine-triggered, addiction, addictive
behavior, drug abuse, alcoholism, dipsomania,
dipsomanic, junkie, substance dependence, dependen
ce disorder, self administration, heroin SA,
psychoactive substance abuse, chemical
dependence, habitual use, physical dependence,
rebound insomnia, discontinuation syndrome,
nicotine abuse, cocaine abuse, heroin abuse,
barbituate abuse, methaqualone abuse, alcohol
abuse, inhalant abuse, LSD abuse, marijuana
abuse, MDMA abuse, Ecstasy abuse, PCP abuse,
Phencyclidine abuse, anabolic steroid
abuse, Quaalude abuse, club drug, amphetamine
dependence, withdrawal drugs of abuse category
(source NIDA website) Nicotine, cocaine, heroin,
barbituate, methaqualone, alcohol, inhalant, LSD,
marijuana, MDMA, Ecstasy, PCP, Phencyclidine,
anabolic steroid, smoking methamphetamine, crystal
meth, club drugs, Special K, ketamine, GHB,
liquid ecstasy, soap, roofies, Rohypnol, coke,
snow, flake, blow, crack, smack, ska, junk,
whippets, popers, snappers, pot, ganga, weed,
grass, XTC, Adam, hug, beans, love drug, speed,
meth, chalk, angel dust, wack, rocket fuel,
blotter, Quaalude, THC
13
Textpresso for Database Curation Tasks

Gene-gene interactions
script extract sentences
curator checks off
database populated (2000 interactions)
Cellular component curation
curate analyze small subset of papers
build specific set of categories
query complete corpus
analyze extract results, refine categories,
repeat query
Now used in the curation pipeline

14
1. Gene-Gene Interactions

gene-gene interaction
extracted via script all sentences that contain
more
than one gene name and at least one association
or regulation word
obtained 26000 sentences out of 4400 articles.
built simple interface to check off sentences

15
1. Gene-Gene Interactions
16
2. Curation Pipeline Textpresso for GO Cellular
Component Annotations
Weekly search of new C. elegans papers with CCC
categories returns sentences from 10 15 papers
Further supporting these conclusions, GFP-tagged
CMD-1 does not accumulate at the furrow, but does
accumulate on the spindle and centrosomes (as
well as to the interphase nuclear membrane and
the borders of abutting cells) (Fig 3 and
Supplementary Video 2). Cytokinesis is not
controlled by calmodulin or myosin light chain
kinase in the Caenorhabditis elegans early
embryo. Batchelder EL, Thomas-Virnig CL, Hardin
JD, and White JG. FEBS Lett. 2007 Sep 4 581
(22) 4337-41.
17
Features, Specification Technical Details

Materials
Bibliography
Full text papers
Ontology
Processing
Pipeline
Methods Scripts

18
Methods Scripts

Processing response times
Building a database of 15,000 papers takes 30
hours and 45 GB disk space
Web interface has fast response for simple
retrieval tasks
Will consider rewriting core routine in C if
response takes too long (1.5 million papers
3000 hours)
NIF requests push the envelope
Need C for sophisticated aspects (advanced
queries in TQL, NLP)
All 3rd party software (such as pdf-to-text
converters) are available under GNU public
license or similar

19
Downloading Installing Textpresso

http//www.textpresso.org/cgi-bin/neuroscience/dow
nloads
Hardware requirement
Linux box
3 GB per 1000 papers
1.5 GB memory or higher for fast processing
WWW server software (such as Apache)
Perl 5.6.1 and most common Perl packages
Setting up system for a new literature (backend
frontend) takes two hours (given ontology,
bibliography and set of full text papers,
excluding processing time for database build)
Package comes with concise installation
instructions

20
14 Textpresso Systems
21
14 Textpresso Systems
22
Mark-up Density for Neuroscience Corpus
23
Processing Pipeline

Retrieve bibliography full text as described
Convert PDFs or HTMLs to plain text
If font information is useful, extract and tag
Remove special characters and formatting
New page character, quotation marks, etc.
Tokenize
Find word sentence boundaries
Mark-up (annotate) text with categories
Index annotation

24
Processing Pipeline

Build keyword index from corpus as-is (no
modification)
People often need to find exact technical terms
Automatic wildcard insertion at the end of word
makes it a little more flexible
Corpus of 15,000 papers and 17,000 abstract has
1.1 million keywords
PDF to text conversion, corrupted tables
introduce some nonsense.

25
Future

Website functionality
Making web-interfaces even more user-friendly
user-defined, up-loadable categories
synonyms, central user-defined
after expanding query language (statistics?)
batch queries
user feedback-related improvements
RSS feeds
bracketing for keywords
More Textpresso flavors
Evaluate which specific literature sets are
important
Decide whether others will pick them up or not
Nematode, A. thaliana, ?E. coli, ?Disease XYZ,

26
Future

Ontology
Continue to update and refine current categories
Continue to devise new categories
neuroscience, Drosophila
new literatures
Word sense disambiguation
a term can have multiple meanings find the right
one in context
Find a system to extract categories and their
lexica from full text
so far have worked with simple frequency lists
is there a smarter way? Higher-order correlations
of words, graph theory?

27
Future

High-fidelity conversion to full text
improve pdftotext or switch to htmltotext
clearly identify tables and figures and their
caption
identify subsection of papers (introduction,
results, methods,)
sentences not always clearly separated revisit
tokenizer
Literature curation for databases
Have extracted gene-gene association,
gene-allele-reference associations, cellular
components
have researched machine learning algorithms (WSD,
HMM) in the past, now apply it to curation tasks
Find more data types and better methods to
extract facts with higher recall and precision.

28
More Issues

Policies and mechanisms for modifying/updating
Textpresso vocabularies (synchronized with NIF
server or handled independently)
Categories delivered to us are easily
incorporated, will close circle by end of 2007
Will marked-up text excerpts be delivered to NIF?
YES!
Feasibility of using Textpresso to aid in
creating/curating neuroscience databases
YES! Semi-automated curation pipelines at
WormBase
Possible synergies with Neurocommons text mining
project
Sophisticated NLP (e.g., entity recognition)
paper acquisition.

29
a variation on John Satterlees Use Case
Query a database of fly, worm and yeast data with
a pair of worm genes to find out whether there
are yeast and fly orthologs,and whether they
interact
Zhong Sternberg Science 2006
30
click to go to curated database
31
END OF SHORTER VERSION
32
Purpose, System Description Specification
Caltech Team Hans-Michael Müller, Arun
Rangarajan,Tracy Teal Kimberly Van Auken,
Juancarlos Chan, Paul Sternberg
33
Overview

What is Textpresso
Motivation, purpose
Application
Interface, example searches
Curation with Textpresso
Specification
Acquisition of materials, development of
ontology, processing pipeline, system
requirement, downloading installing Textpresso
Textpresso Flavors
Future

34
Textpresso - Motivation

Biomedical literature increases at rapid pace
PubMed has over 17 million citations (growth rate
of 700,000 entries/year)
Much of biological data remains buried in full
text
Difficult to access or just lost
Many search engines Can only search abstracts,
not full text
Case studies 1/3 to 2/3 of protein interactions
not mentioned in abstract
Abstract might miss significance of papers w.r.t.
query
Full text contains redundancies
good (synonyms) bad (repetition)

35
Textpresso - Motivation

Specificity of Search Return is Low (Problem of
False Positives)
Search engines do not focus on literature of
specific corpus of literature
Loss of recall by trying to be more specific
(more keywords)
Conceptual (semantic) searches are hard to
formulate with keywords only
I want to learn about all genes that interact
with gene x in cell B.

36
Introduction of Categories
precursor upstream
cascade descendants
BRAIN AREA
37
Full Text Marked up With Categories
Mark up the whole corpus of papers with terms of
categories and index mark-ups for searching.
38
Textpresso - Motivation

I want to learn about all genes that interact
with gene x in cell B.

When searching for keywords x and B, categories
gene and interaction
likelihood of meaningful returns is now
drastically increased
(up to 39-fold, according to one of our studies)

Muller, Kenny Sternberg PLoS 2004
39
Purpose of Textpresso

Build a practical tool for researcher and curator
Facilitates full text (keyword) searches of
research papers
Search scope from single sentence to full
document
Return Paper bibliography paragraphs -
sentences
Facilitates category searches
Besides keywords, search for concepts such as
gene, cell, molecular function, etc.
Adds meaning to a query (semantics)

40
Purpose of Textpresso

Build platform for Natural Language Processing
Help model organism database curators extract
information on a massive scale
Gene-allele-reference associations
Gene-gene interactions
Cellular component extractions
Offer research opportunities for computational
linguists
Paper taxonomy (Chen et al., BMC Bioinformatics,
2006)
Research in fact extraction algorithms

41
Textpresso for Neuroscience Example 1
Are there any cell types in specific brain areas
that are somehow associated with TRP channels?
Specify category (NIF) cell types, brain
area and TRP channel.
42
Answers to Example 1
43
Textpresso for Neuroscience Example 2
What receptors are known to be involved in
substance abuse?

Enter phrase substance abuse and specify
category receptor.

44
Answers to Example 2
45
Answers to Example 2
Could make this return even better by
introducing a category substance abuse that
contains more terms describing substance abuse or
related terms. .
46
Answers to Example 2
Could make this return even better by
introducing a category substance abuse that
contains more terms describing substance abuse or
related terms. .
New Pilot Categories substance abuse category
substance abuse, cocaine seeking,
cocaine-triggered, addiction, addictive
behavior, drug abuse, alcoholism, dipsomania,
dipsomanic, junkie, substance dependence, dependen
ce disorder, self administration, heroin SA,
psychoactive substance abuse, chemical
dependence, habitual use, physical dependence,
rebound insomnia, discontinuation syndrome,
nicotine abuse, cocaine abuse, heroin abuse,
barbituate abuse, methaqualone abuse, alcohol
abuse, inhalant abuse, LSD abuse, marijuana
abuse, MDMA abuse, Ecstasy abuse, PCP abuse,
Phencyclidine abuse, anabolic steroid
abuse, Quaalude abuse, club drug, amphetamine
dependence, withdrawal drugs of abuse category
(source NIDA website) Nicotine, cocaine, heroin,
barbituate, methaqualone, alcohol, inhalant, LSD,
marijuana, MDMA, Ecstasy, PCP, Phencyclidine,
anabolic steroid, smoking methamphetamine, crystal
meth, club drugs, Special K, ketamine, GHB,
liquid ecstasy, soap, roofies, Rohypnol, coke,
snow, flake, blow, crack, smack, ska, junk,
whippets, popers, snappers, pot, ganga, weed,
grass, XTC, Adam, hug, beans, love drug, speed,
meth, chalk, angel dust, wack, rocket fuel,
blotter, Quaalude, THC
47
Search Interface
48
Result Page
49
If Interface is not sufficient Query Language
50
Cellular Component Curation (CCC)

Do authors describe sub-cellular localization in
a sufficiently stereotypical manner?
If so, can this be used to empirically create new
Textpresso categories specific for CCC?

51
1) Identify Extract Relevant Sentences

219 publications
All sections of paper read
Antibody staining only
1429 sentences

52
2) Compute Inspect Histogram of Words

Which phrase size affords specificity?
Phrase size 1 Count 321 cells
Phrase size 1 Count 311 nuclei
Phrase size 2 Count 101 localized to Phrase
size 2 Count 97 nuclei of
Phrase size 3 Count 49 in the
cytoplasm Phrase size 3 Count 46 in the
nuclei
Phrase size 4 Count 9 localized to the
nucleus Phrase size 4 Count 9 localizes to
the apical

53
3) Create Three New Categories

Cellular Component
Adherens junction, nuclei, P granules
Verbs
Localized, expressed, accumulated
Other
Fluorescence, punctate, uniformly

54
4) Test on New Papers

37 papers flagged for expression data

55
4) Test on New Papers

New categories return 86 of available cellular
component data

56
5) Further Development (CCC)

(N1)th generation categories
add new terms
remove terms that lower specificity
test again
repeat
Establish curation pipeline
new paper comes in
three categories applied
sentence presented to curator
database populated

57
Features, Specification Technical Details

Materials
Bibliography
Full text papers
Ontology
Processing
Pipeline
Methods Scripts

58
Obtaining Papers Bibliography

Worked closely with model organism databases such
as WormBase or FlyBase to obtain bibliography
full texts
For other literatures such as Neuroscience,
mouse, human, etc
Need an automated procedure to obtain
bibliography and full texts (pdfs or htmls)

59
Obtaining Papers Bibliography

Curator devises a set of keywords or determines a
set of journals for a particular literature
When queried at PubMed, keywords or journal list
will return all relevant articles for literature
Bibliography of articles is downloaded, including
all bibl. fields Textpresso offers
Using the citation info, full texts (pdf/html)
are downloaded from journal site (only subscribed
or open-access journals)

60
Obtaining Papers Bibliography

Need template for each journal to find pdf link
error prone, can do 80 without human
intervention
All steps are currently closely supervised, will
continue to automate
Downloaded 15,000 neuroscience papers in 2006
A new round of downloads being prepared (150,000
entries)
50,000-120,000 full texts available

61
List of Journals for Neuroscience

BMC Neurosci
Nat Rev Neurosci
J Neurosci
Nat Neurosci
Neuron
Neuroscience
J Neurophysiol
Exp Brain Res
Curr Opin Neurobiol

Brain Res Trends Neurosci Cereb Cortex Annu Rev
Neurosci Int J Neurosci Int Rev Neurobiol J Comp
Neurol Eur J Neurosci Neurobiology
Cell, Science, PLoS, PNAS, Genetics,
62
Ontology Development

Textpresso ontology has two components
Scientific (such as GO)
Rich resource of biological meaningful terms and
their relations, but not necessarily a
representation of natural language prose
Colloquial
Synonyms, jargon, talking in metaphors etc.
People write in own, distinct style, without
caring about a formal ontology (such as GO)
Makes it difficult to achieve high recall

63
Ontology Development

Imported from MOs and other ontologies, assembled
own lists
Automatic extraction of entity names impressive
(BioCreAtivE challenge in 2004 F 0.8), but
for the sake of high recall, lists are curated
carefully by hand
100 categories with 1.7 million terms in lexicon
(divide by 4)
Takes not more than 2-3 months for a curator to
assemble (semi-automated)
Will start working on automatic extraction of
terms soon
24.5 million terms of biological interest marked
up with categories (for 15,000 Neuroscience
papers)

64
C. elegans Categories
C. elegans categories were our base
ontology upon which we built ontologies for other
literatures such as Neuroscience and D.
melanogaster.
65
Neuroscience Expanding the Lexica

Brain area 4,800 terms
Receptor 5,750 terms
NIF cell types (experimental) 550 terms
TRP channel (experimental) 40 terms

66
Mark-up Density for Neuroscience Corpus
67
Processing Pipeline

Retrieve bibliography full text as described
Convert PDFs or HTMLs to plain text
If font information is useful, extract and tag
Remove special characters and formatting
New page character, quotation marks, etc.
Tokenize
Find word sentence boundaries
Mark-up (annotate) text with categories
Index annotation

68
Processing Pipeline

Build keyword index from corpus as-is (no
modification)
People often need to find exact technical terms
Automatic wildcard insertion at the end of word
makes it a little more flexible
Corpus of 15,000 papers and 17,000 abstract has
1.1 million keywords
PDF to text conversion, corrupted tables
introduce some nonsense.

69
Methods Scripts

All data processing scripts (pipeline) and web
interfaces are written in Perl
Perl modules
TextpressoDatabaseQuery
defines data model for querying database
TextpressoDatabaseSearch
subroutines performing database searches, etc
TextpressoDisplayTasks
subroutines necessary for running the web
interface
TextpressoSystemTasks
subroutines building the Textpresso database
TextpressoWebserviceTasks
subroutines for providing the webservice

70
Webservice for NIF

http//dev.textpresso.org/wsdl/textpresso.wsdl

71
Methods Scripts

Processing response times
Building a database of 15,000 papers takes 30
hours and 45 GB disk space
Web interface has fast response for simple
retrieval tasks
Will consider rewriting core routine in C if
response takes too long (1.5 million papers
3000 hours)
NIF requests push the envelope
Need C for sophisticated aspects (advanced
queries in TQL, NLP)
All 3rd party software (such as pdf-to-text
converters) are available under GNU public
license or similar

72
Downloading Installing Textpresso

http//www.textpresso.org/cgi-bin/neuroscience/dow
nloads
Hardware requirement
Linux box
3 GB per 1000 papers
1.5 GB memory or higher for fast processing
WWW server software (such as Apache)
Perl 5.6.1 and most common Perl packages
Setting up system for a new literature (backend
frontend) takes two hours (given ontology,
bibliography and set of full text papers,
excluding processing time for database build)
Package comes with concise installation
instructions

73
14 Textpresso Systems
74
14 Textpresso Systems
75
Future

Website functionality
Making web-interfaces even more user-friendly
user-defined, up-loadable categories
synonyms, central user-defined
after expanding query language (statistics?)
batch queries
user feedback-related improvements
RSS feeds
bracketing for keywords
More Textpresso flavors
Evaluate which specific literature sets are
important
Decide whether others will pick them up or not
Nematode, A. thaliana, ?E. coli, ?Disease XYZ,

76
Future

Ontology
Continue to update and refine current categories
Continue to devise new categories
neuroscience, Drosophila
new literatures
Word sense disambiguation
a term can have multiple meanings find the right
one in context
Find a system to extract categories and their
lexica from full text
so far have worked with simple frequency lists
is there a smarter way? Higher-order correlations
of words, graph theory?

77
Future

High-fidelity conversion to full text
improve pdftotext or switch to htmltotext
clearly identify tables and figures and their
caption
identify subsection of papers (introduction,
results, methods,)
sentences not always clearly separated revisit
tokenizer
Literature curation for MODs
Have extracted gene-gene association,
gene-allele-reference associations, cellular
components
have researched machine learning algorithms (WSD,
HMM) in the past, now apply it to curation tasks
Find more data types and better methods to
extract facts with higher recall and precision.

78
http//www.textpresso.org
79
More Issues

Policies and mechanisms for modifying/updating
Textpresso vocabularies (synchronized with NIF
server or handled independently)
Categories delivered to us are easily
incorporated, will close circle by end of 2007
Will marked-up text excerpts be delivered to NIF?
YES!
Feasibility of using Textpresso to aid in
creating/curating neuroscience databases
YES! Semi-automated curation pipelines at
WormBase
Possible synergies with Neurocommons text mining
project
Sophisticated NLP (e.g., entity recognition)
paper acquisition.

80
a variation on John Satterlees Use Case
Query a database of fly, worm and yeast data with
a pair of worm genes to find out whether there
are yeast and fly orthologs,and whether they
interact
Zhong Sternberg Science 2006
81
click to go to curated database
82
(No Transcript)

Write a Comment

User Comments (0)