Title: New Search Tools for Bioscience Journal Articles
1New Search Tools for Bioscience Journal Articles
- Marti Hearst,
- UC Berkeley School of Information
UIUC Comp-Bio Seminar February 12, 2007
Supported by NSF DBI-0317510 And a gift from
Genentech
2Outline
- Biotext Project Introduction
- Simple Abbreviation Definition Recognition
- Citances
- A New Search Interface Idea
3Double Exponential Growth in Bioscience Journal
Articles
- From Hunter Cohen, Molecular Cell 21, 2006
4BioText Project Goals
- Provide flexible, useful, appealing search for
bioscientists. - Focus on
- Full text journal articles
- New language analysis algorithms
- New search interfaces
5Bioscience Text is Challenging
- Complex sentence structure
- Huge vocabulary
- Including LOTS of abbreviations
- Gene/protein name recognition a major task
- Full text documents have complex structure
which parts are key?
6BioText Architecture
Sophisticated Text Analysis
Annotations in Database
Improved Search Interface
7Project Team
- Project Leaders
- PI Marti Hearst
- Co-PI Adam Arkin
- Computational Linguistics and Databases
- Preslav Nakov
- Jerry Ye
- Ariel Schwartz (alum)
- Brian Wolf (alum)
- Barbara Rosario (alum)
- Gaurav Bhalotia (alum)
- User Interface / IR
- Mike Wooldridge
- Rowena Luk (alum)
- Dr. Emilia Stoica (alum)
- Bioscience
- Dr. Anna Divoli
- Janice Hamerja (alum)
- Dr. TingTing Zhang (alum)
8The Problem Identify Acronym Definitions
- methyl methanesulfonate sulfate (MMS)
- heat shock transcription factor (HSF)
- Gcn5-related N-acetyltransferase (GNAT)
- We investigated the redox regulation of the
stress response and report here that in the human
pre-monocytic line U937 cells, H2O2 induced a
concentration-dependent transactivation and
DNA-binding activity of heat-shock factor-1
(HSF-1)
9Identifying Acronym Definitions
- To identify ltshort form, long formgt pairs
from biomedical text - Short form is abbreviation of long form
- There exists character mapping from short form to
long form - Examples
- Gcn5-related N-acetyltransferase (GNAT)
- A non-trivial problem
- Words in long form may be skipped
- Internal letters in long form may be used
10Previous Work
- Machine learning approaches
- Linear regression (Chang et al.)
- Encoding and compression (Yeates et al.)
- Cubic time or worse
- Heuristic approach
- Rule-based (Park Byrd)
- Factors considered include
- Distance between definition and abbreviation
- Number of stop words
- Capitalization
- Cant reproduce this algorithm
11Step 1 Identifying Candidates
- Consider only two cases
- long form ( short form )
- short form ( long form )
- Short form
- No more than 2 words
- Between 2 and 10 chars
- At least one letter
- First char alphanumeric
- Long form
- Adjacent to short form
- No more than min(A 5, A 2) words
12Step 2 Identifying Correct Long Forms
- heat shock transcription factor (HSF)
- heat shock transcription factor (HSF)
- heat shock transcription factor (HSF)
- heat shock transcription factor (HSF)
- heat shock transcription factor(HSF)
13Step 2 Identifying Correct Long Forms
- Gcn5-related N-acetyltransferase (GNAT)
- Gcn5-related N-acetyltransferase (GNAT)
- Gcn5-related N-acetyltransferase (GNAT)
- Gcn5-related N-acetyltransferase (GNAT)
14Step 2 Identifying Correct Long Forms
- From right to left, the shortest long form that
matches the short form - Each character in short form must match a
character in long form - The match of the character at the beginning of
the short form must match a character in the
initial position of the first word in the long
form
15Java Code for Finding the Best Long Form for a
Given Short Form
16Evaluation
- 1000 randomly selected MEDLINE abstracts
- 82 recall, 95 precision
- Medstract Gold Standard Evaluation Corpus
- 82 recall, 96 precision
- Compared with
- 83 recall, 80 precision (Cheng et al., linear
regression) - 72 recall, 98 precision (Pustejovsky et al.,
heuristics)
17Missing Pairs
- Skipped characters in short form
- ltCNS1, cyclophilin seven suppressorgt
- No match
- lt5-HT, serotoningt
- Out of order
- ltATN, anterior thalamusgt
- Partial match
- ltPol I, RNA polymerase Igt
18Other NLP Work
- Relation labeling
- (Work primarily by Barbara Rosario)
- Protein-protein interactions which ones are
happening? - They also demonstrate that the GAG protein from
membrane-containing viruses , such as HIV, binds
to Alix / AIP1 , thereby recruiting the ESCRT
machinery to allow budding of the virus from the
cell surface cite. - Distinguished among 10 different relations
- Binds, degrades, synergizes with, upregulates
- Simple supervised approach gets surprisingly high
results (60 accuracy)
19Acquiring Labeled Data using Citances
20 A discovery is made
A paper is written
21 That paper is cited
and cited
and cited
as the evidence for some fact(s) F.
22 Each of these in turn are cited for some fact(s)
until it is the case that all important facts
in the field can be found in citation sentences
alone!
23Citances
- Nearly every statement in a bioscience journal
article is backed up with a cite. - It is quite common for papers to be cited 30-100
times. - The text around the citation tends to state
biological facts. (Call these citances.) - Different citances will state the same facts in
different ways - so can we use these for creating models of
language expressing semantic relations?
24Using citances
- Potential uses of citation sentences (citances)
- creation of training and testing data for
semantic analysis, - synonym set creation,
- database curation,
- document summarization,
- and information retrieval generally.
- All of the above require citance word alignments.
25Sample Citance
- Recent research, in proliferating cells, has
demonstrated that interaction of E2F1 with the
p53 pathway could involve transcriptional
up-regulation of E2F1 target genes such as
p14/p19ARF, which affect p53 accumulation
67,68, E2F1-induced phosphorylation of p53
69, or direct E2F1-p53 complex formation 70.
26Related Work
- Traditional citation analysis dates back to the
1960s (Garfield). Includes - Citation categorization,
- Context analysis,
- Citer motivation.
- Citation indexing systems, such as ISIs SCI, and
CiteSeer. - Mercer and Di Marco (2004) propose to improve
citation indexing using citation types. - Bradshaw (2003) introduces Reference Directed
Indexing (RDI), which indexes documents using the
terms in the citances citing them.
27Related Work (cont.)
- Teufel and Moens (2002) identify citances to
improve summarization of the citing paper. They
give lower weight to citances as candidate
sentences for summarization. - Nanba et. al. (2000) use citances as features for
classifying papers into topics. - Related field to citation indexing is the use of
link structure and anchor text of Web pages. - Applications include IR, classification, Web
crawlers, and summarization. - See the full paper for references.
28Issues for Processing Citances
- Text span
- Identification of the appropriate phrase, clause,
or sentence that constructs a citance. - Correct mapping of citations when shown as lists
or groups (e.g., 22-25). - Grouping citances by topic
- Citances that cite the same document should be
grouped by the facts they state. - Normalizing or paraphrasing concepts in citances
29How Do Citances Differ From Abstracts?
- (This part primarily by Anna Divoli.)
- We did a detailed analysis of facts that appear
in citances. - 6 target papers, molecular interactions domain
- We did the same for the abstracts of the target
papers.
30Distributions of Concept Types
31How Do Citances Differ From Abstracts?
- Main results
- all of the facts in the abstract are covered by
the citances (collectively) - However, some facts in citances do not appear in
the abstracts. - Mainly Entities and Experimental Methods
- This suggests there is important information in
the full text that is not represented by the
abstract, title, and metadata alone.
32Paraphrasing Citances
- (This part primarily by Preslav Nakov)
- Problem many citances say the same thing in
different ways - The sentence structure is very complex and
contains irrelevant information - We want to first normalize those citances that
talk about similar things, so we can then
determine which sentences repeat the same
information. - This will then allow us to determine what the key
points are and thus convert them into summaries.
33Want to Normalize These
- NGF withdrawal from sympathetic neurons induces
Bim, which then contributes to death. - Nerve growth factor withdrawal induces the
expression of Bim and mediates Bax dependent
cytochrome c release and apoptosis. - Recently, Bim has been shown to be upregulated
following both nerve growth factor withdrawal
from primary sympathetic neurons, and serum and
potassium withdrawal from granule neurons. - The proapoptotic Bcl-2 family member Bim is
strongly induced in sympathetic neurons in
response to NGF withdrawal. - In neurons, the BH3 only Bcl2 member, Bim, and
JNK are both implicated in apoptosis caused by
nerve growth factor deprivation.
34The Resulting Paraphrases
- NGF withdrawal induces Bim.
- Nerve growth factor withdrawal induces the
expression of Bim. - Bim has been shown to be upregulated following
nerve growth factor withdrawal. - Bim is induced in sympathetic neurons in response
to NGF withdrawal. - Bim implicated in apoptosis caused by nerve
growth factor deprivation. - All they paraphrase
- Bim is induced after NGF withdrawal.
35Paraphrase Creation Algorithm
- 1. Extract the sentences that cite the target.
- 2. Mark the NEs of interest (genes/proteins, MeSH
terms) - and normalize.
- 3. Dependency parse.
- 4. For each parse
- For each pair of NEs of interest
- i. Extract the path between them.
- ii. Create a paraphrase from the path.
- 5. Rank the candidates for a given pair of NEs.
- 6. Select only the ones above a threshold.
- 7. Generalize.
36Creating a Paraphrase
- Given the path from the dependency parse
- Restore the original word order.
- Add words to improve grammaticality.
- Bim shown be following nerve growth factor
withdrawal. - Bim has been shown to be upregulated
following nerve growth factor withdrawal.
37Creating a Paraphrase
- Given the path from the dependency parse
- Restore the original word order.
- Add words to improve grammaticality.
- Complex verb forms passive, infinitive, past
etc. - LinPantel, Ibrahimal. manipulate parsers
output - We use the 2-word heuristic
-
- If the path extracted from the dependency parse
skips over either one or two words, those one or
two words are inserted back into the paraphrase,
unless those words are adverbs.
382-word Heuristic Demonstration
- NGF withdrawal induces Bim.
- Nerve growth factor withdrawal induces the
expression of Bim. - Bim has been shown to be upregulated
following nerve growth factor withdrawal. - Bim is induced in sympathetic neurons in
response to NGF withdrawal. - member Bim implicated in apoptosis caused by
nerve growth factor deprivation.
39Evaluation (1)
- An influential journal paper from Neuron
- J. Whitfield, S. Neame, L. Paquet, O. Bernard,
and J. Ham. Dominantnegative c-jun promotes
neuronal survival by reducing bim expression and
inhibiting mitochondrial cytochrome c release.
Neuron, 29629643, 2001. - 99 journal papers citing it
- 203 citances in total
- 36 different types of important biological
factoids - But we concentrated on one of them
- Bim is induced after NGF withdrawal.
40Evaluation (2)
- Set 1 67 citances pointing to the target paper
and manually found to contain a good or
acceptable paraphrase (do not necessarily contain
Bim or NGF) - Set 2 65 citances pointing to the target paper
and containing both Bim and NGF - Set 3 102 sentences from the 99 texts, contain
both Bim and NGF - Cluster all 203 citances
- Spectral clustering
- Polynomial kernel
- clusters for which more than 80 of the citances
include both NGF and Bim - Set 1 assess the system under ideal conditions.
- Set 2 vs. 3 Do citances produce better
paraphrases?
41Results
- - good (1.0) or acceptable (0.5)
42The Citance Fact Extraction Problem
- (This part primarily by Ariel Schwartz.)
- Find groups of words/phrases that are
semantically similar in target papers context. - Orthographic similarity is important but does not
always entail semantic similarity. - This is another step needed for normalizing the
content. - Can use the results of this algorithm to
determine which entities to use in the
paraphrasing just described.
43Example of original citances
44Entities Identified and Labeled as Equivalent to
One Another
response genotoxic stress Chk1 Chk2 phosphorylate
Cdc25A N terminal sites target rapidly ubiquitin
dependent degradation thought central S G2 cell
cycle checkpoints
Given Chk1 promotes Cdc25A turnover response DNA
damage vivo Chk1 required Cdc25A ubiquitination
SCF beta TRCP vitro explored role Cdc25A
phosphorylation ubiquitination process
activated phosphorylated Chk2 T68 involved
phosphorylation degradation Cdc25A examined
levels Cdc25A 2fTGH U3A cells exposed gamma IR
45Features for citance word alignment
- Orthographic features
- exact string match,
- normalized edit distance,
- prefix, suffix match,
- word lengths,
- capitalization.
- Local contextual features
- distance between target words of adjacent source
words, - Word specific tendency to align like the
previous/next word, - Transition to, from, and between (un)aligned
words. - Biological ontology based features
- Medical Subject Headings (MeSH),
- Gene synonyms (Entrez Gene, Uniprot, OMIM).
- Lexical features
- Wordnet similarity (Lin, 1998)
46Approach Posterior Decoding
- Use Conditional Random Fields
- Compute posterior probabilities using EM
- For every target word w, compute the combination
of source words that maximizes the expected score
of w - Take the union of individual word optimal
alignments and produce a multiple alignment - Use a match-factor to reward/penalize a
combination based on the number of words that
align to the same target word
47Data sets
- 3 sets of citances annotated by a PhD with
biological training (Anna Divoli) - Training set - 4 groups, 10 citances each (360
pairs). - Development set 51 citances (2550 pairs).
- Test set 45 citances (1980 pairs).
- Feature engineering was done using the training
and development sets. - Final results based on a model trained on
training and development sets combined, and
tested on the test set. - Baseline using only normalized edit distance
with a simple cutoff.
48Results
49A Full Text Search Interface
- (This work in part by Mike Wooldridge and Jerry
Ye)
50The Importance of Figures and Captions
- Observations of biologists reading habits
- It has often observed that biologists focus on
figurescaptions along with title and abstract. - KDD Cup 2002
- The objective was to extract only the papers that
included experimental results regarding
expression of gene products and to identify, from
all the genes mentioned in each document, the
genes and products for which experimental results
were provided. - ClearForestCelera did well in part by focusing
on figure captions, which contain critical
experimental evidence.
51(No Transcript)
52Our Idea
- Make a full text search engine for journal
articles that focuses on showing figures - Make it possible to search over caption text (and
text that refers to captions) - Try to group the figures intelligently
53BioFigure Search Interface
- Weve indexed the open access journal article
collection - 130 journals
- 20,000 articles
- 80,000 figures
- Weve built a figure/caption labeling tool to
create training data - Image types
- Comparison or not?
- Weve made a start at a search interface
- Right now figure grouping facility is very crude
- We are going to add faceted navigation (Flamenco)
54(No Transcript)
55(No Transcript)
56(No Transcript)
57(No Transcript)
58(No Transcript)
59(No Transcript)
60(No Transcript)
61(No Transcript)
62(No Transcript)
63(No Transcript)
64Interested in Helping?
- We need figure labeling help!
- We need user feedback!
- Contact me, or send email to
- divoli_at_sims.berkeley.edu
- More information
- biotext.berkeley.edu