New Search Tools for Bioscience Journal Articles

About This Presentation

Title:

New Search Tools for Bioscience Journal Articles

Description:

Medstract Gold Standard Evaluation Corpus. 82% recall, ... Final results based on a model trained on training and development sets combined, ... KDD Cup 2002 ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 63

Provided by: KaPin

Learn more at: https://biotext.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: New Search Tools for Bioscience Journal Articles

1
New Search Tools for Bioscience Journal Articles

Marti Hearst,
UC Berkeley School of Information

UIUC Comp-Bio Seminar February 12, 2007
Supported by NSF DBI-0317510 And a gift from
Genentech
2
Outline

Biotext Project Introduction
Simple Abbreviation Definition Recognition
Citances
A New Search Interface Idea

3
Double Exponential Growth in Bioscience Journal
Articles

From Hunter Cohen, Molecular Cell 21, 2006

4
BioText Project Goals

Provide flexible, useful, appealing search for
bioscientists.
Focus on
Full text journal articles
New language analysis algorithms
New search interfaces

5
Bioscience Text is Challenging

Complex sentence structure
Huge vocabulary
Including LOTS of abbreviations
Gene/protein name recognition a major task
Full text documents have complex structure
which parts are key?

6
BioText Architecture
Sophisticated Text Analysis
Annotations in Database
Improved Search Interface
7
Project Team

Project Leaders
PI Marti Hearst
Co-PI Adam Arkin
Computational Linguistics and Databases
Preslav Nakov
Jerry Ye
Ariel Schwartz (alum)
Brian Wolf (alum)
Barbara Rosario (alum)
Gaurav Bhalotia (alum)

User Interface / IR
Mike Wooldridge
Rowena Luk (alum)
Dr. Emilia Stoica (alum)
Bioscience
Dr. Anna Divoli
Janice Hamerja (alum)
Dr. TingTing Zhang (alum)

8
The Problem Identify Acronym Definitions

methyl methanesulfonate sulfate (MMS)
heat shock transcription factor (HSF)
Gcn5-related N-acetyltransferase (GNAT)
We investigated the redox regulation of the
stress response and report here that in the human
pre-monocytic line U937 cells, H2O2 induced a
concentration-dependent transactivation and
DNA-binding activity of heat-shock factor-1
(HSF-1)

9
Identifying Acronym Definitions

To identify ltshort form, long formgt pairs
from biomedical text
Short form is abbreviation of long form
There exists character mapping from short form to
long form
Examples
Gcn5-related N-acetyltransferase (GNAT)
A non-trivial problem
Words in long form may be skipped
Internal letters in long form may be used

10
Previous Work

Machine learning approaches
Linear regression (Chang et al.)
Encoding and compression (Yeates et al.)
Cubic time or worse
Heuristic approach
Rule-based (Park Byrd)
Factors considered include
Distance between definition and abbreviation
Number of stop words
Capitalization
Cant reproduce this algorithm

11
Step 1 Identifying Candidates

Consider only two cases
long form ( short form )
short form ( long form )
Short form
No more than 2 words
Between 2 and 10 chars
At least one letter
First char alphanumeric
Long form
Adjacent to short form
No more than min(A 5, A 2) words

12
Step 2 Identifying Correct Long Forms

heat shock transcription factor (HSF)
heat shock transcription factor (HSF)
heat shock transcription factor (HSF)
heat shock transcription factor (HSF)
heat shock transcription factor(HSF)

13
Step 2 Identifying Correct Long Forms

Gcn5-related N-acetyltransferase (GNAT)
Gcn5-related N-acetyltransferase (GNAT)
Gcn5-related N-acetyltransferase (GNAT)
Gcn5-related N-acetyltransferase (GNAT)

14
Step 2 Identifying Correct Long Forms

From right to left, the shortest long form that
matches the short form
Each character in short form must match a
character in long form
The match of the character at the beginning of
the short form must match a character in the
initial position of the first word in the long
form

15
Java Code for Finding the Best Long Form for a
Given Short Form
16
Evaluation

1000 randomly selected MEDLINE abstracts
82 recall, 95 precision
Medstract Gold Standard Evaluation Corpus
82 recall, 96 precision
Compared with
83 recall, 80 precision (Cheng et al., linear
regression)
72 recall, 98 precision (Pustejovsky et al.,
heuristics)

17
Missing Pairs

Skipped characters in short form
ltCNS1, cyclophilin seven suppressorgt
No match
lt5-HT, serotoningt
Out of order
ltATN, anterior thalamusgt
Partial match
ltPol I, RNA polymerase Igt

18
Other NLP Work

Relation labeling
(Work primarily by Barbara Rosario)
Protein-protein interactions which ones are
happening?
They also demonstrate that the GAG protein from
membrane-containing viruses , such as HIV, binds
to Alix / AIP1 , thereby recruiting the ESCRT
machinery to allow budding of the virus from the
cell surface cite.
Distinguished among 10 different relations
Binds, degrades, synergizes with, upregulates
Simple supervised approach gets surprisingly high
results (60 accuracy)

19
Acquiring Labeled Data using Citances
20

A discovery is made
A paper is written
21

That paper is cited
and cited
and cited
as the evidence for some fact(s) F.
22

Each of these in turn are cited for some fact(s)
until it is the case that all important facts
in the field can be found in citation sentences
alone!
23
Citances

Nearly every statement in a bioscience journal
article is backed up with a cite.
It is quite common for papers to be cited 30-100
times.
The text around the citation tends to state
biological facts. (Call these citances.)
Different citances will state the same facts in
different ways
so can we use these for creating models of
language expressing semantic relations?

24
Using citances

Potential uses of citation sentences (citances)
creation of training and testing data for
semantic analysis,
synonym set creation,
database curation,
document summarization,
and information retrieval generally.
All of the above require citance word alignments.

25
Sample Citance

Recent research, in proliferating cells, has
demonstrated that interaction of E2F1 with the
p53 pathway could involve transcriptional
up-regulation of E2F1 target genes such as
p14/p19ARF, which affect p53 accumulation
67,68, E2F1-induced phosphorylation of p53
69, or direct E2F1-p53 complex formation 70.

26
Related Work

Traditional citation analysis dates back to the
1960s (Garfield). Includes
Citation categorization,
Context analysis,
Citer motivation.
Citation indexing systems, such as ISIs SCI, and
CiteSeer.
Mercer and Di Marco (2004) propose to improve
citation indexing using citation types.
Bradshaw (2003) introduces Reference Directed
Indexing (RDI), which indexes documents using the
terms in the citances citing them.

27
Related Work (cont.)

Teufel and Moens (2002) identify citances to
improve summarization of the citing paper. They
give lower weight to citances as candidate
sentences for summarization.
Nanba et. al. (2000) use citances as features for
classifying papers into topics.
Related field to citation indexing is the use of
link structure and anchor text of Web pages.
Applications include IR, classification, Web
crawlers, and summarization.
See the full paper for references.

28
Issues for Processing Citances

Text span
Identification of the appropriate phrase, clause,
or sentence that constructs a citance.
Correct mapping of citations when shown as lists
or groups (e.g., 22-25).
Grouping citances by topic
Citances that cite the same document should be
grouped by the facts they state.
Normalizing or paraphrasing concepts in citances

29
How Do Citances Differ From Abstracts?

(This part primarily by Anna Divoli.)
We did a detailed analysis of facts that appear
in citances.
6 target papers, molecular interactions domain
We did the same for the abstracts of the target
papers.

30
Distributions of Concept Types
31
How Do Citances Differ From Abstracts?

Main results
all of the facts in the abstract are covered by
the citances (collectively)
However, some facts in citances do not appear in
the abstracts.
Mainly Entities and Experimental Methods
This suggests there is important information in
the full text that is not represented by the
abstract, title, and metadata alone.

32
Paraphrasing Citances

(This part primarily by Preslav Nakov)
Problem many citances say the same thing in
different ways
The sentence structure is very complex and
contains irrelevant information
We want to first normalize those citances that
talk about similar things, so we can then
determine which sentences repeat the same
information.
This will then allow us to determine what the key
points are and thus convert them into summaries.

33
Want to Normalize These

NGF withdrawal from sympathetic neurons induces
Bim, which then contributes to death.
Nerve growth factor withdrawal induces the
expression of Bim and mediates Bax dependent
cytochrome c release and apoptosis.
Recently, Bim has been shown to be upregulated
following both nerve growth factor withdrawal
from primary sympathetic neurons, and serum and
potassium withdrawal from granule neurons.
The proapoptotic Bcl-2 family member Bim is
strongly induced in sympathetic neurons in
response to NGF withdrawal.
In neurons, the BH3 only Bcl2 member, Bim, and
JNK are both implicated in apoptosis caused by
nerve growth factor deprivation.

34
The Resulting Paraphrases

NGF withdrawal induces Bim.
Nerve growth factor withdrawal induces the
expression of Bim.
Bim has been shown to be upregulated following
nerve growth factor withdrawal.
Bim is induced in sympathetic neurons in response
to NGF withdrawal.
Bim implicated in apoptosis caused by nerve
growth factor deprivation.
All they paraphrase
Bim is induced after NGF withdrawal.

35
Paraphrase Creation Algorithm

1. Extract the sentences that cite the target.
2. Mark the NEs of interest (genes/proteins, MeSH
terms)
and normalize.
3. Dependency parse.
4. For each parse
For each pair of NEs of interest
i. Extract the path between them.
ii. Create a paraphrase from the path.
5. Rank the candidates for a given pair of NEs.
6. Select only the ones above a threshold.
7. Generalize.

36
Creating a Paraphrase

Given the path from the dependency parse
Restore the original word order.
Add words to improve grammaticality.
Bim shown be following nerve growth factor
withdrawal.
Bim has been shown to be upregulated
following nerve growth factor withdrawal.

37
Creating a Paraphrase

Given the path from the dependency parse
Restore the original word order.
Add words to improve grammaticality.
Complex verb forms passive, infinitive, past
etc.
LinPantel, Ibrahimal. manipulate parsers
output
We use the 2-word heuristic
If the path extracted from the dependency parse
skips over either one or two words, those one or
two words are inserted back into the paraphrase,
unless those words are adverbs.

38
2-word Heuristic Demonstration

NGF withdrawal induces Bim.
Nerve growth factor withdrawal induces the
expression of Bim.
Bim has been shown to be upregulated
following nerve growth factor withdrawal.
Bim is induced in sympathetic neurons in
response to NGF withdrawal.
member Bim implicated in apoptosis caused by
nerve growth factor deprivation.

39
Evaluation (1)

An influential journal paper from Neuron
J. Whitfield, S. Neame, L. Paquet, O. Bernard,
and J. Ham. Dominantnegative c-jun promotes
neuronal survival by reducing bim expression and
inhibiting mitochondrial cytochrome c release.
Neuron, 29629643, 2001.
99 journal papers citing it
203 citances in total
36 different types of important biological
factoids
But we concentrated on one of them
Bim is induced after NGF withdrawal.

40
Evaluation (2)

Set 1 67 citances pointing to the target paper
and manually found to contain a good or
acceptable paraphrase (do not necessarily contain
Bim or NGF)
Set 2 65 citances pointing to the target paper
and containing both Bim and NGF
Set 3 102 sentences from the 99 texts, contain
both Bim and NGF
Cluster all 203 citances
Spectral clustering
Polynomial kernel
clusters for which more than 80 of the citances
include both NGF and Bim
Set 1 assess the system under ideal conditions.
Set 2 vs. 3 Do citances produce better
paraphrases?

41
Results

- good (1.0) or acceptable (0.5)

42
The Citance Fact Extraction Problem

(This part primarily by Ariel Schwartz.)
Find groups of words/phrases that are
semantically similar in target papers context.
Orthographic similarity is important but does not
always entail semantic similarity.
This is another step needed for normalizing the
content.
Can use the results of this algorithm to
determine which entities to use in the
paraphrasing just described.

43
Example of original citances
44
Entities Identified and Labeled as Equivalent to
One Another
response genotoxic stress Chk1 Chk2 phosphorylate
Cdc25A N terminal sites target rapidly ubiquitin
dependent degradation thought central S G2 cell
cycle checkpoints
Given Chk1 promotes Cdc25A turnover response DNA
damage vivo Chk1 required Cdc25A ubiquitination
SCF beta TRCP vitro explored role Cdc25A
phosphorylation ubiquitination process
activated phosphorylated Chk2 T68 involved
phosphorylation degradation Cdc25A examined
levels Cdc25A 2fTGH U3A cells exposed gamma IR
45
Features for citance word alignment

Orthographic features
exact string match,
normalized edit distance,
prefix, suffix match,
word lengths,
capitalization.
Local contextual features
distance between target words of adjacent source
words,
Word specific tendency to align like the
previous/next word,
Transition to, from, and between (un)aligned
words.
Biological ontology based features
Medical Subject Headings (MeSH),
Gene synonyms (Entrez Gene, Uniprot, OMIM).
Lexical features
Wordnet similarity (Lin, 1998)

46
Approach Posterior Decoding

Use Conditional Random Fields
Compute posterior probabilities using EM
For every target word w, compute the combination
of source words that maximizes the expected score
of w
Take the union of individual word optimal
alignments and produce a multiple alignment
Use a match-factor to reward/penalize a
combination based on the number of words that
align to the same target word

47
Data sets

3 sets of citances annotated by a PhD with
biological training (Anna Divoli)
Training set - 4 groups, 10 citances each (360
pairs).
Development set 51 citances (2550 pairs).
Test set 45 citances (1980 pairs).
Feature engineering was done using the training
and development sets.
Final results based on a model trained on
training and development sets combined, and
tested on the test set.
Baseline using only normalized edit distance
with a simple cutoff.

48
Results
49
A Full Text Search Interface

(This work in part by Mike Wooldridge and Jerry
Ye)

50
The Importance of Figures and Captions

Observations of biologists reading habits
It has often observed that biologists focus on
figurescaptions along with title and abstract.
KDD Cup 2002
The objective was to extract only the papers that
included experimental results regarding
expression of gene products and to identify, from
all the genes mentioned in each document, the
genes and products for which experimental results
were provided.
ClearForestCelera did well in part by focusing
on figure captions, which contain critical
experimental evidence.

51
(No Transcript)
52
Our Idea

Make a full text search engine for journal
articles that focuses on showing figures
Make it possible to search over caption text (and
text that refers to captions)
Try to group the figures intelligently

53
BioFigure Search Interface

Weve indexed the open access journal article
collection
130 journals
20,000 articles
80,000 figures
Weve built a figure/caption labeling tool to
create training data
Image types
Comparison or not?
Weve made a start at a search interface
Right now figure grouping facility is very crude
We are going to add faceted navigation (Flamenco)

54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
Interested in Helping?