New Search Tools for Bioscience Journal Articles - PowerPoint PPT Presentation

About This Presentation
Title:

New Search Tools for Bioscience Journal Articles

Description:

Medstract Gold Standard Evaluation Corpus. 82% recall, ... Final results based on a model trained on training and development sets combined, ... KDD Cup 2002 ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 63
Provided by: KaPin
Category:

less

Transcript and Presenter's Notes

Title: New Search Tools for Bioscience Journal Articles


1
New Search Tools for Bioscience Journal Articles
  • Marti Hearst,
  • UC Berkeley School of Information

UIUC Comp-Bio Seminar February 12, 2007
Supported by NSF DBI-0317510 And a gift from
Genentech
2
Outline
  • Biotext Project Introduction
  • Simple Abbreviation Definition Recognition
  • Citances
  • A New Search Interface Idea

3
Double Exponential Growth in Bioscience Journal
Articles
  • From Hunter Cohen, Molecular Cell 21, 2006

4
BioText Project Goals
  • Provide flexible, useful, appealing search for
    bioscientists.
  • Focus on
  • Full text journal articles
  • New language analysis algorithms
  • New search interfaces

5
Bioscience Text is Challenging
  • Complex sentence structure
  • Huge vocabulary
  • Including LOTS of abbreviations
  • Gene/protein name recognition a major task
  • Full text documents have complex structure
    which parts are key?

6
BioText Architecture
Sophisticated Text Analysis
Annotations in Database
Improved Search Interface
7
Project Team
  • Project Leaders
  • PI Marti Hearst
  • Co-PI Adam Arkin
  • Computational Linguistics and Databases
  • Preslav Nakov
  • Jerry Ye
  • Ariel Schwartz (alum)
  • Brian Wolf (alum)
  • Barbara Rosario (alum)
  • Gaurav Bhalotia (alum)
  • User Interface / IR
  • Mike Wooldridge
  • Rowena Luk (alum)
  • Dr. Emilia Stoica (alum)
  • Bioscience
  • Dr. Anna Divoli
  • Janice Hamerja (alum)
  • Dr. TingTing Zhang (alum)

8
The Problem Identify Acronym Definitions
  • methyl methanesulfonate sulfate (MMS)
  • heat shock transcription factor (HSF)
  • Gcn5-related N-acetyltransferase (GNAT)
  • We investigated the redox regulation of the
    stress response and report here that in the human
    pre-monocytic line U937 cells, H2O2 induced a
    concentration-dependent transactivation and
    DNA-binding activity of heat-shock factor-1
    (HSF-1)

9
Identifying Acronym Definitions
  • To identify ltshort form, long formgt pairs
    from biomedical text
  • Short form is abbreviation of long form
  • There exists character mapping from short form to
    long form
  • Examples
  • Gcn5-related N-acetyltransferase (GNAT)
  • A non-trivial problem
  • Words in long form may be skipped
  • Internal letters in long form may be used

10
Previous Work
  • Machine learning approaches
  • Linear regression (Chang et al.)
  • Encoding and compression (Yeates et al.)
  • Cubic time or worse
  • Heuristic approach
  • Rule-based (Park Byrd)
  • Factors considered include
  • Distance between definition and abbreviation
  • Number of stop words
  • Capitalization
  • Cant reproduce this algorithm

11
Step 1 Identifying Candidates
  • Consider only two cases
  • long form ( short form )
  • short form ( long form )
  • Short form
  • No more than 2 words
  • Between 2 and 10 chars
  • At least one letter
  • First char alphanumeric
  • Long form
  • Adjacent to short form
  • No more than min(A 5, A 2) words

12
Step 2 Identifying Correct Long Forms
  • heat shock transcription factor (HSF)
  • heat shock transcription factor (HSF)
  • heat shock transcription factor (HSF)
  • heat shock transcription factor (HSF)
  • heat shock transcription factor(HSF)

13
Step 2 Identifying Correct Long Forms
  • Gcn5-related N-acetyltransferase (GNAT)
  • Gcn5-related N-acetyltransferase (GNAT)
  • Gcn5-related N-acetyltransferase (GNAT)
  • Gcn5-related N-acetyltransferase (GNAT)

14
Step 2 Identifying Correct Long Forms
  • From right to left, the shortest long form that
    matches the short form
  • Each character in short form must match a
    character in long form
  • The match of the character at the beginning of
    the short form must match a character in the
    initial position of the first word in the long
    form

15
Java Code for Finding the Best Long Form for a
Given Short Form
16
Evaluation
  • 1000 randomly selected MEDLINE abstracts
  • 82 recall, 95 precision
  • Medstract Gold Standard Evaluation Corpus
  • 82 recall, 96 precision
  • Compared with
  • 83 recall, 80 precision (Cheng et al., linear
    regression)
  • 72 recall, 98 precision (Pustejovsky et al.,
    heuristics)

17
Missing Pairs
  • Skipped characters in short form
  • ltCNS1, cyclophilin seven suppressorgt
  • No match
  • lt5-HT, serotoningt
  • Out of order
  • ltATN, anterior thalamusgt
  • Partial match
  • ltPol I, RNA polymerase Igt

18
Other NLP Work
  • Relation labeling
  • (Work primarily by Barbara Rosario)
  • Protein-protein interactions which ones are
    happening?
  • They also demonstrate that the GAG protein from
    membrane-containing viruses , such as HIV, binds
    to Alix / AIP1 , thereby recruiting the ESCRT
    machinery to allow budding of the virus from the
    cell surface cite.
  • Distinguished among 10 different relations
  • Binds, degrades, synergizes with, upregulates
  • Simple supervised approach gets surprisingly high
    results (60 accuracy)

19
Acquiring Labeled Data using Citances
20

A discovery is made
A paper is written
21

That paper is cited
and cited
and cited
as the evidence for some fact(s) F.
22

Each of these in turn are cited for some fact(s)
until it is the case that all important facts
in the field can be found in citation sentences
alone!
23
Citances
  • Nearly every statement in a bioscience journal
    article is backed up with a cite.
  • It is quite common for papers to be cited 30-100
    times.
  • The text around the citation tends to state
    biological facts. (Call these citances.)
  • Different citances will state the same facts in
    different ways
  • so can we use these for creating models of
    language expressing semantic relations?

24
Using citances
  • Potential uses of citation sentences (citances)
  • creation of training and testing data for
    semantic analysis,
  • synonym set creation,
  • database curation,
  • document summarization,
  • and information retrieval generally.
  • All of the above require citance word alignments.

25
Sample Citance
  • Recent research, in proliferating cells, has
    demonstrated that interaction of E2F1 with the
    p53 pathway could involve transcriptional
    up-regulation of E2F1 target genes such as
    p14/p19ARF, which affect p53 accumulation
    67,68, E2F1-induced phosphorylation of p53
    69, or direct E2F1-p53 complex formation 70.

26
Related Work
  • Traditional citation analysis dates back to the
    1960s (Garfield). Includes
  • Citation categorization,
  • Context analysis,
  • Citer motivation.
  • Citation indexing systems, such as ISIs SCI, and
    CiteSeer.
  • Mercer and Di Marco (2004) propose to improve
    citation indexing using citation types.
  • Bradshaw (2003) introduces Reference Directed
    Indexing (RDI), which indexes documents using the
    terms in the citances citing them.

27
Related Work (cont.)
  • Teufel and Moens (2002) identify citances to
    improve summarization of the citing paper. They
    give lower weight to citances as candidate
    sentences for summarization.
  • Nanba et. al. (2000) use citances as features for
    classifying papers into topics.
  • Related field to citation indexing is the use of
    link structure and anchor text of Web pages.
  • Applications include IR, classification, Web
    crawlers, and summarization.
  • See the full paper for references.

28
Issues for Processing Citances
  • Text span
  • Identification of the appropriate phrase, clause,
    or sentence that constructs a citance.
  • Correct mapping of citations when shown as lists
    or groups (e.g., 22-25).
  • Grouping citances by topic
  • Citances that cite the same document should be
    grouped by the facts they state.
  • Normalizing or paraphrasing concepts in citances

29
How Do Citances Differ From Abstracts?
  • (This part primarily by Anna Divoli.)
  • We did a detailed analysis of facts that appear
    in citances.
  • 6 target papers, molecular interactions domain
  • We did the same for the abstracts of the target
    papers.

30
Distributions of Concept Types
31
How Do Citances Differ From Abstracts?
  • Main results
  • all of the facts in the abstract are covered by
    the citances (collectively)
  • However, some facts in citances do not appear in
    the abstracts.
  • Mainly Entities and Experimental Methods
  • This suggests there is important information in
    the full text that is not represented by the
    abstract, title, and metadata alone.

32
Paraphrasing Citances
  • (This part primarily by Preslav Nakov)
  • Problem many citances say the same thing in
    different ways
  • The sentence structure is very complex and
    contains irrelevant information
  • We want to first normalize those citances that
    talk about similar things, so we can then
    determine which sentences repeat the same
    information.
  • This will then allow us to determine what the key
    points are and thus convert them into summaries.

33
Want to Normalize These
  • NGF withdrawal from sympathetic neurons induces
    Bim, which then contributes to death.
  • Nerve growth factor withdrawal induces the
    expression of Bim and mediates Bax dependent
    cytochrome c release and apoptosis.
  • Recently, Bim has been shown to be upregulated
    following both nerve growth factor withdrawal
    from primary sympathetic neurons, and serum and
    potassium withdrawal from granule neurons.
  • The proapoptotic Bcl-2 family member Bim is
    strongly induced in sympathetic neurons in
    response to NGF withdrawal.
  • In neurons, the BH3 only Bcl2 member, Bim, and
    JNK are both implicated in apoptosis caused by
    nerve growth factor deprivation.

34
The Resulting Paraphrases
  • NGF withdrawal induces Bim.
  • Nerve growth factor withdrawal induces the
    expression of Bim.
  • Bim has been shown to be upregulated following
    nerve growth factor withdrawal.
  • Bim is induced in sympathetic neurons in response
    to NGF withdrawal.
  • Bim implicated in apoptosis caused by nerve
    growth factor deprivation.
  • All they paraphrase
  • Bim is induced after NGF withdrawal.

35
Paraphrase Creation Algorithm
  • 1. Extract the sentences that cite the target.
  • 2. Mark the NEs of interest (genes/proteins, MeSH
    terms)
  • and normalize.
  • 3. Dependency parse.
  • 4. For each parse
  • For each pair of NEs of interest
  • i. Extract the path between them.
  • ii. Create a paraphrase from the path.
  • 5. Rank the candidates for a given pair of NEs.
  • 6. Select only the ones above a threshold.
  • 7. Generalize.

36
Creating a Paraphrase
  • Given the path from the dependency parse
  • Restore the original word order.
  • Add words to improve grammaticality.
  • Bim shown be following nerve growth factor
    withdrawal.
  • Bim has been shown to be upregulated
    following nerve growth factor withdrawal.

37
Creating a Paraphrase
  • Given the path from the dependency parse
  • Restore the original word order.
  • Add words to improve grammaticality.
  • Complex verb forms passive, infinitive, past
    etc.
  • LinPantel, Ibrahimal. manipulate parsers
    output
  • We use the 2-word heuristic
  • If the path extracted from the dependency parse
    skips over either one or two words, those one or
    two words are inserted back into the paraphrase,
    unless those words are adverbs.

38
2-word Heuristic Demonstration
  • NGF withdrawal induces Bim.
  • Nerve growth factor withdrawal induces the
    expression of Bim.
  • Bim has been shown to be upregulated
    following nerve growth factor withdrawal.
  • Bim is induced in sympathetic neurons in
    response to NGF withdrawal.
  • member Bim implicated in apoptosis caused by
    nerve growth factor deprivation.

39
Evaluation (1)
  • An influential journal paper from Neuron
  • J. Whitfield, S. Neame, L. Paquet, O. Bernard,
    and J. Ham. Dominantnegative c-jun promotes
    neuronal survival by reducing bim expression and
    inhibiting mitochondrial cytochrome c release.
    Neuron, 29629643, 2001.
  • 99 journal papers citing it
  • 203 citances in total
  • 36 different types of important biological
    factoids
  • But we concentrated on one of them
  • Bim is induced after NGF withdrawal.

40
Evaluation (2)
  • Set 1 67 citances pointing to the target paper
    and manually found to contain a good or
    acceptable paraphrase (do not necessarily contain
    Bim or NGF)
  • Set 2 65 citances pointing to the target paper
    and containing both Bim and NGF
  • Set 3 102 sentences from the 99 texts, contain
    both Bim and NGF
  • Cluster all 203 citances
  • Spectral clustering
  • Polynomial kernel
  • clusters for which more than 80 of the citances
    include both NGF and Bim
  • Set 1 assess the system under ideal conditions.
  • Set 2 vs. 3 Do citances produce better
    paraphrases?

41
Results
  • - good (1.0) or acceptable (0.5)

42
The Citance Fact Extraction Problem
  • (This part primarily by Ariel Schwartz.)
  • Find groups of words/phrases that are
    semantically similar in target papers context.
  • Orthographic similarity is important but does not
    always entail semantic similarity.
  • This is another step needed for normalizing the
    content.
  • Can use the results of this algorithm to
    determine which entities to use in the
    paraphrasing just described.

43
Example of original citances
44
Entities Identified and Labeled as Equivalent to
One Another
response genotoxic stress Chk1 Chk2 phosphorylate
Cdc25A N terminal sites target rapidly ubiquitin
dependent degradation thought central S G2 cell
cycle checkpoints
Given Chk1 promotes Cdc25A turnover response DNA
damage vivo Chk1 required Cdc25A ubiquitination
SCF beta TRCP vitro explored role Cdc25A
phosphorylation ubiquitination process
activated phosphorylated Chk2 T68 involved
phosphorylation degradation Cdc25A examined
levels Cdc25A 2fTGH U3A cells exposed gamma IR
45
Features for citance word alignment
  • Orthographic features
  • exact string match,
  • normalized edit distance,
  • prefix, suffix match,
  • word lengths,
  • capitalization.
  • Local contextual features
  • distance between target words of adjacent source
    words,
  • Word specific tendency to align like the
    previous/next word,
  • Transition to, from, and between (un)aligned
    words.
  • Biological ontology based features
  • Medical Subject Headings (MeSH),
  • Gene synonyms (Entrez Gene, Uniprot, OMIM).
  • Lexical features
  • Wordnet similarity (Lin, 1998)

46
Approach Posterior Decoding
  • Use Conditional Random Fields
  • Compute posterior probabilities using EM
  • For every target word w, compute the combination
    of source words that maximizes the expected score
    of w
  • Take the union of individual word optimal
    alignments and produce a multiple alignment
  • Use a match-factor to reward/penalize a
    combination based on the number of words that
    align to the same target word

47
Data sets
  • 3 sets of citances annotated by a PhD with
    biological training (Anna Divoli)
  • Training set - 4 groups, 10 citances each (360
    pairs).
  • Development set 51 citances (2550 pairs).
  • Test set 45 citances (1980 pairs).
  • Feature engineering was done using the training
    and development sets.
  • Final results based on a model trained on
    training and development sets combined, and
    tested on the test set.
  • Baseline using only normalized edit distance
    with a simple cutoff.

48
Results
49
A Full Text Search Interface
  • (This work in part by Mike Wooldridge and Jerry
    Ye)

50
The Importance of Figures and Captions
  • Observations of biologists reading habits
  • It has often observed that biologists focus on
    figurescaptions along with title and abstract.
  • KDD Cup 2002
  • The objective was to extract only the papers that
    included experimental results regarding
    expression of gene products and to identify, from
    all the genes mentioned in each document, the
    genes and products for which experimental results
    were provided.
  • ClearForestCelera did well in part by focusing
    on figure captions, which contain critical
    experimental evidence.

51
(No Transcript)
52
Our Idea
  • Make a full text search engine for journal
    articles that focuses on showing figures
  • Make it possible to search over caption text (and
    text that refers to captions)
  • Try to group the figures intelligently

53
BioFigure Search Interface
  • Weve indexed the open access journal article
    collection
  • 130 journals
  • 20,000 articles
  • 80,000 figures
  • Weve built a figure/caption labeling tool to
    create training data
  • Image types
  • Comparison or not?
  • Weve made a start at a search interface
  • Right now figure grouping facility is very crude
  • We are going to add faceted navigation (Flamenco)

54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
Interested in Helping?
  • We need figure labeling help!
  • We need user feedback!
  • Contact me, or send email to
  • divoli_at_sims.berkeley.edu
  • More information
  • biotext.berkeley.edu
Write a Comment
User Comments (0)
About PowerShow.com