Sentence Extraction Results - PowerPoint PPT Presentation

About This Presentation
Title:

Sentence Extraction Results

Description:

Documents on different genes touch different aspects ... Measured results by class entropy, cluster entropy, and their average. Document Retrieval ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 2
Provided by: RishiRak5
Category:

less

Transcript and Presenter's Notes

Title: Sentence Extraction Results


1
Problem Definition
System Overview
Document Similarity and Clustering
  • Motivation manual annotation of genes can hardly
    catch up with the fast growth of biomedical
    literature
  • Goal to automatically extract relevant
    statements from biomedical literature to
    summarize the already-discovered knowledge about
    a target gene
  • Methodology two phases
  • IR phase retrieve articles that contain relevant
    information about the gene of interest
  • IE phase cluster articles into different
    information aspects, then extract the facts about
    the target gene at the sentence level
  • Goal group similar documents together to help
    summarize the target gene in several aspects
  • Difficulties
  • Documents on different genes touch different
    aspects
  • No clear boundary between different aspects
  • False positives may dominate due to gene name
    ambiguity
  • Partial solution
  • Predefine a number of categories based on
    observation
  • Generate prior language models using FlyBase
    annotations
  • Use priors on PLSA to guide the clustering
  • Relevant document retrieval
  • Query expansion using gene synonym list
  • Stopword removal from synonym list to reduce
    false positives
  • Exact phrase matching for multi-token synonyms
  • Document clustering
  • PLSA w/ and w/o prior language models
  • K-means
  • Sentence extraction
  • Rank sentences by cosine similarity to the
    cluster centroid using TF-IDF vectors
  • Rank sentences based on probabilities of being
    generated from corresponding language model
  • Rank by other heuristics title, location, etc.

Document Clustering Results
Sentence Extraction Results
Conclusion
  • Document Clustering
  • Can help recognize false positives
  • K-means is more effective than PLSA for this
    problem
  • Sentence Extraction
  • Can represent the document cluster to some extent
  • Future Work
  • Automatically remove false positives using
    document clustering
  • Derive better scoring function for sentence
    extraction
  • Develop more systematic way to evaluate and tune
    the system
  • Used PubMed abstracts on Drosophila
  • Evaluated results on two genes EAG and SS
  • Two humans with domain knowledge grouped
    documents into 6 categories (Std 1 and Std 2)
  • Another grouping based on false positives (Std 3)
  • Measured results by class entropy, cluster
    entropy, and their average
  • Extracted sentences from standard clusters of
    documents to isolate the evaluation of sentence
    extraction from document clustering
  • Combined top K sentences extracted by different
    methods into a single judgment pool, and judged
    by a human with domain knowledge
  • Measured results by precision at K sentences

gene EAG gene EAG gene EAG gene SS gene SS
Std 1 Std 2 Std 3 Std 2 Std 3
K-means 0.835 0.880 0.411 0.594 0.317
PLSA w/ prior 0.910 1.092 N/A 0.694 N/A
PLSA w/o prior 1.094 1.197 0.790 0.992 0.773
Random 1.228 1.359 0.834 1.261 1.320
Write a Comment
User Comments (0)
About PowerShow.com