A research literature search engine with abbreviation recognition - PowerPoint PPT Presentation

About This Presentation
Title:

A research literature search engine with abbreviation recognition

Description:

A research literature search engine with abbreviation recognition Cheng-Tao Chu Pei-Chin Wang Outline Features Demo Issues involved Implementation Tailored Edit ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 14
Provided by: ChengT3
Learn more at: http://web.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: A research literature search engine with abbreviation recognition


1
A research literature search engine with
abbreviation recognition
  • Cheng-Tao Chu
  • Pei-Chin Wang

2
Outline
  • Features
  • Demo
  • Issues involved
  • Implementation
  • Tailored Edit Distance
  • Probabilistic Model
  • Translation Model
  • Score Combination
  • Evaluation
  • QA

3
Features
  • Given a query containing authors, proceeding or
    title keywords, return relevant papers
  • Able to retrieve the desired papers with
    abbreviated author/proceeding names
  • Web interface for query and user evaluation.

4
Demo
  • Its show time

5
Issues involved
  • Tag the arbitrary query into author, proceeding,
    and other keywords fields
  • Recognize author
  • P. Raghavan -gt Prabhakar Raghavan
  • -gt Padma Raghavan
  • -gt Raghavan
  • Probability of each possible candidates

6
Issues involved (cont.)
  • Recognize proceeding name
  • More than a look-up table
  • IJCAI -gt International Joint Conference of AI
  • -gt IJCAI Workshop
  • How to combine the weight of each candidate
  • Score from Lucene
  • Score for a possible author
  • Score for a possible proceeding

7
Implementation
DBLP
XML Parser
Database
Tagger
Query
Search Engine
Browser
Retrieved Documents
Probabilistic Model
Tailored Edit Distance
8
Tailored Edit Distance
  • Heuristic
  • Award for consecutive matching
  • Award for matching capitalized character
  • More penalty on substitution, less on
    insertion/deletion
  • Probabilistic representation
  • Transform edit distance cost to probability
  • Normalize the cost
  • Use training data to estimate the distribution

9
Conceptual Histogram
10
Probabilistic Model
  • Translation Model
  • Use tailored edit distance to estimate the
    distribution
  • Return a distribution of candidate names
    (Assuming the independency between the full name
    and its abbreviation given evidence)
  • Network Structure

Full Name
Last Name
Middle Name
First Name
Mid. Ini.
First Ini.
Last Ini.
11
Score Combination
  • Lucene score formula
  • Assign weights to each candidates as
  • Combination score
  • Set idf(t) as ( weight of that term original
    idf(t) )
  • Assign boost value to each term in query

12
Evaluation
  • Test data construction
  • Evaluation by test data
  • precision
  • User evaluation
  • Comparison with Google Scholar

13
QA
Write a Comment
User Comments (0)
About PowerShow.com