Computational Linguistic Techniques Applied to Drugname Matching - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Computational Linguistic Techniques Applied to Drugname Matching

Description:

Xanax. Kondrak ALINE (2000) Two fundamental components of ALINE: ... 0.486 xanax contac. EDIT: 0.500 zantac xanax. 0.667 zantac contac. 0.333 xanax contac ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 21
Provided by: ericg171
Category:

less

Transcript and Presenter's Notes

Title: Computational Linguistic Techniques Applied to Drugname Matching


1
Computational Linguistic Techniques Applied to
Drugname Matching
Bonnie J. Dorr, University of Maryland Greg
Kondrak, University of Alberta June 26, 2003
2
Drugname Matching
  • String matching to rank similarity between drug
    names
  • Two classes of string matching
  • orthographic Compare strings in terms of
    spelling without reference to sound
  • phonological Compare strings on the basis of a
    phonetic representation
  • Two methods of matching
  • distance How far apart are two strings?
  • similarity How close are two strings?

3
Distance and Similarity Measures Orthographic/
Phonological
  • Orthographic
  • Distance string-editEx contac / zantac 2/6
    0.33
  • Similarity LCSR, DICEEx contac / zantac 4/6
    0.66Ex co on nt ta ac / za an nt ta ac 6/12
    0.50
  • Phonological
  • Distance SoundexEx contac/zantac 1/4 0.25
  • Similarity ALINEEx contac/zantac 0.64

4
Distance vs. Similarity Examples
  • Example 1 hordes vs lords
  • Distance 2 (replace h with l, and delete e).
  • Similarity 2 (bigrams or and rd in common).
  • Example 2 water vs wine
  • Distance 3 (replace a w/ i, t w/ n, delete r).
  • Similarity 0 (no bigrams in common).
  • We can compare (global) similarity and distance
  • sim(w1,w2)/length
  • 1 - dist(w1,w2)/length

5
Orthographic Distance string-edit
  • Count up the number of steps it takes to
    transform one string into another
  • Examples
  • Distance between hordes and lords is 2.
  • Distance between water and wine is 3.
  • For global distance, we can divide by length of
    longest string 2/6 and 3/5 above

6
Orthographic Similarity LCSR, DICE
  • LCSR Divide length of longest common
    sub-sequence by length of longest string
  • Example reagir and repair have longest common
    subsequence reair.Similarity score 5/max(6,6)
    5/6 0.83
  • DICE Double the number of shared character
    bigrams and divide by total number of bigrams in
    each string
  • Example reagir and repair have bigram sets
    re,ea,ag,gi,ir and re,ep,pa,ai,ir,
    respectively, and shared bigrams are re,ir.
    Similarity score (2 2)/(55) 2/5 0.40

7
Phonological Matching
  • Distance-based phonological matching
  • Soundex
  • Similarity-based phonological matching
  • ALINE

8
Phonological Distance
  • Soundex Examples
  • king and khyngge reduce to k52
  • knight and night reduce to k523 and n23
  • pulpit and phlebotomy reduce to p413

9
What went wrong?
  • Truncation of word to four characters
  • Alternative Use entire string
  • Ignoring vowels
  • Use more sophisticated phonetic rules
  • Using numbers instead of decomposable features
  • Use decomposable features

10
Phonological Similarity
  • Another possible approach Compare syllable
    count, initial/final sounds, stress locations
  • Misses frequently confused pairs
  • Alternative Use phonological features to compare
    two words by their sounds.
  • x?k(s) consonantal, velar, stop, -voice
  • x?z consonantal, alveolar, fricative,
    voice
  • Phonological similarity of two words Optimal
    match between their phonological features.
  • Zantac
  • Xanax

11
Kondrak ALINE (2000)
  • Two fundamental components of ALINE
  • Similarity Function Uses linguistic feature
    analysis measurements based on salience, e.g.,
    alveolar and stop more salient than voice
  • Method for choosing optimal alignment creates
    alignment based on a weighted multi-feature
    analysis
  • Designed to align phonetic sequences for many
    different CL applications
  • Developed originally for identifying cognates in
    vocabularies of related languages (e.g., colour,
    couleur)
  • Feature weights can be fine-tuned for specific
    application.
  • Efficient Dynamic programming algorithm
    quadratic

12
ALINE Features Weights and Values
13
Places of Articulation Numerical Values
14
Manner of ArticulationNumerical Values
  • stop 1.0Example p, b
  • affricate 0.9Example th
  • fricative 0.8Example f, v

15
Tuning of ALINE Parameters
  • Parameters have default settings for cognate
    matching task, but not appropriate for drugname
    matching
  • Parameter tuning
  • calculate weights for drugname matching
  • Hill Climbing search against gold standard
  • Tuned parameters for drugname task
  • maximum score
  • insertion/deletion penalty
  • vowel penalty
  • phonological feature values

16
Comparison of Outputs
  • ALINE 0.792 zantac xanax 0.639
    zantac contac 0.486 xanax contac
  • EDIT 0.500 zantac xanax 0.667
    zantac contac 0.333 xanax contac
  • LCSR 0.545 zantac xanax 0.667
    zantac contac 0.364 xanax contac
  • DICE 0.222 zantac xanax 0.600
    zantac contac 0.000 xanax contac

17
Evaluation
  • Precision and recall against online gold
    standard USP Quality Review, Mar, 2001.
  • 582 unique drug names, 399 true confusion pairs,
    169,071 possible pairs (combinatorically induced)
  • Example (using DICE) 0.889
    atgam ratgam 0.875 herceptin perceptin-
    0.870 zolmitriptan zolomitriptan
    0.857 quinidine quinine- 0.857
    cytosar cytosar-u 0.842
    amantadine rimantadine
    - 0.800
    erythrocin erythromycin

18
Comparison of Precision at Different Recall Values
19
Precision of Techniques withPhonetic
Transcription
20
Conclusion
  • Experimentation with different algorithms and
    their combinations against gold standard.
  • ALINE Strong foundation for search modules in
    automating the minimization of medication errors
  • Fine-tuning based on comparisons with gold
    standard (e.g., re-weighting of phonological
    features).
  • Related to pattern recognition Discover patterns
    of predictable matches based on feature values
Write a Comment
User Comments (0)
About PowerShow.com