Title: Computational Linguistic Techniques Applied to Drugname Matching
1Computational Linguistic Techniques Applied to
Drugname Matching
Bonnie J. Dorr, University of Maryland Greg
Kondrak, University of Alberta June 26, 2003
2Drugname Matching
- String matching to rank similarity between drug
names - Two classes of string matching
- orthographic Compare strings in terms of
spelling without reference to sound - phonological Compare strings on the basis of a
phonetic representation - Two methods of matching
- distance How far apart are two strings?
- similarity How close are two strings?
3Distance and Similarity Measures Orthographic/
Phonological
- Orthographic
- Distance string-editEx contac / zantac 2/6
0.33 - Similarity LCSR, DICEEx contac / zantac 4/6
0.66Ex co on nt ta ac / za an nt ta ac 6/12
0.50 - Phonological
- Distance SoundexEx contac/zantac 1/4 0.25
- Similarity ALINEEx contac/zantac 0.64
4Distance vs. Similarity Examples
- Example 1 hordes vs lords
- Distance 2 (replace h with l, and delete e).
- Similarity 2 (bigrams or and rd in common).
- Example 2 water vs wine
- Distance 3 (replace a w/ i, t w/ n, delete r).
- Similarity 0 (no bigrams in common).
- We can compare (global) similarity and distance
- sim(w1,w2)/length
- 1 - dist(w1,w2)/length
5Orthographic Distance string-edit
- Count up the number of steps it takes to
transform one string into another - Examples
- Distance between hordes and lords is 2.
- Distance between water and wine is 3.
- For global distance, we can divide by length of
longest string 2/6 and 3/5 above
6Orthographic Similarity LCSR, DICE
- LCSR Divide length of longest common
sub-sequence by length of longest string - Example reagir and repair have longest common
subsequence reair.Similarity score 5/max(6,6)
5/6 0.83 - DICE Double the number of shared character
bigrams and divide by total number of bigrams in
each string - Example reagir and repair have bigram sets
re,ea,ag,gi,ir and re,ep,pa,ai,ir,
respectively, and shared bigrams are re,ir.
Similarity score (2 2)/(55) 2/5 0.40
7Phonological Matching
- Distance-based phonological matching
- Soundex
- Similarity-based phonological matching
- ALINE
8Phonological Distance
- Soundex Examples
- king and khyngge reduce to k52
- knight and night reduce to k523 and n23
- pulpit and phlebotomy reduce to p413
9What went wrong?
- Truncation of word to four characters
- Alternative Use entire string
- Ignoring vowels
- Use more sophisticated phonetic rules
- Using numbers instead of decomposable features
- Use decomposable features
10Phonological Similarity
- Another possible approach Compare syllable
count, initial/final sounds, stress locations - Misses frequently confused pairs
- Alternative Use phonological features to compare
two words by their sounds. - x?k(s) consonantal, velar, stop, -voice
- x?z consonantal, alveolar, fricative,
voice - Phonological similarity of two words Optimal
match between their phonological features. - Zantac
- Xanax
11Kondrak ALINE (2000)
- Two fundamental components of ALINE
- Similarity Function Uses linguistic feature
analysis measurements based on salience, e.g.,
alveolar and stop more salient than voice - Method for choosing optimal alignment creates
alignment based on a weighted multi-feature
analysis - Designed to align phonetic sequences for many
different CL applications - Developed originally for identifying cognates in
vocabularies of related languages (e.g., colour,
couleur) - Feature weights can be fine-tuned for specific
application. - Efficient Dynamic programming algorithm
quadratic
12ALINE Features Weights and Values
13Places of Articulation Numerical Values
14Manner of ArticulationNumerical Values
- stop 1.0Example p, b
- affricate 0.9Example th
- fricative 0.8Example f, v
15Tuning of ALINE Parameters
- Parameters have default settings for cognate
matching task, but not appropriate for drugname
matching - Parameter tuning
- calculate weights for drugname matching
- Hill Climbing search against gold standard
- Tuned parameters for drugname task
- maximum score
- insertion/deletion penalty
- vowel penalty
- phonological feature values
16Comparison of Outputs
- ALINE 0.792 zantac xanax 0.639
zantac contac 0.486 xanax contac - EDIT 0.500 zantac xanax 0.667
zantac contac 0.333 xanax contac - LCSR 0.545 zantac xanax 0.667
zantac contac 0.364 xanax contac - DICE 0.222 zantac xanax 0.600
zantac contac 0.000 xanax contac
17Evaluation
- Precision and recall against online gold
standard USP Quality Review, Mar, 2001. - 582 unique drug names, 399 true confusion pairs,
169,071 possible pairs (combinatorically induced) - Example (using DICE) 0.889
atgam ratgam 0.875 herceptin perceptin-
0.870 zolmitriptan zolomitriptan
0.857 quinidine quinine- 0.857
cytosar cytosar-u 0.842
amantadine rimantadine
- 0.800
erythrocin erythromycin
18Comparison of Precision at Different Recall Values
19Precision of Techniques withPhonetic
Transcription
20Conclusion
- Experimentation with different algorithms and
their combinations against gold standard. - ALINE Strong foundation for search modules in
automating the minimization of medication errors - Fine-tuning based on comparisons with gold
standard (e.g., re-weighting of phonological
features). - Related to pattern recognition Discover patterns
of predictable matches based on feature values