Support Vector Machine Based Orthographic Disambiguation

About This Presentation

Title:

Support Vector Machine Based Orthographic Disambiguation

Description:

Support Vector Machine Based Orthographic Disambiguation Eiji ARAMAKI, Takeshi IMAI, Kengo MIYO, Kazuhiko OHE Hospital center and centre are equivalent? – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 9

Provided by: aram167

Category:

more less

Transcript and Presenter's Notes

Title: Support Vector Machine Based Orthographic Disambiguation

1
Support Vector Machine Based Orthographic
Disambiguation
Eiji ARAMAKI, Takeshi IMAI, Kengo MIYO, Kazuhiko
OHE
Hospital
center and centre are equivalent?
We focus on Japanese, but the proposed method
does not depend on languages
2
Background

Japanese in particular contains orthographic
variation, because of tons of transliterations

????? (A BO GA DO RO)
Transliteration
Avogadro
Equivalent or not?
Transliteration
?????? (A VO GA DO RO)
SVM-based classifier
(1) To build training-sets
(2) To define features
3
(1) Training-set
in multiple translation dictionaries

Positive example a term pair, which are spelled
differently, but have the same meaning
Negative example a term pair, which are spelled
differently and have different meanings.

Same English Translation
Different English Translation
4
(2) Features for SVM

different characters its surrounding characters
(window size1 pre-context post-context)

Pre-context
Post-context
Diff.
? ? ? ? ? ?
? ? ? ? ?
(?,????, ?)
(?, ????)
(????, ?)
(????)
label
term2
term1
True
?????? ????? 1 1 1 1

Their combinations features

5
Experiments

Test-set 883 Medical term pairs (312 positive)
Methods
(1) EDIT DISTANCE (th) IF SIM gt th THEN 1
(2) BYHAND SVM 4,130 handmade training-set
(3) AUTOMATIC SVM 68,608 automatically built
training-set
(4) COMBINATION SVM all training-set
(BYHANDAUTOMATIC)
Evaluation

Results

6
Conclusion

Discussion
Why AUTOMATIC lt BYHAND
Because of Corpus specific styles (hepatitis-B
or HepatitisB)
BYHAND corpus test-set corpus ? AUTOMATIC corpus

Conclusion
We proposed a discriminative orthographic
disambiguation method.
We also proposed a method for collecting both
positive negative examples.
Experimental results yielded high levels of
accuracy (87.8), demonstrating the feasibility
of the proposed approach.

Unfortunately Bergsma ACL2007 proposed similar
methods
In the future, we will employ more features to
boost the accuracy
7
(No Transcript)
8
Support Vector Machine Based Orthographic
Disambiguation
Eiji ARAMAKI, Takeshi IMAI, Kengo MIYO, Kazuhiko
OHE
Hospital
term1 and term2 are equivalent?
We focus on Japanese, but the proposed method
does not depend on languages

Write a Comment

User Comments (0)