Applying Conditional Random Fields to Japanese Morphological Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Applying Conditional Random Fields to Japanese Morphological Analysis

Description:

Applying Conditional Random Fields to Japanese Morphological Analysis Taku Kudo 1*, Kaoru Yamamoto 2, Yuji Matsumoto 1 1 Nara Institute of Science and Technology – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 32
Provided by: Taku5
Learn more at: http://chasen.org
Category:

less

Transcript and Presenter's Notes

Title: Applying Conditional Random Fields to Japanese Morphological Analysis


1
Applying Conditional Random Fields to Japanese
Morphological Analysis
  • Taku Kudo 1, Kaoru Yamamoto 2, Yuji Matsumoto
    1
  • 1 Nara Institute of Science and Technology
  • 2 CREST, Tokyo Institute of Technology
  • Currently, NTT Communication Science Labs.

2
Backgrounds
  • Conditional Random Fields Lafferty 01
  • A variant of Markov Random Fields
  • Many applications
  • POS tagging Lafferty01, shallow parsing Sha
    03, NE recognition McCallum 03, IE Pinto
    03, Peng 04
  • Japanese Morphological Analysis
  • Must cope with word segmentation
  • Must incorporate many features
  • Must minimize the influence of the length bias

3
Japanese Morphological Analysis
INPUT ?????? (I live in Metropolis of Tokyo.)
?? / ? / ? / ?? ??
(Tokyo) NOUN-PROPER-LOC-GENERAL ?
(Metro.) NOUN-SUFFIX-LOC ? (in) PARTICLE-GENE
RAL ?? (live) VERB BASE-FORM
  • word segmentation (no explicit spaces in
    Japanese)
  • POS tagging
  • lemmatization, stemming

4
Simple approach for JMA
  • Character-based begin / inside tagging
  • non standard method in JMA
  • cannot directly reflect lexicons
  • over 90 accuracy can be achieved using the
    naïve longest prefix matching with a lexicon
  • decoding is slow

? ? / ? / ? / ? ?
B I B B B I
5
Our approach for JMA
  • Assume that a lexicon is available
  • word lattice
  • represents all candidate outputs
  • reduces redundant outputs
  • Unknown word processing
  • invoked when no matching word can be found in
    a lexicon
  • character types
  • e.g., Chinese character, hiragana, katakana,
    number .. etc

6
Problem Setting
lexicon
? particle, verb ? noun ?
noun ?? noun ?? noun
Input ?????? (I live in Metropolis of Tokyo)
?? (Kyoto) noun
Lattice
? (in) particle
? (east) noun
? (Metro.) suffix
?? (live) verb
? (capital) noun
BOS
EOS
? (resemble) verb
?? (Tokyo) noun
NOTE the number of tokens Y varies
7
Long-standing Problems in JMA
8
Complex tagset
?? (Kyoto) Noun Proper Loc General Kyoto
  • Hierarchical tagset
  • HMMs cannot capture them
  • How to select the hidden classes?
  • TOP level ? lack of granularity
  • Bottom level ? data sparseness
  • Some functional particles should be lexicalized
  • Semi-automatic hidden class selections
  • Asahara 00

9
Complex tagset, cont.
  • Must capture a variety of features

?? (Kyoto) noun proper loc general Kyoto
? (in) particle general f f ?
?? (live) verb independent f f live base-form
overlapping features
POS hierarchy
character types prefix, suffix
lexicalization
inflections
These features are important to JMA
10
JMA with MEMMs Uchimoto 00-03
  • Use discriminative model, e.g., maximum entropy
    model, to capture a variety of features
  • sequential application of ME models

11
Problems of MEMMs
  • Label bias Lafferty 01

0.4
C
1.0
0.6
BOS
A
D
EOS
1.0
0.6
0.4
1.0
B
E
1.0
P(A, D x) 0.6 0.6 1.0 0.36 P(B, E x)
0.4 1.0 1.0 0.4
P(A,Dx) lt P(B,Ex)
paths with low-entropy are preferred
12
Problems of MEMMs in JMA
  • Length bias

0.4
C
1.0
0.6
BOS
A
D
EOS
1.0
0.6
0.4
1.0
B
P(A, D x) 0.6 0.6 1.0 0.36 P(B x)
0.4 1.0 0.4
P(A,Dx) lt P(Bx)
long words are preferred length
bias has been ignored in JMA !
13
Long-standing problems
  • must incorporate a variety of features
  • overlapping features, POS hierarchy,
    lexicalization, character-types
  • HMMs are not sufficient
  • must minimize the influence of length bias
  • another bias observed especially in JMA
  • MEMMs are not sufficient

14
Use of CRFs to JMA
15
CRFs for word lattice
encodes a variety of uni- or bi-gram features in
a path
BOS - noun
noun - suffix
noun / Tokyo
Global Feature F(Y,X) ( 1 1 1 )
Parameter ? ( 3 20 20
... )
16
CRFs for word lattice, cont.
  • single exponential model for the entire paths
  • fewer restrictions in the feature design
  • can incorporate a variety of features
  • can solve the problems of HMMs

17
Encoding
  • Maximum Likelihood estimation
  • all candidate paths are taken in encoding
  • influence of length bias will be minimized
  • can solve the problems of MEMMs
  • A variant of Forward-Backward Lafferty 01 can
  • also be applied to word lattice

18
MAP estimation
  • L2-CRF (Gaussian prior)
  • non-sparse solution (all features have non-zero
    weight)
  • good if most given features are relevant
  • non-constrained optimizers, e.g., L-BFGS, are
    used
  • L1-CRF (Laplacian prior)
  • sparse solution (most features have zero-weight)
  • good if most given features are irrelevant
  • constrained optimizers, e.g., L-BFGS-B, are used
  • C is a hyper-parameter

19
Decoding
  • Viterbi algorithm
  • essentially the same architecture as HMMs and
  • MEMMs

20
Experiments
21
Data
KC and RWCP, widely-used Japanese annotated
corpora
KC
source Mainichi News Article 95
lexicon (size) JUMAN 3.61 (1,983,173)
POS structure 2-levels POS, c-form, c-type, base
training sentences 7,958
training tokens 198,514
test sentences 1,246
test tokens 31,302
features 791,798
22
Features
?? (Kyoto) noun proper loc general Kyoto
? (in) particle general f f ?
?? (live) verb independent f f live base-form
overlapping features
POS hierarchy
character types prefix, suffix
lexicalization
inflections
23
Evaluation
2recallprecision
F
recall precision
correct tokens
recall
tokens in test corpus
correct tokens
precision
tokens in system output
  • three criteria of correctness
  • seg word segmentation only
  • top word segmentation top level of POS
  • all all information

24
Results
Significance Tests McNemars paired test on
the labeling disagreements
seg top all
L2-CRFs 98.96 98.31 96.75
L1-CRFs 98.80 98.14 96.55
HMMs 96.22 94.99 91.85
MEMMs 96.44 95.81 94.28
  • L1/L2-CRFs outperform HMM and MEMM
  • L2-CRFs outperform L1-CRFs

25
Influence of the length bias
long word err. short word err.
HMMs 306 (44) 387 (56)
L2-CRFs 79 (40) 120 (60)
MEMMs 416 (70) 183 (30)
  • HMM, CRFs relative ratios are not much different
  • MEMM of long word errors is large
  • ? influenced by the
    length bias

26
L1-CRFs v.s L2-CRFs
  • L2-CRFs gt L1-CRFs
  • most given features are relevant
  • (POS hierarchies, suffixes/prefixes,
    character types)
  • L1-CRFs produce a compact model
  • of active features
  • L2 791,798 v.s L1 90,163 11
  • L1-CRFs are worth being examined
  • if there exist practical constraints

27
Conclusions
  • An application of CRFs to JMA
  • Not use character-based begin / inside tags but
    use word lattice with a lexicon
  • CRFs offer an elegant solution to the problems
    with HMMs and MEMMs
  • can use a wide variety of features
  • (hierarchical POS tags, inflections,
    character types, etc)
  • can minimize the influence of the length bias
    (length bias has been ignored in JMA!)

28
Future work
  • Tri-gram features
  • Use of all tri-grams is impractical as they make
  • the decoding speed significantly slower
  • need to use a practical feature selection
    e.g., McCallum 03
  • Apply to other non-segmented languages
  • e.g., Chinese or Thai

29
CRFs encoding
  • A variant of Forward-Backward Lafferty 01 can
    also be applied to word lattice

30
Influence of the length bias, cont.
MEMMs select
???? romanticist
? sea
? particle
??? bet
??? romance
? particle
The romance on the sea they bet is
MEMMs select
??? ones heart
?? rough waves
? particle
?? loose
?? not
? heart
A heart which beats rough waves is
  • caused rather by the influence of the length
    bias
  • (CRFs can correctly analyze these sentences)

31
Cause of label and length bias
  • MEMM only use correct path in encoding
  • transition probabilities of unobserved paths will
    be distributed uniformly
Write a Comment
User Comments (0)
About PowerShow.com