Unsupervised Morphological Segmentation With Log-Linear Models - PowerPoint PPT Presentation

1 / 72
About This Presentation
Title:

Unsupervised Morphological Segmentation With Log-Linear Models

Description:

Title: PowerPoint Presentation Last modified by: Hoifung Poon Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 73
Provided by: ResearchM
Category:

less

Transcript and Presenter's Notes

Title: Unsupervised Morphological Segmentation With Log-Linear Models


1
Unsupervised Morphological Segmentation With
Log-Linear Models
  • Hoifung Poon
  • University of Washington
  • Joint Work with
  • Colin Cherry and Kristina Toutanova

2
Machine Learning in NLP
3
Machine Learning in NLP
Unsupervised Learning
4
Machine Learning in NLP
Unsupervised Learning
Log-Linear Models
?
5
Machine Learning in NLP
Unsupervised Learning
Log-Linear Models
Little work except for a couple of cases
6
Machine Learning in NLP
Unsupervised Learning
Log-Linear Models
?
Global Features
7
Machine Learning in NLP
Unsupervised Learning
Log-Linear Models
???
Global Features
8
Machine Learning in NLP
Unsupervised Learning
Log-Linear Models
We developed a method for Unsupervised
Learning of Log-Linear Models with Global Features
Global Features
9
Machine Learning in NLP
Unsupervised Learning
Log-Linear Models
We applied it to morphological segmentation and
reduced F1 error by 10?50 compared to the state
of the art
Global Features
10
Outline
  • Morphological segmentation
  • Our model
  • Learning and inference algorithms
  • Experimental results
  • Conclusion

11
Morphological Segmentation
  • Breaks words into morphemes

12
Morphological Segmentation
  • Breaks words into morphemes
  • governments

13
Morphological Segmentation
  • Breaks words into morphemes
  • governments ? govern ? ment ? s

14
Morphological Segmentation
  • Breaks words into morphemes
  • governments ? govern ? ment ? s
  • lmpxtm
  • (according to their families)

15
Morphological Segmentation
  • Breaks words into morphemes
  • governments ? govern ? ment ? s
  • lmpxtm ? l ? mpm ? t ? m
  • (according to their families)

16
Morphological Segmentation
  • Breaks words into morphemes
  • governments ? govern ? ment ? s
  • lmpxtm ? l ? mpm ? t ? m
  • (according to their families)
  • Key component in many NLP applications
  • Particularly important for morphologically-rich
    languages (e.g., Arabic, Hebrew, )

17
Why Unsupervised Learning ?
  • Text Unlimited supplies in any language
  • Segmentation labels?
  • Only for a few languages
  • Expensive to acquire

18
Why Log-Linear Models ?
  • Can incorporate arbitrary overlapping features
  • E.g., Al ? rb (the lord)
  • Morpheme features
  • Substrings Al, rb are likely morphemes
  • Substrings Alr, lrb are not likely morphemes
  • Etc.
  • Context features
  • Substrings between Al and are likely morphemes
  • Substrings between lr and are not likely
    morphemes
  • Etc.

19
Why Global Features ?
  • Words can inform each other on segmentation
  • E.g., Al ? rb (the lord), l ? Al ? rb (to the
    lord)

20
State of the Art in Unsupervised Morphological
Segmentation
  • Use directed graphical models
  • Morfessor Creutz Lagus 2007
  • Hidden Markov Model (HMM)
  • Goldwater et al. 2006
  • Based on Pitman-Yor processes
  • Snyder Barzilay 2008a, 2008b
  • Based on Dirichlet processes
  • Uses bilingual information to help segmentation
  • Phrasal alignment
  • Prior knowledge on phonetic correspondence
  • E.g., Hebrew w ? Arabic w, f ...

21
Unsupervised Learning with Log-Linear Models
  • Few approaches exist to this date
  • Contrastive estimation Smith Eisner 2005
  • Sampling Poon Domingos 2008

22
This Talk
  • First log-linear model for unsupervised
    morphological segmentation
  • Combines contrastive estimation with sampling
  • Achieves state-of-the-art results
  • Can apply to semi-supervised learning

23
Outline
  • Morphological segmentation
  • Our model
  • Learning and inference algorithms
  • Experimental results
  • Conclusion

24
Log-Linear Model
  • State variable x ? X
  • Features fi X ? R
  • Weights ?i
  • Defines probability distribution over the states

25
Log-Linear Model
  • State variables x ? X
  • Features fi X ? R
  • Weights ?i
  • Defines probability distribution over the states

26
States for Unsupervised Morphological Segmentation
  • Words
  • wvlAvwn, Alrb,
  • Segmentation
  • w ? vlAv ? wn, Al ? rb,
  • Induced lexicon (unique morphemes)
  • w, vlAv, wn, Al, rb

27
Features for Unsupervised Morphological
Segmentation
  • Morphemes and contexts
  • Exponential priors on model complexity

28
Morphemes and Contexts
  • Count number of occurrences
  • Inspired by CCM Klein Manning, 2001
  • E.g., w ? vlAv ? wn

wvlAvwn (_)
vlAv (w_wn)
wn (Av_)
w (_vl)
29
Complexity-Based Priors
  • Lexicon prior ? ?
  • On lexicon length (total number of characters)
  • Favor fewer and shorter morpheme types
  • Corpus prior ? ?
  • On number of morphemes (normalized by word
    length)
  • Favor fewer morpheme tokens
  • E.g., l ? Al ? rb, Al ? rb
  • l, Al, rb ? ? 5 ?
  • l ? Al ? rb ? ? 3/5 ?
  • Al ? rb ? ? 2/4 ?

30
Lexicon Prior Is Global Feature
  • Renders words interdependent in segmentation
  • E.g., lAlrb, Alrb
  • lAlrb ? ?
  • Alrb ? ?

31
Lexicon Prior Is Global Feature
  • Renders words interdependent in segmentation
  • E.g., lAlrb, Alrb
  • lAlrb ? l ? Al ? rb
  • Alrb ? ?

32
Lexicon Prior Is Global Feature
  • Renders words interdependent in segmentation
  • E.g., lAlrb, Alrb
  • lAlrb ? l ? Al ? rb
  • Alrb ? Al ? rb

33
Lexicon Prior Is Global Feature
  • Renders words interdependent in segmentation
  • E.g., lAlrb, Alrb
  • lAlrb ? l ? Alrb
  • Alrb ? ?

34
Lexicon Prior Is Global Feature
  • Renders words interdependent in segmentation
  • E.g., lAlrb, Alrb
  • lAlrb ? l ? Alrb
  • Alrb ? Alrb

35
Probability Distribution
  • For corpus W and segmentation S

Morphemes
Contexts
Lexicon Prior
Corpus Prior
36
Outline
  • Morphological segmentation
  • Our model
  • Learning and inference algorithms
  • Experimental results
  • Conclusion

37
Learning withLog-Linear Models
  • Maximizes likelihood of the observed data
  • ? Moves probability mass to the observed data
  • From where? The set X that Z sums over
  • Normally, X all possible states
  • Major challenge
  • Efficient computation (approximation) of the sum
  • Particularly difficult in unsupervised learning

38
Contrastive Estimation
  • Smith Eisner 2005
  • X a neighborhood of the observed data
  • Neighborhood ? Pseudo-negative examples
  • Discriminate them from observed instances

39
Problem with Contrastive Estimation
  • Objects are independent from each other
  • Using global features leads to intractable
    inference
  • In our case, could not use the lexicon prior

40
Sampling to the Rescue
  • Similar to Poon Domingos 2008
  • Markov chain Monte Carlo
  • Estimates sufficient statistics based on samples
  • Straightforward to handle global features

41
Our Learning Algorithm
  • Combines both ideas
  • Contrastive estimation
  • ? Creates an informative neighborhood
  • Sampling
  • ? Enables global feature (the lexicon prior)

42
Learning Objective
  • Observed W (words)
  • Hidden S (segmentation)
  • Maximizes log-likelihood of observing the words

43
Neighborhood
  • TRANS1 ? Transpose any pair of adjacent
    characters
  • Intuition Transposition usually leads to a
    non-word
  • E.g.,
  • lAlrb ? Allrb, llArb,
  • Alrb ? lArb, Arlb,

44
Optimization
  • Gradient descent

45
Supervised Learning andSemi-Supervised Learning
  • Readily applicable if there are
  • labeled segmentations (S)
  • Supervised Labels for all words
  • Semi-supervised Labels for some words

46
Inference Expectation
  • Gibbs sampling
  • ESWfi
  • For each observed word in turn, sample next
    segmentation, conditioning on the rest
  • EW,Sfi
  • For each observed word in turn, sample a word
    from neighborhood and next segmentation,
    conditioning on the rest

47
Inference MAP Segmentation
  • Deterministic annealing
  • Gibbs sampling with temperature
  • Gradually lower the temperature from 10 to 0.1

48
Outline
  • Morphological segmentation
  • Our model
  • Learning and inference algorithms
  • Experimental results
  • Conclusion

49
Dataset
  • SB Snyder Barzilay 2008a, 2008b
  • About 7,000 parallel short phrases
  • Arabic and Hebrew with gold segmentation
  • Arabic Penn Treebank (ATB) 120,000 words

50
Methodology
  • Development set 500 words from SB
  • Use trigram context in our full model
  • Evaluation Precision, recall, F1 on segmentation
    points

51
Experiment Objectives
  • Comparison with state-of-the-art systems
  • Unsupervised
  • Supervised or semi-supervised
  • Relative contributions of feature components

52
Experiment SB (Unsupervised)
  • Snyder Barzilay 2008b
  • SB-MONO Uses monolingual features only
  • SB-BEST Uses bilingual information
  • Our system Uses monolingual features only

53
Results SB (Unsupervised)
F1
54
Results SB (Unsupervised)
F1
55
Results SB (Unsupervised)
F1
56
Results SB (Unsupervised)
Reduces F1 error by 40
F1
57
Results SB (Unsupervised)
Reduces F1 error by 21
F1
58
Experiment ATB (Unsupervised)
  • Morfessor Categories-MAP Creutz Lagus 2007
  • Our system

59
Results ATB
Reduces F1 error by 11
F1
60
Experiment Ablation Tests
  • Conducted on the SB dataset
  • Change one feature component in each test
  • Priors
  • Context features

61
Results Ablation Tests
Corpus prior only
Both priors are crucial
F1
No priors
Lexicon prior only
FULL
NO-PR
COR
NO-CTXT
LEX
62
Results Ablation Tests
No context features
Overlapping context features are important
F1
FULL
NO-PR
COR
NO-CTXT
LEX
63
Experiment SB (Supervised and Semi-Supervised)
  • Snyder Barzilay 2008a
  • SB-MONO-S Monolingual features and labels
  • SB-BEST-S Bilingual information and labels
  • Our system Monolingual features and labels
  • Partial or all labels (25, 50, 75, 100)

64
Results SB (Supervised and Semi-Supervised)
F1
SB MONO-S
65
Results SB (Supervised and Semi-Supervised)
F1
SB MONO-S
SB BEST-S
66
Results SB (Supervised and Semi-Supervised)
F1
SB MONO-S
SB BEST-S
Our-S 25
67
Results SB (Supervised and Semi-Supervised)
F1
SB MONO-S
SB BEST-S
Our-S 25
Our-S 50
68
Results SB (Supervised and Semi-Supervised)
F1
SB MONO-S
SB BEST-S
Our-S 25
Our-S 50
Our-S 75
69
Results SB(Supervised and Semi-Supervised)
Reduces F1 error by 46 compared to
SB-MONO-S Reduces F1 error by 36 compared to
SB-BEST-S
F1
SB MONO-S
SB BEST-S
Our-S 25
Our-S 50
Our-S 75
Our-S 100
70
Conclusion
  • We developed a method
  • for Unsupervised Learning
  • of Log-Linear Models
  • with Global Features
  • Applied it to morphological segmentation
  • Substantially outperforms state-of-the-art
    systems
  • Effective for semi-supervised learning as well
  • Easy to extend with additional features

71
Future Work
  • Apply to other NLP tasks
  • Interplay between neighborhood and features
  • Morphology
  • Apply to other languages
  • Modeling internal variations of morphemes
  • Leverage multi-lingual information
  • Combine with other NLP tasks (e.g., MT)

72
Thanks Ben Snyder
  • For his most generous help with SB dataset
Write a Comment
User Comments (0)
About PowerShow.com