Novel Speech Recognition Models for Arabic

About This Presentation

Title:

Novel Speech Recognition Models for Arabic

Description:

AR 'knitting' example. unknown: t bqwA. kn.roman: yibqu. ops: ... Knitting local model n-best 30.0% 23.1% (n = 25) Varying the number of dictionary matches ... – PowerPoint PPT presentation

Number of Views:466

Avg rating:3.0/5.0

Slides: 169

Provided by: katrinki

Category:

more less

Transcript and Presenter's Notes

Title: Novel Speech Recognition Models for Arabic

1
Novel Speech Recognition Models for Arabic

The Arabic Speech Recognition Team
JHU Workshop Final Presentations
August 21, 2002

2
Arabic ASR Workshop Team

Senior Participants Undergraduate Students
Katrin Kirchhoff, UW Melissa Egan, Pomona College
Jeff Bilmes, UW Feng He, Swarthmore College
John Henderson, MITRE
Mohamed Noamany, BBN Affiliates
Pat Schone, DoD Dimitra Vergyri, SRI
Rich Schwartz, BBN Daben Liu, BBN
Nicolae Duta, BBN
Graduate Students Ivan Bulyko, UW
Sourin Das, JHU Mari Ostendorf, UW
Gang Ji, UW

3
Arabic
Dialects used for informal conversation
Cross-regional standard, used for formal
communication
4
Arabic ASR Previous Work

dictation IBM ViaVoice for Arabic
Broadcast News BBN TIDESOnTap
conversational speech 1996/1997 NIST CallHome
Evaluations
little work compared to other languages
few standardized ASR resources

5
Arabic ASR State of the Art (before WS02)

BBN TIDESOnTap 15.3 WER
BBN CallHome system 55.8 WER
WER on conversational speech noticeably higher
than for other languages
(eg. 30 WER for English CallHome)
? focus on recognition of conversational Arabic

6
Problems for Arabic ASR

language-external problems
data sparsity, only 1 (!) standardized corpus of
conversational Arabic available
language-internal problems
complex morphology, large number of possible word
forms
(similar to Russian, German, Turkish,)
differences between written and spoken
representation lack of short vowels and other
pronunciation information
(similar to Hebrew, Farsi, Urdu, Pashto,)

7
Corpus LDC ECA CallHome

phone conversations between family
members/friends
Egyptian Colloquial Arabic (Cairene dialect)
high degree of disfluencies (9),
out-of-vocabulary words (9.6), foreign words
(1.6)
noisy channels
training 80 calls (14 hrs), dev 20 calls (3.5
hrs), eval 20 calls (1.5 hrs)
very small amount of data for language modeling
(150K) !

8
MSA - ECA differences

Phonology
/th/ ? /s/ or /t/ thalatha - talata
(three)
/dh/ ? /z/ or /d/ dhahab - dahab
(gold)
/zh/ ? /g/ zhadeed - gideed
(new)
/ay/ ? /e/ Sayf - Seef
(summer)
/aw/ ? /o/ lawn - loon
(color)
Morphology
inflections yatakallamu -
yitkallim (he speaks)
Vocabulary
different terms TAwila - tarabeeza
(table)
Syntax
word order differences SVO - VSO

9
Workshop Goals
improvements to Arabic ASR through
developing novel models to better exploit
available data
developing techniques for using
out-of-corpus data
Automatic romanization
Integration of MSA text data
Factored language modeling
10
Factored Language Models

complex morphological structure leads to large
number of possible word forms
break up word into separate components
build statistical n-gram models over individual
morphological components rather than complete
word forms

11
Automatic Romanization

Arabic script lacks short vowels and other
pronunciation markers
comparable English example
lack of vowels results in lexical ambiguity
affects acoustic and language model training
try to predict vowelization automatically from
data and use result for recognizer training

th fsh stcks f th nrth tlntc hv bn dpletd
the fish stocks of the north atlantic have been
depleted
12
Out-of-corpus text data

no corpora of transcribed conversational speech
available
large amounts of written (Modern Standard Arabic)
data available (e.g. Newspaper text)
Can MSA text data be used to improve language
modeling for conversational speech?
Try to integrate data from newspapers,
transcribed TV broadcasts, etc.

13
Recognition Infrastructure

baseline system BBN recognition system
N-best list rescoring
Language model training SRI LM toolkit with
significant additions implemented during this
workshop
Note no work on acoustic modeling, speaker
adaptation, noise robustness, etc.
two different recognition approaches
grapheme-based vs. phoneme-based

14
Summary of Results (WER)
Grapheme-based reconizer
Phone-based recognizer
15
Novel research

new strategies for language modeling based on
morphological features
new graph-based backoff schemes allowing wider
range of smoothing techniques in language
modeling
new techniques for automatic vowel insertion
first investigation of use of automatically
vowelized data for ASR
first attempt at using MSA data for language
modeling for conversational Arabic
morphology induction for Arabic

16
Key Insights

Automatic romanization improves grapheme-based
Arabic recognition systems
trend morphological information helps in
language modeling
needs to be confirmed on larger data set
Using MSA text data does not help
We need more data!

17
Resources

significant add-on to SRILM toolkit for general
factored language modeling
techniques/software for automatic romanization of
Arabic script
part-of-speech tagger for MSA tagged text

18
Outline of Presentations

130 - 145 Introduction (Katrin Kirchhoff)
145 - 155 Baseline system (Rich Schwartz)
155 - 220 Automatic romanization (John
Henderson,
Melissa Egan)
220 - 235 Language modeling - overview
(Katrin Kirchhoff)
235 - 250 Factored language modeling (Jeff
Bilmes)
250 - 305 Coffee Break
305 - 310 Automatic morphology learning (Pat
Schone)
315 - 330 Text selection (Feng He)
330 - 400 Graduate student proposals (Gang Ji,
Sourin Das)
400 - 430 Discussion and Questions

19
Thank you!

Fred Jelinek, Sanjeev Khudanpur, Laura Graham
Jacob Laderman assistants
Workshop sponsors
Mark Liberman, Chris Cieri, Tim Buckwalter
Kareem Darwish, Kathleen Egan
Bill Belfield colleagues from BBN
Apptek

20
(No Transcript)
21
BBN Baseline System for Arabic

Richard Schwartz, Mohamed Noamany,
Daben Liu, Bill Belfield, Nicolae Duta
JHU Workshop
August 21, 2002

22
BBN BYBLOS System

RoughnReady / OnTAP / OASIS system
Version of BYBLOS optimized for Broadcast News
OASIS system fielded in Bangkok and Aman
Real-Time operation with 1-minute delay
10-20 WER, depending on data

23
BYBLOS Configuration

3-passes of recognition
Forward Fast-match uses PTM models and
approximate bigram search
Backward pass uses SCTM models and approximate
trigram search, creates N-best.
Rescoring pass uses cross-word SCTM models and
trigram LM
All runs in real time
Minimal difference from running slowly

24
Use for Arabic Broadcast News

Transcriptions are in normal Arabic script,
omitting short vowels and other diacritics.
We used each Arabic letter as if it were a
phoneme.
This allowed addition of large text corpora for
language modeling.

25
Initial BN Baseline

37.5 hours of acoustic training
Acoustic training data (230K words) used for LM
training
64K-word vocabulary (4 OOV)
Initial word error rate (WER) 31.2

26
Speech Recognition Performance
27
Call Home Experiments

Modified OnTAP system to make it more appropriate
for Call Home data.
Added features from LVCSR research to OnTAP
system for Call Home data.
Experiments
Acoustic training 80 conversations (15 hours)
Transcribed with diacritics
Acoustic training data (150K words) used for LM
Real-time

28
Using OnTAP system for Call Home
29
Additions from LVCSR
30
Output Provided for Workshop

OASIS was run on various sets of training as
needed
Systems were run either for Arabic script
phonemes or Romanized phonemes with
diacritics.
In addition to workshop participants, others at
BBN provided assistance and worked on workshop
problems.
Output provided for workshop was N-best sentences
with separate scores for HMM, LM, words,
phones, silences
Due to high error rate (56), the oracle error
rate for 100 N-best was about 46.
Unigram lattices were also provided, with oracle
error rate of 15

31
Phoneme HMM Topology Experiment

The phoneme HMM topology was increased for the
Arabic script system from 5 states to 10 states
in order to accommodate a consonant and possible
vowel.
The gain was small (0.3 WER)

32
OOV Problem

OOV Rate is 10
50 is morphological variants of words in the
training set
10 is Proper names
40 is other unobserved words
Tried adding words from BN and from morphological
transducer
Added too many words with too small gain

33
Use BN to Reduce OOV

Can we add words from BN to reduce OOV?
BN text contains 1.8M distinct words.
Adding entire 1.8M words reduces OOV from 10 to
3.9.
Adding top 15K words reduces OOV to 8.9
Adding top 25K words reduces OOV to 8.4.

34
Use Morphological Transducer

Use LDC Arabic transducer to expand verbs to all
forms
Produces gt 1M words
Reduces OOV to 7

35
Language Modeling Experiments

Described in other talks
Searched for available dialect transcriptions
Combine BN (300M words) with CH (230K)
Use BN to define word classes
Constrained back-off for BNCH

36
(No Transcript)
37
Autoromanization of Arabic Script

Melissa Egan and John Henderson

38
Autoromanization (AR) goal

Expand Arabic script representation to include
short vowels and other pronunciation information.
Phenomena not typically marked in non-diacritized
script include
Short vowels a, i, u
Repeated consonants (shadda)
Extra phonemes for Egyptian Arabic f/v,j/g
Grammatical marker that adds an n to the
pronunciation (tanween)
Example
Non-diacritized form ktb write
Expansions kitab book
aktib I write
kataba he wrote
kattaba he caused to write

39
AR motivation

Romanized text can be used to produce better
output from an ASR system.
Acoustic models will be able to better
disambiguate based on extra information in text.
Conditioning events in LM will contain more
information.
Romanized ASR output can be converted to script
for alternative WER measurement.
Eval96 results (BBN recognizer, 80 conv. train)
script recognizer 61.1 WERG (grapheme)
romanized recognizer 55.8 WERR (roman)

40
AR data

CallHome Arabic from LDC
Conversational speech transcripts (ECA) in both
script and a roman specification that includes
short vowels, repeats, etc.
set conversations words
asrtrain 80 135K
dev 20 35K
eval96(asrtest) 20 15K
eval97 20 18K
h5_new 20 18K

Romanizer Testing
Romanizer Training
41
Data format

Script without and with diacritics
CallHome in script and roman forms

our task
Script AlHmd_llh kwIsB w AntI AzIk
Roman ilHamdulillA kuwayyisaB wi inti
izzayyik
42
Autoromanization (AR) WER baseline

Train on 32K words in eval97h5_new
Test on 137K words in ASR_trainh5_new

Status portion error total in train in
test in test error unambig. 68.0 1.8
6.2 ambig. 15.5 13.9 10.8 unknown 16.5
99.8 83.0 total 100 19.9 100.0
Biggest potential error reduction would come from
predicting romanized forms for unknown words.
43
AR knitting example
unknown tbqwA
1. Find close known word
known ybqwA
known y bqwA
2. Record ops required to make roman from known
kn.roman yibqu ops ciccrd
unknown t bqwA
3. Construct new roman using same ops
kn.roman yibqu ops ciccrd
new roman tibqu
44
Experiment 1 (best match)

Observed patterns in the known short/long pairs
Some characters in the short forms are
consistently found with particular, non-identical
characters in the long forms.
Example rule
A ? a

45
Experiment 2 (rules)
Environments in which w occurs in training
dictionary long forms Env Freq C _ V 149 V _
8 _ V 81 C _ 5 V _ V 121 V _ C 118
Environments in which u occurs in training
dictionary long forms Env Freq C _ C 1179 C
_ 301 _ C 29

Some output forms depend on output context.
Rule
u occurs only between two non-vowels.
w occurs elsewhere.
Accurate for 99.7 of the instances of u and
w in the training dictionary long forms.
Similar rule may be formulated for i and y.

46
Experiment 3 (local model)

Move to more data-driven model
Found some rules manually.
Look for all of them, systematically.
Use best-scoring candidate for replacement
Environment likelihood score
Character alignment score

47
Experiment 4 (n-best)

Instead of generating romanized form using the
single best short form in the dictionary,
generate romanized forms using top n best short
forms.
Example (n 5)

48
Character error rate (CER)

Measurement of insertions, deletions, and
substitutions in character strings should more
closely track phoneme error rate.
More sensitive than WER
Stronger statistics from same data
Test set results
Baseline 49.89 character error rate (CER)
Best model 24.58 CER
Oracle 2-best list 17.60 CER suggests more room
for gain.

49

Summary of performance (dev set)

Accuracy CER Baseline 8.4 41.4 Knitting 16.9
29.5 Knitting best match rules 18.4 28.6 K
nitting local model 19.4 27.0 Knitting
local model n-best 30.0 23.1 (n
25)
50

Varying the number of dictionary matches

51
ASR scenarios

1) Have a script recognizer, but want to produce
romanized form.
postprocessing ASR output
2) Have a small amount of romanized data and a
large amount of script data available for
recognizer training.
preprocessing ASR training set

52
ASR experiments
Preprocessing
Roman Result
WERR
Roman ASR
AR
Script Result
R2S
WERG
Script Train
Postprocessing
53
Experiment adding script data
Future training set

Script LM training data could be acquired from
found text.

AR train 40
ASR train 100 conv

Script transcription is cheaper than roman
transcription

Simulate a preponderance of script by training AR
on a separate set.

ASR is then trained on output of AR.

54
Eval 96 experiments, 80 conv
Config WERR WERG script baseline N/A 59.8 post
processing 61.5 59.8 preprocessing 59.9 59.2
(-0.6) Roman baseline 55.8 55.6 (-4.2)

Bounding experiment
No overlap between ASR train and AR train.
Poor pronunciations for made-up words.

55
Eval 96 experiments, 100 conv
Config WERR WERG script baseline N/A 59.0 postproc
essing 60.7 59.0 preprocessing 58.5 57.5
(-1.5) Roman baseline 55.1 54.9 (-4.1)

More realistic experiment
20 conversation overlap between ASR train and AR
train.
Better pronunciations for made-up words.

56
Remaining challenges

Correct dangling tails in short matches

Merge unaligned characters

57
Bigram translation model
input s t b q w A output r ? t i b
q u ? kn. roman dl y i b q u
58
Trigram translation model
input s t b q w A output r t i b
q u kn. roman dl y i b q u
59
Future work

Context provides information for disambiguating
both known and unknown words
Bigrams for unknown words will also be unknown,
use part of speech tags or morphology.
Acoustics
Use acoustics to help disambiguate vowels?
Provide n-best output as alternative
pronunciations for ASR training.

60
(No Transcript)
61
Factored Language Modeling
Katrin Kirchhoff, Jeff Bilmes, Dimitra
Vergyri, Pat Schone, Gang Ji, Sourin Das
62
Arabic morphology

structure of Arabic derived words

pattern
s
k
n
-tu
fa-
affixes
particles
root
LIVE past 1st-sg-past part so I lived
63
Arabic morphology

5000 roots
several hundred patterns
dozens of affixes
large number of possible word forms
problems training robust language model
large number of OOV words

64
Vocabulary Growth - full word forms
65
Vocabulary Growth - stemmed words
66
Particle model

Break words into sequences of stems affixes
Approximate probability of word sequence by
probability of particle sequence

67
Factored Language Model

Problem how can we estimate P(WtWt-1,Wt-2,...)
?
Solution decompose W into its morphological
components affixes, stems, roots, patterns
words can be viewed as bundles of features

patterns
Pt
Pt-1
Pt-2
roots
Rt
Rt-1
Rt-2
affixes
At
At-1
At-2
St
St-1
St-2
stems
words
Wt-2
Wt-1
Wt
68
Statistical models for factored representations

Class-based LM
Single-stream LM

69
Full Factored Language Model

assume where
w word, r root, f pattern, a affixes
Goal find appropriate conditional independence
statements to simplify this model.

70
Experimental Infrastructure

All language models tested using nbest rescoring
two baseline word-based LMs
B1 BBN LM, WER 55.1
B2 WS02 baseline LM, WER 54.8
combination of baselines 54.5
new language models were used in combination with
one or both baseline LMs
log-linear score combination scheme

71
Log-linear combination

For m information sources, each producing a
maximum-likelihood estimate for W
I total information available
Ii the ith information source
ki weight for the ith information source

72
Discriminative combination

We optimize the combination weights jointly with
the language model and insertion penalty to
directly minimize WER of the maximum likelihood
hypothesis.
The normalization factor can be ignored since it
is the same for all alternative hypotheses.
Used the simplex optimization method on the
100-bests provided by BBN (optimization algorithm
available in the SRILM toolkit).

73
Word decomposition

Linguistic decomposition (expert knowledge)
automatic morphological decomposition acquire
morphological units from data without using human
knowledge
assign words to classes based not on
characteristics of word form but based on
distributional properties

74
(Mostly) Linguistic Decomposition

Stems/morph class information from LDC CH
lexicon
roots determined by K. Darwishs morphological
analyzer for MSA
pattern determined by subtracting root from stem

atamna ltgt atamverbpast-1st-plural
stem
morph. tag
atam ? tm
atam ? CaCaC
75
Automatic Morphology

Classes defined by morphological components
derived from data
no expert knowledge
based on statistics of word forms
more details in Pats presentation

76
Data-driven Classes

Word clustering based on distributional
statistics
Exchange algorithm (Martin et. al 98)
initially assign words to individual clusters
move each temporarily word to all other clusters,
compute change in perplexity (class-based
trigram)
keep assignment that minimizes perplexity
stop when class assignment no longer changes
bottom-up clustering (SRI toolkit)
initially assign words to individual clusters
successively merge pairs of clusters with highest
average mutual information
stop at specified number of classes

77
Results

Best word error rates obtained with
particle model 54.0 (B1 particle LM)
class-based models 53.9 (B1MorphStem)
automatic morphology 54.3 (B1B2Rule)
data-driven classes 54.1 (B1SRILM, 200
classes)
combination of best models 53.8

78
Conclusions

Overall improvement in WER gained from language
modeling (1.3) is significant
individual differences between LMs are not
significant
but adding morphological class models always
helps language model combination
morphological models get the highest weights in
combination (in addition to word-based LMs)
trend needs to be verified on larger data set
? application to script-based system?

79
(No Transcript)
80
Factored Language Models and Generalized Graph
Backoff

Jeff Bilmes, Katrin Kirchhoff
University of Washington, Seattle
JHU-WS02 ASR Team

81
Outline

Language Models, Backoff, and Graphical Models
Factored Language Models (FLMs) as Graphical
Models
Generalized Graph Backoff algorithm
New features to SRI Language Model Toolkit (SRILM)

82
Standard Language Modeling

Example standard tri-gram

83
Typical Backoff in LM

In typical LM, there is one natural (temporal)
path to back off along.
Well motivated since information often decreases
with word distance.

84
Factored LM Proposed Approach

Decompose words into smaller morphological or
class-based units (e.g., morphological classes,
stems, roots, patterns, or other automatically
derived units).
Produce probabilistic models over these units to
attempt to improve WER.

85
Example with Words, Stems, and Morphological
classes
86
Example with Words, Stems, and Morphological
classes
87
In general
88
General Factored LM

A word is equivalent to collection of factors.
E.g., if K3
Goal find appropriate conditional independence
statements to simplify this sort of model while
keeping perplexity and WER low. This is the
structure learning problem in graphical models.

89
The General Case
90
The General Case
91
The General Case
92
A Backoff Graph (BG)
93
Example 4-gram Word Generalized Backoff
94
How to choose backoff path?

Four basic strategies
Fixed path (based on what seems reasonable (e.g.,
temporal constraints))
Generalized all-child backoff
Constrained multi-child backoff
Child combination rules

95
Choosing a fixed back-off path
96
How to choose backoff path?

Four basic strategies
Fixed path (based on what seems reasonable (e.g.,
temporal constraints))
Generalized all-child backoff
Constrained multi-child backoff
Child combination rules

97
Generalized Backoff

In typical backoff, we drop 2nd parent and use
conditional probability.

More generally, g() can be any positive function,
but need new algorithm for computing backoff
weight (BOW).

98
Computing BOWs

Many possible choices for g() functions (next few
slides)
Caveat certain g() functions can make the LM
much more computationally costly than standard
LMs.

99
g() functions

Standard backoff

Max counts

Max normalized counts

100
More g() functions

Max backoff graph node.

101
More g() functions

Max back off graph node.

102
How to choose backoff path?

Four basic strategies
Fixed path (based on what seems reasonable
(time))
Generalized all-child backoff
Constrained multi-child backoff
Same as before, but choose a subset of possible
paths a-priori
Child combination rules
Combine child node via combination function
(mean, weighted avg., etc.)

103
Significant Additions to Stolckes SRILM, the
SRI Language Modeling Toolkit

New features added to SRILM including
Can specify an arbitrary number of
graphical-model based factorized models to train,
compute perplexity, and rescore N-best lists.
Can specify any (possibly constrained) set of
backoff paths from top to bottom level in BG.
Different smoothing (e.g., Good-Turing,
Kneser-Ney, etc.) or interpolation methods may be
used at each backoff graph node
Supports the generalized backoff algorithms with
18 different possible g() functions at each BG
node.

104
Example with Words, Stems, and Morphological
classes
105
How to specify a model
word given stem morph W 2 S(0) M(0) S0,M0
M0 wbdiscount gtmin 1 interpolate S0 S0
wbdiscount gtmin 1 0 0 wbdiscount gtmin 1
morph given word word M 2 W(-1) W(-2)
W1,W2 W2 kndiscount gtmin 1 interpolate W1
W1 kndiscount gtmin 1 interpolate 0 0
kndiscount gtmin 1
stem given morph word word S 3 M(0) W(-1)
W(-2) M0,W1,W2 W2 kndiscount gtmin 1
interpolate M0,W1 W1 kndiscount gtmin 1
interpolate M0 M0 kndiscount gtmin 1 0
0 kndiscount gtmin 1
106
Summary

Language Models, Backoff, and Graphical Models
Factored Language Models (FLMs) as Graphical
Models
Generalized Graph Backoff algorithm
New features to SRI Language Model Toolkit (SRILM)

107
Coffee Break

Back in 10 minutes

108
Knowledge-Free Induction of Arabic Morphology

Patrick Schone
21 August 2002

109
Why induce Arabic morphology?

(1) Has not been done before
(2) If it can be done, and if it has value in LM,
it can generalize across languages without
needing an expert

110
Original Algorithm(Schone Jurafsky, 00/01)
Looking for word inflections on words w/
Frgt9 Use a character tree to find word pairs
with similar beginnings/ endings Ex
car/cars , car/cares, car/caring Use Latent
Semantic Analysis to induce semantic vectors
for each word, then compare word-pair
semantics Use frequencies of word stems/rules to
improve the initial semantic estimates
111
Algorithmic Expansions
IR-Based Minimum Edit Distance
Trie-based approach could be a problem for
Arabic Templates gt aGlaB aGlaB
ilAGil aGlu AGil Result 3576 words in
CallHome lexicon w/ 50 relationships!
Use Minimum Edit Distance to find the
relationships (can be weighted)
Use information-retrieval based approach to
faciliate search for MED candidates
112
Algorithmic Expansions
Agglomerative Clustering Using Rules Stems
Word Pairs w/ Rule
Word Pairs w/ Stem
Gayyar 507 xallaS 503 makallim 468 qaddim 434 i
tgawwiz 332 tkallim 285
gt il 1178 gt u 635 gt i 455 i gt
u 377 gt fa 375 gt bi 366
Do bottom-up clustering, where weight
between two words is Ct(Rule)Ct(PairedStem)1/2
113
Algorithmic Expansions
Updated Transitivity
If XY and YZ and XYgt2 and XYltZ, then XZ
114
Scoring Induced Morphology

Score in terms of conflation set
agreement
Conflation set (W)all words morphologically
related to W
Example aGlaB aGlaB ilAGil aGlu AGil

If XWinduced set for W, and YWtruth set for W,
compute total correct, inserted, and
deleted as
ErrorRate 100(ID)/(CD)
115
Scoring Induced Morphology
Induction error rates on words from original 80
Set
116
Using Morphology for LM Rescoring

For each word W, use induced morphology to
generate
Stem smallest word, z, from XW where zlt w
Root character intersection across XW
Rule map of word-to-stem
Patternmap of stem-to-root
Class map of word-to-root

117
Other Potential Benefits of MorphologyMorphology
-driven Word Generation

Generate probability-weighted words using
morphologically-derived rules (like Null gt
ilNULL)
Generate only if initial and final n-characters
of stem have been seen before.

118
(No Transcript)
119
Text Selection for Conversational Arabic

Feng He
ASR (Arabic Speech Recognition) Team
JHU Workshop

120
Motivation

Group goal Conversational Arabic Speech
Recognition.
One of the Problems not enough training data to
build a Language Model most available text is
in MSA (Modern Standard Arabic) or a mixture of
MSA and conversational Arabic.
One Solution Select from mixed text segments
that are conversational, and use them in training.

121
Task Text Selection

Use POS-based language models because it has been
shown to better indicate differences in styles,
such as formal vs conversational.
Method
Training POS (part of speech) tagger on available
data
Train POS-based language models on formal vs
conversational data
Tag new data
Select segments from new data that are closest to
conversational model by using scores from
POS-based language models.

122
Data

For building the Tagger and Language Models
Arabic Treebank 130K words of hand-tagged
Newspaper text in MSA.
Arabic CallHome 150K words of transcribed phone
conversations. Tags are only in the Lexicon.
For Text Selection
Al Jazeera 9M words of transcribed TV
broadcasts. We want to select segments that are
closer to conversational Arabic, such as
talk-shows and interviews.

123
Implementation

Model (bigram)

124
About unknown words

These are words that are not seen in training
data, but appear in test data.
Assume unknown words behave like singletons
(words that appear only once in training data).
This is done by duplicating training data with
singletons replaced by special token. Then train
tagger on both the original and duplicate.

125

Tools
GMTK (Graphical Model Toolkit)
Algorithms
Training EM training set parameters so that
joint probability of hidden states and
observations is maximized.
Decoding (tagging) Viterbi find hidden state
sequence that maximizes joint probability of
hidden state and observations.

126
Experiments

Exp 1 Data first 100K of English Penn Treebank.
Trigram model. Sanity check.
Exp 2 Data Arabic Treebank. Trigram model.
Exp 3 Data Arabic Treebank and CallHome.
Trigram model.
The above three experiments all used 10 fold
cross validation, and are unsupervised.
Exp 4 Data Arabic Treebank. Supervised trigram
model.
Exp 5 Data Arabic Treebank and Callhome.
Partially supervised training using Treebanks
tagged data. Test on portion of treebank not used
in training. Trigram model.

127
Results
128
Building Language Models and Text Selection

Use existing scripts to build formal and
conversational language models from tagged Arabic
Treebank and CallHome data.
Text selection use log likelihood ratio

Si the ith sentence in data set C coversational
language model F formal language model Ni
length of Si
129
Score Distribution
percentage
log count
log likelihood ratio
log likelihood ratio
130
Assessment

A subset of Al Jazeera equal in size to Arabic
CallHome (150K words) is selected, and added to
training data for speech recognition language
model.
No reduction in perplexity.
Possible reasons Al Jazeera has no
conversational Arabic, or has only conversational
Arabic of a very different style.

131
Text Selection Work Done at BBN

Rich Schwartz
Mohamed Noamany
Daben Liu
Nicolae Duta

132
Search for Dialect Text

We have an insufficient amount of CH text for
estimating a LM.
Can we find additional data?
Many words are unique to dialect text.
Searched Internet for 20 common dialect words.
Most of the data found were jokes or chat rooms
very little data.

133
Search BN Text for Dialect Data

Search BN text for the same 20 dialect words.
Found less than CH data
Each occurrence was typically an isolated lapse
by the speaker into dialect, followed quickly by
a recovery to MSA for the rest of the sentence.

134
Combine MSA text with CallHome

Estimate separate models for MSA text (300M
words) and CH text (150K words).
Use SRI toolkit to determine single optimal
weight for the combination, using deleted
interpolation (EM)
Optimal weight for MSA text was 0.03
Insignificant reduction in perplexity and WER

135
Classes from BN

Hypothesis
Even if MSA ngrams are different, perhaps the
classes are the same.
Experiment
Determine classes (using SRI toolkit) from BNCH
data.
Use CH data to estimate ngrams of classes and /
or p(w class)
Combine resulting model with CH word trigram
Result
No gain

136
Hypothesis Test Constrained Back-Off

Hypothesis
In combining BN and CH, if a probability is
different, could be for 2 reasons
CH has insufficient training
BN and CH truly have different probabilities
(likely)
Algorithm
Interpolate BN and CH, but limit the probability
change to be as much as would be likely due to
insufficient training.
Ngram count cannot change by more than its sqrt
Result
No gain

137
(No Transcript)
138
Learning Using Factored Language Models

Gang Ji
Speech, Signal, and Language Interpretation
University of Washington
August 21, 2002

139
Outline

Factored Language Models (FLMs) overview
Part I automatically finding FLM structure
Part II first-pass decoding in ASR with FLMs
using graphical models

140
Factored Language Models

Along with words, consider factors as components
of the language model
Factors can be words, stems, morphs, patterns,
roots, which might contain complementary
information about language
FLMs also provide a new possibilities for
designing LMs (e.g., multiple back-off paths)
Problem We dont know the best model, and space
is huge!!!

141
Factored Language Models

How to learn FLMs
Solution 1 do it by hand using expert linguistic
knowledge
Solution 2 data driven let the data help to
decide the model
Solution 3 combine both linguistic and data
driven techniques

142
Factored Language Models

A Proposed Solution
Learn FLMs using evolution-inspired search
algorithm
Idea Survival of the fittest
A collection (generation) of models
In each generation, only good ones survive
The survivors produce the next generation

143
Evolution-Inspired Search

Selection choose the good LMs

Combination retain useful characteristics

Mutation some small change in next generation

144
Evolution-Inspired Search

Advantages
Can quickly find a good model
Retain goodness of the previous generation while
covering significant portion of the search space
Can run in parallel
How to judge the quality of each model?
Perplexity on a development set
Rescore WER on development set
Complexity-penalized perplexity

145
Evolution-Inspired Search

Three steps form new models.
Selection (based on perplexity, etc)
E.g. Stochastic universal sampling models are
selected in proportion to their fitness
Combination
Mutation

146
Moving from One Generation to Next

Combination Strategies
Inherit structures horizontally
Inherit structures vertically
Random selection
Mutation
Add/remove edges randomly
Change back-off/smoothing strategies

147
Combination according to Frames
148
Combination according to Factors
149
Outline

Factored Language Models (FLMs) overview
Part I automatically finding FLM structure
Part II first-pass decoding with FLMs

150
Problem

May be difficult to improve WER just by rescoring
n-best lists
More gains can be expected from using better
models in first-pass decoding
Solution
do first-pass decoding using FLMs
Since FLMs can be viewed as graphical models, use
GMTK (most existing tools dont support general
graph-based models)
To speed up inference, use generalized
graphical-model-based lattices.

151
FLMs as Graphical Models
F1
F2
F3
Word
Graph for Acoustic Model
152
FLMs as Graphical Models

Problem decoding can be expensive!
Solution multi-pass graphical lattice refinement
In first-pass, generate graphical lattices using
a simple model (i.e., more independencies)
Rescore the lattices using a more complicated
model (fewer independencies) but on much smaller
search space

153
Example Lattices in a Markov Chain
0
1
2
2
2
3
5
7
4
This is the same as a word-based lattice
154
Lattices in General Graphs
0
1
0
1
2
2
3
4
5
6
0
1
1
2
1
2
3
3
2
6
4
5
155
Research Plan

Data
Arabic CallHome data
Tools
Tools for evolution-inspired search
most part already developed during workshop
Training/Rescoring FLMs
Modified SRI LM toolkit developed during this
workshop
Multi-pass decoding
Graphical models toolkit (GMTK) developed in
last workshop

156
Summary

Factored Language Models (FLMs) overview
Part I automatically finding FLM structure
Part II first-pass decoding of FLMs using GMTK
and graphical lattices

157
Combination and mutation
0
1
0
1
0
1
0
0
0
1
1
1
0
1
0
1
1
1
1
0
1
0
0
0
1
0
1
1
1
1
0
1
0
0
0
0
0
1
0
1
1
1
1
0
0
0
0
0
1
0
1
1
1
1
0
1
0
0
1
0
158
Stochastic Universal Sampling

Idea probability of surviving is proportional to
fitness
Fitness is the quantity we described before
SUS
Models have bins with length proportional to
fitness, lay on x-axis
Choose N even-spaced samples uniformly
Choose a model k times, k is number of samples in
its bin.

159
FLMs as Graphical Models
F1
F2
F3
Word
Word Trans.
Word Pos.
Phone Trans.
Phone
observation
160
(No Transcript)
161
Minimum Divergence Adaptation of a MSA-Based
Language Model to Egyptian Arabic

A proposal by
Sourin Das
JHU Workshop Final Presentation
August 21, 2002

162
Motivation for LM Adaptation

Transcripts of spoken Arabic are expensive to
obtain MSA text is relatively inexpensive (AFP
newswire, ELRA arabic data, Al jazeera )
MSA text ought to help after all it is Arabic
However there are considerable dialectal
differences
Inferences drawn from Callhome knowledge or data
ought to overrule those from MSA whenever the
inferences drawn from them disagree e.g.
estimates of N-gram probabilities
Cannot interpolate models or merge data naïvely
Need to instead fall back to MSA knowledge only
when the Callhome model or data is agnostic
about an inference

163
Motivation for LM Adaptation

The minimum K-L divergence framework provides a
mechanism to achieve this effect
First estimate a language model Q from MSA text
only
Then find a model P which matches all major
Callhome statistics and is close to Q.
Anecdotal evidence MDI methods successfully used
to adapt models based on NABN text to SWBD a 2
WER reduction in LM95 from a 50 baseline WER.

164
An Information Geometric View
The Uniform Distribution
Models satisfying Callhome marginals
MaxEnt Callhome LM
MaxEnt MSA-text LM
Min Divergence Callhome LM
Models satisfying MSA-text marginals
The Space of all Language Models
165
A Parametric View of MaxEnt Models

The MSA-text based MaxEnt LM is the ML estimate
among exponential models of the form
Q(x) Z-1(?,?) exp? ?i fi(x) ?j gj(x)
The Callhome based MaxEnt LM is the ML estimate
among exponential models of the form
P(x) Z-1(?,?) exp? ?j gj(x) ?k hk(x)
Think of the Callhome LM as being from the family
P(x) Z-1(?,?) exp? ?i fi(x) ?j gj(x) ?k
hk(x)
where we set ?0 based on the MaxEnt principle.
One could also be agnostic about the values of
?is, since no examples with fi(x)gt0 are seen in
Callhome
Features (e.g. N-grams) from MSA-text which are
not seen in Callhome always have fi(x)0 in
Callhome training data

166
A Pictorial Interpretation of the Minimum
Divergence Model
The ML model for MSA text Q(x)Z-1(?,?) exp?
?ifi(x) ?jgj(x)
Subset of all exponential models with
?? P(x)Z-1(?,?,?) exp? ?ifi(x) ?j gj(x)
?k hk(x)
The ML model for Callhome, with ?? instead of
?0. P(x)Z-1(?,?,?) exp? ?ifi(x) ?jgj(x)
?khk(x)
Subset of all exponential models with
?0 Q(x)Z-1(?,?) exp? ?i fi(x) ?j gj(x)
All exponential models of the form P(x)Z-1(?,?,?)
exp? ?i fi(x) ?j gj(x) ?k hk(x)
167
Details of Proposed Research (1)A Factored LM
for MSA text

Notation Wromanized word, ?script, Sstem,
Rroot, Mtag
Q(?i?i-1,?i-2) Q(?i?i-1,?i-2,Si-1,Si-2,Mi-1,Mi
-2,Ri-1,Ri-2)
Examine all 8C2 28 all trigram templates of
two variables from the history with ?i.
Set observations w/counts above a threshold as
features
Examine all 8C1 8 all bigram templates of one
variable from the history with ?i.
Set observations w/counts above a threshold as
features
Build a MaxEnt model (Use Jun Wus toolkit)
Q(?i?i-1,?i-2)Z-1(?,?) exp ?1f1(?i,?i-1,Si-2)
?2f2(?i,Mi-1,Mi-2) ?ifi(?i,?i-1)?jgj(?i,R
i-1)?JgJ(?i)
Build the Romanized language model
Q(WiWi-1,Wi-2) U(Wi?i) Q(?i?i-1,?i-2)

168
A Pictorial Interpretation of the Minimum
Divergence Model
The ML model for MSA text Q(x)Z-1(?,?) exp?
?ifi(x) ?jgj(x)
The ML model for Callhome, with ?? instead of
?0. P(x)Z-1(?,?,?) exp? ?ifi(x) ?jgj(x)
?khk(x)
All exponential models of the form P(x)Z-1(?,?,?)
exp? ?i fi(x) ?j gj(x) ?k hk(x)
169
Details of Proposed Research (2) Additional
Factors in Callhome LM

P(WiWi-1,Wi-2) P(Wi,?i Wi-1,Wi-2,?i-1,?i-2,Si-
1,Si-2,Mi-1,Mi-2,Ri-1,Ri-2)
Examine all 10C2 45 all trigram templates of
two variables from the history with W or ?.
Set observations w/counts above a threshold as
features
Examine all 10C1 10 all bigram templates of
one variable from the history with W or ?.
Set observations w/counts above a threshold as
features
Compute a Min Divergence model of the form
P(WiWi-1,Wi-2)Z-1(?,?, ?) exp
?1f1(?i,?i-1,Si-2)?2f2(?i,Mi-1,Mi-2)
?ifi(?i,?i-1 )?jgj(?i,Ri-1)?JgJ(?i)
exp?1h1(Wi,Wi-1,Si-2) ?2h2(?i,Wi-1,Si-2)
?khi(?i,?i-1) ?KhK(Wi)

170
Research Plan and Conclusion

Use baseline Callhome results from WS02
Investigate treating romanized forms of a script
form as alternate pronunciations
Build the MSA-text MaxEnt model
Feature selection is not critical use high
cutoffs
Choose features for the Callhome model
Build and test the minimum divergence model
Plug in induced structure
Experiment with subsets of MSA text

171
A Pictorial Interpretation of the Minimum
Divergence Model
The ML model for MSA text Q(x)Z-1(?,?) exp?
?ifi(x) ?jgj(x)
The ML model for Callhome, with ?? instead of
?0. P(x)Z-1(?,?,?) exp? ?ifi(x) ?jgj(x)
?khk(x)
All exponential models of the form P(x)Z-1(?,?,?)
exp? ?i fi(x) ?j gj(x) ?k hk(x)
172
(No Transcript)

Write a Comment

User Comments (0)