Title: Novel Speech Recognition Models for Arabic
1Novel Speech Recognition Models for Arabic
- The Arabic Speech Recognition Team
- JHU Workshop Final Presentations
- August 21, 2002
2Arabic ASR Workshop Team
- Senior Participants Undergraduate Students
- Katrin Kirchhoff, UW Melissa Egan, Pomona College
- Jeff Bilmes, UW Feng He, Swarthmore College
- John Henderson, MITRE
- Mohamed Noamany, BBN Affiliates
- Pat Schone, DoD Dimitra Vergyri, SRI
- Rich Schwartz, BBN Daben Liu, BBN
- Nicolae Duta, BBN
- Graduate Students Ivan Bulyko, UW
- Sourin Das, JHU Mari Ostendorf, UW
- Gang Ji, UW
3Arabic
Dialects used for informal conversation
Cross-regional standard, used for formal
communication
4Arabic ASR Previous Work
- dictation IBM ViaVoice for Arabic
- Broadcast News BBN TIDESOnTap
- conversational speech 1996/1997 NIST CallHome
Evaluations - little work compared to other languages
- few standardized ASR resources
5Arabic ASR State of the Art (before WS02)
- BBN TIDESOnTap 15.3 WER
- BBN CallHome system 55.8 WER
- WER on conversational speech noticeably higher
than for other languages - (eg. 30 WER for English CallHome)
- ? focus on recognition of conversational Arabic
-
6Problems for Arabic ASR
- language-external problems
- data sparsity, only 1 (!) standardized corpus of
conversational Arabic available - language-internal problems
- complex morphology, large number of possible word
forms - (similar to Russian, German, Turkish,)
- differences between written and spoken
representation lack of short vowels and other
pronunciation information - (similar to Hebrew, Farsi, Urdu, Pashto,)
7Corpus LDC ECA CallHome
- phone conversations between family
members/friends - Egyptian Colloquial Arabic (Cairene dialect)
- high degree of disfluencies (9),
out-of-vocabulary words (9.6), foreign words
(1.6) - noisy channels
- training 80 calls (14 hrs), dev 20 calls (3.5
hrs), eval 20 calls (1.5 hrs) - very small amount of data for language modeling
(150K) !
8MSA - ECA differences
- Phonology
- /th/ ? /s/ or /t/ thalatha - talata
(three) - /dh/ ? /z/ or /d/ dhahab - dahab
(gold) - /zh/ ? /g/ zhadeed - gideed
(new) - /ay/ ? /e/ Sayf - Seef
(summer) - /aw/ ? /o/ lawn - loon
(color) - Morphology
- inflections yatakallamu -
yitkallim (he speaks) - Vocabulary
- different terms TAwila - tarabeeza
(table) - Syntax
- word order differences SVO - VSO
9Workshop Goals
improvements to Arabic ASR through
developing novel models to better exploit
available data
developing techniques for using
out-of-corpus data
Automatic romanization
Integration of MSA text data
Factored language modeling
10Factored Language Models
- complex morphological structure leads to large
number of possible word forms - break up word into separate components
- build statistical n-gram models over individual
morphological components rather than complete
word forms
11Automatic Romanization
- Arabic script lacks short vowels and other
pronunciation markers - comparable English example
- lack of vowels results in lexical ambiguity
affects acoustic and language model training - try to predict vowelization automatically from
data and use result for recognizer training
th fsh stcks f th nrth tlntc hv bn dpletd
the fish stocks of the north atlantic have been
depleted
12Out-of-corpus text data
- no corpora of transcribed conversational speech
available - large amounts of written (Modern Standard Arabic)
data available (e.g. Newspaper text) - Can MSA text data be used to improve language
modeling for conversational speech? - Try to integrate data from newspapers,
transcribed TV broadcasts, etc.
13Recognition Infrastructure
- baseline system BBN recognition system
- N-best list rescoring
- Language model training SRI LM toolkit with
significant additions implemented during this
workshop - Note no work on acoustic modeling, speaker
adaptation, noise robustness, etc. - two different recognition approaches
grapheme-based vs. phoneme-based
14Summary of Results (WER)
Grapheme-based reconizer
Phone-based recognizer
15Novel research
- new strategies for language modeling based on
morphological features - new graph-based backoff schemes allowing wider
range of smoothing techniques in language
modeling - new techniques for automatic vowel insertion
- first investigation of use of automatically
vowelized data for ASR - first attempt at using MSA data for language
modeling for conversational Arabic - morphology induction for Arabic
16Key Insights
- Automatic romanization improves grapheme-based
Arabic recognition systems - trend morphological information helps in
language modeling - needs to be confirmed on larger data set
- Using MSA text data does not help
- We need more data!
17Resources
- significant add-on to SRILM toolkit for general
factored language modeling - techniques/software for automatic romanization of
Arabic script - part-of-speech tagger for MSA tagged text
18Outline of Presentations
- 130 - 145 Introduction (Katrin Kirchhoff)
- 145 - 155 Baseline system (Rich Schwartz)
- 155 - 220 Automatic romanization (John
Henderson, -
Melissa Egan) - 220 - 235 Language modeling - overview
(Katrin Kirchhoff) - 235 - 250 Factored language modeling (Jeff
Bilmes) - 250 - 305 Coffee Break
- 305 - 310 Automatic morphology learning (Pat
Schone) - 315 - 330 Text selection (Feng He)
- 330 - 400 Graduate student proposals (Gang Ji,
Sourin Das) - 400 - 430 Discussion and Questions
19Thank you!
- Fred Jelinek, Sanjeev Khudanpur, Laura Graham
- Jacob Laderman assistants
- Workshop sponsors
- Mark Liberman, Chris Cieri, Tim Buckwalter
- Kareem Darwish, Kathleen Egan
- Bill Belfield colleagues from BBN
- Apptek
20(No Transcript)
21BBN Baseline System for Arabic
- Richard Schwartz, Mohamed Noamany,
- Daben Liu, Bill Belfield, Nicolae Duta
- JHU Workshop
- August 21, 2002
22BBN BYBLOS System
- RoughnReady / OnTAP / OASIS system
- Version of BYBLOS optimized for Broadcast News
- OASIS system fielded in Bangkok and Aman
- Real-Time operation with 1-minute delay
- 10-20 WER, depending on data
23BYBLOS Configuration
- 3-passes of recognition
- Forward Fast-match uses PTM models and
approximate bigram search - Backward pass uses SCTM models and approximate
trigram search, creates N-best. - Rescoring pass uses cross-word SCTM models and
trigram LM - All runs in real time
- Minimal difference from running slowly
24Use for Arabic Broadcast News
- Transcriptions are in normal Arabic script,
omitting short vowels and other diacritics. - We used each Arabic letter as if it were a
phoneme. - This allowed addition of large text corpora for
language modeling.
25Initial BN Baseline
- 37.5 hours of acoustic training
- Acoustic training data (230K words) used for LM
training - 64K-word vocabulary (4 OOV)
- Initial word error rate (WER) 31.2
26Speech Recognition Performance
27Call Home Experiments
- Modified OnTAP system to make it more appropriate
for Call Home data. - Added features from LVCSR research to OnTAP
system for Call Home data. - Experiments
- Acoustic training 80 conversations (15 hours)
- Transcribed with diacritics
- Acoustic training data (150K words) used for LM
- Real-time
28Using OnTAP system for Call Home
29Additions from LVCSR
30Output Provided for Workshop
- OASIS was run on various sets of training as
needed - Systems were run either for Arabic script
phonemes or Romanized phonemes with
diacritics. - In addition to workshop participants, others at
BBN provided assistance and worked on workshop
problems. - Output provided for workshop was N-best sentences
- with separate scores for HMM, LM, words,
phones, silences - Due to high error rate (56), the oracle error
rate for 100 N-best was about 46. - Unigram lattices were also provided, with oracle
error rate of 15
31Phoneme HMM Topology Experiment
- The phoneme HMM topology was increased for the
Arabic script system from 5 states to 10 states
in order to accommodate a consonant and possible
vowel. - The gain was small (0.3 WER)
32OOV Problem
- OOV Rate is 10
- 50 is morphological variants of words in the
training set - 10 is Proper names
- 40 is other unobserved words
- Tried adding words from BN and from morphological
transducer - Added too many words with too small gain
33Use BN to Reduce OOV
- Can we add words from BN to reduce OOV?
- BN text contains 1.8M distinct words.
- Adding entire 1.8M words reduces OOV from 10 to
3.9. - Adding top 15K words reduces OOV to 8.9
- Adding top 25K words reduces OOV to 8.4.
34Use Morphological Transducer
- Use LDC Arabic transducer to expand verbs to all
forms - Produces gt 1M words
- Reduces OOV to 7
35Language Modeling Experiments
- Described in other talks
- Searched for available dialect transcriptions
- Combine BN (300M words) with CH (230K)
- Use BN to define word classes
- Constrained back-off for BNCH
36(No Transcript)
37Autoromanization of Arabic Script
- Melissa Egan and John Henderson
38Autoromanization (AR) goal
- Expand Arabic script representation to include
short vowels and other pronunciation information. - Phenomena not typically marked in non-diacritized
script include - Short vowels a, i, u
- Repeated consonants (shadda)
- Extra phonemes for Egyptian Arabic f/v,j/g
- Grammatical marker that adds an n to the
pronunciation (tanween) - Example
- Non-diacritized form ktb write
- Expansions kitab book
- aktib I write
- kataba he wrote
- kattaba he caused to write
39AR motivation
- Romanized text can be used to produce better
output from an ASR system. - Acoustic models will be able to better
disambiguate based on extra information in text. - Conditioning events in LM will contain more
information. - Romanized ASR output can be converted to script
for alternative WER measurement. - Eval96 results (BBN recognizer, 80 conv. train)
- script recognizer 61.1 WERG (grapheme)
- romanized recognizer 55.8 WERR (roman)
40AR data
- CallHome Arabic from LDC
- Conversational speech transcripts (ECA) in both
script and a roman specification that includes
short vowels, repeats, etc. - set conversations words
- asrtrain 80 135K
- dev 20 35K
- eval96(asrtest) 20 15K
- eval97 20 18K
- h5_new 20 18K
Romanizer Testing
Romanizer Training
41Data format
- Script without and with diacritics
- CallHome in script and roman forms
our task
Script AlHmd_llh kwIsB w AntI AzIk
Roman ilHamdulillA kuwayyisaB wi inti
izzayyik
42Autoromanization (AR) WER baseline
- Train on 32K words in eval97h5_new
- Test on 137K words in ASR_trainh5_new
Status portion error total in train in
test in test error unambig. 68.0 1.8
6.2 ambig. 15.5 13.9 10.8 unknown 16.5
99.8 83.0 total 100 19.9 100.0
Biggest potential error reduction would come from
predicting romanized forms for unknown words.
43AR knitting example
unknown tbqwA
1. Find close known word
known ybqwA
known y bqwA
2. Record ops required to make roman from known
kn.roman yibqu ops ciccrd
unknown t bqwA
3. Construct new roman using same ops
kn.roman yibqu ops ciccrd
new roman tibqu
44Experiment 1 (best match)
- Observed patterns in the known short/long pairs
- Some characters in the short forms are
consistently found with particular, non-identical
characters in the long forms. - Example rule
- A ? a
45Experiment 2 (rules)
Environments in which w occurs in training
dictionary long forms Env Freq C _ V 149 V _
8 _ V 81 C _ 5 V _ V 121 V _ C 118
Environments in which u occurs in training
dictionary long forms Env Freq C _ C 1179 C
_ 301 _ C 29
- Some output forms depend on output context.
- Rule
- u occurs only between two non-vowels.
- w occurs elsewhere.
- Accurate for 99.7 of the instances of u and
w in the training dictionary long forms.
Similar rule may be formulated for i and y.
46Experiment 3 (local model)
- Move to more data-driven model
- Found some rules manually.
- Look for all of them, systematically.
- Use best-scoring candidate for replacement
- Environment likelihood score
- Character alignment score
47Experiment 4 (n-best)
- Instead of generating romanized form using the
single best short form in the dictionary,
generate romanized forms using top n best short
forms. - Example (n 5)
48Character error rate (CER)
- Measurement of insertions, deletions, and
substitutions in character strings should more
closely track phoneme error rate. - More sensitive than WER
- Stronger statistics from same data
- Test set results
- Baseline 49.89 character error rate (CER)
- Best model 24.58 CER
- Oracle 2-best list 17.60 CER suggests more room
for gain.
49 Summary of performance (dev set)
Accuracy CER Baseline 8.4 41.4 Knitting 16.9
29.5 Knitting best match rules 18.4 28.6 K
nitting local model 19.4 27.0 Knitting
local model n-best 30.0 23.1 (n
25)
50 Varying the number of dictionary matches
51ASR scenarios
- 1) Have a script recognizer, but want to produce
romanized form. - postprocessing ASR output
- 2) Have a small amount of romanized data and a
large amount of script data available for
recognizer training. - preprocessing ASR training set
52ASR experiments
Preprocessing
Roman Result
WERR
Roman ASR
AR
Script Result
R2S
WERG
Script Train
Postprocessing
53Experiment adding script data
Future training set
- Script LM training data could be acquired from
found text.
AR train 40
ASR train 100 conv
- Script transcription is cheaper than roman
transcription
- Simulate a preponderance of script by training AR
on a separate set.
- ASR is then trained on output of AR.
54Eval 96 experiments, 80 conv
Config WERR WERG script baseline N/A 59.8 post
processing 61.5 59.8 preprocessing 59.9 59.2
(-0.6) Roman baseline 55.8 55.6 (-4.2)
- Bounding experiment
- No overlap between ASR train and AR train.
- Poor pronunciations for made-up words.
55Eval 96 experiments, 100 conv
Config WERR WERG script baseline N/A 59.0 postproc
essing 60.7 59.0 preprocessing 58.5 57.5
(-1.5) Roman baseline 55.1 54.9 (-4.1)
- More realistic experiment
- 20 conversation overlap between ASR train and AR
train. - Better pronunciations for made-up words.
56Remaining challenges
- Correct dangling tails in short matches
- Merge unaligned characters
57Bigram translation model
input s t b q w A output r ? t i b
q u ? kn. roman dl y i b q u
58Trigram translation model
input s t b q w A output r t i b
q u kn. roman dl y i b q u
59Future work
- Context provides information for disambiguating
both known and unknown words - Bigrams for unknown words will also be unknown,
use part of speech tags or morphology. - Acoustics
- Use acoustics to help disambiguate vowels?
- Provide n-best output as alternative
pronunciations for ASR training.
60(No Transcript)
61Factored Language Modeling
Katrin Kirchhoff, Jeff Bilmes, Dimitra
Vergyri, Pat Schone, Gang Ji, Sourin Das
62Arabic morphology
- structure of Arabic derived words
pattern
s
k
n
-tu
fa-
affixes
particles
root
LIVE past 1st-sg-past part so I lived
63Arabic morphology
- 5000 roots
- several hundred patterns
- dozens of affixes
- large number of possible word forms
- problems training robust language model
- large number of OOV words
64Vocabulary Growth - full word forms
65Vocabulary Growth - stemmed words
66Particle model
- Break words into sequences of stems affixes
- Approximate probability of word sequence by
probability of particle sequence
67Factored Language Model
- Problem how can we estimate P(WtWt-1,Wt-2,...)
? - Solution decompose W into its morphological
components affixes, stems, roots, patterns - words can be viewed as bundles of features
patterns
Pt
Pt-1
Pt-2
roots
Rt
Rt-1
Rt-2
affixes
At
At-1
At-2
St
St-1
St-2
stems
words
Wt-2
Wt-1
Wt
68Statistical models for factored representations
- Class-based LM
- Single-stream LM
69Full Factored Language Model
- assume where
- w word, r root, f pattern, a affixes
- Goal find appropriate conditional independence
statements to simplify this model.
70Experimental Infrastructure
- All language models tested using nbest rescoring
- two baseline word-based LMs
- B1 BBN LM, WER 55.1
- B2 WS02 baseline LM, WER 54.8
- combination of baselines 54.5
- new language models were used in combination with
one or both baseline LMs - log-linear score combination scheme
71Log-linear combination
- For m information sources, each producing a
maximum-likelihood estimate for W - I total information available
- Ii the ith information source
- ki weight for the ith information source
72Discriminative combination
- We optimize the combination weights jointly with
the language model and insertion penalty to
directly minimize WER of the maximum likelihood
hypothesis. - The normalization factor can be ignored since it
is the same for all alternative hypotheses. - Used the simplex optimization method on the
100-bests provided by BBN (optimization algorithm
available in the SRILM toolkit).
73Word decomposition
- Linguistic decomposition (expert knowledge)
- automatic morphological decomposition acquire
morphological units from data without using human
knowledge - assign words to classes based not on
characteristics of word form but based on
distributional properties
74(Mostly) Linguistic Decomposition
- Stems/morph class information from LDC CH
lexicon - roots determined by K. Darwishs morphological
analyzer for MSA - pattern determined by subtracting root from stem
atamna ltgt atamverbpast-1st-plural
stem
morph. tag
atam ? tm
atam ? CaCaC
75Automatic Morphology
- Classes defined by morphological components
derived from data - no expert knowledge
- based on statistics of word forms
- more details in Pats presentation
76Data-driven Classes
- Word clustering based on distributional
statistics - Exchange algorithm (Martin et. al 98)
- initially assign words to individual clusters
- move each temporarily word to all other clusters,
compute change in perplexity (class-based
trigram) - keep assignment that minimizes perplexity
- stop when class assignment no longer changes
- bottom-up clustering (SRI toolkit)
- initially assign words to individual clusters
- successively merge pairs of clusters with highest
average mutual information - stop at specified number of classes
77Results
- Best word error rates obtained with
- particle model 54.0 (B1 particle LM)
- class-based models 53.9 (B1MorphStem)
- automatic morphology 54.3 (B1B2Rule)
- data-driven classes 54.1 (B1SRILM, 200
classes) - combination of best models 53.8
78Conclusions
- Overall improvement in WER gained from language
modeling (1.3) is significant - individual differences between LMs are not
significant - but adding morphological class models always
helps language model combination - morphological models get the highest weights in
combination (in addition to word-based LMs) - trend needs to be verified on larger data set
- ? application to script-based system?
79(No Transcript)
80Factored Language Models and Generalized Graph
Backoff
- Jeff Bilmes, Katrin Kirchhoff
- University of Washington, Seattle
- JHU-WS02 ASR Team
81Outline
- Language Models, Backoff, and Graphical Models
- Factored Language Models (FLMs) as Graphical
Models - Generalized Graph Backoff algorithm
- New features to SRI Language Model Toolkit (SRILM)
82Standard Language Modeling
- Example standard tri-gram
83Typical Backoff in LM
- In typical LM, there is one natural (temporal)
path to back off along. - Well motivated since information often decreases
with word distance.
84Factored LM Proposed Approach
- Decompose words into smaller morphological or
class-based units (e.g., morphological classes,
stems, roots, patterns, or other automatically
derived units). - Produce probabilistic models over these units to
attempt to improve WER.
85Example with Words, Stems, and Morphological
classes
86Example with Words, Stems, and Morphological
classes
87In general
88General Factored LM
- A word is equivalent to collection of factors.
- E.g., if K3
- Goal find appropriate conditional independence
statements to simplify this sort of model while
keeping perplexity and WER low. This is the
structure learning problem in graphical models.
89The General Case
90The General Case
91The General Case
92A Backoff Graph (BG)
93Example 4-gram Word Generalized Backoff
94How to choose backoff path?
- Four basic strategies
- Fixed path (based on what seems reasonable (e.g.,
temporal constraints)) - Generalized all-child backoff
- Constrained multi-child backoff
- Child combination rules
95Choosing a fixed back-off path
96How to choose backoff path?
- Four basic strategies
- Fixed path (based on what seems reasonable (e.g.,
temporal constraints)) - Generalized all-child backoff
- Constrained multi-child backoff
- Child combination rules
97Generalized Backoff
- In typical backoff, we drop 2nd parent and use
conditional probability.
- More generally, g() can be any positive function,
but need new algorithm for computing backoff
weight (BOW).
98Computing BOWs
- Many possible choices for g() functions (next few
slides) - Caveat certain g() functions can make the LM
much more computationally costly than standard
LMs.
99g() functions
100More g() functions
101More g() functions
102How to choose backoff path?
- Four basic strategies
- Fixed path (based on what seems reasonable
(time)) - Generalized all-child backoff
- Constrained multi-child backoff
- Same as before, but choose a subset of possible
paths a-priori - Child combination rules
- Combine child node via combination function
(mean, weighted avg., etc.)
103Significant Additions to Stolckes SRILM, the
SRI Language Modeling Toolkit
- New features added to SRILM including
- Can specify an arbitrary number of
graphical-model based factorized models to train,
compute perplexity, and rescore N-best lists. - Can specify any (possibly constrained) set of
backoff paths from top to bottom level in BG. - Different smoothing (e.g., Good-Turing,
Kneser-Ney, etc.) or interpolation methods may be
used at each backoff graph node - Supports the generalized backoff algorithms with
18 different possible g() functions at each BG
node.
104Example with Words, Stems, and Morphological
classes
105How to specify a model
word given stem morph W 2 S(0) M(0) S0,M0
M0 wbdiscount gtmin 1 interpolate S0 S0
wbdiscount gtmin 1 0 0 wbdiscount gtmin 1
morph given word word M 2 W(-1) W(-2)
W1,W2 W2 kndiscount gtmin 1 interpolate W1
W1 kndiscount gtmin 1 interpolate 0 0
kndiscount gtmin 1
stem given morph word word S 3 M(0) W(-1)
W(-2) M0,W1,W2 W2 kndiscount gtmin 1
interpolate M0,W1 W1 kndiscount gtmin 1
interpolate M0 M0 kndiscount gtmin 1 0
0 kndiscount gtmin 1
106Summary
- Language Models, Backoff, and Graphical Models
- Factored Language Models (FLMs) as Graphical
Models - Generalized Graph Backoff algorithm
- New features to SRI Language Model Toolkit (SRILM)
107Coffee Break
108Knowledge-Free Induction of Arabic Morphology
- Patrick Schone
- 21 August 2002
109Why induce Arabic morphology?
- (1) Has not been done before
- (2) If it can be done, and if it has value in LM,
- it can generalize across languages without
- needing an expert
110Original Algorithm(Schone Jurafsky, 00/01)
Looking for word inflections on words w/
Frgt9 Use a character tree to find word pairs
with similar beginnings/ endings Ex
car/cars , car/cares, car/caring Use Latent
Semantic Analysis to induce semantic vectors
for each word, then compare word-pair
semantics Use frequencies of word stems/rules to
improve the initial semantic estimates
111Algorithmic Expansions
IR-Based Minimum Edit Distance
Trie-based approach could be a problem for
Arabic Templates gt aGlaB aGlaB
ilAGil aGlu AGil Result 3576 words in
CallHome lexicon w/ 50 relationships!
Use Minimum Edit Distance to find the
relationships (can be weighted)
Use information-retrieval based approach to
faciliate search for MED candidates
112Algorithmic Expansions
Agglomerative Clustering Using Rules Stems
Word Pairs w/ Rule
Word Pairs w/ Stem
Gayyar 507 xallaS 503 makallim 468 qaddim 434 i
tgawwiz 332 tkallim 285
gt il 1178 gt u 635 gt i 455 i gt
u 377 gt fa 375 gt bi 366
Do bottom-up clustering, where weight
between two words is Ct(Rule)Ct(PairedStem)1/2
113Algorithmic Expansions
Updated Transitivity
If XY and YZ and XYgt2 and XYltZ, then XZ
114Scoring Induced Morphology
- Score in terms of conflation set
agreement - Conflation set (W)all words morphologically
related to W - Example aGlaB aGlaB ilAGil aGlu AGil
If XWinduced set for W, and YWtruth set for W,
compute total correct, inserted, and
deleted as
ErrorRate 100(ID)/(CD)
115Scoring Induced Morphology
Induction error rates on words from original 80
Set
116Using Morphology for LM Rescoring
- For each word W, use induced morphology to
generate - Stem smallest word, z, from XW where zlt w
- Root character intersection across XW
- Rule map of word-to-stem
- Patternmap of stem-to-root
- Class map of word-to-root
117Other Potential Benefits of MorphologyMorphology
-driven Word Generation
- Generate probability-weighted words using
- morphologically-derived rules (like Null gt
ilNULL) - Generate only if initial and final n-characters
of stem have been seen before.
118(No Transcript)
119Text Selection for Conversational Arabic
- Feng He
- ASR (Arabic Speech Recognition) Team
- JHU Workshop
120Motivation
- Group goal Conversational Arabic Speech
Recognition. - One of the Problems not enough training data to
build a Language Model most available text is
in MSA (Modern Standard Arabic) or a mixture of
MSA and conversational Arabic. - One Solution Select from mixed text segments
that are conversational, and use them in training.
121Task Text Selection
- Use POS-based language models because it has been
shown to better indicate differences in styles,
such as formal vs conversational. - Method
- Training POS (part of speech) tagger on available
data - Train POS-based language models on formal vs
conversational data - Tag new data
- Select segments from new data that are closest to
conversational model by using scores from
POS-based language models.
122Data
- For building the Tagger and Language Models
- Arabic Treebank 130K words of hand-tagged
Newspaper text in MSA. - Arabic CallHome 150K words of transcribed phone
conversations. Tags are only in the Lexicon. - For Text Selection
- Al Jazeera 9M words of transcribed TV
broadcasts. We want to select segments that are
closer to conversational Arabic, such as
talk-shows and interviews.
123Implementation
124About unknown words
- These are words that are not seen in training
data, but appear in test data. - Assume unknown words behave like singletons
(words that appear only once in training data). - This is done by duplicating training data with
singletons replaced by special token. Then train
tagger on both the original and duplicate.
125- Tools
- GMTK (Graphical Model Toolkit)
- Algorithms
- Training EM training set parameters so that
joint probability of hidden states and
observations is maximized. - Decoding (tagging) Viterbi find hidden state
sequence that maximizes joint probability of
hidden state and observations.
126Experiments
- Exp 1 Data first 100K of English Penn Treebank.
Trigram model. Sanity check. - Exp 2 Data Arabic Treebank. Trigram model.
- Exp 3 Data Arabic Treebank and CallHome.
Trigram model. - The above three experiments all used 10 fold
cross validation, and are unsupervised. - Exp 4 Data Arabic Treebank. Supervised trigram
model. - Exp 5 Data Arabic Treebank and Callhome.
Partially supervised training using Treebanks
tagged data. Test on portion of treebank not used
in training. Trigram model.
127Results
128Building Language Models and Text Selection
- Use existing scripts to build formal and
conversational language models from tagged Arabic
Treebank and CallHome data. - Text selection use log likelihood ratio
Si the ith sentence in data set C coversational
language model F formal language model Ni
length of Si
129Score Distribution
percentage
log count
log likelihood ratio
log likelihood ratio
130Assessment
- A subset of Al Jazeera equal in size to Arabic
CallHome (150K words) is selected, and added to
training data for speech recognition language
model. - No reduction in perplexity.
- Possible reasons Al Jazeera has no
conversational Arabic, or has only conversational
Arabic of a very different style.
131Text Selection Work Done at BBN
- Rich Schwartz
- Mohamed Noamany
- Daben Liu
- Nicolae Duta
132Search for Dialect Text
- We have an insufficient amount of CH text for
estimating a LM. - Can we find additional data?
- Many words are unique to dialect text.
- Searched Internet for 20 common dialect words.
- Most of the data found were jokes or chat rooms
very little data.
133Search BN Text for Dialect Data
- Search BN text for the same 20 dialect words.
- Found less than CH data
- Each occurrence was typically an isolated lapse
by the speaker into dialect, followed quickly by
a recovery to MSA for the rest of the sentence.
134Combine MSA text with CallHome
- Estimate separate models for MSA text (300M
words) and CH text (150K words). - Use SRI toolkit to determine single optimal
weight for the combination, using deleted
interpolation (EM) - Optimal weight for MSA text was 0.03
- Insignificant reduction in perplexity and WER
135Classes from BN
- Hypothesis
- Even if MSA ngrams are different, perhaps the
classes are the same. - Experiment
- Determine classes (using SRI toolkit) from BNCH
data. - Use CH data to estimate ngrams of classes and /
or p(w class) - Combine resulting model with CH word trigram
- Result
- No gain
136Hypothesis Test Constrained Back-Off
- Hypothesis
- In combining BN and CH, if a probability is
different, could be for 2 reasons - CH has insufficient training
- BN and CH truly have different probabilities
(likely) - Algorithm
- Interpolate BN and CH, but limit the probability
change to be as much as would be likely due to
insufficient training. - Ngram count cannot change by more than its sqrt
- Result
- No gain
137(No Transcript)
138Learning Using Factored Language Models
- Gang Ji
- Speech, Signal, and Language Interpretation
- University of Washington
- August 21, 2002
139Outline
- Factored Language Models (FLMs) overview
- Part I automatically finding FLM structure
- Part II first-pass decoding in ASR with FLMs
using graphical models
140Factored Language Models
- Along with words, consider factors as components
of the language model - Factors can be words, stems, morphs, patterns,
roots, which might contain complementary
information about language - FLMs also provide a new possibilities for
designing LMs (e.g., multiple back-off paths) - Problem We dont know the best model, and space
is huge!!!
141Factored Language Models
- How to learn FLMs
- Solution 1 do it by hand using expert linguistic
knowledge - Solution 2 data driven let the data help to
decide the model - Solution 3 combine both linguistic and data
driven techniques
142Factored Language Models
- A Proposed Solution
- Learn FLMs using evolution-inspired search
algorithm - Idea Survival of the fittest
- A collection (generation) of models
- In each generation, only good ones survive
- The survivors produce the next generation
143Evolution-Inspired Search
- Selection choose the good LMs
- Combination retain useful characteristics
- Mutation some small change in next generation
144Evolution-Inspired Search
- Advantages
- Can quickly find a good model
- Retain goodness of the previous generation while
covering significant portion of the search space - Can run in parallel
- How to judge the quality of each model?
- Perplexity on a development set
- Rescore WER on development set
- Complexity-penalized perplexity
145Evolution-Inspired Search
- Three steps form new models.
- Selection (based on perplexity, etc)
- E.g. Stochastic universal sampling models are
selected in proportion to their fitness - Combination
- Mutation
146Moving from One Generation to Next
- Combination Strategies
- Inherit structures horizontally
- Inherit structures vertically
- Random selection
- Mutation
- Add/remove edges randomly
- Change back-off/smoothing strategies
147Combination according to Frames
148Combination according to Factors
149Outline
- Factored Language Models (FLMs) overview
- Part I automatically finding FLM structure
- Part II first-pass decoding with FLMs
150Problem
- May be difficult to improve WER just by rescoring
n-best lists - More gains can be expected from using better
models in first-pass decoding - Solution
- do first-pass decoding using FLMs
- Since FLMs can be viewed as graphical models, use
GMTK (most existing tools dont support general
graph-based models) - To speed up inference, use generalized
graphical-model-based lattices.
151FLMs as Graphical Models
F1
F2
F3
Word
Graph for Acoustic Model
152FLMs as Graphical Models
- Problem decoding can be expensive!
- Solution multi-pass graphical lattice refinement
- In first-pass, generate graphical lattices using
a simple model (i.e., more independencies) - Rescore the lattices using a more complicated
model (fewer independencies) but on much smaller
search space
153Example Lattices in a Markov Chain
0
1
2
2
2
3
5
7
4
This is the same as a word-based lattice
154Lattices in General Graphs
0
1
0
1
2
2
3
4
5
6
0
1
1
2
1
2
3
3
2
6
4
5
155Research Plan
- Data
- Arabic CallHome data
- Tools
- Tools for evolution-inspired search
- most part already developed during workshop
- Training/Rescoring FLMs
- Modified SRI LM toolkit developed during this
workshop - Multi-pass decoding
- Graphical models toolkit (GMTK) developed in
last workshop
156Summary
- Factored Language Models (FLMs) overview
- Part I automatically finding FLM structure
- Part II first-pass decoding of FLMs using GMTK
and graphical lattices
157Combination and mutation
0
1
0
1
0
1
0
0
0
1
1
1
0
1
0
1
1
1
1
0
1
0
0
0
1
0
1
1
1
1
0
1
0
0
0
0
0
1
0
1
1
1
1
0
0
0
0
0
1
0
1
1
1
1
0
1
0
0
1
0
158Stochastic Universal Sampling
- Idea probability of surviving is proportional to
fitness - Fitness is the quantity we described before
- SUS
- Models have bins with length proportional to
fitness, lay on x-axis - Choose N even-spaced samples uniformly
- Choose a model k times, k is number of samples in
its bin.
159FLMs as Graphical Models
F1
F2
F3
Word
Word Trans.
Word Pos.
Phone Trans.
Phone
observation
160(No Transcript)
161Minimum Divergence Adaptation of a MSA-Based
Language Model to Egyptian Arabic
- A proposal by
- Sourin Das
- JHU Workshop Final Presentation
- August 21, 2002
162Motivation for LM Adaptation
- Transcripts of spoken Arabic are expensive to
obtain MSA text is relatively inexpensive (AFP
newswire, ELRA arabic data, Al jazeera ) - MSA text ought to help after all it is Arabic
- However there are considerable dialectal
differences - Inferences drawn from Callhome knowledge or data
ought to overrule those from MSA whenever the
inferences drawn from them disagree e.g.
estimates of N-gram probabilities - Cannot interpolate models or merge data naïvely
- Need to instead fall back to MSA knowledge only
when the Callhome model or data is agnostic
about an inference
163Motivation for LM Adaptation
- The minimum K-L divergence framework provides a
mechanism to achieve this effect - First estimate a language model Q from MSA text
only - Then find a model P which matches all major
Callhome statistics and is close to Q. - Anecdotal evidence MDI methods successfully used
to adapt models based on NABN text to SWBD a 2
WER reduction in LM95 from a 50 baseline WER.
164An Information Geometric View
The Uniform Distribution
Models satisfying Callhome marginals
MaxEnt Callhome LM
MaxEnt MSA-text LM
Min Divergence Callhome LM
Models satisfying MSA-text marginals
The Space of all Language Models
165A Parametric View of MaxEnt Models
- The MSA-text based MaxEnt LM is the ML estimate
among exponential models of the form - Q(x) Z-1(?,?) exp? ?i fi(x) ?j gj(x)
- The Callhome based MaxEnt LM is the ML estimate
among exponential models of the form - P(x) Z-1(?,?) exp? ?j gj(x) ?k hk(x)
- Think of the Callhome LM as being from the family
- P(x) Z-1(?,?) exp? ?i fi(x) ?j gj(x) ?k
hk(x) - where we set ?0 based on the MaxEnt principle.
- One could also be agnostic about the values of
?is, since no examples with fi(x)gt0 are seen in
Callhome - Features (e.g. N-grams) from MSA-text which are
not seen in Callhome always have fi(x)0 in
Callhome training data
166A Pictorial Interpretation of the Minimum
Divergence Model
The ML model for MSA text Q(x)Z-1(?,?) exp?
?ifi(x) ?jgj(x)
Subset of all exponential models with
?? P(x)Z-1(?,?,?) exp? ?ifi(x) ?j gj(x)
?k hk(x)
The ML model for Callhome, with ?? instead of
?0. P(x)Z-1(?,?,?) exp? ?ifi(x) ?jgj(x)
?khk(x)
Subset of all exponential models with
?0 Q(x)Z-1(?,?) exp? ?i fi(x) ?j gj(x)
All exponential models of the form P(x)Z-1(?,?,?)
exp? ?i fi(x) ?j gj(x) ?k hk(x)
167Details of Proposed Research (1)A Factored LM
for MSA text
- Notation Wromanized word, ?script, Sstem,
Rroot, Mtag - Q(?i?i-1,?i-2) Q(?i?i-1,?i-2,Si-1,Si-2,Mi-1,Mi
-2,Ri-1,Ri-2) - Examine all 8C2 28 all trigram templates of
two variables from the history with ?i. - Set observations w/counts above a threshold as
features - Examine all 8C1 8 all bigram templates of one
variable from the history with ?i. - Set observations w/counts above a threshold as
features - Build a MaxEnt model (Use Jun Wus toolkit)
- Q(?i?i-1,?i-2)Z-1(?,?) exp ?1f1(?i,?i-1,Si-2)
?2f2(?i,Mi-1,Mi-2) ?ifi(?i,?i-1)?jgj(?i,R
i-1)?JgJ(?i) - Build the Romanized language model
- Q(WiWi-1,Wi-2) U(Wi?i) Q(?i?i-1,?i-2)
168A Pictorial Interpretation of the Minimum
Divergence Model
The ML model for MSA text Q(x)Z-1(?,?) exp?
?ifi(x) ?jgj(x)
The ML model for Callhome, with ?? instead of
?0. P(x)Z-1(?,?,?) exp? ?ifi(x) ?jgj(x)
?khk(x)
All exponential models of the form P(x)Z-1(?,?,?)
exp? ?i fi(x) ?j gj(x) ?k hk(x)
169Details of Proposed Research (2) Additional
Factors in Callhome LM
- P(WiWi-1,Wi-2) P(Wi,?i Wi-1,Wi-2,?i-1,?i-2,Si-
1,Si-2,Mi-1,Mi-2,Ri-1,Ri-2) - Examine all 10C2 45 all trigram templates of
two variables from the history with W or ?. - Set observations w/counts above a threshold as
features - Examine all 10C1 10 all bigram templates of
one variable from the history with W or ?. - Set observations w/counts above a threshold as
features - Compute a Min Divergence model of the form
- P(WiWi-1,Wi-2)Z-1(?,?, ?) exp
?1f1(?i,?i-1,Si-2)?2f2(?i,Mi-1,Mi-2)
?ifi(?i,?i-1 )?jgj(?i,Ri-1)?JgJ(?i) - exp?1h1(Wi,Wi-1,Si-2) ?2h2(?i,Wi-1,Si-2)
- ?khi(?i,?i-1) ?KhK(Wi)
170Research Plan and Conclusion
- Use baseline Callhome results from WS02
- Investigate treating romanized forms of a script
form as alternate pronunciations - Build the MSA-text MaxEnt model
- Feature selection is not critical use high
cutoffs - Choose features for the Callhome model
- Build and test the minimum divergence model
- Plug in induced structure
- Experiment with subsets of MSA text
171A Pictorial Interpretation of the Minimum
Divergence Model
The ML model for MSA text Q(x)Z-1(?,?) exp?
?ifi(x) ?jgj(x)
The ML model for Callhome, with ?? instead of
?0. P(x)Z-1(?,?,?) exp? ?ifi(x) ?jgj(x)
?khk(x)
All exponential models of the form P(x)Z-1(?,?,?)
exp? ?i fi(x) ?j gj(x) ?k hk(x)
172(No Transcript)