Title: Lattice-Based Statistical Spoken Document Retrieval
1Lattice-Based Statistical Spoken Document
Retrieval
- Chia Tee Kiah
- Ph. D. thesis
- Department of Computer Science
- School of Computing
- National University of Singapore
- Supervisors A/Prof. Ng Hwee Tou (NUS),
- Dr. Li Haizhou (I2R)
2Outline
- Introduction
- Original contribution
- Background
- Lattice-based SDR under statistical model
- Other SDR methods
- Experiments on SDR with short queries
- Query-by-example SDR
- Conclusion
3IntroductionSpoken Document Retrieval
- Information retrieval (IR)
- Search for items of data according to users
info. need - Spoken document retrieval (SDR)
- IR on speech recordings
- Growing in importance more more speech data
stored news broadcasts, voice mails, - SDR more difficult than text IR
- Currently need automatic speech recognition (ASR)
- 1-best transcripts from ASR are error-prone
- Word error rate for noisy, spontaneous speech may
be 50
4Introduction Lattices
t 0.00
t 0.02
t 0.3
t 0.32
t 0.47
t 0.57
t 0.65
t 0.69
t 0.72
t 1.11
t 1.12
my
sons
mentor
lt/sgt
and
ltsgt
to
tender
lt/sgt
its
nice
and
tender
- Lattice connected directed acyclic graph
- James Young (1994), James (1995)
- Each edge labeled with term hypothesis, probs.
- Each path gives hypothesized seq. of terms, its
probability - Use alternative hypotheses to overcome errors in
1-best transcripts lattice-based SDR
5Introduction Lattices
t 0.00
t 0.02
t 0.3
t 0.32
t 0.47
t 0.57
t 0.65
t 0.69
t 0.72
t 1.11
t 1.12
my
sons
mentor
lt/sgt
and
ltsgt
to
tender
lt/sgt
its
nice
and
tender
- Lattice connected directed acyclic graph
- James Young (1994), James (1995)
- Each edge labeled with term hypothesis, probs.
- Each path gives hypothesized seq. of terms, its
probability - Use alternative hypotheses to overcome errors in
1-best transcripts lattice-based SDR
6Outline
- Introduction
- Original contribution
- Background
- Lattice-based SDR under statistical model
- Other SDR methods
- Experiments on SDR with short queries
- Query-by-example SDR
- Conclusion
7Original Contribution
- A method for lattice-based SDR using a
statistical IR model (Song Croft 1999) - Calculate expected count of each word in each
lattice - From counts, estimate statistical lang. models
for docs. - Compute query-doc. relevance as probability
- Previous lattice-based SDR methods all based on
vector space IR model! - Extension to query-by-example SDR
- SDR where queries are also full-fledged spoken
docs. - Presented in EMNLP-CoNLL 2007, SIGIR 2008
8Outline
- Introduction
- Original contribution
- Background
- Lattice-based SDR under statistical model
- Other SDR methods
- Experiments on SDR with short queries
- Query-by-example SDR
- Conclusion
9BackgroundInformation Retrieval
Nevertheless, information retrieval has
become accepted as a description
C
Tokenization
- The task of IR
- Given doc. collection C, query q giving info.
need - Find list of docs. in C relevant to info. need
- Steps involved
- Before receiving query
- Document preprocessing outputs an index for
rapid access - Upon receiving query
- Retrieval outputs ranked list of docs.
- Done by assigning relevance scores guided by
retrieval model - Good IR systems give higher scores to more
relevant docs.
nevertheless information retrieval has become
accepted as a description
Stop word removal
Document preprocessing
information retrieval accepted description
Stemming
inform retriev accept descript
Indexing
Index document 336, 624, 864, Inform 33,
128, 315,
q Euclids algorithm
Ranked list
Retrieval
3
2
1
an algorithm for finding the greatest common
divisor of two numbers
10Background IRRetrieval Models
- Vector space with tf idf weighting (Salton 1963
Spärck Jones 1972) - Docs. queries are Euclidean vecs.
- Compute relevance as cosine similarity
- Each vec. component d(i), q(i) a product of
- tf(wi , d) term frequency increasing func.
of no. of occurrences c(wi d) of wi in d - idf(wi) inverse doc. frequency decreasing
func. of no. of docs. containing wi
d
q
t
Relevance cos t
11Background IRRetrieval Models
- Okapi BM25 (Robertson et al. 1998)
- Based on approximation to Harters 2-Poisson
theory of word distribution (1974)
Robertson/Spärck Jones weight (1976) - Relbm25(d, q)
- C no. of docs in collection
- V vocabulary
- c(w d) count of w in d
- c(w q) count of w in q
- nw no. of docs containing w
- R no. of docs. known to be rel.
- rw no. of rel. docs containing w
- d length of d
- avdl average doc. length
- k1, k2, k3, b are parms.
12Background IRRetrieval Models
- Statistical lang. n-gram (Song Croft 1999)
- Use Pr(d q) as relevance measure
- Assuming uniform Pr(d)
- Pr(d q) Pr(q d)Pr(d) / Pr(q) ? Pr(q d)
- We can thus define relevance as
- Relstat(d, q) log Pr(q d)
- Write q as seq. of words q1q2qK
- Given unigram model Pr( d),
- Relstat(d, q) log ?1i K Pr(qi d)
- ? c(w q) log Pr(w d)
- Estimate Pr( d) by smoothing word counts
Pr(q)
q
Pr(d q)
d
13BackgroundInformation Retrieval
- System evaluation
- Compare IR engines ranked list to ground truth
relevance judgements - Eval. metric mean average precision (MAP)
- MAP for set of queries Q
- Q no. of queries
- Rq no. of docs. rel. to query q
- r'j, q position of jth rel. doc. in ranked list
output for query q - Intuitively, higher MAP means relevant docs.
ranked higher
14BackgroundAutomatic Speech Recognition
- ASR transcribes speech waveform into text
involves - Pronouncing dictionary maps written words to
phonemes - Phoneme contrastive speech unit /ae/, /ow/,
/th/, /p/, - Acoustic models describe acoustic realizations
of phonemes - Each model usually for a triphone phoneme in
the context of 2 phonemes - Language model gives word transition
probabilities
15BackgroundAutomatic Speech Recognition
- General paradigm hidden Markov models (HMMs)
- Acoustic models left-right triphone HMMs,
trained using EM algo. - Using lang. model pron. dict., join HMMs into
one large utterance HMM - Decoding find most probable transcript
Viterbi search with beam pruning (Ney et al.
1992) - Lattices computed using extension of decoding
- ASR system evaluation word error rate (WER)
- Edit dist. / ref. trans. length
- Other metrics char. error rate, syll. error rate
Structure of a typical triphone HMM
16BackgroundSpoken Document Retrieval
- IR with collection of speech recordings
- ASR engine produces document surrogates may be
- 1-best word transcripts (e.g. Gauvain et al.
2000) - 1-best subword transcripts (e.g. Turunen Kurimo
2006) - Phoneme lattices (e.g. James 1995 Jones et al.
1996) - N-best transcript lists (Siegler 1999)
- Word lattices (e.g. Mamou et al. 2006)
- Phoneme word lattices (e.g. Saraclar Sproat
2004) - IR models used in SDR
- For SDR with 1-best transcripts vector space,
BM25, statistical IR models have been tried - For lattice-based SDR only vector space model
17BackgroundQuery By Example
- IR where queries docs. are of like form
- Queries are exemplars of type of objects sought
- E.g. music (query by humming) (Zhu et al.
2003) images (Vu et al. 2003) - Work related to query-by-example SDR
- Query by example for speech text
- He et al. (2003) Lo Gauvain (2002, 2003)
tracking task in Topic Detection Tracking (TDT) - Chen et al. (2004) newswire articles (text) for
queries, broadcasts (speech) for docs. - All using 1-best transcripts
- Lattices of short spoken queries for IR
- Colineau Halber (1999)
18Outline
- Introduction
- Original contribution
- Background
- Lattice-based SDR under statistical model
- Other SDR methods
- Experiments on SDR with short queries
- Query-by-example SDR
- Conclusion
19Lattice-Based SDRUnder the Statistical Model
- Song Crofts IR model
- Relstat(d, q) log Pr(q d) ? c(w q) log
Pr(w d) - Our idea estimate Pr( d) from lattices
- Find expectations of word counts (Saraclar
Sproat 2004) doc. lengths - Ec(w d) ?t c(w t)Pr(t d)
- Ed ?t tPr(t d)
- Expected counts can be computed efficiently by
dynamic programming (Hatch et al. 2005)
20Lattice-Based SDRUnder the Statistical Model
o1
o2
o3
o
- The method
- Start with speech seg.s acoustic observations o
- Generate lattice using ASR
- Decoding with adaptation of Viterbi algo. keep
track of multiple paths (James 1995) - Use simple lang. model (bigram LM)
- Rescore with more complex LM (trigram LM)
- Replace bigram LM probs. with trigram probs.
- Make duplicates of nodes with differing trigram
contexts
Acoustic observations
w1/Pr(o1w1), Pr(w1ltsgt)
w4/Pr(o3w4), Pr(w4w3)
Latice from decoding with simple LM
w3
w3
w4
w2/Pr(o2o3w2), Pr(w2w2)
w3
w2
w1/Pr(o1w1), Pr(w1ltsgt)
w4/Pr(o3w4), Pr(w4 w1w3)
w3
Lattice rescored with complex LM
w4
w3
w4
w4/Pr(o3w4), Pr(w4w2w3)
w3
w2/Pr(o2o3w2), Pr(w2ltsgtw2)
w2
21Lattice-Based SDRUnder the Statistical Model
o1
o2
o3
o
- The method
- Combine acoustic LM probs.
- In practice, apply grammar scale factor ?, word
insertion penalty ? - Prune lattice
- Remove paths whose log probs. exceed best paths
by Tdoc - Find expectations of word counts Ec(w o),
seg. lengths Eo - Combine expected counts to get Ec(w d), Ed
Lattice with combined acoustic LM probs.
w3/p7
w1/p6
w4/p8
w4/p1 (p1 Pr(w1ltsgt) Pr(o1w1)e?1/?)
w4/p4
w3/p3
w4/p10
w3/p9
w2/p3
w2/p2
w4/p1
w3/p3
w4/p4
Pruned lattice
w2/p5
w2/p2
Word Expected count
w2 2p2p5/(p1p3p4p2p5)
w3 p1p3p4/(p1p3p4p2p5)
w4 2p1p3p4/(p1p3p4p2p5)
Expected counts
22Lattice-Based SDRUnder the Statistical Model
- The method
- Build unigram model to get Pr( d)
- Zhai Laffertys (2004) 2-stage smoothing method
- Combination of Jelinek-Mercer Bayesian
smoothing - Adapt 2-stage smoothing to use expected counts
- w is a word e.g. query word
- U a background language model
- ? ? (0, 1) set according to nature of queries
- µ set using variation of Zhai Laffertys
estimation algo. - Thus we can compute
- Relstat(d, q) log Pr(q d) ? c(w
q) log Pr(w d)
23Outline
- Introduction
- Original contribution
- Background
- Lattice-based SDR under statistical model
- Other SDR methods
- Experiments on SDR with short queries
- Query-by-example SDR
- Conclusion
24Other SDR Methods
- Statistical, using 1-best transcripts
- Motivated by Song Croft (1999), Chen et al.
(2004) - Vector space, using lattices
- Mamou et al. (2006)
- BM25, using lattices
25Other SDR MethodsStatistical, Using 1-Best
Trans.
- Estimate Pr( d) from 1-best transcript
- Use Zhai Laffertys 2-stage smoothing
- w is a word e.g. query word
- c1-best(w d) count of w in ds 1-best
transcript - d1-best length of ds transcript
- U a background language model
- ? ? (0, 1), µ gt 0 are smoothing parameters
- Compute relevance
- Relstat(d, q) log Pr(q d) ? c(w q) log
Pr(w d) -
26Other SDR MethodsVector Space, Using Lattices
o1
o2
o3
o
- Mamou et al. (2006)
- Method
- Compute word confusion network (Mangu et al.
2000) - Sequence of confusion sets
- Compute term freq. vector
- Weight of each term depends on ranks probs. in
confusion sets, freq. in doc. collection - Compute relevance
- Construct d q vectors, compute cosine similarity
w4/p1
w3/p3
w4/p4
Pruned lattice
w2/p5
w2/p2
g1
g2
g3
Word confusion network
w2
w3
w4
w4
w2
e
d
Document query vectors
q
t
27Other SDR MethodsBM25, Using Lattices
- Modify Robertson et al.s BM25 formula to use
expected counts - Relbm25, lat(d, q)
- Estimate doc. freq. nw from expected counts
(Turunen Kurimo 2007)
28Outline
- Introduction
- Original contribution
- Background
- Lattice-based SDR under statistical model
- Other SDR methods
- Experiments on SDR with short queries
- Query-by-example SDR
- Conclusion
29SDR ExperimentsMandarin Chinese Task Setup
- Doc. collection
- Hub5 Mandarin training corpus (LDC98T26)
- 42 telephone calls in Mandarin Chinese, total 17
hours, 600Kb text - Unit of retrieval (document)
- ½-minute time windows with 50 overlap
(Abberley et al. 1998 Tuerk et al. 2001) - 4,312 retrieval units
- Queries
- 18 keyword queries 14 test, 4 devel.
- Ground truth relevance judgements
- Determined manually
30SDR ExperimentsMandarin Chinese Task Details
- Lattices
- Generated by Abacus (Hon et al. 1994)
- Large vocab. triphone-based cont. speech
recognizer - Rescored with trigram language model
- Trained with TDT, Callhome, CSTSC-Flight corpora
- 1-best transcripts
- Decoded from rescored lattices
- Other tools used
- ATT FSM (Mohri et al. 1998)
- SRILM (Stolcke 2002)
- Low et al.s (2005) Chinese word segmenter
31SDR ExperimentsMandarin Chinese Task
- Retrieval
- SDR performed using
- baseline stat. method, on ref. transcripts
- baseline stat. method, on 1-best transcripts
- Mamou et al.s vector space method, on lattices
- our proposed method, on lattices
- Smoothing parameter
- ? 0.1 good for keyword queries (Zhai
Lafferty 2004) - Lattice pruning threshold T 10000.5 Tdoc
- Vary T on devel. queries, use best value on test
queries - Evaluation measure mean avg. prec. (MAP)
32SDR ExperimentsMandarin Chinese Task Results
- Results for statistical methods
- 1-best MAP was 0.1364 ref. MAP was 0.4798
- Lattice-based MAP for devel. queries highest at T
65,000 - At this point, MAP for test queries was 0.2154
MAP for 4 devel. queries
MAP for 14 test queries
33SDR ExperimentsMandarin Chinese Task Results
- Results for Mamou et al.s vector space method
- MAP for devel. queries highest at T 27,500
- At this point, MAP for test queries was 0.1599
MAP for 4 devel. queries
MAP for 14 test queries
34SDR ExperimentsMandarin Chinese Task Results
- Statistical significance testing 1-tailed
t-test - Improvement over 1-best significant at 99.5
level - Improvement over vector space significant at
97.5 level - Our method outperforms stat. 1-best vec. space
with lat.
35SDR ExperimentsEnglish Task Setup
- Corpus Fisher English Training corpus from LDC
- 11,699 telephone calls, total 1,920 hours,
109Mb text - Each call initiated by one of 40 topics
- 6,605 calls for training ASR engine
- Queries
- The 40 topic specifications
- 32 test, 8 devel.
- Doc. collection
- 5,094 calls
- Unit of retrieval (document) a call
- Ground truth rel. judgements
- d rel. to q iff conversation d was initiated by
topic q
ENG01. Professional sports on TV. Do either of
you have a favorite TV sport? How many hours per
week do you spend watching it and other sporting
events on TV?
Example of a topic spec.
36SDR ExperimentsEnglish Task Details
- Lattices
- Generated by HTK (Young et al., 2006)
- Large vocab. triphone-based cont. speech
recognizer - Tried trigram LM rescoring, decoding only with
bigram LM - 1-best transcripts
- Decoded from rescored lattices
- Word error rate 48.1 (with rescoring), 50.8
(without) - Words stemmed with Porter stemmer
- Also tried stop word removal experimented with
- no stopping
- stopping with 319-word list from U. of Glasgow
(gla) - stopping with 571-word list used in SMART system
(smart) - Index building used CMU Lemur toolkit
37SDR ExperimentsEnglish Task
- Retrieval
- Performed using
- baseline stat. method, on ref. transcripts
- baseline stat. method, on 1-best transcripts
- Mamou et al.s vector space method, on lattices
- BM25 method, on lattices
- our proposed method, on lattices
- Retrieval parameters
- For stat. methods ? 0.7 good for verbose
queries - For BM25 k1 1.2, b 0.75, k2 0 (following
Robertson et al. (1998)) ?, k3 tuned with devel.
queries - Evaluation measure MAP
38SDR ExperimentsEnglish Task Results
- Main findings
- Our method outperforms 1-best stat. SDR, Mamou et
al.s vector space method, BM25 - Unlike Mamou et al., does not need stop word
removal - Rescoring lattices with trigram LM helps improve
SDR
39Outline
- Introduction
- Original contribution
- Background
- Lattice-based SDR under statistical model
- Other SDR methods
- Experiments on SDR with short queries
- Query-by-example SDR
- Conclusion
40Query-By-Example SDR
- The task
- Given collection C of spoken docs., query
exemplar q (also a spoken doc.) - Task find docs. in coll. on similar topic as
query - Extending our stat. lat.-based SDR method to
query-by-example additional challenges - Problem 1 How to cope with uncertainty in ASR
transcription of q? - Problem 2 How to handle high concentration of
non-content words in q?
41Query-By-ExampleSDR Problems
- Problem 1 Uncertainty in transcription of q
- Use multiple ASR hypotheses for q
- Reformulate 1-best stat. IR as negative
Kullback-Leibler divergence ranking (Lafferty
Zhai 2001) - -?KL(q, d) log Pr(q d)
- Thus, we can estimate models Pr( d) Pr(
q) from d q lats., rank docs. by neg. KL div. - Problem 2 Lots of non-content words in q
- Use stop word removal
rank
42Query-By-ExampleSDR Proposed Method
- Get lattices for d q, rescore, prune, find
expected counts - Use 2 pruning thresholds Tdoc for docs., Tqry
for queries - Build unigram model of d
- With expected counts
- Again, use 2-stage smoothing (Zhai Lafferty
2004) -
- Build unigram model of q unsmoothed
-
- Compute relevance as neg. KL div.(Lafferty Zhai
2001) - Relstat-qbe(d, q) ?w Pr(w q) log Pr(w d)
43Query-By-ExampleSDR Experiments
- Corpus Fisher English Training corpus
- Queries
- 40 exemplars 32 test, 8 devel. for 40 topics
- Doc. collection
- 5,054 telephone calls
- Ground truth rel. judgements
- d rel. to q iff d q on same topic
- Smoothing parameter
- ? 0.7
- Lattice pruning thresholds Tdoc and Tqry
- Varied independently on devel. queries
- Stop word removal used
- no stopping
- stopping with gla stop list
- stopping with smart stop list
44Query-By-ExampleSDR Experiments
- Retrieval performed using
- 1-best trans. of exemplars docs. (1-best ?
1-best) - exemplar 1-best, doc. lat. (1-best ? Lat)
- exemplar lat., doc. 1-best (Lat ? 1-best)
- lat. counts of exemplars and docs. (Lat ? Lat)
our proposed method - Also tried
- ref. trans. of exemplars. docs. (Ref ? Ref)
- orig. Fisher topic spec. for queries (Top ? Ref,
Top ? 1-best, Top ? Lat) - Evaluation measure MAP
45Query-By-ExampleSDR Experimental Results
- MAP without stop word removal
- Stat. significance testing 1-tailed t-test,
Wilcoxon test - Lat ? Lat vs. 1-best ? 1-best improvement sig.
at 99.95 level - However, original topic specs. still better
nature of exemplars presents difficulties for
retrieval
46Query-By-ExampleSDR Experimental Results
- MAP with stop word removal
- With gla stop list Lat ? Lat better than 1-best
? 1-best at 99.99 level - With smart stop list better at 99.95 level
- Our method (Lat ? Lat) gives consistent
improvement
47Outline
- Introduction
- Original contribution
- Background
- Lattice-based SDR under statistical model
- Other SDR methods
- Experiments on SDR with short queries
- Query-by-example SDR
- Conclusion
48Conclusion
- Contributions
- Proposed novel SDR method combines use of
lattices stat. IR model - Motivated by improved IR accuracy when each
technique was used individually - New method performs well compared to previous
methods lattice-based BM25 - Extended proposed method to query-by-example SDR
- Lat.-based query by example, under stat. IR model
- Significant improvement over using 1-best trans.
- Consistently better, under variety of setups
49Conclusion
- Suggestions for future work
- Incorporate proximity-based search into our
method - Formulate a more principled way of deriving
lattice pruning thresholds - Examine how stop words affect SDR query by
example - Extend stat. lat.-based SDR framework to other
speech processing tasks, e.g. spoken document
classification
50Thank you!