Title: Recognition of spoken and spelled proper names
1Recognition of spoken and spelled proper names
Author Michael Meyer, Hermann Hild
- Introduction
- Experiments
- Summary
- The recognition of increasingly large sets of
spoken names is difficult - Very large recognition vocabularies contain many
easily confused words or even homophones. - In this paper, it compares the performance for
proper name recognition when a name is spoken
only, spelled only or both spoken and spelled.
4Introduction (cont)
- In what contexts do people speak and spell names
Table 1 Three scenarios for speaking and
spelling a proper name
- Speech data
- A database of about 2800 German last name spoken
by 57 different speaker, according to scenario 2. - Recorded with a close-talking microphone at a
sampling rate of 16 kHz. - The boundaries between all spoken and spelled
names were to identified to conduct scenario 1
6Experiments (cont)
- The pronunciation dictionary covers about half of
the 2800 names of our speech data. - The set of 1337 spoken and spelled names is used
all the experiments described below. - For experiments, we use a MS-TDNN as a
specialized letter recognizer. - And use the LVCSR of the JANUS system as a spoken
name recognizer.
7Experiments (cont)
- 60.0 names correct were achieved on the test set
of the 1337 spoken last names. - To recognize the spelled name with JANUS, 93.3
correct names were achieved on the spelled names. - MS-TDNN
- Achieved 96.5 correct spelled names on the test .
8Experiments (cont)
- Small list
- We assume that the list of names to be recognized
is small enough, so that every name can be
explicitly represented in the dictionary. - How can we combine the different information
provided by the spoken and spelled names?
9Experiments (cont)
- After all, the pronunciations of the spelled
letters represent in approximation the sounds of
the letters in the fluently spoken words. - TOM versus T-O-M
- Exceptions are letters with unusual pronunciation
and those letter combination which define their
own pronunciation, such as - Sch , ch
- In the following we will just combines the two
representations on the basis of their acoustic
scores only
10Experiments (cont)
11Experiments (cont)
- Scenario 2
- With 86.1 correct, the recognition on the entire
utterance is worse than on the spelled apart
alone. - It is possible to adapt a similar approach as in
scenario 1. - The boundary of the first best hypothesis was
used for the weighting of all hypotheses. - Resulting in a recognition rate of 89.1
- To incorporate the MS-TDNN letter recognizer,
resulting in 95.8 recognition rate.
12Experiments (cont)
Figure 1 names correct for a weighted
combination of the N-best list of spoken and
spelled names (scenario 1 and 2)
13Experiments (cont)
- Large lists
- If the number of names exceeds the recognizers
maximum vocabulary size, a different approach has
to be taken. - A two-step approach is employed.
- A coarse recognition run is used to get a reduced
list of name candidates. - Then, these are processed in which all the
previously described techniques for small word
lists can be applied.
14Experiments (cont)
- In the case of scenario 1, the list of candidates
can be easily reduced if only the spelled names
are considered in the first pass. - For scenario 2, only phonemes and letters in
JANUSs recognition vocabulary.
15Experiments (cont)
- For scenario 2
- A special language model is employed.
- A list of the most similar names can be
retrieved, and then used in another JANUS
recognition run. - The letter segments are then re-recognized with
16Experiments (cont)
Table 3 Summary of results for the separated and
combined recognition of fluently spoken and
spelled last names
- By combining the N-best lists of both the spoken
and spelled recognition, the overall performance
can be improved. - An input of either L or FL can be distinguish
with almost 99 correct, resulting of 95.5 names
correct without a priori knowing whether L or FL
was spoken.
18Caller Identification from Spelled-Out Personal
Data Over the Telephone
- Introduction
- The personal identification algorithm
- Tests and results
- Conclusions
- The problem of automatically identifying the
caller in a telephone conversation from the
information spoken in the call is extremely
difficult. - The identification must take place despite rather
substantial speech recognition errors that may be
made by the machine.
21Introduction (cont)
- We can find a solution to the problem if we make
two assumptions. - We assume that there is a database of records
containing personal information about there the
caller which can serve as a reference during the
identification process. - We ask our caller to spell the personal
identifying items so that the spoken vocabulary
is small and we can look for correlations with
the items in the database.
22 The personal identification algorithm
Fig 1. the algorithm of personal identification
from spelled tokens
23The personal identification algorithm (cont)
- Bayesian computation which starts with an
estimate, for each record in , of the
probability that record represents the identity
of the caller. - It uses the acquired information and updates
each records probability that it corresponds to
the current callers identity. - The incremental computation is
24The personal identification algorithm (cont)
- Bayesian update of probabilities
25Tests and results
- The system was tested using a database of one
million records that was constructed by using
random combinations of 4,375 female first names,
1,129 male first names, and 88,799 last names. - The account numbers were generated so that the
values for the last four digits of the number
occurred with equal frequency throughout the
database. - The city, postal code, and phone number fields
were generated to correspond the locations in the
26Tests and results (cont)
- Our test involved identifying 300 different
records in the database. - If the system was unable to make an
identification of the target record after asking
the user for all of the information, the caller
was asked to make a second attempt using the same
information. - If the system failed to produce a result after
the second attempt, the call was terminated at
that point.
27Tests and results (cont)
- For each telephone call, the users were asked
eight questions - Enter your ID, using you telephone keypad,
followed by the pound key. If you make a mistake
press the star key. - You entered (the value entered in (1)) if this is
correct, press 1. If it is not, press2. - Please say the first four letters of your last
name. - Please say the first four digits of your first
28Tests and results (cont)
- Please say the last four digits of your card
number. - Could you please spell the city currently listed
on your account? - Please say you phone number.
- Please say the postal code currently listed on
your account.
29Tests and results (cont)
Fig 2. summary of results from 300 calls.
30Tests and results (cont)
Fig 3. Rate of ASR character misrecognition by
31Tests and results (cont)
Fig. 4. Rate of misrecognition of field by ASR
(misrecognition at least one error made in
spelled filed value)
32Tests and results (cont)
Fig 5. Average cumulative number of records
examined by system
33 Conclusions
- The method tolerates high misrecognition rates.
- The method can be used with off-the-shelf
component it doesnt require specialized ASR. - To allow the personal identification information
to be spoken instead of spelled tokens.
34Record Rv that will be verified
Select another T
Request T from caller
Collect subset near T
Rm Rv
Another T?
Add subset to S
Risk(Rv) lt Risk(reject)
Update the risk for each Record in S
Reject !
Rm lt- min risk in S
Accept !