BBN Audio Indexer Training - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

BBN Audio Indexer Training

Description:

... was transcribed using normal orthography short vowels and consonant doubling ... Replace orthographic variants with one single form throughout the system, ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 20
Provided by: jbi88
Category:

less

Transcript and Presenter's Notes

Title: BBN Audio Indexer Training


1
Arabic Speech and Text in TIDES OnTAP Jay Billa,
Mohamed Noamany, Amit Srivastava, John Makhoul
and Francis Kubala
2
Outline
  • TIDES OnTAP overview
  • Brief Arabic Language primer
  • Arabic data resources
  • Speech Recognition
  • Initial system
  • Improvements
  • Scoring

3
Outline
  • Speaker Identification
  • Named Entity extraction
  • Arabic Document Tracking
  • Conclusions

4
TIDES OnTAP
  • OnTAP is a human-language processing system
    operating on mixed media in multiple languages
  • Data enters the system in its native form
  • Radio, satellite, news feed
  • Processed and stored as annotated text in a
    queryable database
  • Provides facilities for
  • Fact Extraction
  • Event Detection
  • Event Tracking

5
TIDES OnTAP Conceptual System Diagram
6
Arabic capabilities in TIDES OnTAP
  • Speech Recognition
  • Speaker Identification
  • Named Entity Extraction
  • Event Tracking

7
Arabic Language
  • Many spoken dialects, one written language
    Modern Standard Arabic (MSA)
  • MSA is also the dominant spoken language for
    broadcast news
  • 25 consonants, 3 quasi-consonants and 3 vowels
  • Duration of a consonant/vowel can change the
    meaning
  • Consonants can be doubled and vowels can be short
    or long
  • Long vowels are always written

8
Arabic Language
  • Short vowels and doubling of consonants are
    almost never written out except to disambiguate a
    word
  • When written they appear as marks or diacritics
    above or below a consonant
  • A word written as ktb can be katab (he
    wrote), kutib (it was dictated) or kutub
    (books) context allows a native speaker to
    infer the correct meaning
  • Long vowels and quasi-consonants use the same
    symbols
  • Arabic is highly inflected a single word can
    represent an entire phrase
  • Vowels at the end of a word can be optional

9
Arabic Data Resources
  • Acoustic modeling data was transcribed using
    normal orthography short vowels and consonant
    doubling were not marked in the transcripts
  • 42 hours of spoken Arabic from Egyptian and
    Syrian broadcast radio and TV
  • 20 hours of broadcast news from the Al-Jazeera TV
    network

10
Arabic Data Resources
  • Language modeling (LM) data acquired from
  • Lebanese newspapers Al-Hayat and An-Nahar
  • Agence France Presse (Newswire)
  • Additionally, the daily text feed from Al-Hayat
    has also been retained and added to the LM data
  • close-in-time language modeling

11
Arabic Data Resources
12
Difficulties faced in Arabic Speech Recognition
  • Audio data transcripts did not mark short vowel
    or consonant doubling
  • Creating accurate phonetic transcriptions is time
    consuming and costly
  • LM data from newspaper or newswire is also not
    marked for phonetic accuracy
  • Speaker optionality in pronouncing word endings
    complicates matters further
  • Compound words (due to inflection) increases
    out-of-vocabulary (OOV) word rate given a fixed
    dictionary size

13
Approaches to Arabic Speech Recognition - 1
  • Assign a phoneme to each Arabic character
  • Benefits
  • Pushes the burden of disambiguation onto a
    statistical model (HMM) with a well-defined
    training methodology a completely automatic
    procedure!
  • Lexicon design becomes trivial S P E L L the
    word!
  • Any written Arabic text (newspaper, WWW) can be
    added without special transformations (beyond
    data formatting)

14
Approaches to Arabic Speech Recognition - 2
  • Keep the compound words as-is
  • Increases OOV but
  • each word is essentially a phrase treating it
    as a unique word results in a very strong
    language model (improves recognition accuracy)
  • no need to stem compound words
  • you can always use a bigger dictionary

15
Initial Speech Recognition System
  • 65,000 word lexicon (4 OOV)
  • 37.5 hours of acoustic data
  • 230K word LM (using acoustic transcripts only)
  • 4.5 hour test set consisting of various news
    broadcasts from Egyptian and Syrian TV/radio
  • Word Error Rate (WER) was 31.2
  • After adding 145 million words to the LM
  • WER reduced to 26.6

16
System Improvements
  • New capabilities
  • Online MLLR adaptation
  • Ability to run with lexicons greater than 65,000
    words
  • Increased acoustic and language modeling data
  • Lexicon refinements
  • System optimization and tuning

17
Evaluating Speech Recognition Performance
  • In written Arabic the hamza (glottal stop) alif
    (long vowel /aa/) pair has four different
    orthographic variations
  • Inconsistencies do not affect readability
  • But these inconsistencies are marked wrong by
    existing scoring
  • And this is a very common construct!
  • Options
  • Do nothing, performance does not reflect user
    experience
  • Replace orthographic variants with one single
    form throughout the system, readable output but
    not natural
  • Replace orthographic variants with one single
    form only during scoring, best of both worlds In
    other words, ignore variations in the alif-hamza
    in scoring.

18
Speech Recognition Performance
19
Conclusions
  • Using letters as phonemes has several advantages
  • Much easier to transcribe (can probably get twice
    as much data for same cost)
  • Allows simple use of large text corpora
  • Increasing vocabulary to 128K
  • reduces OOV problem (for Arabic without vowels
  • Avoids morphology problem of which combinations
    are legal
  • Getting sufficient data is still most important
Write a Comment
User Comments (0)
About PowerShow.com