Technical Aspects of the CALO Recorder - PowerPoint PPT Presentation

About This Presentation
Title:

Technical Aspects of the CALO Recorder

Description:

One of the component of CAMPER. The four: CALO recorder. Speechalizer. End-pointing Information ... Several processing needs to be concurrently. VU meter ... – PowerPoint PPT presentation

Number of Views:9
Avg rating:3.0/5.0
Slides: 20
Provided by: csC76
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Technical Aspects of the CALO Recorder


1
Technical Aspects of the CALO Recorder
  • By
  • Satanjeev Banerjee
  • Thomas Quisel
  • Jason Cohen
  • Arthur Chan
  • Yitao Sun
  • David Huggins-Daines
  • Alex Rudnicky

2
Role of the CALO recorder
  • A centralized mechanism to collect all perceptual
    events.
  • Speech, Text
  • CMU provides technology on
  • On Event Recording
  • On Speech Recognition

3
Role of the CALO Recorder
  • One of the component of CAMPER
  • The four
  • CALO recorder
  • Speechalizer
  • End-pointing Information
  • Prosodic Information
  • Speech Recognition
  • CAMSeg
  • Speech Segmentation
  • Understanding

4
An Architecture Diagram (Client Side)
Audio Capturing
Text Capturing through Keyboard
Other Events
Ring Buffers
End-Pointer
VU Meter
Speech Decoder
Storage
5
Persistence of Data
  • Background Intelligent Transfer System (BITS)
  • Use to transfer data off-line

6
Technical Challenges in the Recorder
  • Threading
  • Audio Buffering
  • Time-synchronization
  • Real-time processing
  • End-pointing
  • Speech processing
  • Portability
  • Maintenance/Distribution

7
Threading
  • Several processing needs to be concurrently
  • VU meter
  • Speech Processing and Higher-level Understanding
  • Graphical User Interface
  • Long development time was invested to make the
    communication between to be correct.
  • (By Thomas Quisel) See Architecture Diagram next
    slides
  • Example Issues In some platforms, WX
    implementation will make GUI thread disallow
    other threads to call its drawing functions.

8
(No Transcript)
9
Audio Buffering
  • Sphinx 2, 3.X libaudio require,
  • Capture audio
  • Do processing on the audio buffer.
  • If the processing thread is slightly slower than
    1xRT
  • Audio will be lost
  • (By Jason Cohen) A ring buffer structure is
    implemented.

10
Time Synchronization
  • By David Huggins
  • Simple NTP (SNTP) is used in getting universal
    time coordinate (UTC) from arbitrary NTP server
  • Clone of standard NTP implementation
  • Internal Synchronization
  • Synchronization time between machines
  • 50-60ms
  • Major challenge is the delay imposed by OS/audio
    capturing software.

11
Real-time Processing
  • Role of End-pointing and Recognition
  • After long-time debate
  • Two stage end-pointing and recognition
    architecture is chosen
  • By Ziad
  • High performance end-pointing routine is created
  • Gaussian Mixture Model-based
  • End-pointer implemented as a frames voter within
    segments
  • The parameters are further manually tuned.
  • Speed optimized.
  • Now in s3ep, a customized version of Sphinx

12
(No Transcript)
13
Speech Recognizer
  • Resulting output is fed to the recognizer
  • Speech Recognition in meeting
  • Regards as one of the biggest challenge in the
    field
  • Results largely varied from meeting style, number
    of attendants, topics, disfluencies of the
    speakers.

14
Accuracy Performance, still under heavy work,
Currently
  • In the cleanest meeting (Bdb001)
  • With one very dominating male speaker
  • With one very dominating female speaker
  • Speaker speaking rate entropy is lowest
  • Error rate 29.4

15
Phase IV of Accuracy Improvement (Core)
  • Boosting-based training
  • Confidence-based N-best re-ranking
  • Speaker adaptation based on transformation
  • Speaker normalization
  • Include BN , SWB material in LM training
  • Dictionary Refinement

16
Phase IV of Accuracy Improvement (Optional)
  • STC
  • MLLT
  • DT
  • PLP, TRAP
  • LM with disfluencies and back-channeling

17
Speed
  • 2.2G machine
  • Communicator
  • S2, 17.3, 0.34xRT
  • S3.X BL 11.8, 4xRT
  • S3.X Tuned 12.8, 0.87xRT
  • WSJ 5k
  • S3.X BL 7.4 1.61xRT
  • S3.X BL 8.3 0.5xRT
  • ICSI
  • With tuning SVQ and CIGMMS, 0.7xRT is achieved.
  • We may possibly tune up the results.
  • Benchmarking results need time to prepared

18
Maintenance and Distribution
  • All in local CVS
  • C, Java
  • Will soon move to SRI
  • Regular release is created, usage of SRIs CVS
    will blur this line.

19
Conclusion
  • Engineering work is mostly done for the recorder
  • Time to improve individual components.
  • Everyone is welcomed to join the effort.
Write a Comment
User Comments (0)
About PowerShow.com