Title: A Multi-Perspective Evaluation of the NESPOLE! Speech-to-Speech Translation System
1A Multi-Perspective Evaluation of the NESPOLE!
Speech-to-Speech Translation System
- Alon Lavie, Carnegie Mellon University
- Florian Metze, University of Karlsruhe
- Roldano Cattoni, ITC-irst
- Erica Costantini, University of Trieste
2Outline
- The NESPOLE! Project
- Approach and System Architecture
- Performance and Usability Challenges
- Distributed real-time performance over internet
- Integration and use of multi-modal capabilities
- End-to-end Translation performance
- Lessons learned and conclusions
3- Speech-to-speech translation for E-Commerce
applications - Partners CMU, Univ of Karlsruhe, ITC-irst,
UJF-CLIPS, AETHRA, APT-Trentino - Builds on successful collaboration within C-STAR
- Improved limited-domain speech translation
- Experiment with multimodality and with MEMT
- Showcase-1 Travel and Tourism in Trentino,
completed in Nov-2001, demonstrated - Showcase-2 expanded travel medical service
4Speech-to-speech in E-commerce
- Replace current passive web E-commerce with live
interaction capabilities - Client starts via web, can easily connect to
agent for specific information - Thin client - very little special hardware and
software on client PC browser, MS Netmeeting,
Shared Whiteboard
5NESPOLE! User Interfaces
6NESPOLE! Architecture
7Distributed S2S Translation over the Internet
8Network Traffic Impact
9NESPOLE! Monitor
10Aethra Whiteboard
11Recent Developments Apr-02
- Improved analysis and generation grammars (using
old C-STAR data) - Improved SR engines
- Packet-loss, video, and modem connection tests
- Data Collection for Showcase 2A
- Evaluation Scheme Experiment
- Paper and Demo at HLT-02
- Paper submissions to ACL-02, ICSLP-02, ESSLLI-02
12IF Status Report
13WP5 HLT Modules
- Data Collection for Showcase-2A completed in
February-2002 - Status of transcriptions from all sites?
- CMU will maintain a data repository (Alon
collecting all data CDs here) - IF discussions and development have already
started (Donna) - Development Schedule?
14WP7 Evaluation
- D9 Evaluation of Showcase-1 Report draft
circulated earlier this week - Each site should verify that most up-to-date
results are being reported - Include detailed tables in the report?
- Majority vote finalize a common procedure
- New evaluation experiments
15Majority Vote Scheme
- Issue did all sites use same guidelines?
- What to do when there is no majority?
- i.e. 4 graders assign P/P/K/K
- What to do when there is complete disagreement?
- i.e. 3 graders assign P/K/B
- Need to recalculate scores from prev evaluation?
16New Evaluation Experiments
- We are investigating three main issues
- Binary versus 3-way grading
- Majority vote versus averaging of scores
- Intercoder and Intracoder agreement
- Grading Experiment
- Four groups, three graders in each group
- Each group grades two sets, two weeks apart
- Sets are different but have a common large
overlap - Groups differ in eval scheme used (binary/3-way)
17Planned Analysis of Data
- Compare results across grading schemes (binary
vs. 3-way) on same set of data - Compare majority scores with average scores
- Evaluate Intercoder agreement between graders (on
same set and same scheme) - Evaluate Intracoder agreement of same grader (on
overlap data in the two sets, same grading scheme
in both sessions)
18Preliminary Results
Group(procedure) W1 Acc W1 Acc W1 Bad W1 Bad W2 Acc W2 Acc W2 Bad
Gr1 (binary/3-way) 50.2 49.8 49.8 48.7 48.7 51.3 51.3
Gr2 (3-way/binary) 52.4 47.6 47.6 48.8 48.8 51.2 51.2
Gr3 (3-way/3-way) 53.8 46.2 46.2 54.9 54.9 45.1 45.1
Gr4 (binary/binary) 49.0 51.0 51.0 50.0 50.0 50.0 50.0
19Plans for Final Evaluations
- Improved end-to-end evaluations
- Additional component evaluations?
- Additional user studies?
- How do we evaluate user interfaces, communication
effectiveness?