Title: Context in Multilingual Tone and Pitch Accent Recognition
1Context in Multilingual Tone and Pitch Accent
Recognition
- Gina-Anne Levow
- University of Chicago
- September 7, 2005
2Roadmap
- Motivating Context
- Data Collections Processing
- Modeling Context for Tone and Pitch Accent
- Context in Recognition
- Conclusion
3Challenges
- Tone and Pitch Accent Recognition
- Key component of language understanding
- Lexical tone carries word meaning
- Pitch accent carries semantic, pragmatic,
discourse meaning -
- Non-canonical form (Shen 90, Shih 00, Xu 01)
- Tonal coarticulation modifies surface realization
- In extreme cases, fall becomes rise
- Tone is relative
- To speaker range
- High for male may be low for female
- To phrase range, other tones
- E.g. downstep
4Strategy
- Common model across languages, SVM classifier
- Acoustic-prosodic model no word label, POS,
lexical stress info - No explicit tone label sequence model
- English, Mandarin Chinese (also Cantonese)
- Exploit contextual information
- Features from adjacent syllables
- Height, shape direct, relative
- Compensate for phrase contour
- Analyze impact of
- Context position, context encoding, context type
- gt 20 relative improvement over no context
- Preceding context greater enhancement than
following
5Data Collection Processing
- English (Ostendorf et al, 95)
- Boston University Radio News Corpus, f2b
- Manually ToBI annotated, aligned, syllabified
- Pitch accent aligned to syllables
- Unaccented, High, Downstepped High, Low
- (Sun 02, Ross Ostendorf 95)
- Mandarin
- TDT2 Voice of America Mandarin Broadcast News
- Automatically force aligned to anchor scripts
(CUSonic) - High, Mid-rising, Low, High falling, Neutral
6Local Feature Extraction
- Uniform representation for tone, pitch accent
- Motivated by Pitch Target Approximation Model
- Tone/pitch accent target exponentially approached
- Linear target height, slope (Xu et al, 99)
- Scalar features
- Pitch, Intensity max, mean (Praat, speaker
normalized) - Pitch at 5 points across voiced region
- Duration
- Initial, final in phrase
- Slope
- Linear fit to last half of pitch contour
7Context Features
- Local context
- Extended features
- Pitch max, mean, adjacent points of preceding,
following syllables - Difference features
- Difference between
- Pitch max, mean, mid, slope
- Intensity max, mean
- Of preceding, following and current syllable
- Phrasal context
- Compute collection average phrase slope
- Compute scalar pitch values, adjusted for slope
8Classification Experiments
- Classifier Support Vector Machine
- Linear kernel
- Multiclass formulation
- (SVMlight, Joachims), LibSVM (Cheng Lin 01)
- 41 training / test splits
- Experiments Effects of
- Context position preceding, following, none,
both - Context encoding Extended/Difference
- Context type local, phrasal
9Results Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5 81.3
Extend LR 74 80.7
Extend L 74 79.9
Extend R 70.5 76.7
Diffs LR 75.5 80.7
Diffs L 76.5 79.5
Diffs R 69 77.3
Both L 76.5 79.7
Both R 71.5 77.6
No context 68.5 75.9
10Results Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5 81.3
Extend PrePost 74.0 80.7
Extend Pre 74.0 79.9
Extend Post 70.5 76.7
Diffs PrePost 75.5 80.7
Diffs Pre 76.5 79.5
Diffs Post 69.0 77.3
Both Pre 76.5 79.7
Both Post 71.5 77.6
No context 68.5 75.9
11Results Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5 81.3
Extend PrePost 74 80.7
Extend Pre 74 79.9
Extend Post 70.5 76.7
Diffs PrePost 75.5 80.7
Diffs Pre 76.5 79.5
Diffs Post 69 77.3
Both Pre 76.5 79.7
Both Post 71.5 77.6
No context 68.5 75.9
12Results Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5 81.3
Extend PrePost 74 80.7
Extend Pre 74 79.9
Extend Post 70.5 76.7
Diffs PrePost 75.5 80.7
Diffs Pre 76.5 79.5
Diffs Post 69 77.3
Both Pre 76.5 79.7
Both Post 71.5 77.6
No context 68.5 75.9
13Discussion Local Context
- Any context information improves over none
- Preceding context information consistently
improves over none or following context
information - English Generally more context features are
better - Mandarin Following context can degrade
- Little difference in encoding (Extend vs Diffs)
-
- Consistent with phonological analysis (Xu) that
coarticulation is carryover, not anticipatory
14Results Discussion Phrasal Context
Phrase Context Mandarin Tone English Pitch Accent
Phrase 75.5 81.3
No Phrase 72 79.9
- Phrase contour compensation enhances recognition
- Simple strategy
- Use of non-linear slope compensate may improve
15Conclusion
- Employ common acoustic representation
- Tone (Mandarin), pitch accent (English)
- Cantonese, recent experiments
- SVM classifiers - linear kernel 76, 81
- Local context effects
- Up to gt 20 relative reduction in error
- Preceding context greatest contribution
- Carryover vs anticipatory
- Phrasal context effects
- Compensation for phrasal contour improves
recognition
16Current Future Work
- Application of model to different languages
- Cantonese, Dschang (Bantu family)
- Cantonese 65 acoustic only, 85 w/segmental
- Integration of additional contextual influence
- Topic, turn, discourse structure
- HMSVM, GHMM models
- http//people.cs.uchicago.edu/levow/projects/tai
- Supported by NSF Grant 0414919
17Confusion Matrix (English)
Recognized Tone Manually Labeled Tone Manually Labeled Tone Manually Labeled Tone Manually Labeled Tone
Unaccented High Low D.S. High
Unaccented 95 (888/934) 25 (110/440) 100 (12/12) 53.5 (61/114)
High 4.6 (43/934) 73 (322/440) 0 38.5 (44/114)
Low 0 0 0 0
D.S. High 0.3 (3/934) 2(8/440) 0 8 (9/114)
18Confusion Matrix (English)
Recognized Tone Manually Labeled Tone Manually Labeled Tone Manually Labeled Tone Manually Labeled Tone
Unaccented High Low D.S. High
Unaccented 95 25 100 53.5
High 4.6 73 0 38.5
Low 0 0 0 0
D.S. High 0.3 2 0 8
19Confusion Matrix (Mandarin)
Recognized Tone Manually Labeled Tone Manually Labeled Tone Manually Labeled Tone Manually Labeled Tone
High Mid-Rising Low High-Falling Neutral
High 84 (38/45) 9 (5/56) 5 (1/20) 13 0 (9/68)
Mid-Rising 6.7 (3/45) 78.6 (44/56) 10 (2/20) 7 27.3 (5/68) (3/11)
Low 0 3.6 (2/56) 70 (14/20) 7 (5/68) 27.3
High-Falling 7.4 (4/45) 3.6 (2/56) 10 (2/20) 70 (48/68) 0
Neutral 0 5.3 (3/56) 5 (1/20) 1.5 (1/68) 45
20Confusion Matrix (Mandarin)
Recognized Tone Manually Labeled Tone Manually Labeled Tone Manually Labeled Tone Manually Labeled Tone
High Mid-Rising Low High-Falling Neutral
High 84 9 5 13 0
Mid-Rising 6.7 78.6 10 7 27.3
Low 0 3.6 70 7 27.3
High-Falling 7.4 3.6 10 70 0
Neutral 0 5.3 5 1.5 45
21Related Work
- Tonal coarticulation
- Xu Sun,02 Xu 97Shih Kochanski 00
- English pitch accent
- X. Sun, 02 Hasegawa-Johnson et al, 04 Ross
Ostendorf 95 - Lexical tone recognition
- SVM recognition of Thai tone Thubthong 01
- Context-dependent tone models
- Wang Seneff 00, Zhou et al 04
22Pitch Target Approximation Model
- Pitch target
- Linear model
- Exponentially approximated
- In practice, assume target well-approximated by
mid-point (Sun, 02)