Title: Bridging the Gap between Linguists
1Bridging the Gap between Linguists Technology
DevelopersLarge-Scale, Sociolinguistic
Annotation for Dialect and Speaker Recognition
- Christopher Cieri1, Stephanie Strassel1, Meghan
Glenn1, Reva Schwartz2, Wade Shen3, Joseph
Campbell3
1. Linguistic Data Consortium 3600 Market Street,
Suite 810 Philadelphia, PA 19104 ccieri,
strassel, mlglenn_at_ldc.upenn.edu
3. MIT Lincoln Laboratory 244 Wood
Street Lexington, MA 02421 swade, jpc_at_ll.mit.edu
2. United States Secret Service Washington,
DC reva.schwartz_at_usss.dhs.gov
This work is sponsored by the Department of
Homeland Security under Air Force Contract
FA8721-05-C-0002. Opinions, interpretations,
conclusions and recommendations are those of the
authors and are not necessarily endorsed by the
United States Government
2Introduction to Phanotics
- Increased interest in speaker recognition
community in high-level features that abstract
from the acoustic signal. - lexical choice, presence of idiomatic
expressions, syntactic structures - Forensic applications require robustness to
channel differences - channel adaptation and the
- identification of features inherently robust to
channel difference - Language Recognition community increasingly
mutually intelligible dialects, not just
languages - Decades of research in dialectology suggest that
high-level features can enable systems to cluster
speakers according to the dialects they speak. - Phanotics (Phonetic Annotation of Typicality in
Conversational Speech) seeks to - Sponsored by United States Secret Service
- MIT Lincoln Laboratory coordinates effort and
develops the systems - Linguists from Arizona State and Old Dominion
universities consult on dialectal phenomena - LDC and Appen Pty Ltd o Australia annotate data
provided by LDC and - Identify high-level features characteristic of
American dialects, - annotate a corpus for these features
- use the data to develop dialect recognition
systems - use the categorization to create better models
for speaker recognition
3Annotation Approach
- Annotating large corpora for many high-level
features impractical without - existing data
- annotations
- technologies that simplify the annotators task
- Phanotics uses data orthographically transcribed
to serve as a guide to potential loci for the
features sought - orthographic transcripts, pronouncing lexicon,
forced-aligner generate putative, time-aligned,
phonetic transcription that - images that the speakers utterances were
standard. - high-level features of interest described as
deviations from standard pronunciation - loci in which actual pronunciation differs from
putative standard are potential high-level
features - Since
- complete phonetic transcription cost-prohibitive
- automatic phonetic transcription is not
adequately accurate - we lack dialect studies for every difference one
might encounter - We do not count deviations directly but allow the
technologies to guide human annotators to
expected features.
4Requirements
- Requires natural speech from speakers of target
dialects - Initial focus on distinguishing African American
Vernacular English (AAVE) from all other dialects
of American English (non-AAVE) - plan to investigate other American dialects later
- Selected data collected to minimize the effect of
observation - recordings of subjects engaged in conversations
- Project requires subjects categorized according
to the dialect spoken. - Since goal is to establish typicality of features
by dialect, categorization based on something
other than features themselves - relied on self-reported metadata
- AAVE
- native speakers of American English
- born and raised in the United States
- ethnically African American
- Non-AAVE
- American English speakers of other ethnicities
- Remove subjects from either pool who appear later
mis-categorized.
5Data Selection
- Mixer Corpora
- CTS, from LDC supports robust SR development
- subjects provided age, sex, occupation, cities
born/raised, ethnicity - subjects completed
- gt10 six-minute calls
- speaking to other subjects whom they typically
did not know - about assigned topics
- Bilinguals in Arabic, Mandarin, Russian, and
Spanish used those languages English - 7 calls in cross-channel recording room (8
microphones on one side of call - calls audited for topic and audio quality but not
generally transcribed - Although not designed for the current effort
includes self-report ethnicity. - Pool contains speakers of multiple American
English dialects who categorized themselves as
African American and other ethnicities - 126 Mixer calls transcribed by Phanotics project
- 35 included conversations between two speakers of
AAVE - 91 include conversations between one AAVE and
non-AAVE
6Data Selection
- Fisher Corpus
- collected at LDC to support STT development
within DARPA EARS - subjects provided age, sex, native language, and
the cities where they were born and raised - subjects completed 1-25 10-minute calls, speaking
to other participants, whom they typically did
not know, about assigned topics - calls audited for topic and quality
- verbatim, time-aligned orthographic transcripts
were produced - lacks crucial information on the ethnicity of the
speaker - but some subjects were LDC employees, their
family, friends, and colleagues - small number (171) could be assigned to an ethnic
category after the fact - StoryCorps Griot Initiative
- funded by Corporation for Public Broadcasting in
US - one-year effort to record one-hour interviews of
African Americans. - nine recording locations open for up to six weeks
each - subjects interview friends and family on topics
of their choice - potential users receive instructions on
conducting good interviews trained facilitator
present - participants receive a free copy of their
interview other copies are archived and
distributed - StoryCorps provides Phanotics selected interview
in exchange for transcripts - Sociolinguistic Interviews
- recorded and contributed by researchers working
in the United States
7Transcription
- Most audio lacked transcripts LDC designed spec
for this project. - similar to Fisher Quick Transcription
specification - emphasizes speed and accuracy.
- annotators segment speech at sentence level
- sentences further segmented if gt8 seconds gt0.5
seconds internal silence - segments overlap audio containing no speech left
un-segmented - standard orthography, case, punctuation (period,
question mark, comma) - -- incomplete sentences and restarts -
incomplete words - proper names, acronyms, letter strings
capitalized - uttered numbers written as words, not as strings
of digits - limited set of standard contractions are used and
- non-standard contractions (cause for because)
written as the full word - obviously mispronounced, idiosyncratic words
tagged with - no other attempt made to mark dialectal
pronunciation - accomplished in annotation phase
- limited set of non-lexemes, (um, uh) used in
filled pauses - speech errors transcribed as produced
- limited time to transcribe diffluencies since
these will be rejected - background noises not marked limited set of
markers for speaker noises
8Feature Annotation
- Goal identify features that distinguish dialect
from standard - features described as rules that change standard
into non-standard - rules apply variably according to internal and
external constraints - lexical identity, morphology of affected word,
position within sentence, phonological
environment, functional effect of change (for
example whether it neutralizes a distinction
between two words), the age, sex, socioeconomic
class of speakers, dialects they speak - Examples
- reduction of consonant clusters in final position
- left gt lef, missed gt miss)
- deletion of r, l, w
- car gt ca, palm gt pam, young ones gt young
uns - change of the voiced and voiceless interdental
fricatives into stops - bother gt boda
- Data preparation, customized tools simplify the
annotation process - Rules specified as a gt b/x_y
- a becomes b when preceded by x and followed by y
- inputenvironment, xay, constitute search term
- inputoutput agtb constitute a question to be
answered by human - Did the subject say xay or xby?
9Feature Annotation
- SPAAT (Super Phonetic Annotation and Analysis
Tool) designed for rapid annotation and analysis - for each feature, presents list of regions of
interest (ROI) where rule may have applied - since transcript audio previously
forced-aligned, annotator can listen to the audio
with small amount of preceding and following
context - Annotators job is to decide whether or not the
rule has applied.
10Initial Results
- average time to annotate an ROI ranges 15-25
- Approach to measuring inter-annotator agreement
- distinguishes initial agreement measured at
beginning of effort - assess the difficulty of a task
- from measures repeated after thorough
documentation created, annotators undergone
rigorous training, testing and selection - Initial inter-annotator agreement varies by rule,
rule type, annotator and annotator training - absolute average initial agreement across five
annotators, all rules was 74.49 on three-way
decision where a feature is annotated as present,
intermediate or absent - converted to two-way decision (feature is present
versus intermediate absent) initial agreement
climbs to 85.54 - Pair wise agreement by chance in three way and
two way decisions is, respectively, 11.1 and 25 - initial two way agreement rates were 83.81 for
rules involving substitutions and 91.95 for
rules involving reductions and insertions. - Team now working to increase IAA
- expanding training program, documentation to
include audio examples - decision form is standard, non-standard,
intermediate, unrelated to rule, indeterminate,
ROI is mistaken - creating a small gold standard
11Summary
- Project connects sociolinguistics and HLT
- Seeks to determine typicality of high level
features in distinguishing dialect for forensic
purposes - Focuses initially on AAVE later on other
dialects of American English - Uses existing audio from CTS and interviews
- Creates transcripts, audio-transcript
time-alignments - Combination of these with SPAAT speeds annotation
- Initial inter-annotator agreement encouraging
- Modifications of spec, training, tool expected to
increase IAA - Fisher audio and transcripts already available in
LDCs Catalog - LDC2005S13 Fisher English Training Part 2,
Speech - LDC2005T19 Fisher English Training Part 2,
Transcripts - LDC2004S13 Fisher English Training Speech Part 1
Speech - LDC2004T19 Fisher English Training Speech Part 1
Transcripts - Mixer audio in queue
- Story Corps Griot and Sociolinguistic Interviews
under negotiation - To be distributed after use in the program
- Mixer Transcripts
- Annotations
- possibly SPAAT