Title: TurnTaking, Grounding and Speaker Segmentation
1Turn-Taking, Grounding and Speaker Segmentation
2Today
- Turn-taking behaviors in human-human conversation
- Conversational Analysis accounts
- Task/circumstance/individual dependencies
- Linguistic/cultural differences
- Grounding analyses
- Speech processing tasks
- Online turn identification for SDS
- Speaker diarization, segmentation, identification
3Turn-taking Behavior
- How do speakers know when it is appropriate to
contribute to a conversation? - Conversational Analysis Theory Conversational
partners expect certain patterns of behavior in
normal conversation - Pat You got an A? Thats great!
- Chris Yeah, Im really smart you know.
- Chris Well, I was just lucky I happened to read
the chapter on dialogue systems right before the
test. Otherwise I would never have squeaked
through. - General patterns in ordinary conversation
- Deviation is significant
4Expectations of What to Say Depend on Task at
Hand
- Telephone
- Openings
- Pat Hello?
- Chris Hi, Pat. Its Chris.
- Pat Hi!
- Closings (6-turn)
- Chris Well, I just wanted to see how you were
doing - Pat Thanks for calling. We'll have to have lunch
sometime - Chris I'd like to
- Pat Okay
- Chris Okay
- Pat See you
- Chris Yeah, see you
5- Email / Chat
- Pat Hi, can we switch lunch to 1230? Im
running late. - Chris Sure. 1230.
- Pat Great. See you.
- Service encounters
- Clerk Good morning. Is there something I can
help you with? - Pat Hi. Yeah. I wonder if you could show me.
- Meetings
- Boss Today I want to focus on next years goal
statements. Chris, could you report please. - Chris
- Boss Pat, now lets hear from you
- Pat
- News broadcasts
- Anchor Chris Smith reports from Rome now on the
upcoming conclave. Chris? - Reporter Thanks, Pat.. And now back to Pat
Jones in New York.
6Conversational Analysis (Sacks et al 74)
- Can we characterize expectations of what to say
more generally? - Rules of turn-taking
- If, during this turn the current speaker has
selected A as the next speaker, then A must speak
next - If the current speaker does not select the next
speaker, any other speaker may take the next turn - If no one else takes the next turn, the current
speaker may take the next turn - Rules Apply at Transition Relevance Places (TRPs)
where something allows speaker changes to occur
7Conversational Analysis (Sacks et al 74)
- Adjacency pairs
- Question/answer
- Greeting/greeting
- Compliment/downplayer
- Dispreferred responses
- Silence
- No to a simple request without explanation
- Changing the topic abruptly without transition
- Important for Spoken Dialogue Systems
8- Developmental Psychology
- Children learn turn-taking within first 2 years
(Stern 74) - Children liked by their peers are more skilled
(Black Hazen 90) - General individual differences
- Shy people pause longer and speak less and less
often (Pilkonis 77) - Schizophrenics, neurotics, depressed people less
skilled in turn-taking
9Cultural Differences in Turn-Taking
- Telephone conversations
- Openings (Zhu 04)
- Mandarin vs. British
- Identification differences
- British self-report
- Chinese callees ask the caller
- Finnish business calls (Halmari 93) vs. American
- Americans get right to the point
- Finns chat
10But where is the intent? Purpose?
11Grounding Approaches to Conversational Modeling
- Conversation is a joint process through which
speakers are constantly negotiating a common
ground (Stalnaker 78, Clark 96 inter alia) - Principle of Closure Agents performing an action
require evidence that they have succeeded (Norman
88)or not. - Clark Schaeffer 89
- Presentation (by S) and Acceptance (by H) via
- Continued attention, acknowledgement/backchannel,
demonstration, display, relevant next
contribution.
12Presentation and Acceptance (Clark Schaeffer
89)
- S John Stewart is my favorite comedian
- H continued attention
- H Mhmm acknowledgement/backchannel
- H Your favorite comedian display
- H Hes the funniest person you know
demonstration - H The Daily Show is not to miss relevant next
contribution
13When Is It Appropriate to Speak?(Duncan 72)
- Analyze acoustic/prosodic and gestural
information in two face-to-face conversations. - Turn-yielding cues
- Slower speaking rate
- Drop in pitch or loudness
- Completion of syntactic clause
- Termination of hand gesticulation
- Rising or falling final intonation
- Expressions like you know.
- Turn-keeping cues
- Hands engaged in gesticulation
- Filled pauses
14When Is It Appropriate to Speak?(Beattie 82)
- Who interrupts?
- Less intelligent, highly neurotic, extroverted
- Men interrupt women
- Interruptions may indicate
- Desire for dominance
- Desire for social approval
- Convey enthusiasm, involvement
- Data 25m televised interviews before 1979
British General election - Margaret Thatcher (Tory leader) the Iron Lady
- Jim Callaghan (Prime Minister) Sunny Jim
15- Beatties classification scheme
- Identify spkr 2s attempts to take the turn
- Smooth switch spkr 1s utterance complete, turn
to spkr 2, no simultaneous speech - Overlap spkr 1s utterance complete, turn to
spkr 2, simultaneous speech - Simple interruption spkr 1 doesnt complete
utterance, turn to spkr 2, simultaneous speech - Silent interruption spkr 1s utterance
incomplete, turn to spkr 2, no simultaneous
speech - Butting-in simultaneous speech but no change of
turn, spkr 1 keeps the turn
16Beattie 82 - Results
- Thatcher is interrupted almost twice as often as
she interrupts interviewer (19/10) unlike
Callaghan (14/23). - Why is Thatcher interrupted?
- Interruptions come at end of syntactic clause,
when drawl on stressed syllable in clause, and
falling intonation 3 turn-yielding cues! - Thatcher has fewer filled pauses (4) than
Callaghan (22) turn-keeping cue. - Why does she do this?
- Speech training before election?
17Beattie 82 - Results
- Public perception Thatcher is domineering in
interviews and Callaghan is a nice guy - Why is she still perceived as domineering?
- When interrupted she does not cede the floor
despite lengthy stretches of simultaneous speech
18Online Turn Identification for SDS
- Push-to-talk systems
- Silence detection
- Not what humans do!
- Speech detection
- Barge-in
- Need more natural turn-taking support
- When are users ready to be interrupted?
- When do they want to keep the floor?
- When do they expect the system to backchannel?
- How can we indicate when the system has finished
its turn?
19Other Dialogue Processing Tasks
- Speaker Diarization
- Speaker Segmentation
- Speaker Identification (? Speaker Verification)
20Speaker Diarization
- Process of partitioning an input audio stream
into homogeneous segments according to the
speaker identity. - SPEECH ? segment 1 segment 2 segment 3
- Outputs no information about the speakers
identities. - Broadcast News, meetings, telephone conversations
21Speaker Segmentation
- Given the diarization output, cluster together
the segments corresponding to the same speaker,
based on acoustic features. - segment 1 segment 2 segment 3 segment 4
? segment 1 - speaker 1 segment 2 -
speaker 2 segment 3 - speaker 1 segment 4
- speaker 3 - State-of-the-art 8.47 error
22Broadcast News
- ltDOCgt
- ltDOCNOgt CNN19980104.1130.0034 lt/DOCNOgt
- ltDOCTYPEgt NEWS STORY lt/DOCTYPEgt
- ltDATE_TIMEgt 01/04/1998 113034.71 lt/DATE_TIMEgt
- ltBODYgt
- ltTEXTgt
- a fire in northern kentucky is forcing 3,000
people in two states to flee their homes. - the fire started early this morning at the
cargill company plant in maysville near the - ohio river.
- authorities have been going door-to-door advising
people in kentucky and ohio - to take shelter in area high schools.
- the fire is in a building where several
fertilizers and chemicals are stored. - officials say all they can do is let the fire
burn itself out, because spraying - water on the flames would be too dangerous.
- at the current time, our only way of getting it
under control is to stay away from it. - we've backed everyone off from the fire by about
a mile and a quarter and evacuated - homes in that radius and the chief threat at this
point is a very small risk of a very - large explosion caused by 400 tons of ammonia
nitrate stored in the building. - four people have been taken to hospitals.
23Speaker Identification
- Problem of identifying a person solely from their
speech. - Not the same as speaker verification (verifying
whether the speaker is who they claim to be). - Linguistic information to identify speaker types
and speaker names on Broadcast News data (LIMSI
04) - Templates (ltnamegt has this report from
ltlocationgt) - Results 10.9 error on test set
- But only 10 of segments contain relevant
patterns - Estimate 25 error on Broadcast News if speaker
clustering is done to identify all of each
speakers segments
24- ltDOCgt
- ltDOCNOgt CNN19980104.1130.0108 lt/DOCNOgt
- ltDOCTYPEgt NEWS STORY lt/DOCTYPEgt
- ltDATE_TIMEgt 01/04/1998 113148.11 lt/DATE_TIMEgt
- ltBODYgt
- ltTEXTgt
- unexpected weather conditions are the rule across
much of the united states - this weekend.
- angela astore reports.
- ltTURNgt
- it was a nice day to play along the beach --
spend a few hours fishing -- - or get in a game of golf -- not uncommon --
unless it's january in chicago. - record high temperatures were set yesterday from
minnesota to massachusetts. - warm air drawn northward from the gulf of mexico
was behind the rise in the mercury. - it was a different scene in the northwest, where
snow is the story. - but the winter weather didn't stop this man from
getting in some warmer pursuits. - and he wasn't bothered by the fact that he
couldn't see where his golf balls landed. - ltTURNgt
- it's not really where it's going to land that's
important at this point
25Conclusions
- Turn-taking models and theories of grounding of
considerable potential use in SDS. - What is the User likely to say next, and when?
- What type of response does s/he expect the system
to make? When? - Obstacles for practical use
- What cues signal when it is appropriate to speak?
- How do we negotiate a common system/user ground?
26 27