Finding Word Clusters in Spoken Dialogue with Narrow Context Based Similarities presentation

About This Presentation

Title:

Finding Word Clusters in Spoken Dialogue with Narrow Context Based Similarities

Description:

Title: Finding Word Clusters in Spoken Dialogue with Narrow Context Based Similarities Author: Leif Gr nqvist Last modified by: Leif Gr nqvist Created Date –

Number of Views:140

Avg rating:3.0/5.0

Slides: 13

Provided by: Leif165

Category:

more less

Transcript and Presenter's Notes

Title: Finding Word Clusters in Spoken Dialogue with Narrow Context Based Similarities

1
Finding Word Clusters in Spoken Dialogue with
Narrow Context Based Similarities

Leif Grönqvist (www.ling.gu.se/leifg)
Växjö University, School of Mathematics and
Systems Engineering, Sweden
The National Graduate School of Language
Technology (GSLT)
Magnus Gunnarsson (www.ling.gu.se/mgunnar)
Göteborg University, Department of Linguistics,
Sweden

2
Background

NordTalk and SweDanes Jens Allwood, Elisabeth
Ahlsén, Peter Juel Henrichsen, Leif Grönqvist,
Magnus Gunnarsson
Comparable Danish and Swedish corpora
1.3 MToken each, natural spoken interaction
We are mainly working with Spoken language not
written

3
Siblings as word groups

Traditional parts-of-speech are not necessarily
valid for spoken language
Few serious attempts to build a spoken language
grammar (Jens Allwoods talk tomorrow 10 am)
What we have is the corpus - only the corpus,
nothing else like morphology or lexica
We will take information from the 11 words
context
Words with similar context distributions are
called Siblings (Peter Juel Henrichsen)

4
Typical context distributions for couple, lot
and moment
32
couple 2
couple that 3
couple of 25
a couple 27
180
lot 18
lot s 6
lot of 110
lot more 10
s lot 4
a lot 142
whole lot 5
awful lot 11
76
moment 33
moment in 6
moment is 3
the moment 57
this moment 3
a moment 9
particular moment 3
ggsib(lot,couple)0.74 ggsib(lot,moment)0.15
ggsib(couple,moment)0.12
5
Typical context distributions for we, they and I
they 21.8
and they 8.6
that they 5.9
if they 5.5
they 7.0
they ve 6.1
they were 6.6
they re 11.6
we 21.9
and we 6.2
that we 8.4
if we 5.1
we 7.0
we do 5.1
we ve 9.5
we have 7.1
we can 5.0
we re 6.3
I 39.3
and I 7.9
I do 6.6
I ve 7.1
I m 9.1
I think 12.3
I mean 10.1
ggsib(we,they)0.71 ggsib(we,I)0.53
ggsib(they,I)0.51
6
Our use of the Sibling measure

We made it symmetric to avoid sibling chains
Another change was not to demand similar context
on both sides
Iterative use
Run the similarity check between pairs
Collapse word pairs with similarity above a
threshold
Run again with a lower threshold until a lowest
threshold is reached

7
Henrichsens and our formulas
8
Comparison to other clustering algorithms

We take all context words into account not just
a selected set
We get natural similarities in the sense that
they are only based on the corpus
But computationally its very complex. We had to
optimize the program a lot using tries and even
arrays instead of hash tables
The iterative approach give us trees instead of
just clusters

9
Some small examples
10
(No Transcript)
11
Further Research

Evaluation is difficult there are no correct
trees, just our language intuition
Homonyms are not handled in a good way
How can we find the interesting sections of the
clustering?
When should the iteration stop? Without stopping,
all words will form a big tree
Sparse data is still a problem, bigger contexts
gives other problems

12
Conclusions

Our method is an interesting way of finding word
groups close to our language intuition
It works for all kinds of words (syncategorematic
as well as categorematic)
It is to a high degree theory independent
Difficult to handle low frequent words and
homonyms

Write a Comment

User Comments (0)

About PowerShow.com