Finding Word Clusters in Spoken Dialogue with Narrow Context Based Similarities

1 / 12
About This Presentation
Title:

Finding Word Clusters in Spoken Dialogue with Narrow Context Based Similarities

Description:

Title: Finding Word Clusters in Spoken Dialogue with Narrow Context Based Similarities Author: Leif Gr nqvist Last modified by: Leif Gr nqvist Created Date –

Number of Views:140
Avg rating:3.0/5.0
Slides: 13
Provided by: Leif165
Category:

less

Transcript and Presenter's Notes

Title: Finding Word Clusters in Spoken Dialogue with Narrow Context Based Similarities


1
Finding Word Clusters in Spoken Dialogue with
Narrow Context Based Similarities
  • Leif Grönqvist (www.ling.gu.se/leifg)
  • Växjö University, School of Mathematics and
    Systems Engineering, Sweden
  • The National Graduate School of Language
    Technology (GSLT)
  • Magnus Gunnarsson (www.ling.gu.se/mgunnar)
  • Göteborg University, Department of Linguistics,
    Sweden

2
Background
  • NordTalk and SweDanes Jens Allwood, Elisabeth
    Ahlsén, Peter Juel Henrichsen, Leif Grönqvist,
    Magnus Gunnarsson
  • Comparable Danish and Swedish corpora
  • 1.3 MToken each, natural spoken interaction
  • We are mainly working with Spoken language not
    written

3
Siblings as word groups
  • Traditional parts-of-speech are not necessarily
    valid for spoken language
  • Few serious attempts to build a spoken language
    grammar (Jens Allwoods talk tomorrow 10 am)
  • What we have is the corpus - only the corpus,
    nothing else like morphology or lexica
  • We will take information from the 11 words
    context
  • Words with similar context distributions are
    called Siblings (Peter Juel Henrichsen)

4
Typical context distributions for couple, lot
and moment
32
couple 2
couple that 3
couple of 25
a couple 27
180
lot 18
lot s 6
lot of 110
lot more 10
s lot 4
a lot 142
whole lot 5
awful lot 11
76
moment 33
moment in 6
moment is 3
the moment 57
this moment 3
a moment 9
particular moment 3
ggsib(lot,couple)0.74 ggsib(lot,moment)0.15
ggsib(couple,moment)0.12
5
Typical context distributions for we, they and I
they 21.8
and they 8.6
that they 5.9
if they 5.5
they 7.0
they ve 6.1
they were 6.6
they re 11.6
we 21.9
and we 6.2
that we 8.4
if we 5.1
we 7.0
we do 5.1
we ve 9.5
we have 7.1
we can 5.0
we re 6.3
I 39.3
and I 7.9
I do 6.6
I ve 7.1
I m 9.1
I think 12.3
I mean 10.1
ggsib(we,they)0.71 ggsib(we,I)0.53
ggsib(they,I)0.51
6
Our use of the Sibling measure
  • We made it symmetric to avoid sibling chains
  • Another change was not to demand similar context
    on both sides
  • Iterative use
  • Run the similarity check between pairs
  • Collapse word pairs with similarity above a
    threshold
  • Run again with a lower threshold until a lowest
    threshold is reached

7
Henrichsens and our formulas
8
Comparison to other clustering algorithms
  • We take all context words into account not just
    a selected set
  • We get natural similarities in the sense that
    they are only based on the corpus
  • But computationally its very complex. We had to
    optimize the program a lot using tries and even
    arrays instead of hash tables
  • The iterative approach give us trees instead of
    just clusters

9
Some small examples
10
(No Transcript)
11
Further Research
  • Evaluation is difficult there are no correct
    trees, just our language intuition
  • Homonyms are not handled in a good way
  • How can we find the interesting sections of the
    clustering?
  • When should the iteration stop? Without stopping,
    all words will form a big tree
  • Sparse data is still a problem, bigger contexts
    gives other problems

12
Conclusions
  • Our method is an interesting way of finding word
    groups close to our language intuition
  • It works for all kinds of words (syncategorematic
    as well as categorematic)
  • It is to a high degree theory independent
  • Difficult to handle low frequent words and
    homonyms
Write a Comment
User Comments (0)
About PowerShow.com