Will Allen w'h'a'allenncl'ac'uk Warren Maguire w'n'maguirencl'ac'uk Hermann Moisl hermann'moislncl'a - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Will Allen w'h'a'allenncl'ac'uk Warren Maguire w'n'maguirencl'ac'uk Hermann Moisl hermann'moislncl'a

Description:

The newly-created Newcastle Electronic Corpus of Tyneside English (NECTE) offers ... methods has been developed in an attempt to make the deluge at least tractable. ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 38
Provided by: nhlm
Category:

less

Transcript and Presenter's Notes

Title: Will Allen w'h'a'allenncl'ac'uk Warren Maguire w'n'maguirencl'ac'uk Hermann Moisl hermann'moislncl'a


1

Phonetic variation in Tyneside exploratory
multivariate analysis of the Newcastle Electronic
Corpus of Tyneside English
Will Allen Warren Maguire Hermann Moisl
School of English Literature, Language, and
LinguisticsUniversity of Newcastle upon Tyne
2
Introduction
  • The newly-created Newcastle Electronic Corpus of
    Tyneside English (NECTE) offers an opportunity to
    study an historically-recent sample of English
    spoken in the Tyneside region of North-East
    England.

3
Introduction
  • This paper gives an overview of an exploratory
    analysis of phonetic data derived from NECTE.
  • The analysis was undertaken with the aim of
    generating hypotheses about the main directions
    of phonetic variation among individual speakers
    and speaker groups in the corpus, and about how
    this variation correlates with associated social
    factors.
  • The discussion is in four main parts.
  • The first part outlines exploratory multivariate
    analysis in general, and in particular
    hierarchical cluster analysis, the method used in
    this study.
  • The second describes the NECTE phonetic data used
    in the analysis.
  • The third carries out a hierarchical cluster
    analysis of a sample of that data and states some
    hypotheses based on it.
  • The fourth states raises some caveats and
    indicates directions for future work.

4
1. Multivariate analysis
  • The proliferation of computational technology has
    generated an explosive production of
    electronically encoded information of all kinds.
  • In the face of this, traditional paper-based
    methods for search and interpretation of data
    have been overwhelmed by sheer volume, and a wide
    variety of computational methods has been
    developed in an attempt to make the deluge at
    least tractable.
  • As such methods have been refined and new ones
    introduced, something over and above tractability
    has emerged new and unexpected ways of
    understanding the data.
  • The fact that a computer can deal with vastly
    larger data sets than a human is an obvious
    factor, but there are two others of at least
    equal importance
  • one is the ease with which data can be
    manipulated and reanalyzed in interesting ways
    without the often prohibitive labour that this
    would involve using manual techniques
  • the other is the extensive scope for
    visualization that computer graphics provide.

5
1. Multivariate analysis
  • These developments have clear implications for
    the analysis of large bodies of text in corpus
    linguistics.
  • Effective analysis of the large electronic
    corpora now being generated will increasingly be
    tractable only by adapting the interpretative
    methods developed by the statistical, data
    mining, and related communities.
  • In the present paper we are interested in one
    particular type of tool multivariate analysis.
    What is multivariate analysis?

6
1. Multivariate analysis
  • Observation of nature plays a fundamental role in
    science.
  • In current scientific methodology, an hypothesis
    about some phenomenon is proposed and its
    adequacy assessed using data obtained from
    observation of the domain of inquiry.
  • But nature is complex, and there is no hope of
    being able to observe it exhaustively. Instead,
    particular aspects of the domain are selected for
    observation.
  • Each selected aspect is represented by a
    variable, and a series of observations is
    conducted in which, at each observation, the
    values for each variable are recorded. A body of
    data is thereby built up on the basis of which an
    hypothesis can be assessed.
  • One might choose to observe only one aspect, in
    which case the data set consists of a set of
    values assigned to one variable such a data set
    is referred to as univariate.
  • If two values are observed, then the data set is
    bivariate, if three trivariate, and so on up to
    some arbitrary number n the general term for
    n-variable data sets is multivariate.

7
1. Multivariate analysis
  • As the number of variables grows, so does the
    difficulty of understanding the data, that is, of
    conceptualizing
  • the interrelationships of variables within a
    single data item how, for example, are height,
    weight, and heart rate of any given person
    interrelated?
  • the interrelationships of complete data items
    how do people measured on the above variables
    compare to one another?
  • Multivariate analysis is the computational use
    of mathematical and statistical tools for
    understanding such interrelationships in data.

8
1. Multivariate analysis
  • Numerous techniques for multivariate analysis
    exist. They can be divided into two main
    categories which are usually referred to as
    'exploratory' and 'confirmatory'.
  • Exploratory analysis aims to discover
    regularities in data which can serve as the basis
    for formulation of hypotheses about the domain
    from which the data comes. Such techniques
    emphasize intuitively accessible, usually
    graphical representations of data structure.
  • Confirmatory multivariate analysis attempts to
    determine whether or not there are significant
    relationships between some number of selected
    independent variables and one or more dependent
    ones.
  • These two types are complementary in that the
    first generates hypotheses about data, and the
    second tries to determine whether or not such
    hypotheses are valid. Exploratory analysis is
    naturally prior to confirmatory this discussion
    is concerned with the former.

9
1. Multivariate analysis hierarchical cluster
analysis
  • Hierarchical cluster analysis is a variety of
    exploratory multivariate analysis.
  • To understand how it works it is first necessary
    to understand the concept of distance between
    data points in vector space.
  • Assume a domain of inquiry, say a linguistic
    corpus, which will be studied using six
    variables.
  • If the six-dimensional data is to be analyzed
    using an exploratory method, it has to be
    represented mathematically.
  • This is done in the form of vectors, where a
    vector is a sequence of values indexed by the
    positive integers 1, 2, 3.

10
1. Multivariate analysis hierarchical cluster
analysis
  • Where the data consists of more than one case,
    which it usually does, then each case is
    represented by a vector, and the set of vectors
    is assembled into a matrix, which is a sequence
    of vectors arranged in rows and the rows are
    indexed by the positive integers 1, 2, 3 .
  • In matrix M, case 2 is at row M2 and the value of
    the third variable for that case is at M2,3, that
    is, 0.1.

11
1. Multivariate analysis hierarchical cluster
analysis
  • A vector space is a geometrical interpretation of
    a set of vectors
  • The dimensionality n of the vectors, that is, the
    number of its elements, defines an n-dimensional
    space.
  • The values in the vector define the coordinates
    of a point in that space

12
1. Multivariate analysis hierarchical cluster
analysis
  • For example, a bivariate data set defines a
    2-dimensional space in which each vector
    specifies the coordinates of a point in that
    space.
  • Take a data set consisting of vectors that
    specify the age and weight of some number of
    individuals. A single such vector might be v
    (36,160).
  • In geometrical terms, the x or age axis is
    0..100, the y or weight axis is 0..200, and any
    vector in the data set can be plotted in the
    (x,y) space, as in

13
1. Multivariate analysis hierarchical cluster
analysis
  • If more vectors are plotted in the space,
    nonrandom structure may or may not emerge,
    depending on the interrelationships of the
    real-world characteristics that the variables
    represent.
  • Where there are no structured real-world
    interrelationships, the result will look
    something like the upper plot of random points.
  • If there is structure, the plot might look
    something like the lower one, where two clusters
    have clearly emerged. These clusters say
    something substantive about the
    interrelationships of the represented entities.

14
1. Multivariate analysis hierarchical cluster
analysis
  • Analogously, a trivariate (age, weight, height)
    vector  v (36, 160, 71) from a data set of
    length-3 vectors defines a point in 3-dimensional
    space, as in the upper plot.
  • Multiple vectors representing a structured domain
    plotted in the space might look like the lower
    figure.

15
1. Multivariate analysis hierarchical cluster
analysis
  • The structure of data with dimensionality higher
    than 3 cannot be directly visualized.
  • How can it be represented in an intuitively
    accessible way?
  • The various exploratory multivariate methods
    provide indirect visualizations.
  • Hierarchical cluster analysis, in particular,
    constructs dendrograms or trees that show the
    constituency structure of clusters using relative
    distance between and among points in the
    high-dimensional data vector space.
  • Distance can be understood quite literally
    distance between points A and B in the figure
    below can be measured, and it is less than the
    measured distance between A and C.

16
1. Multivariate analysis hierarchical cluster
analysis
  • As an example of how dendrograms graphically
    represent the structure of higher-dimensional
    data, we use a benchmark data set that measures
    flowers on 4 variables.
  • A hierarchical cluster analysis generates a tree
    in which the different lengths of the horizontal
    lines represent relativities of distance among
    data vectors in 4-dimensional space the longer
    the lines, the greater the distance.
  • Knowing this, it can easily be seen that there
    are three main clusters, that is, three types of
    flower, and that each cluster has internal
    structure.

17
2. Data the corpus
  • The NECTE corpus is based on two pre-existing
    corpora of audio-recorded speech, one of them
    gathered during Tyneside Linguistic Survey (TLS)
    undertaken in the late 1960s and the other
    between 1991 and 1994 for the Phonological
    Variation and Change in Contemporary Spoken
    English project, both at Newcastle University.
  • NECTEs aim has been to enhance, improve access
    to, and promote the re-use of the TLS and PVC
    corpora by amalgamating them into a single,
    TEI-conformant electronic corpus.
  • The result will shortly be made available to the
    research community in a variety of formats
    digitized sound, phonetic transcription, and
    standard orthographic transcription, all aligned
    and accessible on the Web.

18
2. Data the corpus
  • This discussion is concerned with the TLS
    component of NECTE
  • It originally consisted of 150 loosely-structured
    30-minute interviews with Tyneside informants
    that were recorded onto analog reel-to-reel
    tapes.
  • As part of its research activity based on these
    recordings, the TLS produced highly detailed
    phonetic transcriptions of about 10 minutes of
    each of 64 recordings, of which 61 survive.
  • These 61 transcriptions are the basis for the
    data used in this presentation.

19
2. Data the corpus
  • One of the main aims of the TLS project was to
    see whether systematic phonetic variation among
    Tyneside speakers of the period could be
    significantly correlated with variation in their
    social characteristics.
  • To this end they developed a methodology which
    was radical at the time and remains so today
  • in contrast to the then-universal and
    still-dominant theory driven approach, where
    social and linguistic factors are pre-selected by
    the analyst,
  • the TLS proposed a fundamentally empirical
    approach in which salient factors are extracted
    from the data itself and then serve as the basis
    for model construction.

20
2. Data the corpus
  • To realize its research aim using its empirical
    methodology, the TLS had to compare the audio
    interviews it had collected at the phonetic level
    of representation.
  • To be able to do this, the TLS phonetically
    transcribed a substantial sample of its audio
    corpus, as noted.
  • The TLS invented its own transcription scheme. It
    is too complex for presentation here, but its
    main features are
  • It captures variation in the distribution of
    phonetic segments across lexical environments.
  • There are two levels of transcription relatively
    broad and very narrow.

21
2. Data abstraction
  • The analysis reported in this talk is at the
    broad phonetic transcription level.
  • It is based on comparison of phonetic profiles
    associated with each of the TLS speakers.
  • A profile for any speaker S is the number of
    times S uses each of the phonetic segments
    defined by the TLS transcription scheme in his or
    her interview.

22
2. Data abstraction
  • More specifically, the profile P associated with
    S is a vector having as many elements as there
    are codes such that
  • Each vector element Pj represents the jth
    segment, where j is in the range 1..number of
    phonetic segments in the TLS scheme
  • The value stored at Pj is an integer representing
    the number of times S uses the jth code code.
  • There are 156 codes, and so a profile is a
    length-156 vector. For example

23
2. Data abstraction
  • There are 61 TLS speakers, and their profiles are
    represented in matrices having 61 rows, one for
    each profile

24
2. Data preprocessing
  • Prior to analysis, this matrix modified in two
    ways.
  • Normalization for variation in text length
  • Dimensionality reduction elimination of
    superfluous variables
  • The result is a 61 x 80 length-normalized matrix,
    which is the data for the analysis that follows.

25
3. Analysis
  • This section analyzes the data matrix developed
    in the preceding section.
  • One particular variety of hierarchical cluster
    analysis is used squared Euclidean distance
    measure and the increase in sum of squares
    clustering algorithm, or, more simply, Wards
    method.
  • The implications of this choice are discussed in
    due course.
  • The aim is to get an initial indication of the
    structure of the TLS phonetic data using the
    matrix of speaker profiles normalized for length
    and dimensionality reduced as described. The
    cluster tree looks like this

26
3. Analysis
  • Four main clusters emerge, labelled A-D.
  • D clusters markedly against the rest, and
    comprises the Newcastle group of speakers.
  • On the basis of the phonetic segment frequency
    distribution evidence, therefore, Newcastle
    speakers are strongly distinguished from
    Gateshead ones.
  • Gateshead ones can be further analyzed, but for
    present purposes we stay with the Newcastle /
    Gateshead distinction

27
3. Analysis
  • Knowing that there are well defined clusters is
    one thing.
  • Knowing why is another what are the main
    phonetic segmental determinants of the clusters?
  • Several ways of answering this question exist we
    take a graphical approach.
  • The phonetic profiles for all 61 speakers were
    simultaneously plotted with the aim of visually
    identifying systematic differences between the
    Newcastle and Gateshead clusters.

28
3. Analysis
  • The problem is immediately apparent too much
    detail. The level of detail was reduced as
    follows
  • A significantly smaller number of the
    highest-variance variables was selected, thus
    reducing the density of information on the
    x-axis.
  • The frequency vectors for all the Gateshead
    speakers 1-57 were averaged, yielding a mean
    frequency vector for these speakers.
  • The same was done for the Newcastle speakers
  • The mean frequency vectors were plotted against
    each other on the same graph

29
3. Analysis
  • For the 30 highest-variance variables shown here,
    those for which the mean vectors differ most are
    the most important in differentiating Newcastle
    and Gateshead speakers.

30
3. Analysis
  • The table on the right gives the interpretation
    of this plot
  • Rank indexes the 6 largest variable-differences
    between the two clusters 1 is the largest
    difference, 2 the second largest, etc.
  • Variable nr shows, for each Rank, the
    corresponding variable number in the range 1..40
    on the x-axis of the plot.
  • Variable symbol identifies the phonetic symbol
    corresponding to the Variable nr.

31
3. Analysis
  • The variables that are most important for
    distinguishing Gateshead from Newcastle speakers
    can be read off from the table. They are
  • ? in words like standard and interview
    (localised Tyneside English often has ? here)
  • closely related to this, ? in words like baker
    and china (localised Tyneside English often has
    ? here)
  • ? in the KIT lexical set
  • ? in words like houses and places
  • ?? in the GOAT lexical set (RP type English has
    ?? here)
  • e? in the PRICE lexical set (RP type English
    has a? here).

32
3. Analysis
  • We have conducted similar analyses on the
    Gateshead subcluster.
  • There isnt time to go into the details, but the
    essence of the results is that, on the basis of
    the phonetic frequency profiles
  • The main phonetic segmental determinants for the
    clustering can be extracted
  • The main subclusters can be correlated with
    social factors, as follows.

33
3. Analysis
  • 1. There is a correlation between gender and
    cluster
  • Cluster A is almost exclusively female
  • cluster BC is almost exclusively male
  • cluster DE is mostly female.
  • 2. There is a correlation between socio-economic
    status and cluster
  • speakers in cluster A are, by and large, from the
    lowest of the three socio-economic groups
  • speakers in cluster BC are from socio-economic
    groups 1 and 2
  • speakers from cluster DE are from the two higher
    socio-economic groups, 2 and 3.

34
3. Analysis
  • When the interaction between these two social
    variables in considered, the picture becomes even
    clearer. The following table summarises the
    typical social characteristics of the clusters

35
4. Discussion
  • The foregoing analysis has used only one of a
    wide range of possible distance measure /
    clustering algorithm combinations available under
    the rubric hierarchical cluster analysis.
  • In general, with respect to any given data set,
    different combinations of distance measure and
    clustering algorithm typically differ from one
    another to greater or lesser degrees.
  • Given that there is no obvious way of selecting
    the best analysis that captures the true
    structure of the data, the question is how
    reliable a tool is hierarchical cluster analysis
    for linguistic research?
  • Our own cluster analyses have shown such
    variation with respect to the TLS data.
  • How can we claim any validity for our results?

36
4. Discussion
  • In principle, the response is that the purpose of
    exploratory multivariate analysis is to generate
    hypotheses rather than to provide definitive
    answers, and different analyses of the same data
    simply generate different hypotheses to be
    tested.
  • In practice, if different selections of distance
    measure / clustering algorithm produce wildly
    different analyses of one and the same data set,
    its difficult to see what advantage they offer
    over unguided hypothesizing.
  • Therefore, the next step is to analyze the data
    using
  • various other distance measure / clustering
    algorithm combinations in hierarchical cluster
    analysis
  • Different clustering methods such as
    multidimensional scaling and self organizing maps
  • to see if a stable analysis emerges.

37
4. Conclusion
  • The hierarchical analyses we have undertaken so
    far look promising. The next steps are
  • Detailed phonetic and sociolinguistic analysis of
    the cluster tree generated by the distance
    measure / clustering algorithm used in this talk.
  • Comparison of this tree with structural analyses
    generated by other varieties of hierarchical
    cluster analysis and by alternative clustering
    methods.
  • Relation of our results to existing work on
    Tyneside English, and in particular to that of
    Val Jones-Sargent, one of the original TLS
    researchers, on whose work our own is based.
  • V. Jones-Sargent, Tyne Bytes. A computerised
    sociolinguistic study of Tyneside, Peter Lang,
    1983
Write a Comment
User Comments (0)
About PowerShow.com