Title: Quantitative approaches to language change
1Quantitative approaches to language change
- Søren Wichmann
- MPI-EVA Leiden University
2Overview
- 1. Automated lexicostatistics tools and
- methods
- 2. Automated lexicostatistics results
- 3. Using typological databases for
- historical linguistic research
3Automated lexicostatistics methods
- Lexicostatistics invented in the early 1950s
- Recent renaissance due to two new developments
- Phylogenies can more meaningfully be established
using modern computational methods developed by
bioinformaticians - Subjective determinations of cognacy can be
replaced by an objective, automated method
4Traditional lexicostatistics 1st step determine
cognates on a standard list
Meaning Cocopa Diegeño Cognate?
fire a?á ?aw yes
nose ixú xú yes
one ?ít ?axínk no
Etc. ... ... ...
52nd step build a matrix of percent similarities
Cocopa Diegeño Hualapai
Havasupai ... Cocopa 0
90 77 80 Diegeño
90 0 87
75 Hualapai 77 87
0 72 Havasupai 80
75 72 0
(invented example)
3rd step find a graphic way of expressing
the similarities and interpret this
as a phylogeny
6Fragment of matrix of similarities among Salishan
languages from Swadesh (1950)
7Salish relations, after Swadesh (1950)
8UPGMA tree produced in SplitsTree (UPGMA
Unweighted Pair Group Method with Arithmetic mean)
9(No Transcript)
10Tools for producing a tree from a similarity
matrix
- Convert the similarity matrix to a distance
matrix using a spreadsheet such as Excel - Prepare an input file to your preferred
phylogenetic software using an editor such as
TextPad (free from www.textpad.com) - Run the data using phylogenetic software
SplitsTree can be recommended (free from
www.splitstree.org) - Choose the most appropriate algorithm (Neighbour
Joining recommended for distance data) - Prepare your tree for presentation using using a
tool such as the Tree Explorer of MEGA
11Preparing the input file
- Look at the example files that come with
Splitstree and imitate them. For instance this
12(No Transcript)
13nexus BEGIN Taxa DIMENSIONS ntax30 TAXLABELS
BellaCoola Comox etc. ... END BEGIN
distances DIMENSIONS ntax30 FORMAT
triangleLOWER diagonal
labels missing? MATRIX BellaCoola
0 Comox 80 0 etc. ... END
14Lets do this using TextPad ?
15- Now we produce a tree from the data
- Lets do that using SplitsTree,
- and lets look at different algorithms
- and features of the program ?
16Illustrating the difference between UPGMA and
Neighbour Joining
17UPGMA assumes that all members of a cluster have
the same amount of changes
18Neighbour Joining doesnt make this assumption
19Comparing the two trees
20- Now we prepare our tree for presentation
- Lets do that using MEGA ?
21Automating the similarity measure
Levenshtein distances the minimum number of
stepssubstitutions, insertions or deletionsthat
it takes to get from one word to another
Germ. Zunge ? Eng. tongue
tsu?? tu?? (substitution)
t??? (substitution) t??
(deletion) Or tongue ? Zunge
t?? t??? (insertion)
tu?? (substitution) tsu??
(substitution) 3 steps, so LD 3
22- There are more sophisticated versions where the
phonetic distance - between segments is taken into account, but
operating with such - fine distinction only becomes relevant for minute
dialectology. - People who have been using the more refined
approach - John Nerbonne Johan Heeringa (dialectologists,
Groningen) - Michael Cysouws course
- People who have been using raw LDs
- Serva Petroni (physicists, Italy)
- Myself and colleagues
23Weighting Levenshtein distances
Serva Petroni (2008) divide by the lengths of
the strings compared. Takes into account that
LDs grow with word length Colleagues and I
divide by the length of the longest string
compared and then divide by the average of LDs
among words in Swadesh lists with different
meanings. Takes into account typical word lengths
of the languages compared and accidental
similarity due to similarities in phonological
inventories
24Comparing results for a test set Mixe-Zoquean
languages (Mexico)
Tree based on shared phonological innovations
(data from Wichmann 1995)
Tree based on automated lexicostatistics (using
Levenshtein distances)
25So results are similar
- Disadvantages of automated method
- blind to anything but lexical evidence
- not always accurate
- has a swallower limit of application than the
comparative method - Advantages
- extremely quick
- consistent and objective
- provides information on the amount of changes,
and therefore a time perspective
26(No Transcript)
27So why not apply the automated method to all the
worlds languages and see what happens?
- Tomorrow about the Automated Similarity Judgment
Program, recent history and state of the art