Statistics of Words - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Statistics of Words

Description:

A stochastic process X is stationary iff ... Zipf's basis for his theory in economics, communication, and human behavior in general ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 20
Provided by: VasileiosH9
Category:
Tags: statistics | words

less

Transcript and Presenter's Notes

Title: Statistics of Words


1
Statistics of Words
  • Vasileios Hatzivassiloglou
  • University of Texas at Dallas

2
Stationary processes
  • A stochastic process X is stationary iff
  • For any number n of indices t1, t2, ..., tn, and
    any t, the joint distribution of X(t1), X(t2),
    ..., X(tn) and X(t1t), X(t2t), ..., X(tnt)
    are identical
  • It is implied above that this must hold when tit
    ? T for all i

3
Ergodic processes
  • A stochastic process is ergodic iff
  • Any state can be reached from any other state
    (aperiodic process)
  • No two states are identical in behavior
    (irreducibility)
  • In an ergodic process,
  • Sample average is equal to time average almost
    everywhere
  • We cant get stuck in a state

4
Language as a stochastic process
  • Any generator of symbols can be considered as a
    stochastic process
  • This is a discrete-time, discrete process
  • Each random variable corresponds to one symbol
    (usually a word)
  • Is human language stationary and ergodic?

5
Statistics of words
  • Given a sample of text, how many distinct words
    can we expect?
  • What are the frequencies of each individual word?
  • We distinguish between tokens and types
  • tokens observations of words
  • types distinct words

6
A sample text Tom Sawyer
  • Published in 1881 by Mark Twain (pseudonym of
    Samuel Clemens)
  • Intended for children
  • Total word types and tokens
  • 71,370 tokens (half a megabyte)
  • 8,018 types
  • Compared to newswire text,
  • fewer types (gt11,000 expected)

7
Word frequencies
  • Average frequency is 8.9 tokens/type
  • But not all words are equally frequent
  • Ten most frequent words
  • the, and, a, to, of, was, it, in, that, he
  • Each accounts for at least 1 of the text (the
    occurs 3,332 times and accounts for 4.7 of the
    text)

8
Common and rare words
  • Most common 100 words
  • 50.9 of the total tokens
  • Words that occur only once (hapax legomena)
  • 49.8 of the total types, 5.6 of the text
  • Words that occur 3 times or less
  • 12 of the text
  • Words that occur 10 times or less
  • 90 of the text

9
Zipfs law
  • Consider a words frequency f and its rank r in
    the list of all word types ordered by frequency
  • Then,
  • f r constant (Zipfs law)

10
Zipfs law in Tom Sawyer
11
The principle of least effort
  • Zipfs basis for his theory in economics,
    communication, and human behavior in general
  • Two competing needs
  • unambiguous messages (many of them)
  • short messages (least effort)
  • Solution
  • Frequent messages must be limited in number so
    that they can be short

12
Paretos law
  • Originally developed for the income of people
  • P(X gt x) c1x-k (cumulative distribution)
  • Differentiating, we obtain the distribution
    density function
  • f(X x) c2x-(k1)
  • This is known as the power law

13
Examples
  • Income of people
  • Revenue of corporations
  • Population of cities
  • Magnitude of earthquakes
  • Number of visits to web sites
  • Number of interactions of proteins

14
Oil field distribution
15
US city population
16
Fatalities for natural events
Fatalities per incident for tornadoes (1), floods
(2), hurricanes (3), and earthquakes (4) in the
20th century United States.
17
Sites visited by AOL users
18
AOL visits Zipf fit
19
Reading
  • Section 4.1 from the book An Introduction to
    Statistical Signal Processing by Robert M. Gray
    and Lee D. Davisson
  • Sections 1.4.2-1.4.3 on word counts and Zipfs law
Write a Comment
User Comments (0)
About PowerShow.com