Title: Basics of Natural Language Processing
1Lecture 5
- Basics of Natural Language Processing
2Aims of Linguistic Science
- Characterize and explain the linguistics
observations - Conversation
- Writing
- Other media
- How humans acquire, produce and use language
- Relationship between linguistic utterances and
the world - Understand linguistic structures by which
language communicates - Rules
3All grammars leek!
- Grammars attempt to describe well-formed versus
ill-formed utterances - Not possible to give an exact and complete
characterization that cleanly divides. - People are always stretching and bending rules
4Alternate Approach
- Abandon the idea of dividing sentences into
grammatical and non-grammatical ones. - Ask What are the common patterns that occur in
language use? - Approach becomes statistical gt Statistical
Natural Language Processing.
5Rationalist Approach to LP
- Dominant from 1960-1985
- Prevalent in linguistics, psychology, artificial
intelligence, natural language processing - Characterized by the belief that a significant
part of the knowledge in the human mind is not
derived by senses, but is fixed in advance
genetic inheritance. - Within linguistics, the rationalist position has
come to dominate the field by the widespread
acceptance of arguments by Noam Chomsky for
innate language facility.
6Poverty of Stimulus
- How can children learn something as complex as
natural language from the limited input they hear
during their early years? - Rationalist approach says key parts of language
are innate hardwired in the brain at birth as
part of human genetic inheritance.
7Empiricist Approach
- Dominant from 1920-1960 and re-emerging now.
- Agree that some cognitive abilities are present
in the brain. - But the thrust of the empiricist approach is that
the mind does not begin with detailed sets of
principles and procedures specific to various
components of language and other cognitive
domains.
8Generative Linguistics
- Chomskyan or generative linguistics seeks to
describe the language module in the human brain
(the I-language) for which data such as texts
(the E-language) provide only indirect evidence. - Distinguish between linguistic competence which
reflects the knowledge of language structure that
is in the mind of a native speaker and - Linguistic performance in the world which is
affected by factors from the real world such as
memory limitations and noise.
9Statistical NLP
- The aim is to assign probabilities to linguistic
events so that one can say which sentences are
usual and which are unusual. - Interested in good descriptions of associations
and preferences that occur in the totality of
language use.
10Questions Linguistics Should Answer
- What do people say?
- What do these things say/ask/request about the
world? - Patterns in corpora more easily reveal the
syntactic structure of language and so
statistical NLP deals principally with the first
question. - Generative linguistics abstracts away any attempt
to describe the things people actually say but
seeks to describe a competence grammar that is
said to underlie the language. What is resident
in peoples minds.
11Grammaticality
- The concept of grammaticality is meant to be
judged on whether a sentence is structurally
well-formed. - Not on whether it is the kind of things people
would say. - Not on whether it is semantically meaningful
- Colorless green ideas sleep furiously.
12Blending of Parts of Speech
- Near as adjective or preposition
- We will review that decision in the near future.
- Adjective
- He lives near the station.
- Preposition
- We nearly lost.
- Adjective gt adverb
- He lives right near the station.
- Preposition modified by adjective
- We live nearer the water than you thought.
- Proposition in comparative form
13Language Change
- Two uses of kind of and sort of.
- What sort of animal made these tracks?
- Noun
- We are kind of hungry.
- Adjective (degree modifiers) similar to somewhat.
- He sort of understood what was going on.
- Adverb (degree modifier).
- The nette sent in to the see, and alle kind of
fishis gedrynge. 1382 - I knowe that sorte of men ryght well. 1560
- I kind of love you, Sal. 1804
- It sort o stirs one up to hear about old times.
1833
14Language Change
- While language change can be sudden, it is
generally gradual. - The details of gradual change can only be made
sense of by examining frequencies of use and
being sensitive to varying strengths of
relationships. - This type of modeling requires statistical as
opposed to categorical observations - Human cognition is probabilistic and so language
must be probabilistic too. - This implies probability is key to scientific
understanding of language.
15Disambiguation
- I have given several examples in previous
lectures of ambiguous sentences. - NLP System must be good at making disambiguation
decisions with respect to word sense, word
category, syntactic structure, and semantic
scope. - Hand-coded syntactic constraints and preference
rules are time consuming to build, do not scale
well, and are brittle in the face of the
extensive use of metaphor in language.
16Disambiguation
- A traditional approach is to use sectional
restrictions. - For example, a verb like swallow requires an
animate being as its subject and a physical
object as its object. - Counterexamples.
- I swallowed his story, hook, line, and sinker.
- The supernova swallowed the planet.
17Getting Hands Dirty
- Lexical Resources machine-readable text,
dictionaries, thesauri, and the tools for
processing them. - Brown Corpus
- A tagged corpus of about 1,000,000 words
assembled at Brown University in the 1960s and
1970s. - Lancaster-Oslo-Bergen Corpus
- British version.
- Susanne Corpus
- Free subset of Brown Corpus
- Penn Treebank
- From Wall Street Journal
- Canadian Hansards
- Proceedings of Canadian parliament a bilingual
corpus
18Word Counts
Word tokens versus word types. Word tokens are
the number of words. In Tom Sawyer there are
71,370 word tokens.
19Word Counts
In contrast, word types refers to the number of
distinct words, some of which are repeated. In
Tom Sawyer, there are 8,018. One can calculate
the ratio of tokens to types, which is just the
average frequency of each word. The ratio is 8.6
20Zipfs Law
- If we count how often each word type occurs in a
large corpus, then list the words in the order of
frequency of occurrence, we can explore the
relationship between the frequency of a word, f,
and its position in the list, known as its rank,
r. - Zipfs law says f? 1/r or equivalently
frconstant.
21Word Counts
22Zipfs Law
- The product fr tends to bulge for words of rank
around 100. - For human languages Zipfs law is a useful rough
description of the frequency distribution there
are a few very common words, a medium amount of
medium frequency words and many low frequency
words.
23Zipfs Law
24Mandelbrots Law
- Mandelbrot derived a better fit.
- f P(r ?)-B
- Here P, B, and ? are parameters of a text that
collectively measure the richness of the texts
use of words.
25Mandelbrots Law
26Other Laws
- If m is the number of meanings a word can have,
then Zipf argues m ? f ½. - Equivalently, m ? r - ½.
- Power Laws
- The probability of a word of length n being
generated at random is (26/27)n(1/27) - There are 26 times more words of length n1 than
words of length n. - There is a constant ratio by which words of
length n are more frequent that words of length
n1.