Title: LING001
1LING001
- Language and Computers
- 4-13-2009
2Im sorry Dave, Im afraid I cant do that
- Computational linguistics
- Use computational tools to understand how humans
(and human languages) work - Use computational tools to make computers
understand humans (and human languages)
3The Turing Test
- Language has always been viewed as the essence of
being human - Alan Turing (1912-1954), a pioneer in computer
science, proposed that a machine would be
considered intelligent if it could fool a human
into believing it to be another human via
teletyping (or IRC, IM, ...) - This test has led to many philosophical
controversies but one thing is clear, no machine
has ever passed the Test - ELIZA A computer program created by Joseph
Weizenbaum in the 1960s that played a
psychiatrist and was disturbingly effective at
tricking people into confessions - See demo
4Part I Language as Computation
- Modern linguistics was/is computer science
- representations and rules were conceived as
components in a mechanical device that generates
linguistic expressions--an insight that goes back
many centuries - rules such as S?NP VP became the foundation of
computer science - logic representations of meanings (the SEMANTICS
lectures) were used to represent what programming
languages express - Exp? Exp OP Exp
- OP? , -, ,
- Exp? 0, 1, 2, ....
5Case Study Language processing
- Linguistic Problems
- How do we analyze sentences on the fly?
- Why are some sentences more difficult to process
than others? - Computational Problems
- computers have a processor and a memory
- the processor carries out an algorithm (a precise
series of steps) that, in effect, draws a syntax
tree much like what you do in your homework - the tree is shipped to semantics where the
logic-based type of calculation takes over to
derive meanings
6Simple Grammar
How do we parse in real time?
7What comes down must go up
Make the left-most expansion, which is going to
be the leftmost word, to see if the
predicted category is actually found in the input
sentence
8Does this flight include a meal?
9Does this flight include a meal? (Cont)
10Does this flight include a meal? (Cont)
11Breakdown
- Recall the structure of center embedding NP -gt
NP S (relative clause) - The cheese the rat the cat the dog worried chased
ate lay in the house - Red indicates a predicted rule, and blue
indicates a confirmed rule - S-gt NP VP -gt Det N VP -gt the N VP -gt the cheese
VP -gt NP VP - we are now at The cheese
- VP would next predict V, which is contradicted by
the in the rat - we need to trace back the NP expansion (the green
box) to try out other rules in the grammar for NP
(namely, NP-gtNP S) - Note that the red VP is still held in the memory
because its predicted more and more will stack
up, causing memory load problem and hence parsing
difficulty
12Case II Language learning
- Recall the strategies in word segmentation
(lecture on 3-12) - stress information English is predominantly
initial stress - statistical information the transitional
probability between syllables tends to be lower
at word boundaries - infants have been shown to be sensitive to both,
but how/whether do conflicting cues work
together? - in experimental psychology, most researchers will
focus on one of the cues, because it is inherent
in the research to eliminate confounds from other
cues in order to establish the empirical
effectiveness of the cue that the researchers are
interested in
13Language Learning
- In many cases where it is difficult to carry out
experiments, or the resulting experiment would be
too complicated (for young children),
computational modeling is often the best--perhaps
the only--way of testing hypotheses about
language learning - computer models must be faithful to the findings
in child language acquisition e.g., it cannot
presuppose unrealistic computing power on the
part of the learner - computer models must produce behavior consistent
with what children actually do in language
learning - they also allow the integration of quantitative
factors (e.g., frequency) with the more abstract
representations and rules we have been discussing
in this class
14From Learning to Change
- The transmission of language is by language
learning this process is strikingly similar to
genetics and evolution - Language learning can be modeled as an adaptive
process to the variant forms of grammar in the
environment some competing variants are from
Universal Grammar, while others are contingent on
social and cultural factors (e.g., soda vs.
pop) - There is a healthy amount of work now that tries
to use the mathematical models of biological
evolution to develop mathematical models of
language change
14
15II Computing Language
- Another branch of computational linguistics
develops applications for engineering purposes - In most cases, the engineering techniques are
inspired by how language works as revealed by
linguistics - Manual construction of the system is usually
labor intensive so people have been looking for
automatic ways of learning the linguistic
system - Speech synthesis
- Ambiguity in language
- machine translation
- Spam filtering
16Speech Synthesis
- Virtually all systems were modeled after the
human vocal tract mechanical before, electronic
now
Wolfgang von Kempelen (late 1700s)
17Modern Speech Synthesis Systems
- Many follow the pipeline of linguistic
representations that we have been talking about - Note that running speech involves more than word
synthesis prosody is very important as well - this often requires the system to parse the
sentence into tree like structures - see demo
http//www.research.att.com/ttsweb/tts/demo.php
18Ambiguity
- Ambiguity is pervasive in language
- word level two vs. too vs. to
- part of speech bark is a verb as well as a
noun - word senses bank as an institution vs. bank
as an object - syntax I shot an elephant in my pajamas
- semantics everyone likes someone
- Humans can make rapid decisions on ambiguity by
tapping into both linguistic and non-linguistic
knowledge and thus ignoring the majority of
ambiguities - How does a computer do that?
19Part of Speech Disambiguation
- Observation some categories appear more often
than others - the is far more likely to be followed by a noun
or adjective, and never verbs - Idea look at a lot of pairs of words in the
preanalyzed data (e.g., book that flight V
Det N), and try to discover the regularities - we can then generalize these regularities to
novel texts - here the problem is solved by having a teacher
- The US military has paid a lot of money to
produce tons of preanalyzed linguistic data,
hoping that useful regularities can be extracted
out of it.
20The dog saw the icecream
- (The, Dog) Det-N, Det-V
- (Dog, Saw) N-N, N-V, V-N, V-V
- (Saw, The) N-Det, V-Det
- (The, Icecream) Det-N
- Proceed from left to right, and pick out more
likely POS pairings (with some technical tricks
that we omit) - In practice, this gets English POS correctly
above 95 - but just assigning the most likely tag gets it
correctly about 91 - reason English has fairly rigid word order such
that this kind of technique works reasonably well
21Machine Translation
- Machine translation systems in practice differ
wrt how deep their linguistic analysis goes - Not surprisingly, the deeper systems work better
but are more expensive to construct - One of the many challenges are idioms the
spirit is willing but the flesh is weakthe
vodka is strong but the meat is rotten
22Babelfish Needs oxygen
- Try out websites such as babelfish.altavista.com
- Translate a passage from English to German to
French and the back to English - See demo
23Spam Filtering
- 1978 apparently the first email spam
(advertising for a product demo within a computer
company) - 2007 90 billions per day (gt85 of all emails),
at huge costs to servers as well as users - this is a multibillion dollar industry
- One of more successful filtering systems
developed out of computational linguistics - we will briefly review how it works, and why it
fails.
24Conditional Probability
- P(AB) is the prob. of A happening given than B
has happened - Aa person living in Philadelphia
- Ba person going to UPenn
- Both P(A) and P(B) are very small
- but P(AB) is much larger
25Conditional Probability in Spams
- P(spamdocument containing the word Nigeria)
not very high - P(spamdocument containing the word
investment) not very high - P(spamdocument containing the word Nigeria AND
investment) very high - Note that there is nothing inherently spammy
about Nigera or investment it is a fact of
the world, and the Spam Filter must be tuned (or
trained) to it - Spam filter must adapt to the world
26How a Computer does it?
- Collect email messages and use human judges to
classify them into spam and non-spam file - every time you report spam in gmail, you are
contributing to Googles business - It generates profiles of spam and non-spam
messages - in practice, this is just based on occurrences of
words - e.g., suppose Nigeria has the probability of 1
in 200,000 in non-spams, but 1 in 1,000 in Spams,
then Nigeria will be treated as a spam flag
27Difficulties
- Note that spam is often signified by
co-occurrences of words (Nigeria investment) - The technique from the previous slide requires
comparisons of two probabilities the normal and
the spam - for one word (say, a noun, about 20,000), it is
possible to gather enough data to get a sense of
their probabilities - but for word pairs, triples, etc., very quickly
we run out of data - there are 20,0002400 million combinations many
of word pairs will have zero occurrence in the
data - and this is not even talking about how words are
structured in the message the spam filters
assume if messages are bags of words
28Summary
- Computers can be effectively used to model and
study human linguistic behavior - The infinity and ambiguity inherent in human
language poses a significant challenge to
engineers