LING001 - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

LING001

Description:

machine translation. Spam filtering. 16. 16. Speech Synthesis ... Machine Translation. Machine translation systems in practice differ wrt how 'deep' their ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 29
Provided by: lingU3
Category:
Tags: ling001

less

Transcript and Presenter's Notes

Title: LING001


1
LING001
  • Language and Computers
  • 4-13-2009

2
Im sorry Dave, Im afraid I cant do that
  • Computational linguistics
  • Use computational tools to understand how humans
    (and human languages) work
  • Use computational tools to make computers
    understand humans (and human languages)

3
The Turing Test
  • Language has always been viewed as the essence of
    being human
  • Alan Turing (1912-1954), a pioneer in computer
    science, proposed that a machine would be
    considered intelligent if it could fool a human
    into believing it to be another human via
    teletyping (or IRC, IM, ...)
  • This test has led to many philosophical
    controversies but one thing is clear, no machine
    has ever passed the Test
  • ELIZA A computer program created by Joseph
    Weizenbaum in the 1960s that played a
    psychiatrist and was disturbingly effective at
    tricking people into confessions
  • See demo

4
Part I Language as Computation
  • Modern linguistics was/is computer science
  • representations and rules were conceived as
    components in a mechanical device that generates
    linguistic expressions--an insight that goes back
    many centuries
  • rules such as S?NP VP became the foundation of
    computer science
  • logic representations of meanings (the SEMANTICS
    lectures) were used to represent what programming
    languages express
  • Exp? Exp OP Exp
  • OP? , -, ,
  • Exp? 0, 1, 2, ....

5
Case Study Language processing
  • Linguistic Problems
  • How do we analyze sentences on the fly?
  • Why are some sentences more difficult to process
    than others?
  • Computational Problems
  • computers have a processor and a memory
  • the processor carries out an algorithm (a precise
    series of steps) that, in effect, draws a syntax
    tree much like what you do in your homework
  • the tree is shipped to semantics where the
    logic-based type of calculation takes over to
    derive meanings

6
Simple Grammar
How do we parse in real time?
7
What comes down must go up
Make the left-most expansion, which is going to
be the leftmost word, to see if the
predicted category is actually found in the input
sentence
8
Does this flight include a meal?
9
Does this flight include a meal? (Cont)
10
Does this flight include a meal? (Cont)
11
Breakdown
  • Recall the structure of center embedding NP -gt
    NP S (relative clause)
  • The cheese the rat the cat the dog worried chased
    ate lay in the house
  • Red indicates a predicted rule, and blue
    indicates a confirmed rule
  • S-gt NP VP -gt Det N VP -gt the N VP -gt the cheese
    VP -gt NP VP
  • we are now at The cheese
  • VP would next predict V, which is contradicted by
    the in the rat
  • we need to trace back the NP expansion (the green
    box) to try out other rules in the grammar for NP
    (namely, NP-gtNP S)
  • Note that the red VP is still held in the memory
    because its predicted more and more will stack
    up, causing memory load problem and hence parsing
    difficulty

12
Case II Language learning
  • Recall the strategies in word segmentation
    (lecture on 3-12)
  • stress information English is predominantly
    initial stress
  • statistical information the transitional
    probability between syllables tends to be lower
    at word boundaries
  • infants have been shown to be sensitive to both,
    but how/whether do conflicting cues work
    together?
  • in experimental psychology, most researchers will
    focus on one of the cues, because it is inherent
    in the research to eliminate confounds from other
    cues in order to establish the empirical
    effectiveness of the cue that the researchers are
    interested in

13
Language Learning
  • In many cases where it is difficult to carry out
    experiments, or the resulting experiment would be
    too complicated (for young children),
    computational modeling is often the best--perhaps
    the only--way of testing hypotheses about
    language learning
  • computer models must be faithful to the findings
    in child language acquisition e.g., it cannot
    presuppose unrealistic computing power on the
    part of the learner
  • computer models must produce behavior consistent
    with what children actually do in language
    learning
  • they also allow the integration of quantitative
    factors (e.g., frequency) with the more abstract
    representations and rules we have been discussing
    in this class

14
From Learning to Change
  • The transmission of language is by language
    learning this process is strikingly similar to
    genetics and evolution
  • Language learning can be modeled as an adaptive
    process to the variant forms of grammar in the
    environment some competing variants are from
    Universal Grammar, while others are contingent on
    social and cultural factors (e.g., soda vs.
    pop)
  • There is a healthy amount of work now that tries
    to use the mathematical models of biological
    evolution to develop mathematical models of
    language change

14
15
II Computing Language
  • Another branch of computational linguistics
    develops applications for engineering purposes
  • In most cases, the engineering techniques are
    inspired by how language works as revealed by
    linguistics
  • Manual construction of the system is usually
    labor intensive so people have been looking for
    automatic ways of learning the linguistic
    system
  • Speech synthesis
  • Ambiguity in language
  • machine translation
  • Spam filtering

16
Speech Synthesis
  • Virtually all systems were modeled after the
    human vocal tract mechanical before, electronic
    now

Wolfgang von Kempelen (late 1700s)
17
Modern Speech Synthesis Systems
  • Many follow the pipeline of linguistic
    representations that we have been talking about
  • Note that running speech involves more than word
    synthesis prosody is very important as well
  • this often requires the system to parse the
    sentence into tree like structures
  • see demo

http//www.research.att.com/ttsweb/tts/demo.php
18
Ambiguity
  • Ambiguity is pervasive in language
  • word level two vs. too vs. to
  • part of speech bark is a verb as well as a
    noun
  • word senses bank as an institution vs. bank
    as an object
  • syntax I shot an elephant in my pajamas
  • semantics everyone likes someone
  • Humans can make rapid decisions on ambiguity by
    tapping into both linguistic and non-linguistic
    knowledge and thus ignoring the majority of
    ambiguities
  • How does a computer do that?

19
Part of Speech Disambiguation
  • Observation some categories appear more often
    than others
  • the is far more likely to be followed by a noun
    or adjective, and never verbs
  • Idea look at a lot of pairs of words in the
    preanalyzed data (e.g., book that flight V
    Det N), and try to discover the regularities
  • we can then generalize these regularities to
    novel texts
  • here the problem is solved by having a teacher
  • The US military has paid a lot of money to
    produce tons of preanalyzed linguistic data,
    hoping that useful regularities can be extracted
    out of it.

20
The dog saw the icecream
  • (The, Dog) Det-N, Det-V
  • (Dog, Saw) N-N, N-V, V-N, V-V
  • (Saw, The) N-Det, V-Det
  • (The, Icecream) Det-N
  • Proceed from left to right, and pick out more
    likely POS pairings (with some technical tricks
    that we omit)
  • In practice, this gets English POS correctly
    above 95
  • but just assigning the most likely tag gets it
    correctly about 91
  • reason English has fairly rigid word order such
    that this kind of technique works reasonably well

21
Machine Translation
  • Machine translation systems in practice differ
    wrt how deep their linguistic analysis goes
  • Not surprisingly, the deeper systems work better
    but are more expensive to construct
  • One of the many challenges are idioms the
    spirit is willing but the flesh is weakthe
    vodka is strong but the meat is rotten

22
Babelfish Needs oxygen
  • Try out websites such as babelfish.altavista.com
  • Translate a passage from English to German to
    French and the back to English
  • See demo

23
Spam Filtering
  • 1978 apparently the first email spam
    (advertising for a product demo within a computer
    company)
  • 2007 90 billions per day (gt85 of all emails),
    at huge costs to servers as well as users
  • this is a multibillion dollar industry
  • One of more successful filtering systems
    developed out of computational linguistics
  • we will briefly review how it works, and why it
    fails.

24
Conditional Probability
  • P(AB) is the prob. of A happening given than B
    has happened
  • Aa person living in Philadelphia
  • Ba person going to UPenn
  • Both P(A) and P(B) are very small
  • but P(AB) is much larger

25
Conditional Probability in Spams
  • P(spamdocument containing the word Nigeria)
    not very high
  • P(spamdocument containing the word
    investment) not very high
  • P(spamdocument containing the word Nigeria AND
    investment) very high
  • Note that there is nothing inherently spammy
    about Nigera or investment it is a fact of
    the world, and the Spam Filter must be tuned (or
    trained) to it
  • Spam filter must adapt to the world

26
How a Computer does it?
  • Collect email messages and use human judges to
    classify them into spam and non-spam file
  • every time you report spam in gmail, you are
    contributing to Googles business
  • It generates profiles of spam and non-spam
    messages
  • in practice, this is just based on occurrences of
    words
  • e.g., suppose Nigeria has the probability of 1
    in 200,000 in non-spams, but 1 in 1,000 in Spams,
    then Nigeria will be treated as a spam flag

27
Difficulties
  • Note that spam is often signified by
    co-occurrences of words (Nigeria investment)
  • The technique from the previous slide requires
    comparisons of two probabilities the normal and
    the spam
  • for one word (say, a noun, about 20,000), it is
    possible to gather enough data to get a sense of
    their probabilities
  • but for word pairs, triples, etc., very quickly
    we run out of data
  • there are 20,0002400 million combinations many
    of word pairs will have zero occurrence in the
    data
  • and this is not even talking about how words are
    structured in the message the spam filters
    assume if messages are bags of words

28
Summary
  • Computers can be effectively used to model and
    study human linguistic behavior
  • The infinity and ambiguity inherent in human
    language poses a significant challenge to
    engineers
Write a Comment
User Comments (0)
About PowerShow.com