Title: Introduction to Computational Linguistics
1Introduction to Computational Linguistics
- Lecture 1 Intro to Field, History, Quick Review
of Regular Expressions, Start of Finite Automata - Based on Dan Jurafskys Lecture Notes for the
textbook, Speech and Language Processing
2- Overview and history of the field
- Knowledge of language
- The role of ambiguity
- Models and Algorithms
- Eliza, Turing, and conversational agents
- History of speech and language processing
- Regular expressions
3Computer Speech and Language Processing
- What is Natural Language Processing?
- Getting computers to perform useful tasks
involving human languages whether for - Enabling human-machine communication
- information retrieval, question answer, chatbot
- Improving human-human communication
- writing, translation
- Manipulating language objects
- words, phrases, sentences
4Kinds of knowledge needed?
- Consider the following interaction with HAL the
computer from 2001 A Space Odyssey - Dave Open the pod bay doors, Hal.
- HAL Im sorry Dave, Im afraid I cant do that.
5Knowledge needed to build HAL?
- Speech recognition and synthesis
- Dictionaries (how words are pronounced)
- Phonetics (how to recognize/produce each sound of
English) - Natural language understanding
- Knowledge of the English words involved
- What they mean
- How they combine (what is a pod bay door?)
- Knowledge of syntactic structure
- Im I do, Sorry that afraid Dave Im cant
6Whats needed?
- Dialog and pragmatic knowledge (???????)
- open the door is a REQUEST (as opposed to a
STATEMENT or information-question) - Politeness consideration
- It is polite to pretend to want to be cooperative
- Im afraid, I cant
- Anaphoric Reference
- What is that in I cant do that?
- More practical and realistic situation
- A system to book airline flights
- Also needs much of this kind of knowledge
7Modern conversational agents
- Switchboard Receptionists
- Implications need to know how a native or
non-native speaker of English is likely to
pronounce a random American name. Eek. - Airline and other simple dialogue systems
- Lots of fabulously fun pragmatic issues
- Grounding
- Error handling
- In general
- Now that there are commercial installations
- Big incentive to use linguistic or any other kind
of knowledge if it actually improves performance
8Eliza Weizenbaum (1966)
- User You are like my father in some ways.
- ELIZA What resemblance do you see
- User You are not very aggressive but I think
you dont want me to notice that. - ELIZA What makes you think I am not aggressive
- User You dont argue with me
- ELIZA Why do you think I dont argue with you
- User You are afraid of me
- ELIZA Does it please you to believe I am afraid
of you
9Question Answering
- What does door mean?
- What year was Abraham Lincoln born?
- How many states were in the United States when
Lincoln was born? - Was there a military draft during the Hoover
administration? - What do US scientists think about whether human
cloning should be legal?
10Modern QA systems
- Still in infancy
- Simple factoid questions beginning to work OK
- Annual government-sponsored bakeoff called TREC
11Machine Translation
- ??????????,
- ???????????????????,?????
- ???????????,????,????,????????
- Chinese gloss Dai-yu alone on bed top
think-of-with-gratitude Bao-chai Aagain listen to
window outside bamboo tip plantain leaf of on-top
rain sound sigh drop clear cold penetrate curtain
not feeling again fall down tears come - Hawkes translation As she lay there alone,
Dai-yus thoughts turned to Bao-chai Then she
listened to the insistent rustle of the rain on
the bamboos and plantains outside her window.
The coldness penetrated the curtains of her bed.
Almost without noticing it she had begun to cry.
12Machine Translation
- The Story of the Stone
- The Dream of the Red Chamber (Cao Xueqin 1792)
- Issues (Language Divergences)
- Sentence segmentation
- Zero-anaphora
- Coding of tense/aspect
- Penetrate -gt penetrated
- Stylistic differences across languages
- Bamboo tip plaintain leaf -gt bamboos and
plantains - Cultural knowledge
- Curtain -gt curtains of her bed
13Open MT Evaluation 2008
- Input
- ltdoc docid"AFP_CMN_20070702.0022" genre"text"
sysid"source"gt - lthlgt
- ltseg id"1"gt???????????????????lt/seggt
- lt/hlgt
- ltpgt
- ltseg id"2"gt??????????????,?????????????????????
????????????????,???????lt/seggt - lt/pgt
- ltseg id"6"gt???????????????????,?????????????????
??lt/seggt - lt/pgt
- lt/docgt
14Open MT Evaluation 2008
- Outpu
- ltdoc docid"AFP_CMN_20070702.0022"gt
- lthlgtltseg id"1"gtWhite House Pushes for Nuclear
Inspectors to Be Sent as Soon as Possible to
Monitor North Korea's Closure of Its Nuclear
Reactorslt/seggtlt/hlgtltpgt - ltseg id"2"gtThe White House today called for
nuclear inspectors to be sent as soon as possible
to monitor North Korea's closure of its nuclear
reactors. The White House made this call after US
President Bush had telephone conversations with
South Korean President Roh Moo-hyun.lt/seggt . - ltseg id"6"gtHill, the US envoy to the six-party
talks, said after a visit to Pyongyang last week
that he expected the Yongbyon nuclear reactors
would be shut down in the middle of July.lt/seggt - lt/pgtlt/docgt
15Ambiguity
- Language is full of ambiguity at all levels
- Toke boundary ice creem vs. I screem
- Part of speech walk as verb vs. noun
- Word sense ambiguity money bank vs river bank
- Fundamental problem of computational linguistics
- Resolving ambiguity is a crucial goal
- Example Find at least 5 meanings of this
sentence - I made her duck
16Ambiguity
- Find at least 5 meanings of this sentence
- I made her duck
- I cooked waterfowl for her benefit (to eat)
- I cooked waterfowl belonging to her
- I created the (plaster?) duck she owns
- I caused her to quickly lower her head or body
- I waved my magic wand and turned her into
undifferentiated waterfowl - At least one other meaning thats inappropriate
for gentle company.
17Ambiguity is Pervasive
- I caused her to quickly lower her head or body
- Lexical category duck can be a N or V
- I cooked waterfowl belonging to her.
- Lexical category her can be a possessive (of
her) or dative (for her) pronoun - I made the (plaster) duck statue she owns
- Lexical Semantics make can mean create or
cook
18Ambiguity is Pervasive
- Grammar Make can be
- Transitive (verb has a noun direct object)
- I cooked waterfowl belonging to her
- Ditransitive (verb has 2 noun objects)
- I made her (into) undifferentiated waterfowl
- Action-transitive (verb has a direct object and
another verb) - I caused her to move her body
19Ambiguity is Pervasive
- Phonetics!
- I mate or duck
- Im eight or duck
- Eye maid her duck
- Aye mate, her duck
- I maid her duck
- Im aid her duck
- I mate her duck
- Im ate her duck
- Im ate or duck
- I mate or duck
20Models and Algorithms
- Models formalisms used to capture the various
kinds of linguistic structure. - State machines (FSA, FS Transducers, Markov
models) - Formal rule systems
- Context-Free Grammars, Feature-based Grammars)
- Logic (predicate calculus, inference)
- Probabilistic versions of all of these others
- Gaussian Mixture Models, Probabilistic relational
models, etc etc) - Algorithms used to manipulate representations to
create structure. - Search (A, dynamic programming)
- Supervised learning, etc etc
21Language, thought, and Machine Understanding
- A Gedanken Experiment Turing Test
- Question can a machine think is not
operational. - Operational version
- 2 people and a computer
- Interrogator talks to contestant and computer via
teletype - Task of machine is to convince interrogator it is
human - Task of contestant is to convince interrogator
she and not machine is human.
Wikepedia A thought experiment (from the German
term Gedankenexperiment, coined by Hans Christian
Ørsted) in the broadest sense is the use of a
hypothetical scenario to help us understand the
way things actually are.
22Eliza Weizenbaum (1966)
- User You are like my father in some ways.
- ELIZA What resemblance do you see
- User You are not very aggressive but I think
you dont want me to notice that. - ELIZA What makes you think I am not aggressive
- User You dont argue with me
- ELIZA Why do you think I dont argue with you
- User You are afraid of me
- ELIZA Does it please you to believe I am afraid
of you
23History foundational insights 1940s-1950s
- Automaton
- Turing 1936 Turing Machine (the most powerful
machine in abstract terms) - McCulloch-Pitts neuron (1943)
- http//diwww.epfl.ch/mantra/tutorial/english/mcpit
s/html/ - Kleene (1951/1956) FSA recognizes a regular
language. - Shannon (1948) link between automata and Markov
models - Chomsky (1956)/Backus (1959)/Naur(1960) CFG
- Probabilistic/Information-theoretic models
- Shannon (1948)
- Bell Labs speech recognition (1952)
24History the two camps 1957-1970
- Symbolic
- Zellig Harris 1958 TDAP first parser?
- Cascade of finite-state transducers
- Chomsky Generative Grammar
- AI workshop at Dartmouth (McCarthy, Minsky,
Shannon, Rochester) - Newell and Simon Logic Theorist, General Problem
Solver - Statistical
- Bledsoe and Browning (1959) Bayesian OCR
- Mosteller and Wallace (1964) Bayesian authorship
attribution - Denes (1959) ASR combining grammar and acoustic
probability
25Four paradigms 1970-1983
- Stochastic
- Hidden Markov Model 1972
- Independent application of Baker (CMU) and
Jelinek/Bahl/Mercer lab (IBM) following work of
Baum and colleagues at IDA - Logic-based
- Colmerauer (1970,1975) Q-systems
- Definite Clause Grammars (Pereira and Warren
1980) - Kay (1979) functional grammar, Bresnan and Kaplan
(1982) unification - Natural language understanding
- Winograd (1972) Shrdlu
- Schank and Abelson (1977) scripts, story
understanding - Influence of case-role work of Fillmore (1968)
via Simmons (1973), Schank. - Discourse Modeling
- Grosz and colleagues discourse structure and
focus - Perrault and Allen (1980) BDI model
26Empiricism and Finite State Machines return
1983-1993
- Finite State Models
- Kaplan and Kay (1981) Phonology/Morphology
- Church (1980) Syntax
- Return of Probabilistic Models
- Corpora created for language tasks
- Early statistical versions of NLP applications
(parsing, tagging, machine translation) - Increased focus on methodological rigor
- Cant test your hypothesis on the data you used
to build it! - Training sets and test sets
27The field comes together 1994-2007
- NLP has borrowed statistical modeling from speech
recognition, is now standard - ACL conference
- 1990 39 articles 1 statistical
- 2003 62 articles 48 statistical
- Machine learning techniques key
- NLP has borrowed focus on web and search and bag
of words models from information retrieval - Unified field
- NLP, MT, ASR, TTS, Dialog, IR
28How this course fits in
- This is our new introductory course in natural
language processing - Covers applications
- information retrieval
- machine translation
- educational application
- For speech, and dialog processing, check other
courses by ???
29Some brief demos
- Machine Translation
- http//translate.google.com/translate_t
30Regular expressions
- A formal language for specifying text strings
- How can we search for any of these?
- woodchuck
- woodchucks
- Woodchuck
- Woodchucks
Figure from Dorr/Monz slides
31Regular Expressions
- Basic regular expression patterns
- Perl-based syntax (slightly different from other
notations for regular expressions) - Disjunctions /wWoodchuck/
Slide from Dorr/Monz
32Regular Expressions
Slide from Dorr/Monz
33Regular Expressions
- Optional characters ? , and
- ? (0 or 1)
- /colou?r/ ? color or colour
- (0 or more)
- /ooh!/ ? oh! or Ooh! or Ooooh!
- (1 or more)
- /oh!/ ? oh! or Ooh! or Ooooh!
- Wild cards .- /beg.n/ ? begin or began or begun
Slide from Dorr/Monz
34Regular Expressions
- Anchors and
- /A-Z/ ? Ramallah, Palestine
- /A-Z/ ? verdad? really?
- /\./ ? It is over.
- /./ ? ?
- Boundaries \b and \B
- /\bon\b/ ? on my way Monday
- /\Bon\b/ ? automaton
- Disjunction
- /yoursmine/ ? it is either yours or mine
Slide from Dorr/Monz
35Disjunction, Grouping, Precedence
- Column 1 Column 2 Column 3 How do we
express this? - /Column 0-9 /
- /(Column 0-9 )/
- Precedence
- Parenthesis ()
- Counters ?
- Sequences and anchors the my end
- Disjunction
- REs are greedy!
Slide from Dorr/Monz
36Example
- Find me all instances of the word the in a
text. - /the/
- Misses capitalized examples
- /tThe/
- Returns other or theology
- /\btThe\b/
- /a-zA-ZtThea-zA-Z/
- /(a-zA-Z)tThea-zA-Z/
Slide from Dorr/Monz
37Errors
- The process we just went through was based on two
fixing kinds of errors - Matching strings that we should not have matched
(there, then, other) - False positives
- Not matching things that we should have matched
(The) - False negatives
38Errors cont.
- Well be telling the same story for many tasks,
all quarter. Reducing the error rate for an
application often involves two antagonistic
efforts - Increasing accuracy (minimizing false positives)
- Increasing coverage (minimizing false negatives).
39More complex RE example
- Regular expressions for prices
- /0-9/
- Doesnt deal with fractions of dollars
- /0-9\.0-90-9/
- Doesnt allow 199, not word-aligned
- \b0-9(\.0-90-9)?\b)
40Advanced operators
Slide from Dorr/Monz
41Conclusion
- Overview and history of the field
- Knowledge of language
- The role of ambiguity
- Models and Algorithms
- Eliza, Turing, and conversational agents
- History of computational linguistics
- The merger of 4 fields NLP, Speech Recognition,
Dialog, Information Retrieval - Regular expressions
- Finite State Automata