Title: CC384 Natural Language Engineering
1CC384 - Natural Language Engineering
-
- Linguistic Essentials
- What NLE systems do
-
-
2This lecture
- An overview of basic linguistic terms and
concepts used in the module - A basic architecture for NL systems
3Levels of Linguistic Analysis
- Word level
- Parts of speech DOG, EAT, RED
- Sub-word level
- Phonetics, Phonology
- Morphology
- Phrase level (syntax) THE RED DOG, CUTTING
CORNERS, BY 3 OCLOCK - Semantics
- E.g., lexical semantics COURSE / MODULE, DOG /
ANIMAL - Discourse
- E.g., anaphora MOST STUDENTS SETTING OFF ON GAP
YEAR TRIPS WILL NEED THEIR MONEY AND POSSESSIONS
TO FIT SECURELY AND COMPACTLY INTO A RUCKSACK
AND MONEYBELT
4Words Parts of speech
- Words belong to different classes
- The sad / sheep / run / always / most dog
barked - Basic PARTS OF SPEECH
- NOUN dog, man, car, law
- ADJECTIVE red, fat, brave
- VERBS run, barked
- Best known set of part of speech TAGS Brown TAGS
- NN for nouns
- VB for verb base forms
- JJ for adjectives in positive form
- Notice many words belong to more than one class
- Open and closed classes
5Nouns and Pronouns
- Nouns cat, dog, house, notebook
- Plurals
- Regular dog -gt dogs
- Irregular deer -gt deer, ox -gt oxen
- Case
- The womans house
- Pronouns I, you, he / she/ it, we, they
- Accusative case him, her, them
- Genitive case his, her, theirs
- Reflexives
- Herself
- Mary saw her in the mirror
- Mary saw herself in the mirror
6Words that go with nouns determiners,
adjectives
- Determiners
- Articles THE tree
- Demonstratives THIS tree
- Quantifiers MANY trees, MOST children, .
- Brown tags AT for articles, DT for singular
demonstratives (THIS, THAT), DTS for plural ones - Adjectives
- A red rose, many intelligent children,
- Predicative use That rose is red, Many children
are intelligent - Comparative use John is richer than Bill
- Superlative use John is the trendiest student in
his class / - John is the MOST incompetent mechanic I ever met.
- Brown tags JJ for positive form, JJR for
comparatives, JJT for superlatives
7Verbs
- Used to describe
- actions She threw the stone
- Activities She walked along the river
- States I have 50
- Morphological forms
- Base form walk
- 3rd singular present tense walks
- Gerund and present participle walking
- Past tense, past / passive participle walked
- Auxiliaries
- John has been to Boston
- Modal auxiliaries
- You should spend more time with your family
8Other parts of speech
- Adverbs (RB)
- She often travels to Las Vegas
- Prepositions (IN)
- In the glass, over the table
- Particles (RP)
- He put me off
- Conjunctions (CC)
- She bought her car, but she also considered
leasing it. - She bought or leased the car.
- Give me a peach or an apple.
9Sub-word level Morphology
- Inflection
- Dog / dogs
- Eat / eats
- Derivation
- Adjectives into adverbs ly
- widely (from wide)
- But note difficultly
- Verbs into adjectives able
- Understandable
- Compounding
- Tea kettle
- Schadenfreude
- Finnish rautatieasemassa
10Syntax Phrase Structure
- Words are organized in PHRASES
- I put THE BAGELS in the freezer
- I put THE BAGELS THAT WE HAD NOT EATEN in the
freezer - Phrases are classified according to their main
CONSTITUENT, or HEAD - Noun phrases
- the bagels, the homeless old man that I tried to
help yesterday - Mary, she, one of them
- Verb phrases
- Mary went to the store and bought a bagel
- Adjective Phrases
- John is tall / very tall / quite certain to
succeed - Sentences
11Marking Phrase Constituents
- BRACKETING
- S NP The children VP ate NP the cake
- TREES
S
NP
VP
NP
AT
NNS
VBD
AT
NN
the
children
ate
the
cake
12Semantics
- Lexical semantics
- (Near) Synonyms COURSE / MODULE
- Hypernyms DOG / ANIMAL
- Compositional semantics
- John ran
- Red car
- Red herring
13Discourse
- Anaphora
- John arrived late. He always does that.
- My car didnt start this morning. There was some
problem with the engine fan. - Discourse relations
- My car didnt start this morning BECAUSE there
was some problem with the engine fan.
14Levels of linguistic processing the basic
pipeline of an NLP system (e.g., GATE)
15Example processing a query to a web search engine
TERM IDENTIFICATION STOP WORDS
POS TAGGING
List the estate agents in Stratford, London.
LEXICAL PROCESSING
SYNTACTIC PROCESSING
PREPROCESSING
SYNONYMS
TOKENIZATION
SEMANTIC PROCESSING
WEB ACCESS
16Preprocessing tokenizing, conversion to a
standard format (e.g., XML)
List the estate agents in Stratford, London
PARAGRAPH MARKUP TOKENIZER
ltW CwgtListlt/Wgt ltW Cwgtthelt/Wgt ltW
Cwgtestatelt/Wgt ltW Cwgtagentslt/Wgt ltW
Cwgtinlt/Wgt ltW CwgtStratfordlt/Wgt ltW
Cwgt,lt/Wgt ltW CwgtLondonlt/Wgt
17Processing Steps
- LEXICAL PROCESSING
- POS TAGGING
- THE -gt THE/DT ESTATE -gt ESTATE/NN
- STEMMING / LEMMATIZATION
- AGENTS -gt AGENT (or even AGENT N PL)
18Lexical Processing, I POS tagging
ltW CVB'gtListlt/Wgt ltW CDT'gtthelt/Wgt ltW
CNN'gtestatelt/Wgt ltW CNNS'gtagentslt/Wgt ltW
CIN'gtinlt/Wgt ltW CNNP'gtStratfordlt/Wgt ltW
C'CM'gt,lt/Wgt ltW CNNP'gtLondonlt/Wgt
19Lexical Processing, IIlemmatizing / stemming
ltW CVB'gtListlt/Wgt ltW CDT'gtthelt/Wgt ltW
CNN'gtestatelt/Wgt ltW CNNS'gtagentlt/Wgt ltW
CIN'gtinlt/Wgt ltW CNNP'gtStratfordlt/Wgt ltW
C'CM'gt,lt/Wgt ltW CNNP'gtLondonlt/Wgt
20Processing Steps, III Syntactic Processing
- Identify TERMS ESTATE AGENT
- Remove STOPWORDS (e.g., words tagged as DT, IN,
VB, )
21Practical (partial) parsingidentifying search
terms, filtering
ltSEARCHTERMgt ltW CNN'gtestatelt/Wgt ltW
CNN'gtagentlt/Wgt lt/SEARCHTERMgt ltSEARCHTERMgt ltW
CNNP'gtStratfordlt/Wgt lt/SEARCHTERMgt ltBOOLgt ltW
C'CM'gt,lt/Wgt lt/BOOLgt ltSEARCHTERMgt ltW
CNNP'gtLondonlt/Wgt lt/SEARCHTERMgt
22Processing Steps, IV Semantic Processing
- QUERY EXPANSION ESTATE AGENT OR REAL ESTATE
23Semantic processing finding synonyms, (or better
keywords) interpreting stop words.
ltSEARCHTERMgt ltW CNN'gtestatelt/Wgt ltW
CNN'gtagentlt/Wgt lt/SEARCHTERMgt ltBOOL
TYPEORgtlt/BOOLgt ltSEARCHTERMgt ltW CNN'gtreallt/Wgt
ltW CNN'gtestatelt/Wgt lt/SEARCHTERMgt ltBOOL
TYPEANDgtlt/BOOLgt ltSEARCHTERMgt ltW
CNNP'gtStratfordlt/Wgt lt/SEARCHTERMgt ltBOOL
TYPEANDgt ltW C'CM'gt,lt/Wgt lt/BOOLgt ltSEARCHTERMgt
ltW CNNP'gtLondonlt/Wgt lt/SEARCHTERMgt
24More advanced examples Information Extraction
Systems (e.g., LASIE)
25Preprocessing, I tokenizing
In July 1995 CEG Corp. posted net of 102
million, or 34 cents a share Late last night the
company announced a growth of 20.
PARAGRAPH MARKUP TOKENIZER
ltPgtltW C'W'gtInlt/Wgt ltW C'W'gtJulylt/Wgt ltW
C'CD'gt1995lt/Wgt ltW C'W'gtCEGlt/Wgt ltW
C'W'gtCorp.lt/Wgt ltW C'W'gtpostedlt/Wgt ltW
C'W'gtnetlt/Wgt ltW C'W'gtoflt/Wgt ltW C'W'gtlt/WgtltW
C'CD'gt102lt/Wgt ltW C'W'gtmillionlt/Wgt ltW
C'CM'gt,lt/Wgt ltW C'W'gtorlt/Wgt ltW C'CD'gt34lt/Wgt
ltW C'W'gtcentslt/Wgt ltW C'W'gtalt/Wgt ltW
C'W'gtsharelt/Wgt ltW C.'gt.lt/Wgt lt/Pgt
26Preprocessing, I tokenizing
PARAGRAPH MARKUP TOKENIZER
27Preprocessing,II sentence splitting
ltPgt ltSgt ltW C'W'gtInlt/Wgt ltW C'W'gtJulylt/Wgt ltW
C'CD'gt1995lt/Wgt ltW C'W'gtCEGlt/Wgt ltW
C'W'gtCorp.lt/Wgt ltW C'W'gtpostedlt/Wgt ltW
C'W'gtnetlt/Wgt ltW C'W'gtoflt/Wgt ltW C'W'gtlt/WgtltW
C'CD'gt102lt/Wgt ltW C'W'gtmillionlt/WgtltW
C'CM'gt,lt/Wgt ltW C'W'gtorlt/Wgt ltW C'CD'gt34lt/Wgt ltW
C'W'gtcentslt/Wgt ltW C'W'gtalt/Wgt ltW
C'W'gtsharegtlt/Wgt ltW C'.'gt.lt/Wgtlt/Sgt lt/Pgt
ltPgt ltSgt ltW C'W'gtLatelt/Wgt ltW C'W'gtlastlt/Wgt ltW
C'W'gtnightlt/Wgt ltW C'W'gtthelt/Wgt ltW
C'W'gtcompanylt/Wgt ltW C'W'gtannouncedlt/Wgt ltW
C'W'gtalt/Wgt ltW C'W'gtgrowthlt/Wgt ltW C'W'gtoflt/Wgt
ltW C'CD'gt20lt/WgtltW C'W'gtlt/Wgt ltW C'.'gt.lt/Wgt
lt/Sgt lt/Pgt
28Lexical Processing, I POS tagging
ltW CNNP'gtCEGlt/Wgt ltW CNN'gtCorp.lt/Wgt ltW
CVBD'gtpostedlt/Wgt ltW CNN'gtnetlt/Wgt ltW
CIN'gtoflt/Wgt ltW CS'gtlt/Wgt ltW C'CD'gt102lt/Wgt
ltW CNN'gtmillionlt/Wgt ltW C'CM'gt,lt/Wgt
29Lexical Processing, IIlemmatizing / stemming
ltW CNNP'gtCEGlt/Wgt ltW CNN'gtCorp.lt/Wgt ltW
CVBD'gtpostlt/Wgt ltW CNN'gtnetlt/Wgt ltW
CIN'gtoflt/Wgt ltW CS'gtlt/Wgt ltW C'CD'gt102lt/Wgt
ltW CNN'gtmillionlt/Wgt ltW C'CM'gt,lt/Wgt
30An example of practical (partial)
ParsingIdentifying numerical expressions
ltW CNNP'gtCEGlt/Wgt ltW CNN'gtCorp.lt/Wgt ltW
CVBD'gtpostlt/Wgt ltW CNN'gtnetlt/Wgt ltW
CIN'gtoflt/Wgt ltNUMEXgt ltW CS'gtlt/Wgt ltW
C'CD'gt102lt/Wgt ltW CNN'gtmillionlt/Wgt lt/NUMEXgt ltW
C'CM'gt,lt/Wgt
31An example of practical semantic processing
identifying semantic type
ltW CNNP'gtCEGlt/Wgt ltW CNN'gtCorp.lt/Wgt ltW
CVBD'gtpostlt/Wgt ltW CNN'gtnetlt/Wgt ltW
CIN'gtoflt/Wgt ltNUMEX TYPEMONEYgt ltW
CS'gtlt/Wgt ltW C'CD'gt102lt/Wgt ltW
CNN'gtmillionlt/Wgt lt/NUMEXgt ltW C'CM'gt,lt/Wgt
32An example of discourse processingresolving
anaphoric references
In July 1995 CEG Corp. posted net of 102
million, or 34 cents a share Late last night the
company announced a growth of 20.
33Why language processing is hard
- There is a virtually infinite number of ways of
expressing the same information - E.g., different temporal terms
- Virtually all text contains some noise this
holds even more for spoken output - It becomes particularly funny in the case of some
instruction manuals - Large amounts of money should be kept on your
derson. Other wise lockers are available - After put on the costume Put this parts inside
the outerwear - This garment is selected with new materials for
your daily best comfort, it's an original way of
fashion Try it and you'll be enjoyed! - The same string can mean different things
- POS
- The Dilbert slide
34Ambiguity and humor
35References
- Jurafsky and Martin, chapters 3.1, 5.1, 5.2, 12.1
(first edition 3.1, 8.1, 8.2, and 9.1) - R. Huddleston English Grammar an outline,
Cambridge University Press 1990. - This presentation is based on slides prepared by
Massimo Poesio.