Christopher Manning

1 / 42
About This Presentation
Title:

Christopher Manning

Description:

... English, was an adjective [Maling] But, today, is it an adjective or a preposition? ... System starts with plain text of ads. These are hardly exactly 'English' ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 43
Provided by: Mann5
Learn more at: http://nlp.stanford.edu

less

Transcript and Presenter's Notes

Title: Christopher Manning


1
Christopher Manning
  • CS300 talk Fall 2000
  • manning_at_cs.stanford.edu
  • http//nlp.stanford.edu/manning/

2
Research areas of interestNLP/CL
  • Statistical NLP models Combining linguistic and
    statistical sophistication
  • NLP and ML methods for extracting meaning
    relations from webpages, medical texts, etc.
  • Information extraction and text mining
  • Lexical and structural acquisition from raw text
  • Using robust NLP dialect/style, readability,
  • Using pragmatics, genre, NLP in web searching
  • Computational lexicography and the visualization
    of linguistic information

3
Models for language
  • What is the motivation for statistical models for
    understanding language?
  • From the beginning, logics and logical reasoning
    were invented for handling natural language
    understanding
  • Logics have a language-like form that draws from
    and meshes well with natural languages
  • Where are the numbers?

4
Sophisticated grammars for NL
  • From NP ? Det Adj N
  • there
  • developed
  • precise and
  • sophisticated
  • grammar
  • formalisms
  • (such as LFG,
  • HPSG)

5
The Problem of Ambiguity
  • Any broad-coverage grammar is hugely ambiguous
    (often hundreds of parses for 20 word
    sentences).
  • Making the grammar more comprehensive only makes
    the ambiguity problem get worse.
  • Traditional (symbolic) NLP methods dont provide
    a solution.
  • Selectional restrictions fail because creative/
    metaphorical use of language is everywhere
  • I swallowed his story
  • The supernova swallowed up the planet

6
The problem of ambiguity close up
  • The post office will hold out discounts and
    service concessions as incentives.
  • 12 words. Real language. At least 83 parses.

7
(No Transcript)
8
Statistical NLP methods
  • P(to Sarah drove)
  • P(time is verb Time flies like an arrow)
  • P(NP ? Det Adj N mother VPdrive )
  • Statistical NLP methods
  • Estimate grammar parameters by gathering counts
    from texts or structured analyses of texts
  • Assign probabilities to various things to
    determine the likelihood of word sequences,
    sentence structure, and interpretation

9
Probabilistic Context-Free Grammars
NP
NP Det N 0.4 NP NPposs N 0.1 NP
Pronoun 0.2 NP NP PP 0.1 NP N
0.2
NP
PP
N
Det
P(subtree above) 0.1 x 0.4 0.04
10
Why Probabilistic Grammars?
  • The predictions about grammaticality and
    ambi-guity of categorical grammars are not in
    accord with human perceptions or engineering
    needs.
  • Categorical grammars arent predictive
  • They dont tell us what sounds natural
  • Probabilistic grammars model error tolerance,
    online lexical acquisition, and have been
    amazingly successful as an engineering tool
  • They capture a lot of world knowledge for free
  • Relevant to linguistic change and variation, too!

11
Example near
  • In Middle English, was an adjective Maling
  • But, today, is it an adjective or a preposition?
  • The near side of the moon
  • We were near the station
  • Not just a word with multiple parts of speech!
    There is evidence of blending
  • We were nearer the bus stop than the train
  • He has never been nearer the center of the
    financial establishment

12
Research aim
  • Most current statistical models are quite simple
    (linguistically and also statistically)
  • Aim To combine the good features of statistical
    NLP methods with the sophistication of rich
    linguistic analyses.

13
Lexicalising a CFG
VPlooked
Vlooked
PPinside
looked
Pinside
NPbox
Dthe
Nbox
  • A lexicalized CFG can capture probabilistic
    dependencies between words

the
box
14
Left-corner parsing
  • The memory requirements of standard parsers do
    not match human linguistic processing. What
    humans find hardest center embedding
  • The man that the woman the priest met knows
    couldnt help
  • is really the bread-and-butter of standard CFG
    parsing
  • (((a b)))
  • As an alternative, left-corner parsing does
    capture this.

15
Parsing and (stack) complexity
  • She ruled that the contract between the union and
    company dictated that claims from both sides
    should be bargained over or arbitrated.

16
Tree geometry vs. stack depth
  • Kims friends mothers car smells.
  • Kim thinks Sandy knows she likes green apples.
  • The rat that the cat that Kim likes chased died
  • TD LC BU
  • 5 1 1
  • 1 1 7
  • 3 3 7

17
Probabilistic Left-Corner Grammars
  • Use richer probabilistic conditioning
  • Left corner and goal category rather than just
    parent
  • P(NP Det Adj N Det, S)
  • Allow left-to-right online parsing (which
  • can hope to explain how people build
  • partial interpretations online)
  • Easy integration with lexicalization,
  • part-of-speech tagging models, etc.

S
NP
Det
Adj
N
18
Probabilistic Head-driven Grammars
  • The heads of phrases are the source of the main
    constraining information about a sentence
    structure
  • We work out from heads by following the
    dependency order of the sentence
  • The crucial property is that we have always built
    and have available to us for conditioning all
    governing heads and all less oblique dependents
    of the same head
  • We can also easily integrate phrase length

19
Information from the web The problem
  • When people see web pages, they understand their
    meaning
  • By and large. To the extent that they dont,
    theres a gradual degradation
  • When computers see web pages, they get only
    character strings and HTML tags

20
The human view
21
The intelligent agent view
  • Ford Motor Company - Home Page
  • trucks, SUV, mazda, volvo, lincoln, mercury,
    jaguar, aston martin, ford"
  • Company corporate home page"
  • WIDTH768
  • HREF"default.asp?pageid473" onmouseover"logoOve
    r('fordscript')rolloverText('ht0')"
    onmouseout"logoOut('fordscript')rolloverText('ht
    0')"pt.gif" ALT"Learn more about Ford Motor Company"
    WIDTH"521" HEIGHT"39"

22
The problem (cont.)
  • We'd like computers to see meanings as well, so
    that computer agents could more intelligently
    process the web
  • These desires have led to XML, RDF, agent markup
    languages, and a host of other proposals and
    technologies which attempt to impose more syntax
    and semantics on the web in order to make life
    easier for agents.

23
Thesis
  • The problem cant and wont be solved by
    mandating a universal semantics for the web
  • The solution is rather agents that can
    understand the human web by text and image
    processing

24
(1) The semantics
  • Are there adequate and adequately understood
    methods for marking up pages with such a
    consistent semantics, in such a way that it would
    support simple reasoning by agents?
  • No.

25
What are some AI people saying?
  • Anyone familiar with AI must realize that the
    study of knowledge representationat least as it
    applies to the commensense knowledge required
    for reading typical texts such as newspapersis
    not going anywhere fast. This subfield of AI has
    become notorious for the production of countless
    non-monotonic logics and almost as many logics of
    knowledge and belief, and none of the work shows
    any obvious application to actual
    knowledge-representation problems. Indeed, the
    only person who has had the courage to actually
    try to create large knowledge bases full of
    commonsense knowledge, Doug Lenat , is believed
    by everyone save himself to be failing in his
    attempt. (Charniak 1993xviixviii)

26
(2) Pragmatics not semantics
  • pragmatic relating to matters of fact or
    practical affairs often to the exclusion of
    intellectual or artistic matters
  • pragmatics linguistics concerned with the
    relationship of the meaning of sentences to their
    meaning in the environment in which they occur
  • A lot of the meaning in web pages (as in any
    communication) derives from the context what is
    referred to in the philosophy of language
    tradition as pragmatics
  • Communication is situated

27
Pragmatics on the web
  • Information supplied is incomplete humans will
    interpret it
  • Numbers are often missing units
  • A rubber band for sale at a stationery site is
    a very different item to a rubber band on a metal
    lathe
  • A sidelight means something different to a
    glazier than to a regular person
  • Humans will evaluate content using information
    about the site, and the style of writing
  • value filtering

28
(3) The world changes
  • The way in which business is being done is
    changing at an astounding rate
  • or at least thats what the ads from ebusiness
    companies scream at us
  • Semantic needs and usages evolve (like languages)
    more rapidly than standards (cf. the Académie
    française)
  • People use words that arent in the dictionary.
  • Their listeners understand them.

29
(4) Interoperation
  • Ontology a shared formal conceptualization of a
    particular domain
  • Meaning transfer frequently has to occur across
    the subcommunities that are currently designing
    ML languages, and then all the problems
    reappear, and the current proposals don't do much
    to help

30
Many products cross industries
  • http//www.interfilm-usa.com/Polyester.htm
  • Interfilm offers a complete range of SKC's
    Skyrol brand polyester films for use in a wide
    variety of packaging and industrial processes.
  • Gauges 48 - 1400
  • Typical End Uses Packaging, Electrical, Labels,
    Graphic Arts, Coating and Laminating
  • labels milk jugs, beer/wine, combination forms,
    laminated coupons,

31
(5) Pain but no gain
  • A lot of the time people won't put in information
    according to standards for semantic/agent markup,
    even if they exist.
  • Three reasons
  • Laziness Only 0.3 of sites currently use the
    (simple) Dublin Core metadata standard.
  • Profits Having an easily robot-crawlable site is
    a recipe for turning what you sell into a
    commodity, and hence making little profit
  • Cheats There are people out there that will
    abuse any standard, if its profitable

32
(6) Less structure to come
  • the convergence of voice and data is creating
    the next key interface between people and their
    technology. By 2003, an estimated 450 billion
    worth of e-commerce transactions will be
    voice-commanded.
  • Question will these customers speak XML tags?
  • Intel ad, NYT, 28 Sep 2000
  • Data Source Forrester Research.

33
The connection to language
  • Decker et al. IEEE Internet Computing (2000)
  • The Web is the first widely exploited
    many-to-many data-interchange medium, and it
    poses new requirements for any exchange format
  • Universal expressive power
  • Syntactic interoperability
  • Semantic interoperability
  • But human languages have all these properties,
    and maintain superior expressivity and
    interoperability through their flexibility and
    context dependence

34
NLP and information access
  • Solution use robust natural language processing
    and machine learning techniques
  • NLP comes into its own when you want to do more
    than just standard IR.
  • E.g., defined information needs over text
  • An apartment with 2 bedrooms in Menlo Park for
    less than 1,500.
  • Where was there an airline accident today?
  • What proteins is this gene known to regulate?

35
Example of extracting textual relations Real
Estate Ads
  • System starts with plain text of ads
  • These are hardly exactly English
  • But an unstructured information source, close to
    English
  • Chosen as lowest common denominator
  • Output database records
  • A variety of tables giving information about
  • the property bedrooms, garages, price
  • the real estate agency
  • inspection times

36
Real Estate Ads Input
  • 2067206v1
  • March 02, 1998
  • MADDINGTON 89,000
  • OPEN 1.00 - 1.45
  • U 11 / 10 BERTRAM ST
  • NEW TO MARKET Beautiful
  • 3 brm freestanding
  • villa, close to shops bus
  • Owner moved to Melbourne
  • ideally suit 1st home buyer,
  • investor 55 and over.
  • Brian Hazelden 0418 958 996
  • R WHITE LEEMING 9332 3477

37
Real Estate Ads Output
  • Output is database tables
  • But the general idea in slot-filler format
  • SUBURB MADDINGTON
  • ADDRESS (11,10,BERTRAM,ST)
  • INSPECTION (1.00,1.45,11/Nov/98)
  • BEDROOMS 3
  • TYPE HOUSE
  • AGENT BRIAN HAZELDEN
  • BUS PHONE 9332 3477
  • MOB PHONE 0418 958 996
  • Manning Whitelaw, U. Sydney 1998 in daily use
    at News Corp.

38
(No Transcript)
39
(No Transcript)
40
One needs a little NLP
  • There is no semantic coding to use
  • Standard IR doesnt work
  • suburbs
  • the Paddington of the west
  • one hours drive from Sydney
  • real estate agent
  • prices
  • recently sold for x. Was y now z. Rent.
  • bedrooms
  • multi-property ads

41
Text Segmentation
  • Real-estate ads have an hiearchical text
    structure!!
  • SOUTHPORT UNIT SPECIALS
  • 58,900 o.n.o. 2 brm close to water and shops.
  • 114,000 "Grandview", excellent value, good
    returns
  • LJ Coleman Real Estate
  • Contact Steve 5527 0572
  • GLEBE 2br yd 250 4br yd 430
  • COOGEE 3br yd 320 1br 150
  • BALMAIN 1br 180
  • H.R. Licensed FEE 9516-3211

42
The End
Write a Comment
User Comments (0)