Parsing - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Parsing

Description:

orthography/phonology. morphology. syntax. semantics. Orthography. pannenkoek is a Dutch word in new spelling; pannekoek is in old spelling ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 61
Provided by: jod7
Category:

less

Transcript and Presenter's Notes

Title: Parsing


1
Parsing
  • Jan Odijk
  • TST-Evaluation

2
Overview
  • What is Parsing
  • What is Parsing used for
  • Parser Efficiency
  • Grammar Types
  • Parsing for CFG
  • Parsers for Dutch
  • Why is Parsing Difficult?
  • Evaluating Parsers

3
Parsing (Broad)
  • Grammar defines
  • strings and
  • their characterization
  • structure
  • meaning (sometimes)
  • degree of deviance (often bifurcation)
  • often a separate lexicon is distinguished
  • Parser (analyzer, analysis module, ...)
  • uses a grammar (and lexicon)
  • string ? string characterization(s)
  • effective procedure (algorithm)
  • as efficient as possible

4
Parsing (narrow)
  • Grammar defines
  • well-formed strings
  • and their syntactic characterization
  • Parser
  • uses a grammar (syntax)
  • string ? syntactic characterization(s)
  • non-empty set if string is well-formed
  • empty set if string is ill-formed
  • usually an extension of a recognizer
  • characterization of ill-formed input by
    robustness

5
String Characterization
  • General on all relevant levels of grammar
  • orthography/phonology
  • morphology
  • syntax
  • semantics
  • Orthography
  • pannenkoek is a Dutch word in new spelling
    pannekoek is in old spelling
  • voorbeled is not a word of Dutch probably a
    misspelling for voorbeeld
  • Morphology
  • liep is (in Dutch) a past tense singular form of
    the verb lopen
  • blakte is not a word of Dutch, possibly a paste
    tense singular form of an unknown verb blakken
  • Semantics
  • the meaning of the man walked away is
    (simplifying) ?x man(x) walk-away(x)
  • the meaning of green ideas is not well-defined,
    maybe green has a metaphorical meaning

6
String Characterization
  • Syntax
  • phrase structure
  • the man is a phrase in the man walked away
  • man walked is not a phrase in this sentence
  • ...
  • syntactic categories
  • labels associated with words and phrases
  • Noun, Verb, Noun Phrase, Clause, ...
  • grammatical relations
  • labels associated with a (superordinate,
    subordinate) phrase pair
  • subject, object, head, modifier, ...

7
String Characterization
  • Syntax
  • the man
  • is a well-formed singular definite noun phrase in
    English
  • man the
  • is not a well-formed phrase or sentence in
    English
  • is a sequence of English words (noun def.
    article)
  • ik het boek kocht
  • is not a well-formed full utterance of Dutch
  • but can be analyzed as a subordinate clause
    lacking a subordinate conjunction (cf. toen ik
    het boek kocht)
  • on videl knigu
  • is not a well-formed sentence of Dutch (or
    English)
  • can only be analyzed as a sequence of unknown
    words
  • is possibly a well-formed sentence of a different
    language

8
(Syntax) Parser
  • restricted to syntax
  • syntax defines sequences of
  • occurrences of word forms with their
    morphological characterization
  • and their syntactic characterization
  • example
  • ARTdef, stemthe
  • Nounsg, stemman
  • Verb past tense, stemwalk
  • Partstemaway

9
Parser Use
  • Machine Translation
  • Question-Answering
  • Information Extraction
  • Speech Recognition
  • Veldhuijzen van Zanten et al. 2000
  • Speech Synthesis
  • Grammar checking

10
Parser Efficiency
  • Parser should use
  • as little as possible processing time
  • as little as possible space (memory)
  • Efficiency studied in Complexity Theory
  • computation steps / memory cells needed to
    obtain a result in the worst case scenario
  • Typically as a function
  • of the length of the input sentence
  • and of the size of the grammar

11
Parser Efficiency
  • usually represented by O(...)
  • e.g. O(n) where n is the input sentence length
  • meaning a certain constant factor c times n
  • so O(n) c.n
  • Examples of complexity
  • O(n) linear
  • O(nk) for a constant k polynomial
  • O(kn) for a constant k exponential

12
Grammars
  • G (VT,VN, P, S)
  • VT a set of terminal symbols
  • VN a set of non-terminal symbols
  • VT ? VN V
  • P a set of rules of the form
  • a ? ß where a, ß sequences ? V
  • S ? VN, start symbol
  • non-terminals A,B,....
  • terminals a,b, ...

13
Grammar Types
  • Regular (Type-3) Grammars
  • A? a, A? Ba (left-regular)
  • A? a, A ? aB (right-regular)
  • Parsers easy, effective, efficient
    (deterministic)
  • based on (deterministic) finite state automata
  • not suited for natural language in toto
  • cannot characterize natural language adequately
  • can be (and is) used for restricted fragments
  • related (but more powerful) finite state
    transducers are often used for morphology

14
Grammar Types
  • Context-Free (Type-2) Grammars
  • A? a, A? BC (Chomsky Normal Form)
  • A? a, A ? ß (ß a non-empty sequence of terminals
    and non-terminals)
  • Parsers
  • various types of parsers exist
  • reasonably efficient
  • not suited for natural language in toto
  • cannot characterize natural language adequately
    (in theory)
  • it is very difficult to write natural language
    grammars with CFGs (in practice)
  • but often used in combination with other
    techniques

15
Grammar Types
  • Mildly-Context-Sensitive Grammars (Weir 1988)
  • used in a few formalisms for NLP
  • Tree-Adjoining Grammars (TAG)
  • SchabesJoshi 1998 http//www.cis.upenn.edu/xtag
    /
  • Head Grammars (HG)
  • Bach 1979 Pollard 1984
  • Combinatory Categorial Grammars (CCG)
  • http//www.iccs.informatics.ed.ac.uk/stephenc/ccg
    parsing.html
  • SteedmandBaldridge 2003
  • Linear Indexed Grammars (LIG)
  • Gazdar 1988
  • Parsers
  • various types of parsers exist
  • Vijay-Shanker Weir 1994
  • less efficient than CFG but still reasonable
    ---O(n6)

16
Grammar Types
  • Context-Sensitive Grammars and Type 0 Grammars
  • not discussed here
  • seldomly used for NLP directly
  • if used usually by extensions of a CFG

17
Attribute-Value (A-V)Pairs
  • Labels for terminals and nonterminals extended
    with A-V Pairs
  • A-V pair
  • attribute (identifier)
  • number, gender, person, ....
  • value
  • finite possible values enumerated type
  • sg, pl m,f,n1,2,3...
  • infinite set of A-V pairs

18
Attribute-Value (A-V)-Pairs
  • finitely valued A-V pairs used in almost all
    frameworks
  • Rosetta, AGFL, Amazon, ...
  • infinitely valued A-V pairs
  • LFG (f-structures)
  • KaplanBresnan 1982
  • HPSG (A-V matrixes (AVM) for signs)
  • PollardSag 1994
  • Alpino
  • ... (many others)

19
Attribute-Value (A-V)-Pairs
  • Reentrancy (token-identity)
  • one part of the structure is token-identical to
    another part
  • Formally represented in Directed Acyclic Graphs
    (DAGs)
  • often represented by indexes (1,2,..)
  • hij wil zwemmen
  • hij is subject of wil and subject of zwemmen

20
Parsing CFG
  • Top-Down (goal-directed)
  • Bottom-up (data-directed)
  • Dynamic Programming based

21
Top-Down
  • Expand S, using rules until leaves cover the
    string
  • Advantage
  • no waste in exploring trees that cannot lead to S
  • Disadvantage
  • wastes time on S-rooted trees that cannot cover
    the string

22
Bottom-up
  • starting at the leaves, apply rules to build ever
    larger trees up to an S-rooted tree
  • Advantage
  • no wasting time on S-rooted trees that cannot
    cover the string
  • Disadvantage
  • waste in exploring trees that cannot lead to S

23
Mixed TD/BU
  • Top-down prediction
  • guided by bottom-up filtering
  • taking into account the left-corner
  • current input terminal must be the first terminal
    along the left edge of a derivation

24
Still problems
  • Left-recursion leads to infinitely deep paths
  • not effective (NP -gt NP PP, NP-gt N)
  • Ambiguity not handled efficiently
  • real ambiguity and local ambiguity
  • representation of ambiguity exponential
  • Repeated Parsing of Subtrees
  • not efficient
  • Exponential time-complexity

25
Solution
  • Dynamic Programming
  • Store subresults in tables and retrieve these
    when needed (no recomputation)---memoization
  • add predictions only when not already there
    (avoids infinite recursion)
  • no explicit calculation/storage of all parses
  • can be obtained when needed
  • and only those that are needed
  • Examples
  • Earley parser (chart parser) (Earley 1970)
  • CKY (Cocke-Younger-Kasami)
  • Younger 1967 , Kasami 1965, AhoUllman 1972
  • GHR (Graham-Harrison-Ruzzo 1980)
  • functional parsing using memoization (Leermakers
    1993)
  • Polynomial time-complexity
  • O(n3) (ninput words)

26
Parsers for Dutch
  • Alpino (Groningen) http//ziu.let.rug.nl/vannoord_
    bin/alpino
  • Rosetta (Philips Res. Labs, Utrecht)
  • Jan Landsbergen, Jan Odijk see M.T. Rosetta
    (1994)
  • no running version available at this moment
  • Amazon (Nijmegen)
  • http//lands.let.kun.nl/amazon/
  • Delilah
  • Crit Cremers (Leiden) http//www.let.leidenuniv.nl
    /ulcl/faculty/cremers/
  • AG-grammars (Nijmegen) (under construction)
  • http//www.cs.kun.nl/agfl/
  • gebaseerd op de Amazon-grammatika
  • Data-Oriented Parsing (DOP) --Amsterdam
  • BodScha 1996 Bod et al. 2003
  • CORRiespecifically for spelling- and grammar
    checking
  • Vosse, T., (1994). .

27
Problems for parsing
  • Ambiguity
  • Unknown words
  • Names
  • Multiword Expressions (MWEs)
  • Fragments, incomplete utterances, ill-formed
    utterances
  • Symbols

28
Ambiguity
  • Parsing resolves (some) lexical ambiguities
  • hij heeft boeken gelezen
  • heeft main or auxiliary verb
  • boeken verb or noun
  • But introduces new structural ambiguities
  • most of which cannot be resolved on the basis of
    the grammar alone

29
Structural Ambiguity
  • Attachment ambiguity
  • zag (de man met de verrekijker) v.
  • zag (de man) (met de verrekijker)
  • zag de man met de verrekijker in de tuin van de
    buurman

30
Structural Ambiguity
  • Coordination ambiguity
  • (oude mannen) en vrouwen v.
  • (oude (mannen en vrouwen))

31
Structural Ambiguity
  • Argument/Adjunct ambiguity
  • Hij heeft boeken gelezen v.
  • Hij heeft uren gelezen
  • Hij wacht op het antwoord v.
  • Hij wacht op het station

32
Structural Ambiguity
  • NP bracketing ambiguity
  • can you book (TWA flights) v.
  • can you book (TWA) (flights)
  • ik gaf (die mooie boeken) v.
  • ik gaf (die) (mooie boeken) v.
  • ik gaf (die mooie) boeken

33
Structural Ambiguity
  • Topicalization ambiguity
  • het meisje heeft de jongen niet gezien
  • het meisje subj or obj

34
Catalan numbers
  • Cn (1/n1) (2n over n)
  • n over k n!/(k!(n-k)!)

35
Resolving Structural Ambiguity
  • Add semantic properties and use them
  • Add knowledge of the world / situation / text
    type
  • Heuristics
  • Statistics

36
Semantic properties
  • Semantic types and semantic selection
    restrictions
  • eten ltsu animate, ob ediblegt
  • man, vrouw, hond, ... animate
  • hamburger, sla, tomaat edible (not animate)
  • uur, vergadering not edible
  • de man heeft uren gegeten
  • sla eet de man niet graag

37
Semantic properties
  • Problems
  • very difficult to determine type system, and to
    apply it systematically
  • many deviations by metaphors, meaning extensions
  • de auto zuipt benzine
  • zijn uitgaven hebben mijn budget opgegeten
  • a lot of manual work
  • usually used only in a minimal form

38
World / Situation / Knowledge
  • Hardly used in actual systems
  • how to model world knowledge?
  • how to use it?

39
Heuristics
  • Prefer complementation over modification
  • Prefer subject topicalization over object
    topicalization
  • Disprefer certain rules (coordination without
    explicit coordinator)
  • Prefer certain readings of a word over other
    readings
  • Prefer certain guesses for unknown words over
    others

40
Statistics
  • Use a large Tree Bank to determine statistical
    properties of what occurs more often
  • use the statistics to assign scores to parses
  • e.g.
  • determine relative frequency of sla occurring as
    a noun head of the object of the verb eten in a
    given Tree Bank (etc., for all such relations in
    the Tree Bank)
  • use this as the probability
  • multiply the probabilities of all such relations
    in parse tree to obtain a score for the parse

41
Statistics
  • Such a score can usually NOT be computed for all
    parses of a sentence (not efficient enough)
  • scores only computed for a subset
  • compute scores for subtrees
  • select b best-scoring subtrees
  • use only these in computing scores for supertree
  • beam search

42
Statistics
  • Advantages
  • automatically obtainable from a large Tree Bank
  • implicitly encodes syntactic, semantic, world
    knowledge, lexical preferences, etc.
  • may differ per domain / text type can be adapted
    if Tree banks are available for these

43
Statistics
  • Disadvantage
  • requires large Tree Bank
  • only small ones are available
  • implicit encoding does not advance our knowledge
    on syntax, semantics, world knowledge, lexical
    preferences, etc.
  • Way out (not ideal)
  • unsupervised learning
  • make Treebank with initial parse system (e.g.
    using heuristics only) automatically
  • gather statistics on this automatically created
    Tree Bank

44
Unknown words
  • Each lexicon is incomplete
  • requires strategies to deal with unknown words
  • Compounding very productive in some languages
  • Dutch, German, Norwegian, ...
  • decompounding modules
  • xy is a word with morphosyntactic properties of y
    (the head) iff x is a word and y is a word
  • problems
  • w, h sometimes in a modified form
  • difficult to prevent wrong analyses especially if
    x,y are short

45
Unknown words
  • Productive affixation
  • morpho-syntactic properties often determined by
    affix (suffix)
  • guess morphosyntactic properties on the basis of
    the affix
  • other unknown words
  • guess morphosyntactic properties using
  • statistics of morpho-syntax
  • direct context

46
Names
  • Name recognition
  • Names are often multiword expressions
  • Jan Odijk Gertjan van Noord Oog in Al
  • Identification
  • sometimes relatively simple using capitalization
  • but German, initial words of a sentence, headings
  • use direct context
  • minister-president Kok, Herr Kohl

47
Multi-Word Expressions
  • Fixed
  • ad hoc, stante pede, op en top, compounds in
    English (plurals), kant en klaar (kant en
    klare)
  • Usually deviant or undefinable syntax
  • easy to deal with by treating them as single
    words
  • Semi-flexible
  • bijvoeglijk naamwoord bijvoeglijke naamwoorden
  • heer des huizes (heren des huizes), mother-in-law
    (mothers-in-law)

48
Multi-Word Expressions
  • Flexible
  • not difficult to parse
  • but difficult to identify as a multiword
    expression
  • de plaat poetsen, iemand ter verantwoording
    roepen, in de gaten houden
  • Syntax
  • deviant
  • lacking determiners op schoot, lose face,
  • words not used elsewhere brui
  • different complementation iemand op de vingers
    tikken,
  • unproductive
  • ten, ter, e-form ten voeten uit, ter meerdere
    eer en glorie van, ten tijde van
  • des, genitives de heer des huizes
  • inalienable possession zich het hoofd breken
    over, iemand boven het hoofd hangen

49
Fragments, Ill-formed input
  • Not only sentences but also
  • phrases (NP, AP, etc.)
  • ellipsis, listings
  • partial sentences
  • telegraphic style in headings
  • Ill-formed input
  • relative to the grammar
  • of the language dealt with especially in spoken
    language
  • especially in output of speech recognizer

50
Symbols
  • Not always used correctly
  • Imperative sentences without !
  • Question sentences without ?
  • commas lacking or too many
  • Interspersed punctuation symbols
  • examples
  • ...(...).... ........... ............
    ....-... etc. dots blablabla, ....
  • symbols used for lay-out (-, bullets,
    ---------------------, etc.)
  • often with errors
  • lacking closing brackets, quotes, etc.
  • Special symbols
  • currency symbols , , , etc,
  • Spelling variations and errors
  • old v. new
  • standard v. alternative
  • typos
  • lacking diacritics, too many diacritics

51
Evaluating Parsers
  • Qualitative evaluation
  • what kind of syntactic structures does it yield
  • which information is contained in these
  • How does it deal with typical problems for
    parsers
  • Different kinds of ambiguity
  • Efficiency considerations
  • Robustness aspects
  • What applications can it be used for /is it used
    for
  • is it suited to the application in which it is
    used?

52
Evaluating Parsers
  • Quantitative evaluation
  • Test Suite (e.g. Flickinger et al. 1987)
  • list of well-formed and ill-formed example
    sentences illustrating constructions (with
    parses)
  • Pro/Cons
  • focused attention on specific constructions
  • no coverage information
  • no really occurring data
  • core/periphery intuitively defined
  • interaction of constructions difficult to
    guarantee

53
Evaluating Parsers
  • Quantitative evaluation
  • (independent) Corpus
  • no focus on specific constructions
  • based on really occurring data
  • coverage taken into account
  • correctness of parses unknown

54
Evaluating Parsers
  • Quantitative evaluation
  • (independent) Corpus Tree Bank
  • based on really occurring data
  • coverage taken into account
  • correctness of parses can be determined
  • no focus on specific constructions

55
Evaluating Parsers
  • How to compare Syntax trees?
  • on full trees
  • POS-assignment correctness
  • on local trees contained in full trees
  • other tree similarity definitions
  • on lthead-word, relation, dependent head-wordgt
    triples

56
Evaluating Parsers
  • on full trees
  • too rude yes/no while there may be 90 identity
  • full identity rare given different
    frameworks/implementations
  • Correct PoS-assignment
  • measures only one aspect
  • inapplicable to parsers that start from a
    PoS-tagged sentence

57
Evaluating Parsers
  • On local trees contained in full trees
  • Treebank and grammar must be 100 compatible
  • too many less relevant details disturb comparison
  • exact nature of tree structure, single branching
  • Other tree similarity measures
  • correctly used rules / all rules in the
    reference tree
  • but how important are the missed rules?
  • bracketing mismatches, ignoring labels
  • easier compatibility with treebank
  • some treebanks (Penn) have few brackets
  • scores overestimate accuracy

58
Evaluating Parsers
  • on lthead-word, relation, dependent head-wordgt
    triples
  • abstracts from less relevant details of parse
    trees
  • concentrates on the essential parts of the parse
  • not applicable to all parsers
  • may miss important syntactic differences (e.g.
    Dutch extraposition v. verb clusters)

59
Evaluating Parsers
  • Recall
  • matching triples / triples in the reference
    parse
  • Precision
  • matching triples / triples in the parse to be
    evaluated
  • F-score
  • (2PrecRec)/(PrecRec) if PrecRec/0
  • 0 if PrecRec0

60
Evaluating Parsers
  • Recall (PnR) / R
  • Precision (PnR) / P

R
P
61
Treebanks for Dutch
  • Corpus Gesproken Nederlands
  • Van der Wouden et al. 2002
  • Alpino Treebank
  • Van der Beek et al. 2002
  • http//odur.let.rug.nl/vannoord/trees/

P
Write a Comment
User Comments (0)
About PowerShow.com