Title: Linguistics 239E Week 9
1Linguistics 239E Week 9
Generation Evaluation and Testing
- Ron Kaplan and Tracy King
2Issues from HW8
- How to keep punctuation from being TOKENS
- FRAGMENT --gt PUNCT
- NP _at_FIRST-EQ
- S _at_FIRST-EQ
- TOKEN
_at_FIRST-EQ PUNCT - PUNCT (
REST).
3Sample c-structure
4Sample F-structure
5Generation
- Parsing string to analysis
- Generation analysis to string
- What type of input?
- How to generate
6Why generate?
- Machine translation
- Lang1 string -gt Lang1 fstr -gt Lang2 fstr -gt Lang2
string - Sentence condensation
- Long string -gt fstr -gt smaller fstr -gt new string
- Question answering
- Production of NL reports
- State of machine or process
- Explanation of logical deduction
- Grammar debugging
7F-structures as input
- Use f-structures as input to the generator
- May parse sentences that shouldnt be generated
- May want to constrain number of generated options
- Input f-structure may be underspecified
8XLE generator
- Use the same grammar for parsing and generation
- Advantages
- maintainability
- write rules and lexicons once
- But
- special generation tokenizer
- different OT ranking
9Generation tokenizer
- White space
- Parsing multiple white space becomes a single TB
- John appears. -gt John TB appears TB . TB
- Generation single TB becomes a single space (or
nothing) - John TB appears TB . TB -gt John appears.
-
John appears .
10Generation tokenizer
- Capitalization
- Parsing optionally decap initially
- They came -gt they came
- Mary came -gt Mary came
- Generation always capitalize initially
- they came -gt They came
- they came
- May regularize other options
- quotes, dashes, etc.
11Generation morphology
- Suppress variant forms
- Parse both favor and favour
- Generate only one
12Morphconfig for parsing generation
- STANDARD ENGLISH MOPRHOLOGY (1.0)
- TOKENIZE
- P!eng.tok.parse.fst G!eng.tok.gen.fst
- ANALYZE
- eng.infl-morph.fst G!amerbritfilter.fst
- G!amergen.fst
- ----
13Reversing the parsing grammar
- The parsing grammar can be used directly as a
generator - Adapt the grammar with a special OT ranking
GENOPTIMALITYORDER - Why do this?
- parse ungrammatical input
- have too many options
14Ungrammatical input
- Linguistically ungrammatical
- They walks.
- They ate banana.
- Stylistically ungrammatical
- No ending punctuation They appear
- Superfluous commas John, and Mary appear.
- Shallow markup NP John and Mary appear.
15Too many options
- All the generated options can be linguistically
valid, but too many for applications - Occurs when more than one string has the same,
legitimate f-structure - PP placement
- In the morning I left. I left in the morning.
16Using the Gen OT ranking
- Generally much simpler than in the parsing
direction - Usually only use standard marks and NOGOOD
- no marks, no STOPPOINT
- Can have a few marks that are shared by several
constructions - one or two for disprefered
- one or two for prefered
17Example Comma in coord
- COORD(_CAT) _CAT _at_CONJUNCT
- (COMMA _at_(OTMARK
GenBadPunct)) - CONJ
- _CAT _at_CONJUNCT.
- GENOPTIMALITYORDER GenBadPunct NOGOOD.
- parse They appear, and disappear.
- generate without OT They appear(,) and
disappear. - with OT They appear and
disappear.
18Example Prefer initial PP
- S --gt (PP _at_ADJUNCT _at_(OT-MARK GenGood))
- NP _at_SUBJ
- VP.
- VP --gt V
- (NP _at_OBJ)
- (PP _at_ADJUNCT).
- GENOPTIMALITYORDER NOGOOD GenGood.
- parse they appear in the morning.
- generate without OT In the morning they appear.
- They appear
in the morning. - with OT In the morning they
appear.
19Generation commands
- XLE command line
- regenerate "They appear."
- generate-from-file my-file.pl
- (regenerate-from-directory, regenerate-testfile)
- F-structure window
- commands generate from this fs
- Debugging commands
- regenerate-morphemes
20Debugging the generator
- When generating from an f-structure produced by
the same grammar, XLE should always generate - Unless
- OT marks block the only possible string
- something is wrong with the tokenizer/morphology
- regenerate-morphemes if this gets a
string - the tokenizer/morphology is not the
problem - Very hard to debug newest XLE has robustness
features to help
21Underspecified Input
- F-structures provided by applications are not
perfect - may be missing features
- may have extra features
- may simply not match the grammar coverage
- Missing and extra features are often systematic
- specify in XLE which features can be added and
deleted - Not matching the grammar is a more serious problem
22Adding features
- English to French translation
- English nouns have no gender
- French nouns need gender
- Soln have XLE add gender
- the French morphology will control
the value - Specify additions in xlerc
- set-gen-adds add "GEND"
- can add multiple features
- set-gen-adds add "GEND CASE PCASE"
- XLE will optionally insert the feature
Note Unconstrained additions make generation
undecidable
23Example
The cat sleeps. -gt Le chat dort.
- PRED 'dormirltSUBJgt'
- SUBJ PRED 'chat'
- NUM sg
- SPEC def
- TENSE present
PRED 'dormirltSUBJgt' SUBJ PRED 'chat'
NUM sg GEND masc
SPEC def TENSE present
24Deleting features
- French to English translation
- delete the GEND feature
- Specify deletions in xlerc
- set-gen-adds remove "GEND"
- can remove multiple features
- set-gen-adds remove "GEND CASE PCASE"
- XLE obligatorily removes the features
- no GEND feature will remain in the f-structure
- if a feature takes an f-structure value, that
f-structure is also removed
25Changing values
- If values of a feature do not match between the
input f-structure and the grammar - delete the feature and then add it
- Example case assignment in translation
- set-gen-adds remove "CASE"
- set-gen-adds add "CASE"
- allows dative case in input to become accusative
- e.g., exceptional case marking verb in input
language but regular case in output language
26Creating Paradigms
- Deleting and adding features within one grammar
can produce paradigms - Specifiers
- set-gen-adds remove "SPEC"
- set-gen-adds add "SPEC DET DEMON"
- regenerate "NP boys"
- the those these boys
27Generation for Debugging
- Checking for grammar and lexicon errors
- create-generator english.lfg
- reports ill-formed rules, templates, feature
declarations, lexical entries - Checking for ill-formed sentences that can be
parsed - parse a sentence
- see if all the results are legitimate strings
- regenerate they appear.
28Regeneration example
- regenerate "In the park they often see the boy
with the telescope." - parsing In the park they often see the boy with
the telescope. - 4 solutions, 0.39 CPU seconds, 178 subtrees
unified - They see the boy in the parkIn the park they
see the boy often with the telescope. - regeneration took 0.87 CPU seconds.
29Regenerate testfile
- regenerate-testfile
- produces new file testfile.regen
- sentences with parses and generated strings
- lists sentences with no strings
- if have no Gen OT marks, everything should
generate back to itself
30Testing and Evaluation
- Need to know
- Does the grammar do what you think it should?
- cover the constructions
- still cover them after changes
- not get spurious parses
- not cover ungrammatical input
- How good is it?
- relative to a ground truth/gold standard
- for a given application
31Testsuites
- XLE can parse and generate from testsuites
- parse-testfile
- regenerate-testfile
- Issues
- where to get the testsuites
- how to know if the parse the grammar got is the
one that was intended
32Basic testsuites
- Set of sentences separated by blank lines
- can specify category
- NP the children who I see
- can specify expected number of results
- They saw her duck. (2! 0 0 0)
- parse-testfile produces
- xxx.new sentences plus new parse statistics
- of parses time complexity
- xxx.stats new parse statistics without the
sentences - xxx.errors changes in the statistics from
previous run
33Testsuite examples
- LEXICON _'s
- ROOT He's leaving. (11 0.10 55)
- ROOT It's broken. (21 0.11 59)
- ROOT He's left. (31 0.12 92)
- ROOT He's a teacher. (11 0.13 57)
- RULE CPwh
- ROOT Which book have you read? (14 0.15 123)
- ROOT How does he be? (0! 0 0.08 0)
- RULE NOMINALARGS
- NP the money that they gave him (1 0.10 82)
34.errors file
ROOT They left, then they arrived. (22 0.17
110) MISMATCH ON 339 (22 -gt 12) ROOT Is
important that he comes. (0! 0 0.15 316) ERROR
AND MISMATCH ON 784 (0! 0 -gt 1119)
35.stats file
((1901) (11 0.21 72) -gt (11 0.21 72) (5
words)) ((1902) (11 0.10 82) -gt (11 0.12 82) (6
words)) ((1903) (1 0.04 15) -gt (1 0.04 15) (1
word)) XLE release of Feb 26, 2004
1129. Grammar /tilde/thking/pargram/english/sta
ndard/english.lfg. Grammar last modified on Feb
27, 2004 1358. 1903 sentences, 38 errors, 108
mismatches 0 sentences had 0 parses (added 0,
removed 56) 38 sentences with 0! 38 sentences
with 0! have solutions (added 29, removed 0) 57
starred sentences (added 57, removed 0) timeout
100 max_new_events_per_graph_when_skimming
500 maximum scratch storage per sentence 26.28
MB (642) maximum event count per sentence
1276360 average event count per graph 217.37
36.stats file cont.
293.75 CPU secs total, 1.79 CPU secs max new
time/old time 1.23 elapsed time 337
seconds biggest increase 1.16 sec (677 1.63
sec) biggest decrease 0.64 sec (1386 0.54
sec) range parsed failed words seconds
subtrees optimal suboptimal 1-10 1844
0 4.25 0.14 80.73 1.44
2.49E01 11-20 59 0 11.98 0.54
497.12 10.41 2.05E04 all 1903
0 4.49 0.15 93.64 1.72
6.60E02 0.71 of the variance in seconds is
explained by the number of subtrees
37Is it the right parse?
- Use shallow markup to constrain possibilities
- bracketing of desired constituents
- POS tags
- Compare resulting structure to a previously
banked one (perhaps a skeletal one) - significant amount of work if done by hand
- bank f-structures from the grammar if good enough
- reduce work by using partial structures
- (e.g., just predicate argument structure)
38Where to get the testsuite?
- Basic coverage
- create testsuite when writing the grammar
- publically available testsuites
- extract examples from the grammar comments
- "COMEX NP-RULE NP the flimsy boxes"
- examples specific enough to test one construction
at a time - Interactions
- real world text necessary
- may need to clean up the text somewhat
39Evaluation
- How good is the grammar?
- Absolute scale
- need a gold standard to compare against
- Relative scale
- comparing against other systems
- For an application
- some applications are more error tolerant than
others
40Gold standards
- Representation of the perfect parse for the
sentence - can bootstrap with a grammar for efficiency and
consistency - hand checking and correction
- Determine how close the grammar's output is to
the gold standard - may have to do systematic mappings
- may only care about certain relations
41PARC700
- 700 sentences randomly chosen from section23 of
the UPenn WSJ corpus - How created
- parsed with the grammar
- saved the best parse
- converted format to "triples"
- hand corrected the output
- Issues
- very time consuming process
- difficult to maintain consistency even with
bootstrapping and error checking tools
42Sample triple from PARC700
sentence( id(wsj_2356.19, parc_23.34)
date(2002.6.12) validators(T.H. King, J.-P.
Marcotte) sentence_form(The device was
replaced.) structure( mood(replace0,
indicative) passive(replace0, )
stmt_type(replace0, declarative)
subj(replace0, device1) tense(replace0,
past) vtype(replace0, main)
det_form(device1, the) det_type(device1,
def) num(device1, sg) pers(device1, 3)))
43Evaluation against PARC700
- Parse the 700 sentences with the grammar
- Compare the f-structure with the triple
- Determine
- number of attribute-value pairs that are missing
from the f-structure - number of attribute-value pairs that are in the
f-structure but should not be - combine result into an f-score
- 100 is perfect match 0 is no match
- current grammar is in the low 80s
44Using other gold standards
- Need to match corpus to grammar type
- written text vs. transcribed speech
- technical manuals, novels, newspapers
- May need to have mappings between systematic
differences in analyses - minimally want a match in grammatical functions
- but even this can be difficult (e.g. XCOMP
subjects)
45Testing and evaluation
- Necessary to determine grammar coverage and
useability - Frequent testing allows problems to be corrected
early on - Changes in efficiency are also detectable in this
way
46(No Transcript)