Title: Linguistics 239E Week 8
1Linguistics 239E Week 8
Fragments, Performance limits, Shallow markup
- Ron Kaplan and Tracy King
2Issues from Week7 HW
- Be careful to make your disjunctions
non-overlapping (unless you really mean it) - V3SG ( SUBJ NUM)sg
- ( SUBJ PERS)3
- _at_(OT-MARK BadVAgr).
- he laughs. --gt 11 parses
- V3SG ( SUBJ NUM)sg
- ( SUBJ PERS)3
- ( SUBJ NUM)sg
- ( SUBJ PERS)3
- _at_(OT-MARK BadVAgr).
- you laughs. --gt 2 parses
3New XLE release
- 8 bug fixes/improvements based on input from the
class - Windows control-m problem
- S--gt problem
- can parse lexical categories
- closed sets as arguments to templates
- interpretation of empty feature declaration
- lexical entries without morph codes
- index files on Windows
- documentation contact updated
4FRAGMENT grammar
- What to do when the grammar does not get a parse
- always want some type of output
- want the output to be maximally useful
- Why might it fail
- construction not covered yet
- "bad" input
- took too long (XLE parsing parameters)
5Grammar engineering approach
- First try to get a complete parse
- If fail, build up chunks that get complete parses
(c-str and f-str) - Have a fall back for things without even chunk
parses - Link these chunks and fall backs together in a
single f-structure
6Basic idea
- XLE has a REPARSECAT which it tries if there is
no complete parse - Grammar writer specifies what category the
possible chunks are - OT marks are used to
- build the fewest chunks possible
- disprefer using the fall back over the chunks
7Sample output
- the the dog appears.
- Split into
- "token" the
- sentence "the dog appears"
- ignore the period
8C-structure
9F-structure
10How to get this
FRAGMENTS --gt NP ( FIRST)!
_at_(OT-MARK Fragment) S ( FIRST)!
_at_(OT-MARK Fragment) TOKEN (
FIRST)! _at_(OT-MARK Fragment)
(FRAGMENTS ( REST)! ).
Lexicon -token TOKEN ( TOKEN)stem
_at_(OT-MARK Token).
11Why First-Rest?
- FIRST-REST
- FIRST PRED
- REST FIRST PRED
- REST
- Efficient
- Encodes order
- Possible alternative set
- PRED
- PRED
- Not as efficient (copying)
- Even less efficient if mark scope facts
12Accuracy?
- Evaluation against gold standard
- PARC 700 f-structure bank for Wall Street
Journal - Measure F-score on dependency triples
- F-score average of precision and recall
- Dependency triples separate f-structure
features - Subj(run, dog) Tense(run, past)
- Results for best-matching f-structure
- Full parses F88.5
- Fragment parses F76.7
(Riezler et al, 2002)
13Fragments summary
- XLE has a chunking strategy for when the grammar
does not provide a full analysis - Each chunk gets full c-str and f-str
- The grammar writer defines the chunks based on
what will be best for that grammar and
application - Quality
- Fragments have reasonable but degraded f-scores
- Usefulness in applications is being tested
14Resource limitations Time and space
15Exceeding available resources
16Hard limits Time and storage
- For some applications
- No output on a few hard sentences is better than
getting hung up, never getting to easy ones - E.g.
- Search applications you never find everything
anyway - Grammar testing/debugging no surprise, move on
- XLE commands
- set timeout 60 abort after 60
second - set max_xle_scratch_storage 50 abort after 50
megabytes
17Soft limits Skimming
- Bound the f-structure effort per subtree
- Compute normally until a threshhold is reached
- set start_skimming_when_scratch_storage_exceeds
700 (megabytes) - set start_skimming_when_total_events_exceed XX
(some number) - (XX estimated from timeouts in test runs)
- Then limit the number of solutions per edge
- set max_new_events_per_graph_when_skimming XX
- Bounded computation/edge ? cubic
- Result in reasonable time/space
- At least one solution for every sentence
- But some solutions will be missed
- Suppress weighty constituents
- Limit length of medial constituents
- set max_medial_constituent_weight 20
- Dont allow medial edges that span more than 20
terminals - (approximation to avoiding center embedding)
18Accuracy?
- Again, evaluation against gold standard
- PARC 700 f-structure bank for Wall Street
Journal - Results for best-matching f-structure
- Full parses F88.5
- Fragment parses F76.7
- Skimmed parses F70.3
- Skimmed/Fragments F61.3
(Riezler et al, 2002)
19Integrating Shallow Mark up Part of speech
tags Named entities Syntactic brackets
20Shallow mark-up of input strings
- Part-of-speech tags (tagger?)
- I/PRP saw/VBD her/PRP duck/VB.
- I/PRP saw/VBD her/PRP duck/NN.
- Named entities (named-entity recognizer)
- ltpersongtGeneral Millslt/persongt bought it.
- ltcompanygtGeneral Millslt/companygt bought it
- Syntactic brackets (chunk parser?)
- NP-S I saw NP-O the girl with the
telescope. - NP-S I saw NP-O the girl with the
telescope.
21Hypothesis
- Shallow mark-up
- Reduces ambiguity
- Increases speed
- Without decreasing accuracy
- (Helps development)
- Issues
- Markup errors may eliminate correct analyses
- Markup process may be slow
- Markup may interfere with existing robustness
mechanisms (optimality, fragments, guessers) - Backoff may restore robustness but decrease speed
in 2-pass system (STOPPOINT)
22Implementation in XLE
How to integrate with minimal changes to existing
system/grammar?
23XLE String Processing
lexical forms
Multiwords
Modify sequences
token morphemes
Morph,Guess, Tok
Analyze
tokens
Tthe TB oil TB filter TB s TB gone TB
Decap, split, commas
Tokenize
string
The oil filters gone
24Part of speech tags
lexical forms
Multiwords
token morphemes
Analyze
- How do tags pass thru Tokenize/Analyze?
- Which tags constrain which morphemes?
- How?
tokens
Tokenize
string
The/DET_ oil/NN_ filter/NN_s/VBZ_
gone/VBN_
25Passing tags through Tokenizer
- Tokenizer must treat tag characters specially
- Must recognize them e.g. xxx/TAG_
- Must not transform them e.g. x/NN_ ? x/nn_
- Must not let tags interrupt other patterns
- e.g. wo/MD_nt/RB_ should behave like
wont - Must split tags off as separate tokens, for
existing Token path through Analyzer - How to do this with minimal changes to existing
tokenizer FST?
tokens
Tokenize
string
26Modifying an existing tokenizer
- Tags shouldnt be transformed
- Tags shouldnt disrupt any other patterns
Script for xfst program Tokenizer Tag
.o. Tokenizer/Tag
Dont transform
Dont disrupt
Glitch Ignore (/) introduces unwanted ambiguity
around insertions
Solution, a little less modularity Construct
Tokenizer using cover symbol for tags, placing
them wrt insertion Substitute actual
tag-strings for cover symbol
27Specifying morpheme/pos-tag constraints
- For each pos-tag, grammar/morphology writer
specifies by hand the set of compatible morph-tag
sequences - Inputs Description of pos-tag interpretation (
from Penn document) - List of all possible morph-tag sequences from
analyzer (from program run on Morph/Guesser
FSTs) - Output A text file that characterizes the
relationship - E.g. NNS is plural noun, so text file has
- (NNS ( Noun Pl) (Noun SP) ( Abbr) )
- PRP is personal pronoun, so text file has
- (PRP ( Pron Pers Gen) (Pron Poss) )
- Lisp program reads file, produces POSFilter
transducer - Allows NNS_ Token sequence only if preceded by
strings that contain - Noun and PL tags, or Noun and SP tags, etc.
- POSFilter FST is put in MULTIWORD section,
knocks out undesired morpheme sequences.
28Excerpts from file
Determiner (DT (Det Interrog) (DetPron
Interrog)) Adjectives (JJ (Adj Comp
Sup Interrog IntRel) (Num Ord) (Dig
Ord) (Verb Prog)) (JJR (Adj Comp))
comparative (JJS (Adj Sup)) superlative
Verbs (VB (Verb Pres 3sg) (Aux Pres 3sg)
(Verb Inf) (Aux Inf)) base
form (VBD (Verb PastTense) (Verb PastBoth)
(Aux PastTense) (Aux PastBoth)) past (VBG
(Verb PresPart) (Verb Prog) (Aux Prog))
gerund, present participle (VBN (Verb
PastPart) (Verb PastPerf) (Aux PastPerf)
(Verb PastBoth) (Aux PastBoth))
past particple (VBP (Verb Pres
3sg) (Aux Pres 3sg)) non 3sg (VBZ (Verb
Pres 3sg) (Aux Pres 3sg)) 3sg
29All together
lexical forms
Multiwords
POSFilterFST
token morphemes
Analyze
tokens
Tokenize
Tokenize
POSStringFST
string
30MORPHCONFIG
- STANDARD ENGLISH MORPHOLOGY (1.0)
- TOKENIZE
- ../common/englishpostags.stringfst
../common/english.tok.parse.fst - ANALYZE
- ../common/english.infl.fst
- ../common/english.morph.guesser.fst
- MULTIWORD
- ../common/eng-infl-final.posfilterfst
- BuildMultiwordsFromLexicon
- Tag Prefer
- BuildMultiwordsFromMorphology
- Tag Prefer
31Embellishments
- Alternative POS tags in string, if tagger unsure
- walks/NNSVBZ
- Can specify that some tags are ignored
- Redundant with other information ./._
- Constraints cross-classify too many morph-tag
sequences - Too hard to specify
- Tags are optional may appear on some but not
all words
32Named entities Example input
- parse ltpersongtMr. Thejskt Thejslt/persongt
arrived. - tokenized string
- Mr. Thejskt Thejs TB NEperson Mr(TB). TB
Thejskt TB Thejs
. (.) TB (, TB) .
TB arrived
TB
33Lexicon
- Lexical entries for tags
- NEperson NE_SFX _at_(PROPER name).
- Lexical entry for token
- -token TOKEN ( TOKEN)stem
- NE _at_(NOUN stem)
- _at_(GRAIN proper)
- _at_(SOURCE entity-finder)
- _at_(OT-MARK NamedEntity).
34Grammar Rules
- Rules
- NOUN-ENTITY --gt NE NE_SFX.
- NOUN --gt
- _at_NOUN-ENTITY.
- Config OT Mark
- (MWE NamedEntity) STOPPOINT
35Resulting C-structure
36Resulting F-structure
37Overriding Bad NE Bracketing
- Override if no parse found with bracketing
- For example verb accidentally bracketed
- parse Mr. ltpersongtAtbeu Thes
arrivedlt/persongt.
38Result Normal C-structure
39Result Normal F-structure
40Syntactic brackets
- Chunker labelled bracketing
- NP-SBJ Mary and John saw NP-OBJ the girl with
the telescope. - They V pushed and pulled the cart.
- Implementation
- Tokenizing FST identifies, tokenizes labels
without interrupting other patterns - Bracketing constraints enforced by Metarulemacro
-
- METARULEMACRO(_CAT _BASECAT _RHS)
- _RHS
- LSB
- CAT-LB_BASECAT
- _CAT
- RSB.
41Syntactic brackets
- NP-SBJ Mary appeared.
- Lexicon NP-SBJ CAT-LBNP (SUBJ ).
S
VP
NP
V appeared
LSB
CAT-LBNP
NP
RSB
NP-SBJ
N Mary
42Experimental test
- Again, F-scores on PARC 700 f-structure bank
- Upper bound Sentences with best-available
markup - POS tags from Penn Tree Bank
- Some noise from incompatible coding
- Werner is president of the
parent/JJ company/NN. Adj-Noun
vs. our Noun-Noun - Some noise from multi-word treatment
- Kleinword/NNP
Benson/NNP /CC Co./NNP - vs.
Kleinword_Benson__Co./NNP - Named entities hand-coded by us
- Labeled brackets also approximated by Penn Tree
Bank - Keep core-GF brackets S, NP, VP-under-VP
- Others are incompatible or unreliable discarded
43Results
44(No Transcript)
45Motivation for Part of Speech Tags
- Extra source of information for reducing
ambiguity - Online parsing Less confusing, more useful
results - Grammar development Heuristic for determining
whether grammar gets correct analyses, help in
building f-structure bank - Note Recall is more important than precision
- Dont want local, probabilistic decisions to
eliminate globally corrrect analysis - Reducing ambiguity in initial parse chart might
drastically improve speed if - Much c-structure ambiguity comes from POS
ambiguity - So chart is more linear than cubic
- And total time is (more or less) proportional to
chart size