Linguistics 239E Week 8 - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

Linguistics 239E Week 8

Description:

Be careful to make your disjunctions non-overlapping (unless you really mean it) ... (JJ ( Adj ~ Comp ~ Sup ~ Interrog ~ IntRel) ( Num Ord) ( Dig Ord) ( Verb Prog) ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 46

Provided by: Franci65

Category:

more less

Transcript and Presenter's Notes

Title: Linguistics 239E Week 8

1
Linguistics 239E Week 8
Fragments, Performance limits, Shallow markup

Ron Kaplan and Tracy King

2
Issues from Week7 HW

Be careful to make your disjunctions
non-overlapping (unless you really mean it)
V3SG ( SUBJ NUM)sg
( SUBJ PERS)3
_at_(OT-MARK BadVAgr).
he laughs. --gt 11 parses
V3SG ( SUBJ NUM)sg
( SUBJ PERS)3
( SUBJ NUM)sg
( SUBJ PERS)3
_at_(OT-MARK BadVAgr).
you laughs. --gt 2 parses

3
New XLE release

8 bug fixes/improvements based on input from the
class
Windows control-m problem
S--gt problem
can parse lexical categories
closed sets as arguments to templates
interpretation of empty feature declaration
lexical entries without morph codes
index files on Windows
documentation contact updated

4
FRAGMENT grammar

What to do when the grammar does not get a parse
always want some type of output
want the output to be maximally useful
Why might it fail
construction not covered yet
"bad" input
took too long (XLE parsing parameters)

5
Grammar engineering approach

First try to get a complete parse
If fail, build up chunks that get complete parses
(c-str and f-str)
Have a fall back for things without even chunk
parses
Link these chunks and fall backs together in a
single f-structure

6
Basic idea

XLE has a REPARSECAT which it tries if there is
no complete parse
Grammar writer specifies what category the
possible chunks are
OT marks are used to
build the fewest chunks possible
disprefer using the fall back over the chunks

7
Sample output

the the dog appears.
Split into
"token" the
sentence "the dog appears"
ignore the period

8
C-structure
9
F-structure
10
How to get this
FRAGMENTS --gt NP ( FIRST)!
_at_(OT-MARK Fragment) S ( FIRST)!
_at_(OT-MARK Fragment) TOKEN (
FIRST)! _at_(OT-MARK Fragment)
(FRAGMENTS ( REST)! ).
Lexicon -token TOKEN ( TOKEN)stem
_at_(OT-MARK Token).
11
Why First-Rest?

FIRST-REST
FIRST PRED
REST FIRST PRED
REST
Efficient
Encodes order
Possible alternative set
PRED
PRED
Not as efficient (copying)
Even less efficient if mark scope facts

12
Accuracy?

Evaluation against gold standard
PARC 700 f-structure bank for Wall Street
Journal
Measure F-score on dependency triples
F-score average of precision and recall
Dependency triples separate f-structure
features
Subj(run, dog) Tense(run, past)
Results for best-matching f-structure
Full parses F88.5
Fragment parses F76.7

(Riezler et al, 2002)
13
Fragments summary

XLE has a chunking strategy for when the grammar
does not provide a full analysis
Each chunk gets full c-str and f-str
The grammar writer defines the chunks based on
what will be best for that grammar and
application
Quality
Fragments have reasonable but degraded f-scores
Usefulness in applications is being tested

14
Resource limitations Time and space
15
Exceeding available resources
16
Hard limits Time and storage

For some applications
No output on a few hard sentences is better than
getting hung up, never getting to easy ones
E.g.
Search applications you never find everything
anyway
Grammar testing/debugging no surprise, move on
XLE commands
set timeout 60 abort after 60
second
set max_xle_scratch_storage 50 abort after 50
megabytes

17
Soft limits Skimming

Bound the f-structure effort per subtree
Compute normally until a threshhold is reached
set start_skimming_when_scratch_storage_exceeds
700 (megabytes)
set start_skimming_when_total_events_exceed XX
(some number)
(XX estimated from timeouts in test runs)
Then limit the number of solutions per edge
set max_new_events_per_graph_when_skimming XX
Bounded computation/edge ? cubic
Result in reasonable time/space
At least one solution for every sentence
But some solutions will be missed
Suppress weighty constituents
Limit length of medial constituents
set max_medial_constituent_weight 20
Dont allow medial edges that span more than 20
terminals
(approximation to avoiding center embedding)

18
Accuracy?

Again, evaluation against gold standard
PARC 700 f-structure bank for Wall Street
Journal
Results for best-matching f-structure
Full parses F88.5
Fragment parses F76.7
Skimmed parses F70.3
Skimmed/Fragments F61.3

(Riezler et al, 2002)
19
Integrating Shallow Mark up Part of speech
tags Named entities Syntactic brackets
20
Shallow mark-up of input strings

Part-of-speech tags (tagger?)
I/PRP saw/VBD her/PRP duck/VB.
I/PRP saw/VBD her/PRP duck/NN.
Named entities (named-entity recognizer)
ltpersongtGeneral Millslt/persongt bought it.
ltcompanygtGeneral Millslt/companygt bought it
Syntactic brackets (chunk parser?)
NP-S I saw NP-O the girl with the
telescope.
NP-S I saw NP-O the girl with the
telescope.

21
Hypothesis

Shallow mark-up
Reduces ambiguity
Increases speed
Without decreasing accuracy
(Helps development)

Issues
Markup errors may eliminate correct analyses
Markup process may be slow
Markup may interfere with existing robustness
mechanisms (optimality, fragments, guessers)
Backoff may restore robustness but decrease speed
in 2-pass system (STOPPOINT)

22
Implementation in XLE
How to integrate with minimal changes to existing
system/grammar?
23
XLE String Processing
lexical forms
Multiwords
Modify sequences
token morphemes
Morph,Guess, Tok
Analyze
tokens
Tthe TB oil TB filter TB s TB gone TB
Decap, split, commas
Tokenize
string
The oil filters gone
24
Part of speech tags
lexical forms
Multiwords
token morphemes
Analyze

How do tags pass thru Tokenize/Analyze?
Which tags constrain which morphemes?
How?

tokens
Tokenize
string
The/DET_ oil/NN_ filter/NN_s/VBZ_
gone/VBN_
25
Passing tags through Tokenizer

Tokenizer must treat tag characters specially
Must recognize them e.g. xxx/TAG_
Must not transform them e.g. x/NN_ ? x/nn_
Must not let tags interrupt other patterns
e.g. wo/MD_nt/RB_ should behave like
wont
Must split tags off as separate tokens, for
existing Token path through Analyzer
How to do this with minimal changes to existing
tokenizer FST?

tokens
Tokenize
string
26
Modifying an existing tokenizer

Tags shouldnt be transformed
Tags shouldnt disrupt any other patterns

Script for xfst program Tokenizer Tag
.o. Tokenizer/Tag
Dont transform
Dont disrupt
Glitch Ignore (/) introduces unwanted ambiguity
around insertions
Solution, a little less modularity Construct
Tokenizer using cover symbol for tags, placing
them wrt insertion Substitute actual
tag-strings for cover symbol
27
Specifying morpheme/pos-tag constraints

For each pos-tag, grammar/morphology writer
specifies by hand the set of compatible morph-tag
sequences
Inputs Description of pos-tag interpretation (
from Penn document)
List of all possible morph-tag sequences from
analyzer (from program run on Morph/Guesser
FSTs)
Output A text file that characterizes the
relationship
E.g. NNS is plural noun, so text file has
(NNS ( Noun Pl) (Noun SP) ( Abbr) )
PRP is personal pronoun, so text file has
(PRP ( Pron Pers Gen) (Pron Poss) )
Lisp program reads file, produces POSFilter
transducer
Allows NNS_ Token sequence only if preceded by
strings that contain
Noun and PL tags, or Noun and SP tags, etc.
POSFilter FST is put in MULTIWORD section,
knocks out undesired morpheme sequences.

28
Excerpts from file
Determiner (DT (Det Interrog) (DetPron
Interrog)) Adjectives (JJ (Adj Comp
Sup Interrog IntRel) (Num Ord) (Dig
Ord) (Verb Prog)) (JJR (Adj Comp))
comparative (JJS (Adj Sup)) superlative
Verbs (VB (Verb Pres 3sg) (Aux Pres 3sg)
(Verb Inf) (Aux Inf)) base
form (VBD (Verb PastTense) (Verb PastBoth)
(Aux PastTense) (Aux PastBoth)) past (VBG
(Verb PresPart) (Verb Prog) (Aux Prog))
gerund, present participle (VBN (Verb
PastPart) (Verb PastPerf) (Aux PastPerf)
(Verb PastBoth) (Aux PastBoth))
past particple (VBP (Verb Pres
3sg) (Aux Pres 3sg)) non 3sg (VBZ (Verb
Pres 3sg) (Aux Pres 3sg)) 3sg
29
All together
lexical forms
Multiwords
POSFilterFST
token morphemes
Analyze
tokens
Tokenize
Tokenize
POSStringFST
string
30
MORPHCONFIG

STANDARD ENGLISH MORPHOLOGY (1.0)
TOKENIZE
../common/englishpostags.stringfst
../common/english.tok.parse.fst
ANALYZE
../common/english.infl.fst
../common/english.morph.guesser.fst
MULTIWORD
../common/eng-infl-final.posfilterfst
BuildMultiwordsFromLexicon
Tag Prefer
BuildMultiwordsFromMorphology
Tag Prefer

31
Embellishments

Alternative POS tags in string, if tagger unsure
walks/NNSVBZ
Can specify that some tags are ignored
Redundant with other information ./._
Constraints cross-classify too many morph-tag
sequences
Too hard to specify
Tags are optional may appear on some but not
all words

32
Named entities Example input

parse ltpersongtMr. Thejskt Thejslt/persongt
arrived.
tokenized string
Mr. Thejskt Thejs TB NEperson Mr(TB). TB
Thejskt TB Thejs

. (.) TB (, TB) .
TB arrived
TB
33
Lexicon

Lexical entries for tags
NEperson NE_SFX _at_(PROPER name).
Lexical entry for token
-token TOKEN ( TOKEN)stem
NE _at_(NOUN stem)
_at_(GRAIN proper)
_at_(SOURCE entity-finder)
_at_(OT-MARK NamedEntity).

34
Grammar Rules

Rules
NOUN-ENTITY --gt NE NE_SFX.
NOUN --gt
_at_NOUN-ENTITY.
Config OT Mark
(MWE NamedEntity) STOPPOINT

35
Resulting C-structure
36
Resulting F-structure
37
Overriding Bad NE Bracketing

Override if no parse found with bracketing
For example verb accidentally bracketed
parse Mr. ltpersongtAtbeu Thes
arrivedlt/persongt.

38
Result Normal C-structure
39
Result Normal F-structure
40
Syntactic brackets

Chunker labelled bracketing
NP-SBJ Mary and John saw NP-OBJ the girl with
the telescope.
They V pushed and pulled the cart.
Implementation
Tokenizing FST identifies, tokenizes labels
without interrupting other patterns
Bracketing constraints enforced by Metarulemacro
METARULEMACRO(_CAT _BASECAT _RHS)
_RHS
LSB
CAT-LB_BASECAT
_CAT
RSB.

41
Syntactic brackets

NP-SBJ Mary appeared.
Lexicon NP-SBJ CAT-LBNP (SUBJ ).

S
VP
NP
V appeared
LSB
CAT-LBNP
NP
RSB

NP-SBJ
N Mary
42
Experimental test

Again, F-scores on PARC 700 f-structure bank
Upper bound Sentences with best-available
markup
POS tags from Penn Tree Bank
Some noise from incompatible coding
Werner is president of the
parent/JJ company/NN. Adj-Noun
vs. our Noun-Noun
Some noise from multi-word treatment
Kleinword/NNP
Benson/NNP /CC Co./NNP
vs.
Kleinword_Benson__Co./NNP
Named entities hand-coded by us
Labeled brackets also approximated by Penn Tree
Bank
Keep core-GF brackets S, NP, VP-under-VP
Others are incompatible or unreliable discarded

43
Results
44
(No Transcript)
45
Motivation for Part of Speech Tags

Extra source of information for reducing
ambiguity
Online parsing Less confusing, more useful
results
Grammar development Heuristic for determining
whether grammar gets correct analyses, help in
building f-structure bank
Note Recall is more important than precision
Dont want local, probabilistic decisions to
eliminate globally corrrect analysis
Reducing ambiguity in initial parse chart might
drastically improve speed if
Much c-structure ambiguity comes from POS
ambiguity
So chart is more linear than cubic
And total time is (more or less) proportional to
chart size