Title: Apple Pie Parser Satoshi Sekine, July 1996
1Apple Pie ParserSatoshi Sekine, July 1996
- Discussion and Demonstration Session
- Kenneth Wilson
- The College of New Jersey
- Spring 2003
2Apple Pie Parser
- Developed by Satoshi Sekine at NYU Spring 1995
- Bottom-up probabilistic char parser
- Uses a best-first search algorithm
- Grammar
- Semi-context sensitive grammar
- Two nonterminals
- Extracted from Penn Tree Bank
3Apple Pie Parser
- Fully automatic acquisition of grammar from a
tagged corpus - Parser generates a syntactic tree as its output
- Goal is to make a parse tree as accurate as
possible for reasonable sentences (newspapers,
well written documents, etc.) - Excludes conversation and ill-formed sentences
4Obtaining the source
Via FTP _at_ cs.nyu.edu
Via WWW http//cs.nyu.edu/cs/projects/proteus/sek
ine http//cs.nyu.edu/cs/projects/proteus/app
5Directories and files
- After unzipping and untaring
Commands for unzipping and untaring gzip d
APP5.9.tar.gz tar xvf APP5.9.tar
6How to make
- Changes to be made to src/config.h file
- Change relative pathname to absolute pathname
- define DEFAULT_PARAMFILE /local2/users/wilson15/
APP5.9/bin/app.prm - Change memory allocation
- define ALLOCATE_ANODE 600000 / number of Anode
/ - Changes to be made to bin/app.prm
- Change relative pathnames to absolute pathnames
- DICTIONARY_FILE /local2/users/wilson15/APP5.9/data
/WS-A-001.dic - GRAMMAR_FILE /local2/users/wilson15/APP5.9/data/WS
-A-001.grm - NICKNAME_FILE /local2/users/wilson15/APP5.9/data/W
S-A-001.nic
7How to make
Type make at ./src directory (rm .o first if
necessary)
8How to run
- Type app at ./bin directory or from any directory
(after modifying the path in shell)
9Example of a session
- Example of an APP session
10Parsing algorithm
- Techniques which enable the parser to handle
large grammars - Grammar rules are factored with common prefixes
- Best-first search
- Because the grammar is a probabilistic grammar,
it is enough to find only one parse tree with
highest possibility. - Parser uses integer values for indicating POS or
a grammatical node. Relationships between the
integer values and corresponding POS are stored
in the nickname file.
11Out of vocabulary
- Treatment of words not found in the vocabulary
- Uses a list of part-of-speech d
12Fitted parsing
- Treatment of very long sentences.
- Parser cant create a complete tree for some long
sentences. - Parser prepares a post process to make a complete
tree from several partial trees in the chart.
13List of parameters
- Default parameter file ./bin/app.prm
System Cut off START_SYMBOL Out of
Vocabulary OOV_NUM DICTIONARY_FILE TOO_LONG NP_LAB
EL FLAG_OOV OOV_NNP DICT_SCORETYPE CUT_GSCORE T
okenization OOV Others GRAMMAR_FILE CUT_DPROB TN_
SYMBOL OOV_NUMHYPN TRANS_POS NICKNAME_FILE PRUNE_R
ATE TN_TWO_SYMBOL OOV_CAPITAL INV_TRANS_POS DEBUG
Adjustment TN_TWO1_SYMBOL OOV_LY SUP_WORD PRINT
_STYLE WEIGHT_GRAM SPECIAL_HEAD OOV_Y PRINT_TOKEN
NO_WARNING WEIGHT_DICT DEL_TAIL OOV_ED PRINT_NT
Fitted parsing CAP_MINFREQ DEL_TAIL_SYMBOL OOV_D
FITTED_PARSE ATTACH DEL_TAIL_STRING OOV_S FITTED_
ROOT NO_LETTERCASE TAIL_NODE_CAT OOV_ION FITTED_LA
BEL DOUBLE_QUOTE OOV_ING FITTED_CCOST
14Parameter setting by file
- Can set parameters by file (parameter file)
- ASCII file
- Each line contains a parameter and its value
separated by at least one space. - Default parameter file is app.prm (used when no
other parameter file is specified. - Specify parameter files using the p option.
15Change parameter value
- A few parameters can also be changed during a
session - Type param and specify parameter name and value
on the next line - PRINT_STYLE
- TOO_LONG
- ATTACH
- NO_LETTERCASE
- DOUBLE_QUOTE
- OOV
16Grammar file
- ltGRAMMAR RULEgt ltGRAMMAR RULEgt
- ltGRAMMAR RULEgt ltLHS NODEgt ltRHS NODESgt
- ltSCORE INFORMATIONgt
- ltSTRUCTURE INFORMATIONgt
- ltRHS NODESgt ltRHS NODEgt
- ltLHS NODEgt integer
- ltRHS NODEgt integer
- ltSCORE INFORMATIONgt score ltSCOREgt
- ltSCOREgt integer
- ltSTRUCTURE INFORMATIONgt string to be print out
for the node - ltnumgt correspond to the string of the
num-th node in RHS - ---------- Example--------------------------------
------------------------------------------------- - 1 53 54 2 103 205 1 57 55
- score 113
- struct (S lt1gtlt2gtlt3gt (VP lt4gt (SBAR lt5gt
lt6gt)) lt7gt lt8gt)
17Dictionary file
- ltDICTIONARY FILEgt ltWORD INFORMATIONgt
- ltWORD INFORMATIONgt ltSTRINGgt ltPOS
INFORMATIONgt - ltstringgt string
- ltPOS INFORMATIONgt ltPOSgtltSCOREgt
- ltPOSgt integer
- ltSCOREgt integer
- --------Example-----------------------------------
------------------------ - base 6515, 70119, 852, 892
- Base-price 701
- Base-rate 651
- Baseball 7054
18Nickname file
- ltNICKNAME FILEgt ltNICKNAME INFORMATIONgt
- ltNICKNAME INFORMATIONgt ltPOSgt ltNICKNAMEgt
- ltPOSgt integer
- ltNICKNAMEgt string
- ------------Example-------------------------------
------------------ - 1 S
- 2 NP
- 5 S1
- 6 NP0
- 51
- 52
- 53
- 54 -LRB-
- 55 -RRB-
-
19List of APP functions
- APP_init_param() Initialize parameter variables
- APP_read_param_file() Read parameter
- APP_init_global() Initialize global variables
- APP_parse() Parsing
- APP_set_param() Set parameter
- APP_current_param() Show current parameter values
- APP_debug_routine () Get into debug routine
20Interface
- Initialization
- Parameter setting
- Must be done first
- APP_init_param() function sets default parameters
- Set your own parameter values using
APP_set_param() - Initialize internal data
- Load dictionary, grammar, and nickname
information and store them - APP_init_global()
21Interface
- Parsing
- Initialized by APP_parse()
- Two arguments
- Sentence
- Address of return value structure
- Returns a 1 if parsing is completed, 0 if
otherwise
22Interface
- Parameter setting and look-up
- Can set limited kinds of parameters during the
session using APP_set_param() - Lookup parameters using APP_current_param()
23Interface
- Debug routine
- Use to view
- internal structure
- Information about previous input sentences.
- Dictionary entries
- Part of grammar rules
- Sentence and parse tree data on the heap
- Type resume to quit the debug routine and
resume parsing session
24Grammar
- WS-A-001.grm is a grammar extracted from the Penn
Tree Bank. - Consists of all occurrences of two non-terminal
grammar rules (S and NP) - WS-A-002.grm also extracted from Penn Tree Bank
- contains all the rules whose frequencies are more
than one - Smaller grammar than WS-A-001.grm
- Tradeoffs
- WS-A-001.grm gives greater accuracy with slower
parse time - WS-A-002.grm gives lower accuracy with faster
parse time
25Dictionary
- WS-A-001.dic
- Extracted from Penn Tree Bank
- Supplemented by part-of-speech information from
COMLEX syntax dictionary
26Accuracy, time, size
- Statistics based on tests runs using WS-A-001.grm
and WS-A-001.dic - Tests performed on a SPARCstation 5 with 160 MB
of memory and set ANODE_TOP to 5,000,000 - Number of sentences 1989
- Average length of sentences 23.28
- Number of parsed sentences 1788
- Number of fitted parse 203
- Average parsing time for all 18
- Average Parsing time excluding fitted parse 9.76
- Precision 71.04
- Recall 70.33
- Average Crossing 3.03