Title: Using JavaCC
1Using JavaCC
2Automating Lexical Analysis Overall picture
Tokens
3Building Faster Scanners from the DFA
- Table-driven recognizers waste a lot of effort
- Read ( classify) the next character
- Find the next state
- Assign to the state variable
- Branch back to the top
- We can do better
- Encode state actions in the code
- Do transition tests locally
- Generate ugly, spaghetti-like code
- (it is OK, this is automatically generated
code) - Takes (many) fewer operations per input character
state s0 string ? char
get_next_char() while (char ! eof) state
?(state,char) string string char char
get_next_char() if (state in Final) then
report acceptance else report failure
4Inside lexical analyzer generator
- How does a lexical analyzer work?
- Get input from user who defines tokens in the
form that is equivalent to regular grammar - Turn the regular grammar into a NFA
- Convert the NFA into DFA
- Generate the code that simulates the DFA
5Flow for Using JavaCC
Extracted from http//www.cs.unb.ca/profs/nickers
on/courses/cs4905/Labs/L1_2006.pdf
6Structure of a JavaCC File
- A JavaCC file is composed of 3 portions
- Options
- Class declaration
- Specification for lexical analysis (tokens), and
specification for syntax analysis. - For the very first example of JavaCC, let's
recognize two tokens '', and numerals. - Use an editor to edit and save it with file name
numeral.jj
7Using javaCC for lexical analysis
- javacc is a top-down parser generator.
- Some parser generators (such as yacc , bison, and
JavaCUP) need a separate lexical-analyzer
generator. - With javaCC, you can specify the tokens within
the parser generator.
8Example File
/ main class definition / PARSER_BEGIN(Numeral)
public class Numeral public static void
main(String args) throws
ParseException, TokenMgrError
Numeral numeral new Numeral(System.in)
while (numeral.getNextToken().kind!EOF)
PARSER_END(Numeral) / token
definitions / TOKEN ltADD ""gt
ltNUMERAL ("0"-"9")gt
9Options
- The options portion is optional and is omitted in
the previous example. - STATIC is a boolean option whose default value is
true. If true, all methods and class variables
are specified as static in the generated parser
and token manager. - This allows only one parser object to be present,
but it improves the performance of the parser. - To perform multiple parses during one run of your
Java program, you will have to call the ReInit()
method to reinitialize your parser if it is
static. - If the parser is non-static, you may use the
"new" operator to construct as many parsers as
you wish. These can all be used simultaneously
from different threads.
10Start
/ main class definition / PARSER_BEGIN(Numeral)
public class Numeral public static void
main(String args) throws
ParseException, TokenMgrError
Numeral numeral new Numeral(System.in)
while (numeral.getNextToken().kind!EOF)
PARSER_END(Numeral) / token
definitions / TOKEN ltADD ""gt
ltNUMERAL ("0"-"9")gt
11Compilation
12javaCC specification of a lexer
Note the need for ( )!
Defining Whitespace
13A Full Example
14Dealing with errors
- Error reporting 123eq
- Could consider it an invalid token (lexical
error) or - return a sequence of valid tokens
- 123, e, , q,
- and let the parser deal with the error.
15Lexical error correction?
- Sometimes interaction between the Scanner and
parser can help - especially in a top-down (predictive) parse
- The parser, when it calls the scanner, can pass
as an argument the set of allowable tokens. - Suppose the Scanner sees calss in a context where
only a top-level definition is allowed.
16Same symbol, different meaning.
- How can the scanner distinguish between binary
minus and unary minus? - x -a vs x 3 a
17Scanner troublemakers
- Unclosed strings
- Unclosed comments.
18JavaCC as a Parsing Tool
19Javacc Overview
- Generates a top down parser.
- Could be used for generating a Prolog parser
which is in LL. - Generates a parser in Java.
- Hence can be integrated with any Java based
Prolog compiler/interpreter to continue our
example. - Token specification and grammar specification
structures are in the same file gt easier to
debug.
20Types of Productions in Javacc
- There can be four different kinds of Productions.
- Javacode
- For something that is not context free or is
difficult to write a grammar for. - eg) recognizing matching braces and error
processing. - Regular Expressions
- Used to describe the tokens (terminals) of the
grammar. - BNF
- Standard way of specifying the productions of the
grammar. - Token Manager Declarations
- The declarations and statements are written into
the generated Token Manager (lexer) and are
accessible from within lexical actions.
21Javacc Look-ahead mechanism
- Exploration of tokens further ahead in the input
stream. - Backtracking is unacceptable due to performance
hit. - By default Javacc has 1 token look-ahead. Could
specify any number for look-ahead. - Two types of look-ahead mechanisms
- Syntactic
- A particular token is looked ahead in the input
stream. - Semantic
- Any arbitrary Boolean expression can be
specified as a look-ahead parameter. - eg) A -gt aBc and B -gt b ( c )? Valid strings
abc and abcc
22References
- Compilers Principles, Techniques and Tools, Aho,
Sethi, and Ullman - http//www.cc.gatech.edu/classes/AY2002/cs2130_spr
ing/ - http//www.rose-hulman.edu/Class/se/csse404/class-
notes/day07-javaCC.ppt - http//students.csci.unt.edu/pgupta/2
- http//www.cs.utsa.edu/danlo/teaching/cs4713/lect
ure/node14.html