JavaCC - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

JavaCC

Description:

JavaCC CMSC 431 Spring 04 What is a parser generator JavaCC JavaCC (Java Compiler Compiler) is a scanner and parser generator its unusual in this regard; Produce ... – PowerPoint PPT presentation

Number of Views:458
Avg rating:3.0/5.0
Slides: 58
Provided by: Jiang46
Category:

less

Transcript and Presenter's Notes

Title: JavaCC


1
JavaCC
  • CMSC 431
  • Spring 04

2
What is a parser generator
T o t a l p r i c e t a x
Scanner
Total price tax
Parser
assignment
Expr

Total
Parser generator (JavaCC)
id id
price
tax
lexicalgrammar specification
3
JavaCC
  • JavaCC (Java Compiler Compiler) is a scanner and
    parser generator its unusual in this regard
  • Produce a scanner and/or a parser written in
    java, itself is also written in Java
  • There are many parser generators.
  • yacc (Yet Another Compiler-Compiler) for C
    programming language (See Dragon book chapter
    4.9)
  • Bison from gnu.org
  • There are also many parser generators written in
    Java
  • JavaCUP Well look at this one latter
  • ANTLR
  • SableCC

4
More on classification of java parser generators
  • Bottom up Parser Generators Tools
  • JavaCUP
  • jay, YACC for Java www.inf.uos.de/bernd/jay
  • SableCC, The Sable Compiler Compiler
    www.sablecc.org
  • Topdown Parser Generators Tools
  • ANTLR, Another Tool for Language Recognition
    www.antlr.org
  • JavaCC, Java Compiler Compiler www.webgain.com/jav
    a_cc

5
Features of JavaCC
  • TopDown LL(K) parser genrator
  • Lexical and grammar specifications in one file
  • Tree Building preprocessor
  • with JJTree
  • Extreme Customizable
  • many different options selectable
  • Document Generation
  • by using JJDoc
  • Internationalized
  • can handle full unicode
  • Syntactic and Semantic lookahead

6
Features of JavaCC (contd)
  • Permits extneded BNF specifications
  • can use ? () at RHS.
  • Lexical states and lexical actions
  • Case-insensitive lexical analysis
  • Extensive debugging capability
  • Special tokens
  • Very good error reporting

7
JavaCC Installation
  • Download the file javacc-3.X.zip from
    https//javacc.dev.java.net/
  • Follow the link this says Download or go directly
    to https//javacc.dev.java.net/servlets/ProjectDo
    cumentList
  • unzip javacc-3.X.zip to a directory JCC_HOME
  • add JCC_HOME\bin directory to your path.
  • javacc, jjtree, jjdoc may now be invoked
    directly from the command line.

8
Steps to use JavaCC
  • Write a javaCC specification (.jj file)
  • Defines the grammar and actions in a file (say,
    calc.jj)
  • Run javaCC to generate a scanner and a parser
  • javacc calc.jj
  • Will generate parser, scanner, token, java
    sources
  • Write your program that uses the parser
  • For example, UseParser.java
  • Compile and run your program
  • javac -classpath . .java
  • java -cp . mainpackage.MainClass

9
Example 1
Parse a spec of regular expressions and match it
with input strings
  • Grammar re.jj
  • Example
  • all strings ending in "ab"
  • (ab)ab
  • aba
  • ababb
  • Our tasks
  • For each input string (Line 3,4) determine
    whether it matches the regular expression (line
    2).

10
the overall picture
comment (ab)ab a ab
11
Format of a JavaCC input Grammar
  • javacc_options
  • PARSER_BEGIN ( ltIDENTIFIERgt1 )
  • java_compilation_unit
  • PARSER_END ( ltIDENTIFIERgt2 )
  • ( production )

12
the input spec file (re.jj)
  • options
  • USER_TOKEN_MANAGERfalse
  • BUILD_TOKEN_MANAGERtrue
  • OUTPUT_DIRECTORY"./reparser"
  • STATICfalse

13
re.jj
  • PARSER_BEGIN(REParser)
  • package reparser
  • import java.lang.
  • import dfa.
  • public class REParser
  • public FA tg new FA()
  • // output error message with current line
    number
  • public static void msg(String s)
  • System.out.println("ERROR"s)
  • public static void main(String args) throws
    Exception
  • REParser reparser new REParser(System.in)
  • reparser.S()

14
re.jj (Token definition)
  • TOKEN
  • ltSYMBOL "0"-"9","a"-"z","A"-"Z" gt
  • ltEPSILON "epsilon" gt
  • ltLPAREN "( gt
  • ltRPAREN ") gt
  • ltOR "" gt
  • ltSTAR " gt
  • ltSEMI " gt
  • SKIP
  • lt ( " ","\t","\n","\r","\f" ) gt
  • lt "" ( "\n" ) "\n" gt
    System.out.println(image)

15
re.jj (productions)
  • void S() FA d1
  • d1 R() ltSEMIgt
  • tg d1 System.out.println("------NFA") tg.
    print()
  • System.out.println("------DFA")
  • tg tg.NFAtoDFA() tg.print()
  • System.out.println("------Minimize")
  • tg tg.minimize() tg.print()
  • System.out.println("------Renumber")
  • tgtg.renumber() tg.print()
  • System.out.println("------Execute")
  • testCases()

16
re.jj
  • void testCases()
  • (testCase() )
  • void testCase() String testInput
  • testInput symbols()
  • ltSEMIgt
  • tg.execute( testInput)
  • String symbols()
  • Token token null StringBuffer result new
    StringBuffer()
  • (
  • token ltSYMBOLgt
  • result.append( token.image)
  • )
  • return result.toString()

17
re.jj (regular expression)
  • // R --gt RUnit RConcat RChoice
  • FA R() FA result
  • result RChoice() return
    result
  • FA RUnit()
  • FA result Token d1
  • (
  • ltLPARENgt result RChoice() ltRPARENgt
  • ltEPSILONgt result tg.epsilon()
  • d1 ltSYMBOLgt result tg.symbol( d1.image
    )
  • )
  • return result

18
re.jj
  • FA RChoice() FA result, temp
  • result RConcat()
  • ( ltORgt temp RConcat() result
    result.choice( temp ) )
  • return result
  • FA RConcat() FA result, temp
  • result RStar()
  • ( temp RStar() result
    result.concat( temp ) )
  • return result
  • FA RStar() FA result
  • result RUnit()
  • ( ltSTARgt result result.closure()
    )
  • return result

19
Format of a JavaCC input Grammar
  • javacc_input javacc_options
  • PARSER_BEGIN ( ltIDENTIFIERgt1 )
  • java_compilation_unit
  • PARSER_END ( ltIDENTIFIERgt2 )
  • ( production )
  • ltEOFgt
  • color usage
  • blue --- nonterminal
  • ltorangegt a token type
  • purple --- token lexeme ( reserved word
  • I.e.,
    consisting of the literal itself.)
  • black -- meta symbols

20
Notes
  • ltIDENTIFIERgt means any Java identifers like var,
    class2,
  • IDENTIFIER means IDENTIFIER only.
  • ltIDENTIFIERgt1 must ltIDENTIFIERgt2
  • java_compilation_unit is any java code that as a
    whole can appear legally in a file.
  • must contain a main class declaration with the
    same name as ltIDENTIFIERgt1 .
  • Ex
  • PARSER_BEGIN ( MyParser )
  • package mypackage
  • import myotherpackage.
  • public class MyParser
  • class MyOtherUsefulClass
  • PARSER_END (MyParser)

21
The input and output of javacc
(MyLangSpec.jj )
javacc
Token.java
  • PARSER_BEGIN ( MyParser )
  • package mypackage
  • import myotherpackage.
  • public class MyParser
  • class MyOtherUsefulClass
  • PARSER_END (MyParser)

ParserError.java
MyParser.java
MyParserTokenManager.java
MyParserCostant.java
22
Notes
  • Token.java and ParseError.jar are the same for
    all input and can be reused.
  • package declaration in .jj are copied to all 3
    outputs.
  • import declarations in .jj are copied to the
    parser and token manager files.
  • parser file is assigned the file name
    ltIDENTIFIERgt1 .java
  • The parser file has contents
  • class MyParser
  • //generated parser is inserted here.
  • The generated token manager provides one public
    method
  • Token getNextToken() throws ParseError

23
Lexical Specification with JavaCC
24
javacc options
  • javacc_options
  • options ( option_binding )
  • option_binding are of the form
  • ltIDENTIFIERgt3 ltjava_literalgt
  • where ltIDENTIFIERgt3 is not case-sensitive.
  • Ex
  • options
  • USER_TOKEN_MANAGERtrue
  • BUILD_TOKEN_MANAGERfalse
  • OUTPUT_DIRECTORY"./sax2jcc/personnel"
  • STATICfalse

25
More Options
  • LOOKAHEAD
  • java_integer_literal (1)
  • CHOICE_AMBIGUITY_CHECK
  • java_integer_literal (2) for A B C
  • OTHER_AMBIGUITY_CHECK
  • java_integer_literal (1) for (A), (A) and
    (A)?
  • STATIC (true)
  • DEBUG_PARSER (false)
  • DEBUG_LOOKAHEAD (false)
  • DEBUG_TOKEN_MANAGER (false)
  • OPTIMIZE_TOKEN_MANAGER
  • java_boolean_literal (false)
  • OUTPUT_DIRECTORY (current directory)
  • ERROR_REPORTING (true)

26
More Options
  • JAVA_UNICODE_ESCAPE (false)
  • replace \u2245 to actual unicode (6 char ? 1
    char)
  • UNICODE_INPUT (false)
  • input strearm is in unicode form
  • IGNORE_CASE (false)
  • USER_TOKEN_MANAGER (false)
  • generate TokenManager interface for users own
    scanner
  • USER_CHAR_STREAM (false)
  • generate CharStream.java interface for users
    own inputStream
  • BUILD_PARSER (true)
  • java_boolean_literal
  • BUILD_TOKEN_MANAGER (true)
  • SANITY_CHECK (true)
  • FORCE_LA_CHECK (false)
  • COMMON_TOKEN_ACTION (false)
  • invoke void CommonTokenAction(Token t) after
    every getNextToken()
  • CACHE_TOKENS (false)

27
Example Figure 2.2
  • if IF
  • a-za-z0-9 ID
  • 0-9 NUM
  • (0-9.0-9) (0-9.0-9) REAL
  • (--a-z\n) ( \n \t )
    nonToken, WS
  • . error
  • javacc notations ?
  • if or i f or if
  • a-z(a-z,0-9)
  • (0-9)
  • (0-9) . ( 0-9 )
  • (0-9) . (0-9)

28
JvaaCC Spec for Some Tokens
  • PARSER_BEGIN(MyParser) class MyParser
  • PARSER_END(MyParser)
  • / For the regular expressin on the right, the
    token on the left will be returned /
  • TOKEN
  • lt IF if gt
  • lt DIGIT 0-9 gt
  • lt ID a-z ( a-z
    ltDIGITgt) gt
  • lt NUM (ltDIGITgt) gt
  • lt REAL ( (ltDIGITgt) . (ltDIGITgt) )
  • ( ltDIGITgt . (ltDIGITgt) ) gt

29
Continued
  • / The regular expression here will be skipped
    during lexical analysis /
  • SKIP lt gt lt\tgt lt\ngt
  • / like SKIP but skipped text accessible from
    parser action /
  • SPECIAL_TOKEN
  • lt-- (a-z) (\n \r \n\r ) gt
  • / . For any substring not matching lexical spec,
    javacc will throw an error /
  • / main rule /
  • void start()
  • (ltIFgt ltIDgt ltNUMgt ltREALgt)

30
Grammar Specification with JavaCC
31
The Form of a Production
  • java_return_type java_identifier (
    java_parameter_list )
  • java_block
  • expansion_choices
  • EX
  • void XMLDocument(Logger logger) int msg 0
  • ltStartDocgt print(token)
  • Element(logger)
  • ltEndDocgt print(token)
  • else()

32
Example ( Grammar 3.30 )
  • P ? L
  • S ? id id
  • S ? while id do S
  • S ? begin L end
  • S ?if id then S
  • S ? if id then S else S
  • L? S
  • L? LS
  • 1,7,8 P ? S (S)

33
JavaCC Version of Grammar 3.30
  • PARSER_BEGIN(MyParser)
  • pulic class MyPArser
  • PARSRE_END(MyParser)
  • SKIP \t \n
  • TOKEN
  • ltWHILE whilegt ltBEGIN begingt
    ltENDendgt
  • ltDOdogt ltIFifgt
    ltTHEN thengt
  • ltELSEelsegt ltSEMI gt
    ltASSIGN gt
  • ltLETTER a-zgt
  • ltID ltLETTERgt(ltLETTERgt 0-9 ) gt

34
JavaCC Version of Grammar 3.30 (contd)
  • void Prog() StmList() ltEOFgt
  • void StmList()
  • Stm() ( Stm() )
  • void Stm()
  • ltIDgt ltIDgt
  • while ltIDgt do Stm()
  • ltBEGINgt StmList() ltENDgt
  • if ltIDgt then Stm() LOOKAHEAD(1) else
    Stm()

35
Types of productions
  • production javacode_production
  • regulr_expr_production
  • bnf_production
  • token_manager_decl
  • Note
  • 1,3 are used to define grammar.
  • 2 is used to define tokens
  • 4 is used to embed code into token manager.

36
JAVACODE production
  • javacode_production JAVACODE
  • java-return_type iava_id (
    java_param_list )
  • java_block
  • Note
  • Used to define nonterminals for recognizing sth
    that is hard to parse using normal production.

37
Example JAVACODE
  • JAVACODE void skip_to_matching_brace()
  • Token tok
  • int nesting 1
  • while (true)
  • tok getToken(1)
  • if (tok.kind LBRACE) nesting
  • if (tok.kind RBRACE)
  • nesting--
  • if (nesting 0) break
  • tok getNextToken()

38
Note
  • Do not use nonterminal defined by JAVACODE at
    choice point without giving LOOKHEAD.
  • void NT()
  • skip_to_matching_brace()
  • some_other_production()
  • void NT()
  • "" skip_to_matching_brace()
  • "(" parameter_list() ")"

39
TOKEN_MANAGER_DECLS
  • token_manager_decls
  • TOKEN_MGR_DECLS java_block
  • The token manager declarations starts with the
    reserved word "TOKEN_MGR_DECLS" followed by a ""
    and then a set of Java declarations and
    statements (the Java block).
  • These declarations and statements are written
    into the generated token manager
    (MyParserTokenManager.java) and are accessible
    from within lexical actions.
  • There can only be one token manager declaration
    in a JavaCC grammar file.

40
regular_expression_production
  • regular_expr_production
  • lexical_state_list
  • regexpr_kind IGNORE_CASE
  • regexpr_spec ( regexpr_spec )
  • regexpr_kind
  • TOKEN SPECIAL_TOKEN SKIP MORE
  • TOKEN is used to define normal tokens
  • SKIP is used to define skipped tokens (not passed
    to later parser)
  • MORE is used to define semi-tokens (I.e. only
    part of a token).
  • SPECIAL_TOKEN is between TOKEN and SKIP tokens in
    that it is passed on to the parser and accessible
    to the parser action but is ignored by production
    rules (not counted as an token). Useful for
    representing comments.

41
lexical_state_list
  • lexical_state_list
  • lt gt lt java_identifier ( , java_identifier )
    gt
  • The lexical state list describes the set of
    lexical states for which the corresponding
    regular expression production applies.
  • If this is written as "ltgt", the regular
    expression production applies to all lexical
    states. Otherwise, it applies to all the lexical
    states in the identifier list within the angular
    brackets.
  • if omitted, then a DEFAULT lexical state is
    assumed.

42
regexpr_spec
  • regexpr_spec
  • regular_expression1 java_block
    java_identifier
  • Meaning
  • When a regular_expression1 is matched then
  • if java_block exists then execute it
  • if java_identifier appears, then transition to
    that lexical state.

43
regular_expression
  • regular_expression
  • java_string_literal
  • lt java_identifier
    complex_regular_expression_choices gt
  • ltjava_identifiergt
  • ltEOFgt
  • ltEOFgt is matched by end-of-file character only.
  • (3) ltjava_identifiergt is a reference to other
    labeled regular_expression.
  • used in bnf_production
  • java_string_literal is matched only by the
    string denoted by itself.
  • (2) is used to defined a labled regular_expr and
    not visible to outside the current TOKEN section
    if occurs.
  • (1) for unnamed tokens

44
Example
  • ltDEFAULT, LEX_ST2gt TOKEN IGNORE_CASE
  • lt FLOATING_POINT_LITERAL
  • ("0"-"9") "." ("0"-"9") (ltEXPONENTgt)?
    ("f","F","d","D")?
  • "." ("0"-"9") (ltEXPONENTgt)?
    ("f","F","d","D")?
  • ("0"-"9") ltEXPONENTgt ("f","F","d","D")?
  • ("0"-"9") (ltEXPONENTgt)? "f","F","d","D" gt
  • // do Something LEX_ST1
  • lt EXPONENT "e","E" ("","-")?
    ("0"-"9") gt
  • Note if is omitted, E123 will be recognized
    erroneously
  • as a token of kind EXPONENT.

45
Structure of complex_regular_expression
  • complex_regular_expression_choices
  • complex_regular_expression ( complex_regular_exp
    ression )
  • complex_regular_expression
  • ( complex_regular_expression_
    unit )
  • complex_regular_expression_unit
  • java_string_literal "lt"
    java_identifier "gt"
  • character_list
  • ( complex_regular_expression_choices )
    ?
  • Note
  • unit ?concatenationjuxtaposition?
    complex_regular_expression ?choice ?
    complex_regular_expression_choice ?(.)? ?
  • unit

46
character_list
  • character_list
  • character_descriptor ( ,
    character_descriptor )
  • character_descriptor
  • java_string_literal - java_string_literal
  • java_string_literal // reference to java
    grammar
  • singleCharString
  • note java_sting_literal here is restricted to
    length 1.
  • ex
  • a,b --- all chars but a and b.
  • a-f, 0-9, A,B,C,D,E,F ---
    hexadecimal digit.
  • a,b is not a regular_expression_unit. Why
    ?
  • should be written ( a,b ) instead.

47
bnf_production
  • bnf_production
  • java_return_type java_identifier "("
    java_parameter_list ")" ""
  • java_block
  • "" expansion_choices "
  • expansion_choices expansion ( "" expansion
    )
  • expansion ( expansion_unit )

48
expansion_unit
  • expansion_unit
  • local_lookahead
  • java_block
  • "(" expansion_choices ")" "" ""
    "?"
  • "" expansion_choices ""
  • java_assignment_lhs ""
    regular_expression
  • java_assignment_lhs ""
  • java_identifier "(" java_expression_list ")
  • Notes
  • 1 is for lookahead 2 is for semantic action
  • 4 ( )?
  • 5 is for token match
  • 6. is for match of other nonterminal

49
lookahead
  • local_lookahead "LOOKAHEAD" "("
    java_integer_literal ","
    expansion_choices "," "" java_expression
    "" ")
  • Notes
  • 3 componets max lookahead syntax semantics
  • examples
  • LOOKHEAD(3)
  • LOOKAHEAD(5, Expr() ltINTgt ltREALgt , true )
  • More on LOOKAHEAD
  • see minitutorial

50
JavaCC API
  • Non-Terminals in the Input Grammar
  • NT is a nonterminal gt
  • returntype NT(parameters) throws ParseError
  • is generated in the parser class
  • API for Parser Actions
  • Token token
  • variable always holds the last token and can be
    used in parser actions.
  • exactly the same as the token returned by
    getToken(0).
  • two other methods - getToken(int i) and
    getNextToken() can also be used in actions to
    traverse the token list.

51
Token class
  • public int kind
  • 0 for ltEOFgt
  • public int beginLine, beginColumn, endLine,
    endColumn
  • public String image
  • public Token next
  • public Token specialToken
  • public String toString()
  • return image
  • public static final Token newToken(int ofKind)

52
Error reporting and recovery
  • It is not user friendly to throw an exception and
    exit the parsing once encountering a syntax
    error.
  • two Exceptions
  • ParseException . ? can be recovered
  • TokenMgrError ? not expected to be recovered
  • Error reporting
  • modify ParseExcpetion.java or TokenMgeError.java
  • generateParseException method is always
    invokable in parser action to report error

53
Error Recovery in JavaCC
  • Shallow Error Recovery
  • Deep Error Recovery
  • Shallow Error Recovery
  • Ex
  • void Stm()
  • IfStm()
  • WhileStm()
  • if getToken(1) ! if or while gt shallow
    error

54
Shallow recovery
  • can be recovered by additional choice
  • void Stm()
  • IfStm()
  • WhileStm()
  • error_skipto(SEMICOLON)
  • where
  • JAVACODE
  • void error_skipto(int kind)
  • ParseException e generateParseException() //
    generate the exception object.
  • System.out.println(e.toString()) // print the
    error message
  • Token t
  • do t getNextToken() while (t.kind !
    kind)

55
Deep Error Recovery
  • Same example void Stm() IfStm()
    WhileStm()
  • But this time the error occurs during paring
    inside IfStmt() or WhileStmt() instead of the
    lookahead entry.
  • The approach use java try-catch construct.
  • void Stm()
  • try
  • ( IfStm() WhileStm() )
  • catch (ParseException e)
  • error_skipto(SEMICOLON)
  • note the new syntax for javacc bnf_production.

56
More Examples
  • There are plenty examples on the net
  • http//www.vorlesungen.uni-osnabrueck.de/informati
    k/compilerbau98/code/JavaCC/examples/
  • JavaCC Grammar Repository
  • http//www.cobase.cs.ucla.edu/pub/javacc/

57
References
  • http//xml.cs.nccu.edu.tw/courses/compiler/cp2003F
    all/slides/javaCC.ppt
  • Compilers Principles, Techniques and Tools, Aho,
    Sethi, and Ullman
Write a Comment
User Comments (0)
About PowerShow.com