Title: JFlex
1JFlex
- Basically, a lexer is a Finite-State Transducer
plus bells and whistles - Arbitrary Java code can be associated with
actions state transitions - Specify the transducer in a .flex file JFlex
compiles it into a .java file - By default, JFlex gives you convenience methods
to access results of state transitions
2A simple task
- Charniaks statistical parser takes input
sentences delimited by ltsgt...lt/sgt - Suppose we want to take a Reader over such input
and get back a Tokenizer over the tokens, which
returns Word objects, plus a special
end-of-sentence character
garbage garbage garbage ltsgtStocks skyrocketed on
news that investigation of Cheney s energy
taskforce was dropped . lt/sgtmore garbage
3edu.stanford.nlp.process.AbstractTokenizer
- ...
- /
- Internally fetches the next token.
-
- _at_return the next token in the token
stream, or null if none exists. - /
- protected abstract Object getNext()
- ...
4Lexical Rules
- Basically youre specifying a finite-state
automaton with actions associated with state
transitions
though not strictly limited by FSA expressivity
5Schematic .flex file
user code options and declarations
lexical rules
6Lexical Rules (schematic)
ltYYINITIALgt BeginSentence
yybeginSENTENCE return
yylex() WhiteSpace / ignore /
return yylex() . / ignore /
return yylex() ltSENTENCEgt EndSentence
yybeginYYINITIAL
return SENTENCE_BOUNDARY Token
return new Word(yytext()) Space
/ ignore / return yylex()
7Lexical Rules (detail)
- ltYYINITIALgt
- BeginSentence / . yybegin(SENTENCE)
- return yylex()
- ...
-
- ltSENTENCEgt
- EndSentence / . yybegin(YYINITIAL)
- return SENTENCE_BOUNDARY
- Token return new Word(yytext())
- ...
8Options and declarations States and Macros
state SENTENCE SentenceLetter s BeginSentence
ltSentenceLettergt EndSentence
lt\/SentenceLettergt WhiteSpace
\t\r\n\f Token \t\r\n\f
- Macros can be used to define other macros
- Order of macro definition is irrelevant
9Other options and declarations
class CharniakTokenizer implements
Tokenizer extends AbstractTokenizer unicode typ
e Object eofval return null eofval
10Options declarations class-internal code (1)
-
- static final Word SENTENCE_BOUNDARY
- new Word("SENTENCE_BOUNDARY")
- public Object getNext()
- try
- Object o yylex()
- return o
-
- catch(IOException e)
- return null
-
-
-
- ...
11Options declarations class-internal code (2)
-
- ...
- public static void main(String args) throws
- IOException
- Reader r new FileReader(args0)
- Tokenizer t new CharniakTokenizer(r)
- while(t.hasNext())
- System.out.println(t.next())
-
-
-
12User Code inserted directly into the file
- package rog
- import java.util.
- import java.io.
- import edu.stanford.nlp.ling.Word
- import edu.stanford.nlp.process.
- / A lexer for Charniak input sentences
- _at_author Roger Levy
- /
13Beyond FSA expressivity
- class ParenCounter
-
- private int numParens 0
-
- ...
-
- ...
- ltYYINITIALgt
- \( numParens return yytext()
- \) if(numParens 0) throw new
RuntimeException( - "error too many close
parens!") - else
- numParens--
- return yytext()
-
-