JFlex - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

JFlex

Description:

Internally fetches the next token. ... recursion return value of next token. returns the String value of the matched region ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 14
Provided by: roger192
Category:

less

Transcript and Presenter's Notes

Title: JFlex


1
JFlex
  • Basically, a lexer is a Finite-State Transducer
    plus bells and whistles
  • Arbitrary Java code can be associated with
    actions state transitions
  • Specify the transducer in a .flex file JFlex
    compiles it into a .java file
  • By default, JFlex gives you convenience methods
    to access results of state transitions

2
A simple task
  • Charniaks statistical parser takes input
    sentences delimited by ltsgt...lt/sgt
  • Suppose we want to take a Reader over such input
    and get back a Tokenizer over the tokens, which
    returns Word objects, plus a special
    end-of-sentence character

garbage garbage garbage ltsgtStocks skyrocketed on
news that investigation of Cheney s energy
taskforce was dropped . lt/sgtmore garbage
3
edu.stanford.nlp.process.AbstractTokenizer
  • ...
  • /
  • Internally fetches the next token.
  • _at_return the next token in the token
    stream, or null if none exists.
  • /
  • protected abstract Object getNext()
  • ...

4
Lexical Rules
  • Basically youre specifying a finite-state
    automaton with actions associated with state
    transitions

though not strictly limited by FSA expressivity
5
Schematic .flex file
user code options and declarations
lexical rules
6
Lexical Rules (schematic)
ltYYINITIALgt BeginSentence
yybeginSENTENCE return
yylex() WhiteSpace / ignore /
return yylex() . / ignore /
return yylex() ltSENTENCEgt EndSentence
yybeginYYINITIAL
return SENTENCE_BOUNDARY Token
return new Word(yytext()) Space
/ ignore / return yylex()
7
Lexical Rules (detail)
  • ltYYINITIALgt
  • BeginSentence / . yybegin(SENTENCE)
  • return yylex()
  • ...
  • ltSENTENCEgt
  • EndSentence / . yybegin(YYINITIAL)
  • return SENTENCE_BOUNDARY
  • Token return new Word(yytext())
  • ...

8
Options and declarations States and Macros
state SENTENCE SentenceLetter s BeginSentence
ltSentenceLettergt EndSentence
lt\/SentenceLettergt WhiteSpace
\t\r\n\f Token \t\r\n\f
  • Macros can be used to define other macros
  • Order of macro definition is irrelevant

9
Other options and declarations
class CharniakTokenizer implements
Tokenizer extends AbstractTokenizer unicode typ
e Object eofval return null eofval
10
Options declarations class-internal code (1)
  • static final Word SENTENCE_BOUNDARY
  • new Word("SENTENCE_BOUNDARY")
  • public Object getNext()
  • try
  • Object o yylex()
  • return o
  • catch(IOException e)
  • return null
  • ...

11
Options declarations class-internal code (2)
  • ...
  • public static void main(String args) throws
  • IOException
  • Reader r new FileReader(args0)
  • Tokenizer t new CharniakTokenizer(r)
  • while(t.hasNext())
  • System.out.println(t.next())

12
User Code inserted directly into the file
  • package rog
  • import java.util.
  • import java.io.
  • import edu.stanford.nlp.ling.Word
  • import edu.stanford.nlp.process.
  • / A lexer for Charniak input sentences
  • _at_author Roger Levy
  • /

13
Beyond FSA expressivity
  • class ParenCounter
  • private int numParens 0
  • ...
  • ...
  • ltYYINITIALgt
  • \( numParens return yytext()
  • \) if(numParens 0) throw new
    RuntimeException(
  • "error too many close
    parens!")
  • else
  • numParens--
  • return yytext()
Write a Comment
User Comments (0)
About PowerShow.com