Title: From program text to tokens the lexical structures
1From program text to tokens the lexical
structures
- From Section 2.1, Modern Compiler Design, by
Dick Grun et al.
22.1.1 Reading the program text
- Read in memory the entire file with a system call
instead of using the standard character-reading
routines
32.1.1.1 The troublesome newline
- Each OS implements its own convention.
- UNIX (o12), MS-DOS (o15o12), OS-370(non-char)
- The newline character is rather an end-of-line
character
42.1.2 Lexical versus syntactic analysis
- Where the border between the two lies.
- Lexical analysis produces tokens and syntax
analysis consumes them, but what exactly is a
token? - If it can be separated from its left and right
neighbors by white space without changing the
meaning, its a token otherwise, it isnt.
52.1.3 Regular expressions and regular descriptions
- An identifier is a sequence of letters, digits,
and underscores that starts with a letter no
consecutive underscores are allowed in it, nor
can it have a trailing underscore. - This is satisfactory for the user of language.
- For the purpose of compiler construction, we need
to expression this in regular expression. - A regular expression is a formula that describes
a possibly infinite set of strings. - It can be viewed both as a recipe for generating
these strings and as a patter to match these
strings.
62.1.3 Regular expressions and regular descriptions
abcd?
(a(b)(c(d?))
72.1.3.1 Regular expressions and BNF/EBNF
- Basic patterns share with the BNF notation the
invisible concatenation operators and the
alternative operator, and with EBNF the
repetition operators and parentheses.
82.1.3.2 Escape characters in regular expressions
- \ denotes the asterisk
- \\ the backslash
92.1.3.3 Regular descriptions
- A regular description is like a context-free
grammar in EBNF, with the restriction that no
non-terminal can be used before it has been fully
defined.
letter ? a-zA-Z digit ? 0-9 underscore ?
_ letter_or-digit ? letter digit underscored_tai
l ? underscore letter_or_digit identifier ?
letter letter_or_digit underscored_tail
identifier ? a-zA-Z (a-zA-Z0-9)
(_(a-zA-Z)0-9))
102.1.4 Lexical analysis
- The basic task of a lexical analyzer is
- given a set S of token descriptions and a
position P in the input stream, - to determine which of the regular expressions in
S will match a segment of the input starting at P
and what that segment is.
112.1.5 Creating a lexical analyzer by hand
122.1.5 Creating a lexical analyzer by hand
132.1.5 Creating a lexical analyzer by hand
142.1.5 Creating a lexical analyzer by hand
152.1.5 Creating a lexical analyzer by hand
162.1.5 Creating a lexical analyzer by hand
172.1.5 Creating a lexical analyzer by hand
182.1.5 Creating a lexical analyzer by hand