Lexical and Analysis - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Lexical and Analysis

Description:

Number of Views:78

Avg rating:3.0/5.0

Slides: 23

Provided by: ics78

Category:

Tags: analysis | lexical

Transcript and Presenter's Notes

Title: Lexical and Analysis

1
Chapter 4

2
4.1 Introduction

3
Introduction (cont.)

Compilation ? large industrial application
development.
Pure interpretation ? smaller system in which
execution efficiency is not critical, such as
scripts embedded in HTML documents
Hybrid system ? high level to intermediate forms
Java and perl.
Nearly all syntax analysis is based on a formal
description of the syntax of the source language.
The Syntax analysis portion of a language
processor nearly always consist of two parts
A low-level part called a lexical analyzer.
A high part called a syntax analyzer or parser.

4
Introduction (cont.)

Using BNF has at least three compelling
advantages
BNF descriptions of the syntax of a programs are
clear and concise, both for humans and software
systems that use them.
The parser can be based directly on the BNF.
Implementations based on BNF are relatively easy
to maintain because of their clear modularity.
Reasons to separate lexical and syntax analysis
Simplicity - less complex approaches can be used
for lexical analysis separating them simplifies
the parser
Efficiency - separation allows optimization of
the lexical analyzer
Portability - parts of the lexical analyzer may
not be portable, but the parser always is
portable

5
4.2 Lexical Analysis

A lexical analyzer is a pattern matcher for
character strings.
A lexical analyzer is a front-end for the
parser.
Identifies substrings of the source program that
belong together - lexemes
Lexeme mach a character pattern, which is
associated with a lexical category called a
token.
Sum oldsum value / 100
Lexeme ? sum, , oldsum, -, value, / , 100,
Token ? IDENT, ASSIGN_OP, IDENT, SUBTRACT_OP,
IDENT, DIVISION_OP, INT_LIT, SEMICOLON

6
4.2 Lexical Analysis

Lexical analyzers extract lexemes from a given
input string and produce the corresponding
tokens.
Three approaches to building a lexical analyzer
-1. Write a formal description of the tokens and
use a software tool that constructs table-driven
lexical analyzers given such a description
-2. Design a state diagram that describes the
tokens and write a program that implements the
state diagram
-3. Design a state diagram that describes the
tokens and hand construct a table-driven
implementation of the state diagram.
We only discuss approach 2

7
4.2 Lexical Analysis

A native state diagram would have a transition
from every state on every character in the source
language such a diagram would be very large!
In many cases, transitions can be combined to
simplify the state diagram
When recognizing an identifier, all uppercase and
lowercase letters are equivalent - Use a
character class that includes all letters
When recognizing an integer literal, all digits
are equivalent - use a digit class
Reserved words and identifiers can be recognized
together ( rather than having a part of the
diagram for each reserved word)
Use a table lookup to determine whether a
possible identifier is in fact reserved word

8
4.2 Lexical Analysis (cont.)

Convenient utility subprograms
getChar - gets the next character of input, puts
it in
nextChar, determines its class and puts the class
in
charClass
addChar - puts the character from nextChar into
the place the lexeme is being accumulated, lexeme
lookup -determines whether the string in lexeme
is a reserved word (returns a code)

9
4.2 Lexical Analysis (cont.)
10
The Parsing Problem

The part of the process of analyzing syntax that
is referred to as syntax analysis is often called
parsing.
Goals of the parser, given as input program
Syntax analysis must check the input program to
determine whether it is syntactically correct.
When an error is found, the analyzer must produce
a diagnostic message and recover.
Produce either a complete parse tree, or at least
trace the structure of the complete parse tree.
In either case, the result is used as the basis
for translation.
Two categories of parsers
Top down produce the parse tree, beginning at the
root
Order is that of a leftmost derivation.
Bottom up produce the parse tree, beginning at
the leaves
Parse tree is built from the leaves upward to the
root.

11
Grammar symbols

Terminal symbols Lowercase letters at the
beginning of the alphabet (a,b, ..,)
Nonterminal symbols Uppercase letters at the
beginning of alphabet (A, B, ..)
Terminals or nonterminals Uppercase letters at
the end of the alphabet (W, X, Y, Z)
Strings of terminals Lowercase letters at the
end of the alphabet (w, x, y, z)
Mixed strings (terminals and/or nonterminals)
Lowercase Greek letters (a, ß, µ )

12
Top-Down Parsers

Given a sentential form that is part of a
leftmost derivation, the parsers task is to find
the next sentential form in that leftmost
derivation.
The most common top-down parsing algorithms
Recursive descent - use a parsing table rather
than code
LL parsers

13
Parser
Series of sub-routine calls
14
Bottom-Up Parsers (LR parser)
15
The Complexity of Parsing

Parsers that works for any unambiguous grammar
are complex and inefficient (O(n3), which means
the amount of time they take is on the order of
the cube of the length of the string to be
parsed.
All algorithms used for the syntax analyzers of
compilers have complexity O(n), which means the
time take is linearly related to the length of
the string to be parsed.

16
The Recursive-Descent Parsing Process

A recursive-descent parser is so named because it
consists of a collection of subprograms, many of
which are recursive, and it produces a parse tree
in top-down (descending) order.
There is a subprogram for each nonterminal in the
grammar, which can parse sentences that can be
generated by that nonterminal.
A grammar for simple expressions
ltexprgt ? lttermgt ( - ) lttermgt
lttermgt ? ltfactorgt ( /) ltfactorgt
ltfactorgt ? id (ltexprgt )
The coding process when there is only one RHS
For each terminal symbol in the RHS, compare it
with the next input token if they match,
continue, else there is an error
For each nonterminal symbol in the RHS, call its
associated parsing subprogram

17
The Recursive-Descent Parsing Process (cont.)

18
The Recursive-Descent Parsing Process (cont.)

A nonterminal that has more than one RHS requires
an initial process to determine which RHS it is
to parse
The correct RHS is chosen on the basis of the
next token of input (the lookahead)
The next token is compared with the first token
that can be generated by each RHS until a match
is found
If no match is found, it is a syntax error

19
The Recursive-Descent Parsing Process (cont.)