Title: Syntax Analysis
1Syntax Analysis
- Source
- 1. Chapter 4, Compilers Principles Techniques
Tools - 2. Compiler Construction Lecture Notes,
- Prof Trevor Mudge and Prof Mark Hodges
- University of Michigan
2Agenda
- In this lecture we expand on the introduction of
syntax analysis - With particular attention to
- Parsing and
- Context Free Grammars
3Syntax Analysis
- The syntax of programming language constructs can
be described by context free grammars of BNF - Grammars offer significant advantage to compiler
designer design and construction - Gives precise, easy to understand, syntactic
specification of a programming language. - Grammars help to automatically construct
efficient parser which in turn can reveal
syntactic ambiguities.
4Syntax AnalysisAdvantages of Grammars
- Give precise, easy to understand, syntactic
specification of a programming language. - Help to automatically construct efficient parser
which in turn can reveal syntactic ambiguities. - Give structure to programming language that is
useful for the translation of source code into
correct object code and error checking. - New constructs can be added to a language when
there is an existing grammatical description of
the language.
5Syntax AnalysisParsers
- Most of this lecture will deal with parsing
methods that are typically used in compilers. - Recall the position of parser in compiler model
Parse tree
token
Rest of Front end
Parser
Source program
Lexical Analyser
get next token
Intermediate Representation
Symbol Table
6Where is Syntax Analysis Performed?
if (b 0) a b
Lexical Analysis or Scanner
if
(
b
0
)
a
b
Syntax Analysis or Parsing
if
abstract syntax tree or parse tree
b
0
a
b
7Parsing Analogy
- Syntax analysis for natural languages
- Recognize whether a sentence is grammatically
correct - Identify the function of each word
sentence
subject
verb
indirect object
object
I
gave
him
noun phrase
article
noun
I gave him the book
book
the
8Syntax Analysis Overview
- Goal determine if the input token stream
satisfies the syntax of the program - What do we need to do this?
- An expressive way to describe the syntax
- A mechanism that determines if the input token
stream satisfies the syntax description - For lexical analysis
- Regular expressions describe tokens
- Finite automata mechanisms to generate tokens
from input stream
9Can we just use Regular Expressions?
- REs can expressively describe tokens
- Easy to implement via DFAs
- So can we just use them to describe the syntax of
a programming language? - NO! They dont have enough power to express any
non-trivial syntax - Example Nested constructs (blocks, expressions,
statements) Detect balanced braces
. . .
- We need unbounded counting! - FSAs cannot count
except in a strictly modulo fashion
10Something more is needed!
- Most programming language constructs have an
inherently recursive structure that can be
defined by context-free grammars (CFG). - For example a conditional statement defined by a
rule such as - if E then S1 else S2 can not be specified by a
REs. - REs can specify the lexical structure of tokens
- We will find CFGs handy!
11Context-Free Grammars
- Consist of 4 components
- Terminal symbols token or ?. e.g.
- if, then, and else
- Non-terminal symbols syntactic variables that
denote sets of strings. - Define sets of strings that help define language
generated by grammer. e.g. - if expr then stmt else stmt expr and stmt are
non-terminals - Start symbol S special non-terminal
- Productions of a grammar of the form LHS?RHS
- LHS single non-terminal
- RHS string of terminals and non-terminals
- Specify how non-terminals and terminals can be
combined to form strings.
12Context-Free Grammars
- Each production consists of a non-terminal
followed by an arrow (or ), followed by a
string of non-terminals and terminals - Language generated by a grammar is the set of
strings of terminals derived from the start
symbol by repeatedly applying the productions - L(G) language generated by grammar G
S ? a S a S ? T T ? b T b T ? ?
13CFG Example 1
- A grammar that defines simple arithmetic
expressions can be defined by these productions - expr ? expr op expr
- expr ? (expr)
- expr ? - expr
- expr ? id
- op ? op ?
- op ? -
- op ?
- op ? /
- The terminal symbols are id - / ( )
14CFG Shorthand-Some National Conventions
- Terminal symbols
- Lower case letters a,b,a
- Operator symbols (,-,etc)
- Punctuation symbols , (, ), etc
- The digits 0,1,..,9
- Boldface strings id, if
Grammer symbols Upper case letters late in
alphabet X, Y, Z Strings of grammar
symbols Lower case Greek letters a,b, g A ? a
- Nonterminals
- A,B,C
- (early in alphabet)
- S start symbol
- Lower-case italic names expr, etc
String of terminals u,v,., z (late in alphabet)
vertical bar for Multiple productions S ? a S a
T T ? b T b ?
15CFG - Example 2
- Using CFG shorthand we can rewrite the grammar
for the previous example as - E ? E A E (E) -E id
- A ? - /
expr ? expr op expr expr ? (expr) expr ? -
expr expr ? id op ? op ? op ? - op ? op ? /
16More on CFGs
- Shorthand notation vertical bar is used for
multiple productions - S ? a S a T
- T ? b T b ?
- Definitions
- Derivation successive application of
productions starting from S - Acceptance Determine if there is a derivation
for an input token stream
17CFG Example 3
- Grammar for balanced-parentheses language
- S ? ( S ) S
- S ? ?
- ? stands for empty string
- 1 non-terminal S
- 2 terminals (, )
- Start symbol S
- 2 productions
- If grammar accepts a string, there is a
derivation of that string using the productions - (())
- S (S) ? ((S) S) ? ((?) ? ) ? (())
18 Parsers
Context free grammar, G
Parser
Yes, if s in L(G) No, otherwise
Token stream, s (from lexer)
Error messages
Syntax analyzers (parsers) CFG acceptors which
also output the corresponding derivation when the
token stream is accepted Various kinds LL, LR
19Parsers
- Popular parsing methods are classified to be
either - Top-down or
- Build parse tree from the top (root) of the parse
tree to bottom (leaves) - Bottom up
- Start from the bottom (Leaves) and work up to the
root. - In both cases the input to the parser is scanned
from left to right, one symbol at a time.
20LL Parsers
- www.Wikipedia.org
- An LL parser is a table-based top-down parser for
a subset of the context-free grammars. - It parses the input from Left to right, and
constructs a Leftmost derivation of the sentence. - The class of grammars which are parsable in this
way is known as the LL grammars.
21LR parser
- www.Wikipedia.org
- A type of bottom-up parser for context-free
grammars that is very commonly used by computer
programming language compilers (and other
associated tools). - LR parsers read their input from Left to right
and produce a Rightmost derivation
22Parsers Other tasks
- A number of tasks might be carried out during
parsing - Collecting information about various tokens into
symbol table - performing type checking,
- semantic analysis and
- Generating intermediate code
23Syntax AnalysisSyntax Error Handling
- Most of error handling (detection and recovery)
is often done during syntax analysis phase. - Programs can have errors in different levels. For
example the errors can be - Lexical
- misspelling of keywords, identifiers or operators
- Syntactic
- Arithmetic expression with unbalanced parentheses
- Semantic
- Operator applied to an incompatible operand
- Logical
- Infinitely recursive call
24What Next?
- With the background on grammars so far we shall
next look at - How we can construct parse trees.
- Ambiguous grammars
- A grammar that produces more than one parse tree
- Some derivation examples