Title: CS 3240: Languages and Computation
1CS 3240 Languages and Computation
2Full Compiler Structure
Scanner
- Most compilers have two pass
Parser
Start
Semantic Action
Semantic Error
Code Generation
CODE
3The Big Picture
- Parsing
- Translating code to rules of grammar. Building
representation of code. - Scanning
- Converting input text into stream of known
objects called tokens. Simplifies parsing
process. - Grammar dictates syntactic rules of language
i.e., how legal sentence could be formed - Lexical rules of language dictate how legal word
is formed by concatenating alphabet.
4Regular Expressions
- Symbols and alphabet
- A symbol is a valid character in a language
- Alphabet is set of legal symbols
- Typically denoted as ?
- Metacharacters/metasymbols
- Defining reg-ex operations
- Escape character (\)
- Empty string ? and empty set ?
- Basic regular expressions
- Basic operations union, concatenation, repetition
5DFA
- A DFA is a five-tuple consisting of
- Alphabet ??
- A set of states Q
- A transition function d Q??? ? Q
- One start state q0
- One or more accepting states F ? Q
- A language accepted by a DFA is the set of
strings such that DFA ends at an accepting state
after processing the string
6NFA
- Nondeterministic Finite Automata
- Same input may produce multiple paths
- Allows transition with an empty string or
transition from one state to different states
given a character - NFAs and DFAs are equivalent in power
- Proof by construction
- They differ only in implementation detail
- Regular languages are closed under regular
operations - If a language is regular, then it can be
described by a regular expression.
7Pumping lemma
- Pumping lemma
- For every regular language L, there is a finite
pumping length p, s.t. ? s?L and s?p, we can
write sxyz with 1) x yi z ? L for every
i?0,1,2,2) y ? 13) xy ? p
8State Machines
- Lexical analyzer is a state machine
- State machines are very similar to finite automata
Letter Digit
Letter
2
3
Identifier
Letter Digit
9Context-Free Grammar
- A context-free grammar (V, S, R, S) is a grammar
where all rules are of the form A ? x, with
A?V and x?(V?S) - A string is accepted if there is a derivation
from S to the string - Representation of derivation by ? or parse
trees - Left-most and right-most derivations
10Ambiguity
- A string w?L(G) is derived ambiguously if it has
more than one derivation tree (or equivalently
if it has more than one leftmost derivation (or
rightmost)). - A grammar is ambiguous if some strings are
derived ambiguously. - Some languages are inherently ambiguous
11Chomsky normal form
- Method of simplifying a CFG
- Definition A context-free grammar is in Chomsky
normal form if every rule is of one of the
following forms - A ? BC
- A ? a
- where a is any terminal and A is any variable,
and B, and C are any variables or terminals other
than the start variable - if S is the start variable then
- the rule S ? e is the only permitted ? rule
12Pushdown automata
- Similar to finite automata, but for CFGs
- PDAs are finite automata with a stack
- Theorem A language is context free if and only
if some pushdown automaton recognizes it
13Pumping Lemma for CFL
Theorem For every context-free language L, there
is a pumping length p, such that for any string
s?L and s?p, we can write suvxyz with1) u vi
x yi z ? L for every i?0,1,2,2) vy ? 13)
vxy ? p Note that 1) implies that uxz ? L
(take i0), requirement2) says that v,y cannot
be the empty strings e and condition 3) is
useful in proving non-CFL.
14Parser Classification
- Parsers are broadly broken down into
- LL - Top-down parsers
- L - Scan left to right
- L - Traces leftmost derivation of input string
- LR - Bottom-up parsers
- L - Scan left to right
- R - Traces rightmost derivation of input string
- LL is a subset of LR
- Typical notation
- LL(0), LL(1), LR(1), LR(k)
- Number (k) refers to maximum look ahead
- Lower is better!
15Top-down Parsing
- Top-down parsing
- Recursive-descent Recursive or non-recursive
- LL(1) parsing Table-driven, stack-based
implementation similar to Pushdown Automata - Removal of left recursions
- Why?
- Left recursions may lead to infinite loop
- How?
- EBNF
- Immediate left recursion
- Indirect left recursion
- Left factorizing
- LL(1) parsing
- First set and Follow set
16First Set
- Let X be a grammar symbol (a terminal or
nonterminal) or ?. Then the set First(X) is
defined as follows - If X is a terminal or ?, then First(X)X.
- If X is nonterminal, then for each production
rule X ? X1X2...Xn, First(X) contains
First(X1)-?. - If for some iltn, First(X1),...First(Xi) all
contain ?, then First(X) contains First(Xi1)-? - If First(X1),...First(Xn) all contain ?, then
First(X) contains ? - First(?) for any string ? X1X2...Xn is defined
using rules 2--4.
17Follow Set
- Given a nonterminal A, the set Follow(A) is
defined as - If A is start symbol, then is in Follow(A)
- If there is a production rule B??A?, then
Follow(A) contains First(?)-? - If there is a production rule B??A? and ???, then
Follow(A) contains Follow(?) - Notes
- is needed to indicate end of string
- ? is never member of Follow set
18LR Parsing
- Traverse rightmost derivation in reverse order
- Also uses a stack
- Main actions of LR parsing Shift and reduce
- LR(0)
- What are LR(0) items?
- How is DFA for LR(0) defined?
- Simple LR(1)
- How does SLR(1) extends LR(0)?
- General LR(1)
- What are LR(1) items?
- How is DFA for LR(1) items defined?
- Their corresponding parsing algorithms
- Conflicts of actions
19General LR(1) Parsing
- Also called canonical LR(1)
- More complex than SLR(1)
- An efficient variation is LALR(1)
- Key difference between General LR(1) and SLR(1)
is that lookahead is built into DFA - LR(1) items differ from LR(0) items, as they
include a single lookahead token in each item - e.g., A ? .?, a
20GLR(1) Parsing Algorithm
- If statecontains any item A??.X?,a and current
input token is X, then - shift X
- push new state ?(A??.X?,a,X)
- If state contains any item A??.,a and next
input token is a, then - remove ? and corresponding states from stack
- If rule is S?S.,, then accept if input token
is - backup the DFA to the state s from which
construction of ? began - push A and new state ?(s,A) onto stack
21LALR(1) Parsing
- LR(1) has too high complexity (too many states)
- How to reduce number of states?
- If LR(1) items in two states differ only by
lookahead variables, then merge the states - Definition The core of a state of the DFA of
LR(1) items is the set of LR(0) items consisting
of the first components of all LR(1) items in the
state.
22Principles of LALR(1) Parsing
- First principle (observation)
- The core of a state of the LR(1) DFA is a state
of the LR(0) DFA - Second principle (observation)
- Given two states s1 and s2 of the LR(1) DFA with
the same core, if there is a transition
t1?(s1,X), then there is a transition t2?(s2,X)
where t1 and t2 have the same core - Based on these principles, we merge LR(1) states
with the same cores with a set of lookahead
symbols in each item
23Grammar Relationships
24Attribute Grammars
- Attributes Property of a programming language
construct - Data type, value of expressions, etc.
- Attribute grammar collection of attribute
equations or semantic rules associated with the
grammar rules of a language - Each attribute equation in general has form
A.a f (X1.a1, X1.a2,..., X1.ak, ...
Xm.a1, Xm.a2,..., Xm.ak)
25Dependencies of Attribute Equations
- Synthesization
- Attribute of LHS depends on attributes of RHS
- E.g., arithmetic expressions
- Inheritance
- Attribute is inherited from attributes of LHS
to RHS or between symbols of RHS
26Equivalence between S- and L-Attributed Grammars
- Theorem Given an attribute grammar, all
inherited attributes can be changed into
synthesized attributes by modification of the
grammar without changing the language. - However, it may be difficult to change the
grammar in practice.
27Symbol Table
- Compilers use symbol table to keep track of
various names encountered in program - Symbol table entries
- Main fields Name, Attributes
- Attribute-field contains various info, including
binding information, type of name etc. - Interact with all phases of compilation
- Basic operations insert, lookup, delete
28Type Expressions
- The type of a language construct will be denoted
by a type- expression. - Type-expressions are either basic types or they
are constructed from basic types using type
constructors. - Basic types boolean, char, integer, real,
type_error, void - array(I,T) where T is a type-expression and I is
an integer-range. E.g. int A10has the type
expression array(0,..,9,integer) - We can take cartesian products of
type-expressions. E.g. - struct entry char letter int value is of
type (letter x char) x (value x integer)
29Type Expressions,II
- Pointers.int aaaa is of type pointer(integer).
- Functions
- int divide(int i, int j)is of type integer x
integer ? integer - Representing type expressions as trees
- e.g. char x char ? pointer(integer)
?
x
pointer
char
integer
char
30Type Systems
- A Type-system collection of rules for assigning
type-expressions to the variable parts of a
program. - A type-checker implements a type-system.
- It is most convenient to implement a type-checker
within the semantic rules of a syntax-directed
definition (and thus it will be implemented
during translation). - Many checks can be done statically (at
compilation). - Not all checks can be done statically. E.g. int
A10 int i printf(d,Ai)
31Formal definition of TM
- Definition A Turing machine is a 7-tuple
(Q,?,?,?,q0,qaccept,qreject), where Q, ?, and ?
are finite sets and - Q is the set of states,
- ? is the input alphabet not containing the
special blank symbol - ? is tape alphabet, where ????,
- ? Q???Q???L,R is the transition function,
32Formal definition of a TM
- Definition A Turing machine is a 7-tuple
(Q,?,?,?,q0,qaccept,qreject), where Q, ?, and ?
are finite sets and - q0?Q is the start state,
- qaccept?Q is the accept state, and
- qreject?Q is the reject state, where
qreject?qaccept
33TM configurations
- The configuration of a Turing machine is the
current setting - Current state
- Current tape contents
- Current tape location
- Notation uqv
- Current state q
- Current tape contents uv
- Only symbols after last symbol of v
- Current tape location first symbol of v
34Equivalence of machines
- Theorem Every multitape Turing machine has an
equivalent single tape Turing machine - Proof method By construction.
35Equivalence of machines
- Theorem Every nondeterministic Turing machine
has an equivalent deterministic Turing machine - Proof method construction
- Proof idea Use a 3-tape Turing machine to
deterministically simulate the nondeterministic
TM. First tape keeps copy of input, second tape
is computation tape, third tape keeps track of
choices.
36Decidability
- A language is decidable if some Turing machine
decides it - Not all languages are decidable
- How to show a language is decidable?
- Write a decider that decides it
- Accepts w iff w is in the language
- Must halt on all inputs
37Summary of Decidable Languages
38Turing Machine Acceptance Problem
- Consider the following language
- ATMltM,wgt M is a TM that accepts w
- Theorem ATM is Turing-recognizable
- Theorem ATM is undecidable
- Proof idea Construct a universal Turing machine
recognizes, but does not decide, ATM
39Undecidable languages
Turing recognizable
Co-Turing recognizable
Decidable
40The Halting Problem HALTTM
- HALTTM ltM,wgt M is a TM and M halts on input
w - Theorem HALTTM is undecidable
- Proof Idea (by contradiction)
- Show that if HALTTM is decidable then ATM is also
decidable
41Reductions and Decidability
- To prove a language is decidable, we have
converted it to another language and used the
decidability of that language - Example use decidability of EDFA to determine
decidability of EQDFA - Thus, we reduce the problem of determining if
EQDFA is decidable to the problem of determining
if EDFA is decidable
42Reductions and Undecidability
- To prove a language is undecidable, we have
assumed its decidable and found a contradiction - Example assume decidability of HALTTM and show
ATM is decidable which is a contradiction - In each case, we have to do a computation to
convert one problem to another problem - What kind of computations can we do?
43Rices Theorem
- Determining whether a TM satisfies any
non-trivial property is undecidable - A property is non-trivial if
- It depends only on the language of M, and
- Some, but not all, Turing machines have the
property - Examples Is L(M) regular? A CFG? Finite?
44Linear Bounded Automata
- LBA definition TM that is prohibited from moving
head off right side of input. - machine prevents such a move, just like a TM
prevents a move off left of tape - How many possible configurations for a LBA M on
input w with wn, m states, and p?? - counting gives mnp
45Mapping Reducibility
- Definition Language A is mapping reducible to
language B, written A?mB, if there is a
computable function f? ? ?, where for every w, - w ? A iff f(w) ? B
- The function f is called the reduction of A to B.
B
A
46Applications of Mapping Reductions
- If A ?m B and B is decidable, then A is decidable
- If A ?m B and A is undecidable, then B is
undecidable - If A ?m B and B is Turing-recognizable, then A is
Turing-recognizable - Equivalently, A ?m B
- If A ?m B and A is not Turing-recognizable, then
B is not Turing-recognizable
47Complexity relationships
- Theorem Let t(n) be a function, where t(n) ? n.
Then every t(n) time multitape TM has an
equivalent O(t2(n)) time single-tape TM - Proof idea Consider structure of equivalent
single-tape TM. Analyzing behavior shows each
step on multi-tape machine takes O(t(n)) on
single tape machine
48Determinism vs. non-determinism
- Definition Let P be a non-deterministic Turing
machine. The running time of P is the function
fN?N, where f(n) is the maximum number of steps
that P uses on any branch of its computation in
any input of length n.
49NP-completeness
- A problem C is NP-complete if finding a
polynomial-time solution for C would imply PNP - Definition A language B is NP-complete if it
satisfies two conditions - B is in NP, and
- Every A in NP is polynomial time reducible to B
50Cook-Levin theorem
- SAT ltBgtB is a satisfiable Boolean expression
- Theorem SAT is NP-complete
- If SAT can be solved in polynomial time then any
problem in NP can be solved in polynomial time
51Showing a problem in NP-complete
- Two steps to proving a problem L is NP-complete
- Show the problem is in NP
- Demonstrate there is a polynomial time verifier
for the problem - Show some NP-complete problem can be polynomially
reduced to L
52Space Complexity Classes
Definition Let fN?N be a function. The space
complexity classes SPACE(f(n)) and NSPACE(f(n))
are the following sets of languages SPACE(f(n))
L there is a TM that decides the
language L in space O(f(n))
NSPACE(f(n)) L there is a
nondeterministic
TM that decides the
language L in space O(f(n))
53Savitchs Theorem
Theorem For any function fN?N with f(n)?nwe
have NSPACE(f(n)) ? SPACE((f(n))2). In other
words Nondeterminism does not give you much
extra for space complexity classes. Compare this
with our (lack of) understanding ofthe time
complexity classes TIME and NTIME.
54A Hierarchy of Classes
EXPTIME
PSPACE
NP
P
P ? NP ? PSPACENPSPACE ? EXPTIME
We dont know how to prove P?PSPACEor
NP?EXPTIME. But we do know P?EXPTIME