Title: Formal Languages and Applications:
1Formal Languages and Applications
- This is a brief summary of the hard copy handed
out at the class. See this handout - for more details.
We know that Pascal programming language is
defined in terms of a CFG. All the other
programming languages are context-free (except
for a few special cases, like the input statement
in FORTRAN). In this section we will briefly see
that the popular hypertext markup language(HTML)
and the newly emerging extensible markup language
(XML), for e-commerce, are also context-free.
These languages are usually presented in a
descriptive form. We will see how the language
description can be transformed to a formal
grammar.
2HyperText Markup Language(HTML)
HTML consists of text and tags. Matching tags are
of the form ltxgt and lt/xgt for a string x.
Unmatched tags are of the form ltxgt with no
matching part lt/xgt. The following specification
is for the item list, which can be easily
transformed to a set of production rules as shown
in the box.
1. Char is a single character. 2. Text is any
string of characters with no tags. 3. Doc
represents documents, which are sequences of
Elements. 4. Element is either a Text, a pair of
matching tags and a Doc between them, or an
unmatched tag followed by a Doc. 5. ListItem is
the ltLIgt tag followed by a document, which is a
single list item. 6. List is a sequence of zero
or more list items.
ltChargt ?? a A . . . . . z Z . .
. ltTextgt ?? ltChargtltTextgt ? ltDocgt ??
ltElementgtltDocgt ? ltElementgt ?? ltTextgt
ltxgtltDocgtlt/xgt
ltygtltDocgt ltListItemgt ?? ltLIgtltDocgt ltListgt
?? ltListItemgtltListgt ?
3XML and Document-Type Definitions
The main purpose of XML is to describe the
meaning (i.e., semantics) of the document by
elaborating it with a form called DTD (Document
Type Definition). The general form of a DTD
is lt!DOCTYPE name-of-DTD list of element
definitions gt ,where the Element definition has
the following form. lt!ELEMENT element-name
(description of the element)gt Element
descriptions are essentially regular expressions
defined as follows. (Notice that definition 1 and
2 are for the base of the definition. Recall that
(E1) E1(E1).) 1. An element-name. 2. The
special term \PCDATA, standing for any text that
does not involve XML tags. 3. If E1 and E2 are
Elements, then E1, E1, E1?, E1.E2, and E1 E2
are Elements which, respectively, denote the
following E1 zero or more occurrences of
E1. E1 one or more occurrences of
E1. E1? zero or one occurrences of E1.
E1.E2 E1 concatenated by E2. E1 E2
E1 union E1.
4Example. A DTD for personal computer
lt!DOCTYPE PcSpecs lt!ELEMENT PCS (PC)gt
lt!ELEMENT PC (MODEL, PRICE, PROCESSOR, RAM,
DISK)gt lt!ELEMENT MODEL (\PCDATA)gt
lt!ELEMENT PRICE (\PCDATA)gt
lt!ELEMENT PROCESSOR (MANF, MODEL, SPEED)gt
lt!ELEMENT MANF (\PCDATA)gt
lt!ELEMENT MODEL (\PCDATA)gt
lt!ELEMENT SPEED (\PCDATA)gt
lt!ELEMENT RAM (\PCDATA)gt lt!ELEMENT
DISK (HARDDISK CD DVD) gt
lt!ELEMENT HARDDISK (MANF, MODEL, SISE)gt
lt!ELEMENT SIZE (\PCDATA)gt
lt!ELEMENT CD (SPEED)gt
lt!ELEMENT DVD (SPEED)gt gt
5Above DTD form can easily be transformed to a
CFG. For example lt!ELEMENT PC (MODEL,
PRICE, PROCESSOR, RAM, DISK)gt Can be
transformed to the production rule ltPCgt ?
ltMODELgt lt PRICEgt ltPROCESSORgt ltRAMgt ltDISKgt
ltMODELgt ? ltTextgt ltPRICEgt ? ltTextgt ltPROCESSORgt ?
ltMANFgt lt MODELgt lt SPEEDgt ltRAMgt ? ltTextgt ltDISKgt
? ltDISKgt ltDISKgt ltDISKgt , and so on.
6A part of an XML document conforming the above
DTD is shown below. Notice that each element is
delimited by a pair of matching tags with the
name of the element.
ltPCSgt ltPCgt
ltMODELgt1234lt/MODELgt
ltPRICEgt3000lt/PRICEgt
ltPROCESSORgt ltRAMgt512lt/RAM.
ltDISKgtltHARDDISKgt
ltMANFgtSuperdclt/MANFgt
ltMODELgtxx1000lt/MODELgt
ltSIZEgt62Gblt/SIZEgt
lt/HARDDISKgtlt/DISKgt
ltDISKgtltCDgt
ltSPEEDgt32xlt/SPEEDgt
lt/CDgtlt/DISKgt lt/PCgt ltPCgt
. . . . lt/PCgt lt/PCSgt
7Lex and YACC
Most compilers have two main functional
components a lexical analyzer and a parser.
The lexical analyzer, reading the input source
program, identifies tokens, and the parser, based
on the tokens, parses (i.e., identifies the
relationships between the tokens in terms of a
sequence of production rules) the program.
Lex Since the tokens can be expressed in terms
of a regular expression, the lexical analyzer can
be built based on the model of finite state
automata. (Recall the automaton that we have
designed for recognizing Pascal numbers.) Lex
builds a lexical analyzer based on token forms
given as (actually, a variation of) regular
expressions, and carries out the action given for
each token. The input to Lex consists of three
parts, each separated by - Definition -
Token description and actions - User-written
codes
8Definition
In definition section, reach regular expression
is defined with a name. In regular expressions,
operator is used for the closure together with
the operator , and the vertical bar is used
for the union operator. Thus (a b) denotes any
combination of as and bs including the null
string, and (a b) denotes (a b) with the
null string excluded. Alternation can be also
written using brackets. For example, ab means
(a b), and a-z for any symbol from the lower
case alphabet. A question mark indicates the
preceding expression is optional. Thus abc? is
equivalent to ab abc, and (abc)? is abc ?.
The period is used as a wild card symbol that
matches any character. For example integers
and reals can be defined as follows. digits
0-9 int digits
real int.int(Ee-?int)?
9Token Descriptions and Actions
Recognizing a token, Lex returns an indication to
the parser of what the token is. This section
specifies such responses in terms of actions. For
example, real return FLOAT
inteter return INTEGER
User-Written Code
When the action part for a token is complex, it
is written as a function and included in this
section to be used in terms of a function call in
the section for actions.
10YACC
Yacc takes a grammar as its input and generates
the table and a program (in C) that implements a
look ahead LR (also called LALR) parser. The
input also provides semantic actions for each
production rule, and YACC generates a code for
carrying out these actions. The input form for
YACC also consists of three sections as for
Lex. - Declarations and Definitions - Grammar
and Actions - User-written Codes We will
briefly describe each section. (For more details,
see a reference manual for YACC.)
Declarations and Definitions
In this section, all tokens, except
single-character operator, are defined. To help
parser we can also specify operator precedence
and the associativity (left or right) in this
section. To establish proper links to other parts
of C, it also includes facilities for identifying
variables and type definition. Here are some
examples.
11token ID /token for identifier
/ token NUMBER / token for numbers
/ token BEGIN END ARRAY FILE include yylex.c
/ include lexical scanner / extern int
yylval / token values from yylex / int
tcount 0 / a temporary integer
variable / start S /
Grammars start symbol /
Grammar and Actions
The grammars are defined similar to BNF form as
follows. - Single characters used as terminals
are put in singly quotes and non-terminals are
written as a name with no delimiters. - Instead
of ?, a colon is used, and the right end of a
production rule is marked by a semicolon. -
Blank is used to represent an ?-production.
12Here is an example.
expr expr term expr - term term
term term fact term / fact
fact fact ( expr ) ID
E ? E T E T T T ? T F T / F F F ?
(E) i
User-written codes
The user-written code section contains the main
program that invokes the parser, named yyparse(
), and other codes if needed. So it should
contain at least the following code. main (
) yyparse( )