Title: Scanning and Parsing
1Scanning and Parsing
2Scanning and Parsing in Squeak
- Defining Lexical and Syntactic Analysis
- Scanning/Tokenizing
- Parsing
- Easy ways to do it
- State Transition Tables
- Recursive Descent Parsing
- More sophisticated way
- T-Gen Lex and YACC for Squeak
- SmaCC
- XML SAX, DOM
- SIXX
- Examples from Squeak
- Smalltalk parser
- HTML parser
3Challenge of Compiling
- How do you go from source code to object code?
- Lexical analysis Figure out the pieces (tokens)
of the program Constants, variables, keywords. - Syntactic analysis Figure out (and check) the
structure (declaration, statemtents, etc.)also
called parsing - Interpret meaning (semantics) of recognized
structures, based on declarations - Backend Generate object code
4Lexical Analysis
- Given a bunch of characters, how do you recognize
the key things and their types? - Simplest way Parse by white space
- 'This is
- a test
- with returns in it.' findTokens (Character cr
asString),(Character space asString). - OrderedCollection ('This' 'is' 'a' 'test' 'with'
'returns' 'in' 'it.' )
5But
- if(xgty)
- xy findTokens (Character cr
asString),(Character space asString). - OrderedCollection ( if(xgty) xy)
- Not what we want!
6Scanning Doing It Right
- Read in characters one-at-a-time
- Recognize when an important token has arrived
- Return the type and the value of the token
7A Theoretical Tool for Scanning FSA's
- Finite State Automata (FSA)
- One model of computation that can scan well
- We can make them fast and efficient
- FSA's are
- A collection of states
- Arcs between states
- Labeled with input symbols
8Example FSA
- State 1 is start state
- Incoming arrow
- "Incomplete state" can't end there
- State 2 is terminal or end statecan stop there,
recognizing a token - Consume A's in 1, end with a B in 2
- Valid AB, AAB, AAAB
9General FSA Processing
- Enter the Start state
- Read input
- Go to the state for that input
- If an End state, can stop
- But may not want to, since we must find the
longest possible token (consider scanning an
identifier)
10Implementing FSAs
- Easiest way State Transition Tables
- Read a character
- Append character to VALUE
- Using a table indexed by states and characters,
find a new state to move to given current STATE
and input CHARACTER - If end state and no more transitions possible,
return VALUE and STATE - (Sometimes need to do a lookahead. Could I grab
the next character and be in another end state?)
11Syntactic Analysis
- Given the tokens, can we recognize the language?
- Parsing
- Structure for describing relationship between
tokens is called a grammar - A grammar describes how tokens can be assembled
into an acceptable sentence in a language - We're going to study a kind called context-free
grammars
12Context-free grammars
- Made up of a set of rules
- Each rule consists of a left-hand side
non-terminal which maps to a right-hand side
expression - Expressions are made up of other non-terminals
and terminals - Rules can be used as replacements
- Either side can be replaced with the other
13Example grammar
- Expression Factor Expression
- Expression Factor
- Factor Term Factor
- Factor Term
- Term Number
- Term Identifier (variable)
14Derivation tree using grammar for 345
15Implementing Parsing
- Simplest way Recursive descent parsing
- Each non-terminal maps to a method/function/proced
ure in language - The m/f/p is responsible for recognizing the
related non-terminal - Including calling another m/f/p as needed
- Use your scanner to supply tokens
16A Simple Equation Recursive Descent Parser
- Expression Factor Expression
- Expression Factor
- expression
- Transcript show 'Expression' cr.
- self factor.
- (scanner peek '')
- ifTrue Transcript show '' cr.
- scanner advance.
- self
expression.
17Factor and Term Simple RD Parsing
- Factor Term Factor
- Factor Term
- Term Number
- factor
- Transcript show 'Factor' cr.
- self term.
- (scanner peek '') ifTrue
- Transcript show ''
cr. - scanner advance
- self
factor. - term
- Transcript show 'Term' cr.
- (scanner nextIsNumber)
- ifTrue Transcript show
'Number ',(scanner nextToken) cr. - ifFalse Transcrpt show
Error -- Number expected
18Simulating a Scanner
- tokens aCollection
- tokens aCollection
- peek
- tokens isEmpty
- ifTrue nil
- ifFalse tokens first
- advance
- tokens tokens allButFirst.
19Simulating a Scanner
- nextIsNumber
- (tokens first select character
- character asciiValue lt 0
asciiValue or - character asciiValue gt 9
asciiValue) isEmpty - nextToken
- token
- token self peek.
- token isNil ifFalse self advance.
- token.
20Trying out the toy parser
- eqn EquationParser new.
- eqnscan EquationScanner new.
- eqn scanner eqnscan.
- eqnscan tokens ('3 4 5' findTokens
(Character space asString)). - eqn expression
21Comparing to the earlier derivation tree 3 4
5
- Transcript
- Expression
- Factor
- Term
- Number 3
-
- Factor
- Term
- Number 4
-
- Expression
- Factor
- Term
- Number 5
22Derivation tree for 3 4 5
Expression
- Transcript
- Expression
- Factor
- Term
- Number 3
-
- Expression
- Factor
- Term
- Number 4
-
- Factor
- Term
- Number 5
Factor Expression
Factor
Term
Term Factor
Number
Number
Term
3
Number
4
5
23Scanning and Parsing in Squeak
- Defining Lexical and Syntactic Analysis
- Scanning/Tokenizing
- Parsing
- Easy ways to do it
- State Transition Tables
- Recursive Descent Parsing
- More sophisticated way
- T-Gen Lex and YACC for Squeak
- SmaCC
- XML SAX, DOM
- SIXX
- Examples from Squeak
- Smalltalk parser
- HTML parser
24T-Gen A Translator Generator for Squeak
25Using T-Gen
- Link to changeSet on co-web (Software page)
- In Morphic, TGenUI open
- Enter your tokens as regular expressions in
upper-left - Enter your grammar in lower-left
- Put in sample code in lower-right
- Transcript for parsing is upper-right
- Processing of each occurs as soon as you accept
(Alt/Cmd-S) - From the transcript pane, you can inspect result
- Buttons let you specify kind of parser and kind
of result - You can install the resultant scanner and parser
into your system
26smaCC
- Smalltalk Compiler-Compiler freely available
parser generator - replacement for the T-Gen parser generator
- overcomes T-Gen's limitations
- can generate parsers for ambiguous grammars and
grammars with overlapping tokens - smaller runtime than T-Gen
- faster than T-Gen
- Available via SqueakMap
- Tutorial at http//www.refactory.com/Software/SmaC
C/Tutorial.html
27XML Vocabulary
- XML Extensible Markup Language
- Designed to describe data and focus on what the
data is - Vs. HTML display data and focus on how data
looks. - It doesnt do anything, it describes data via
tags and values. - Tutorial http//www.w3schools.com/xml/xml_whatis.
asp
28XML
- Must have open/close tags
- Must be properly nested
- Always have a root element
- Parsed document forms a tree structure
- Can be commented
- lt!-- This is a comment --gt
- Is case sensitive ltNamegt ! ltnamegt
- Can have attributes ltperson sexmalegt
29Sample XML Description
ltCustomerListgt ltCompanyNamegtExtroon
Incorporatedlt/CompanyNamegt ltCompanyPhonegt770-555-1
212lt/CompanyPhonegt ltcustomergt ltnamegtBob
Waterslt/namegt ltidgt126423lt/idgt ltaddrgt 1313
MockingBird Lane lt/addrgt lt/customergt ltcustomergt ltn
amegtSally Smithlt/namegt ltidgt559382lt/idgt ltaddrgt
1212 Sunnyvale Retirement Homelt/addrgt lt/customergt
lt/CustomerListgt
30Well-Formed vs. Valid XML
- Just because it is well-formed (syntactically
correct) doesnt mean the data is correct - Need to specify what the data is supposed to look
like for the information to be valid - Can use either Document Type Definition (DTD) or
schemas
31Sample Document Type Defn
lt!DOCTYPE CustomerList lt!ELEMENT CompanyName
(PCDATA)gt lt!ELEMENT CompanyPhone (PCDATA)gt
lt!ELEMENT customers (customer)gt lt!ELEMENT
customer (name,id,addr)gt lt!ELEMENT name
(PCDATA)gt lt!ELEMENT id (PCDATA)gt
lt!ELEMENT addr (PCDATA)gt gt
32Sample Schema
- lt?xml version"1.0"?gt
- ltxsschema xmlnsxs"http//www.w3.org/2001/XMLSch
ema" targetNamespace"http//www.cc.gatech.edu/cs2
340" xmlns"http//www.cc.gatech.edu/cs2340"
elementFormDefault"qualified"gt - ltxselement nameCustomerList"gt
- ltxscomplexTypegt
- ltxssequencegt
- ltxselement nameCompanyName"
type"xsstring"/gt - ltxselement nameCompanyPhone
type"xsstring"/gt
33Schema Continued
- ltxselement namecustomer" /gt
- ltxscomplexTypegt
- ltxssequencegt
- ltxselement namename
typexsstring/gt - ltxselement nameid" type"xsstring"/gt
- ltxselement nameaddr"
type"xsstring"/gt lt/xssequencegt - lt/xscomplexTypegt
- lt/xselementgt
- lt/xscomplexTypegt
- lt/xselementgt
- lt/xsschemagt
34Parsing XML
- You could do it yourself..
- SAX Simple API for XML
- Event-Based
- Report parsing events and handle as they happen
- www.saxproject.org
- DOM Document Object Model
- Tree-Based
- Parse entire doc into tree, then query
- www.w3.org/DOM
35SAX Example
- lt?xml version"1.0"?gt
- ltdocgt
- ltparagtHello, world!lt/paragt
- lt/docgt
start document start element doc start
element para characters Hello, world! end
element para end element doc end document
36Using SAX
- Override the class SaxHandler
- Override as necessary the messages
- startDocument
- endDocument
- startElement aName attributeList attributes
- endElement aName
- characters aString
37Some Code
- SAXHandler subclass MySampleSaxThing
- instanceVariableNames ''
- classVariableNames ''
- poolDictionaries ''
- category 'XML-Parser'
38More Code
- startElement elementName attributeList
attributeList - Transcript show 'Processing Element '
- show elementName
- cr.
- characters aString
- Transcript show 'Got characters '
- show aString
- cr
39Starting it Up
MySampleSaxThing parseDocumentFromFileNamed
'sample.xml'
40For DOM, we get document first
fFileStream fileNamed 'samplexml2.xml'.
xXMLDOMParser parseDocumentFrom f.
X now contains an object of type XMLDocument
Note that DOM uses SAX to build the in-memory
tree.
41By Jonathan DAndries
42Getting elements out
document elements returns an OrderedCollection
of elements in the document
(document elements) at 1 gets us the root
XMLElement document topElement document
elementAt rootElementName
We can then use the firstTagNamed customer
We can also use tagsNamed customer do aBlock
to execute the same code for each tag block.
43Playing with DOM Directly
fFileStream fileNamed 'samplexml2.xml'.
xXMLDOMParser parseDocumentFrom f. f
close. ex elements. ne at 1. n name. n
tag. n contentString. cn firstTagNamed
customer. n tagsNamed customer do i
Transcript show i cr.
44Writing a Custom Class -- Looking up specific
elements
lookup aName top ele topdocument
topElement. top tagsNamed customer do tag
eletag firstTagNamed name.
Transcript show 'Examining "'
show ele characterData show
'"' cr. ele
characterData aName ifTrue Transcript
show 'Found the entry'. self
showData tag.
aName. Transcript show 'Entry Not Found'.
'No such customer'
45Making document from scratch
- createHeader
- aTopElement
- document XMLDocument new.
- aTopElement XMLElement named 'CustomerList
- attributes Dictionary new.
- aTopElement addElement (self makeSubElement
- 'CompanyName' content 'FooBar Inc').
- aTopElement addElement (self makeSubElement
- 'CompanyPhone' content
'990-555-1345'). - document addElement aTopElement
46Making a string subelement
makeSubElement aTagName content aStringContent
anXMLElement anXMLElement
XMLElement named aTagName
attributes Dictionary new. anXMLElement
addContent (XMLStringNode string
aStringContent). anXMLElement
47Making a subgroup
createCustomer aName id anId status aStatus
top aCustElement top document
topElement. aCustElement XMLElement named
'Customer' attributes
Dictionary new. aCustElement attributeAt
'status' put aStatus. aCustElement addElement
(self makeSubElement 'name
content aName). aCustElement addElement
(self makeSubElement 'id'
content anId). top addElement aCustElement
48SIXX
- Smalltalk Instance eXchange in XML
- SIXX is an XML serializer/deserializer
- Store and load Smalltalk objects in a portable,
dialect-independent XML format. - Pointer on co-web
49Using SIXX
- SixxWriteStream and SixxReadStream
- write/read Smalltalk objects like a
binary-object-stream way. - Writing objects to an external file
- sws SixxWriteStream newFileNamed 'obj.sixx'.
- sws nextPut ltobjectgt.
- sws nextPutAll ltcollection of objectgt.
- sws close.
- And to read objects from an external file
- srs SixxReadStream readOnlyFileNamed
'obj.sixx'. objects srs contents. - srs close.
50Scanning and Parsing in Squeak
- Defining Lexical and Syntactic Analysis
- Scanning/Tokenizing
- Parsing
- Easy ways to do it
- State Transition Tables
- Recursive Descent Parsing
- More sophisticated way
- T-Gen Lex and YACC for Squeak
- SmaCC
- XML SAX, DOM
- SIXX
- Examples from Squeak
- Smalltalk parser
- HTML parser
51Smalltalk Parser
52Smalltalk's Parser is Recursive Descent!
- Scanner methods are in Parser
- Scanning method category advance endOfLastToken
match matchToken startOfNextToken - All the kinds of messages are defined in
Expression Types - argumentName assignment blockExpression
braceExpression cascade expression
messagePartrepeat methodcontext
patterninContext primaryExpression
statementsinnerBlock temporaries
temporaryBlockVariables variable
53Example Parsing an Assignment
- assignment varNode
- " var '' expression gt AssignmentNode."
- loc
- (loc varNode assignmentCheck encoder at
prevMark requestorOffset) gt 0 - ifTrue self notify 'Cannot store into' at
loc. - varNode nowHasDef.
- self advance.
- self expression ifFalse self expected
'Expression'. - parseNode AssignmentNode new
- variable varNode
- value parseNode
- from encoder.
- true
54HtmlParser
- Used for Scamper
- HtmlParser parse '
- lthtmlgt
- ltheadgt
- lttitlegtFred the Pagelt/titlegt
- lt/headgt
- ltbodygt
- lth1gtFred the Bodylt/h1gt
- This is a body for Fred.
- lt/bodygt
- lt/htmlgt'
55HtmlParser returns an HtmlDocument
- HtmlDocument has contents, which is an
OrderedCollection - HtmlHead
- HtmlBody
- HtmlEntityHierarchy exists
56Walk the Object Structure
- doc HtmlParser parse '
- lthtmlgt
- ltheadgt
- lttitlegtFred the Pagelt/titlegt
- lt/headgt
- ltbodygt
- lth1gtFred the Bodylt/h1gt
- This is a body for Fred.
- lt/bodygt
- lt/htmlgt'.
- body doc contents last. "This should be an
HtmlBody" - body contents detect entity entity isKindOf
HtmlHeader. "This should be the first heading." - PrintIt
- lt'h1'gt
- Fred the Body
57Next Week
- M5 due Tuesday
- Alan Kay video The Computer Revolution Hasnt
Happened Yet - Optimizing Squeak