Title: Information Infrastructure II
1Information Infrastructure II
- I211 Week 9
- Rajarshi Guha
2Outline
- Brief overview of HTML tags
- The BeautifulSoup module
- Parsing HTML
- Introduction to Python CGI
- mod_python
3HTML Overview
- A series of tags indicating what a document
should look like in a browser - HTML is a subset of XML
- Two types of HTML
- Ordinary HTML
- XHTML
lthtmlgt ltheadgtlt/headgt ltbodygt This is ltIgtitaliclt/Igt
text followed by a lt a hrefhttp//www.google.comgt
hyperlinklt/agt ltpgt lt/bodygt lt/htmlgt
4Ordinary HTML
- This is what is commonly found on the Net
- This version is not always a valid XML document
- Closing tags can bedropped sometimes
- The browser will tryits best to work out what
is meant - But its invalid XML
- Painful for us to parse
lthtmlgt ltheadgtlt/headgt ltbodygt This is ltIgtitaliclt/Igt
text followed by a lt a hrefhttp//www.google.comgt
hyperlinklt/agt ltpgt lt/bodygt
5XHTML
- This is a stricter form of HTML
- All opening tags must have closing tags
- A valid XHTML document is a valid XML document
- Can be easily parsedusing any XML parser
lthtmlgt ltheadgtlt/headgt ltbodygt This is ltigtitaliclt/igt
text followed by a lt a hrefhttp//www.google.comgt
hyperlinklt/agt ltp /gt lt/bodygt lt/htmlgt
6Parsing HTML in Python
- Since its supposed to be XML, we could use
ElementTree - This will work for well formed HTML or XHTML
documents - But if tags are not closed or are in the wrong
order, this will fail - But browsers still show proper HTML!
7Parsing HTML in Python
- Browsers are able to show malformed HTML because
they employ heuristics - Wed also like to employ these heuristics
lthtmlgt ltheadgtlt/headgt ltbodygt This is
ltigtitaliclt/igt ltform nameaformgt lttablegt lttrgtlttdgt
Blahlt/tdgtlt/trgt lt/formgt lt/tablegt lt/htmlgtlt/bodygt
lthtmlgt ltheadgtlt/headgt ltbodygt This is ltigtitaliclt/igt
text followed by a lt a hrefhttp//www.google.comgt
hyperlinklt/agt ltp /gt
8Beautiful Soup
- Copy BeautifulSoup.py from the class web page to
your current directory - Add the following import to your filefrom
BeautifulSoup import BeautifulSoup - We can load and parse an HTML document by
supplying a string to the constructor
9BeautifulSoup
- BeautifulSoup expects to get a string or a file
like object - Once youve made an objectof class
BeautifulSoupwere ready toextract stuff
from BeautifulSoup import BeautifulSoup s
lthtmlgtltbodygtHello ltIgtWorldlt/Igt this isa lta
hrefhttp//blah.comgtlinklt/agtlt/htmlgt soup
BeautifulSoup(s)
10BeautifulSoup from URLs
- Since the constructor can take a file-like
object, we can combine it with urlopen - Code is much simpler to read
- Shorter to write
- Should handlethe relevantexception
from BeautifulSoup import BeautifulSoup import
urllib con urllib.urlopen(http//www.google.co
m) soup BeautifulSoup(con)
11What Can We Do with the Soup?
- Most common task is to get various HTML elements
- We can access them like we did for XML tags with
ElementTree - But its a little more flexible. We can access
elements by - Tag name (ltagt or ltigt or lttablegt etc)
- By tag name CSS class
12Useful Concepts Methods
- A soup has two types of object
- Tags - lthtmlgt, ltbgt etc
- Tag objects can have attributesltp idmenu
alignleftgtA paragraphlt/pgt - NavigableStrings - the text of an element
13Useful Concepts Methods
- A Tag object behaves like a dictionary
- You can get attributes values by their names
from BeautifulSoup import BeautifulSoup doc
'lthtmlgtltheadgtlttitlegtPage titlelt/titlegtlt/headgt',
'ltbodygtltp id"firstpara" align"center"gtThis
is paragraph ltbgtonelt/bgt.', 'ltp
id"secondpara" align"blah"gtThis is paragraph
ltbgttwolt/bgt.', 'lt/htmlgt' soup
BeautifulSoup(''.join(doc)) firstPTag,
secondPTag soup.findAll('p') firstPTag'id' fir
stPTagalign
14Properties of Tag Objects
- All tags will have the following properties
- parent - the parent element containing the tag
object in question - contents - the contents of the tag. It might be a
simple string or it might be other elements - string - if the contents is a simple string, this
will return the string, otherwise None
15Properties of Tag Objects
from BeautifulSoup import BeautifulSoup as
BS from urllib import urlopen doc
""" lthtmlgtlt/htmlgt soup BS(doc) head
soup.find('head') print head.contents print
head.string title soup.find('title') print
title.contents print title.string you can also
directly access elements print 'Directly
accessing elements is neat!\n\n' print
soup.head print soup.title
- ex1.py on the class web has the full code
- we get the first element of a certain type
using find - But we can also use the element name as a
property of the BeautifulSoup object
16Contents of a Tag Object
- We saw that the contents property gives you a
list of all the tags contained in the current tag
or else the plan string - Just as with ElementTree, we can loop through
these tags - In some cases this is required
- Unlike ElementTree we cannot specify a path
like body/div/a
17Looping Over Sub Tags
- The ltheadgt element will contain a few fixed tags
like lttitlegt, ltscriptgt, ltstylegt - We can get the ltheadgt tag and then loop over its
contents - See ex2.py on theclass web site
from BeautifulSoup import BeautifulSoup as
BS from urllib import urlopen soup
BS(doc) head soup.find(head) for item in
head.contents print item for item in
soup.head.contents print item
18Useful Concepts Methods
- Searching the HTML tree
- findAll()
- Finds elements that match a set of criteria
- Lots of different ways to specify criteria
- Gets you a list of Tag objects which you can then
loop over - find()
- Like the above, but just gets the first matching
Tag object
19Getting All the Links from a Web Page
- A link is represented by the a tag
- Can have attributes
- href
- target
- name
- Wed like to find
- All the a tags
- a tags that represent web links
20Getting All the Links from a Web Page
- A link is represented by the a tag
- We can get a list of a elements
- We can then access the attributes of
each element
from BeautifulSoup import BeautifulSoup from
urllib import con urlopen(http//www.google.
com) soup BeautifulSoup(con) anchors
soup.findAll(a) print Found d links
(len(anchors))
21Getting All the Links from a Web Page
- A link is represented by the a tag
- We can get a list of a elements
- We can then access the attributes of
each element as if
they
were
dictionaries
for anchor in anchors print
anchorhref for anchor in anchors print
anchortarget
22Getting All the Links from a Web Page
- We can get attributes of a Tag by accessing the
tag like a dictionary, with the attribute name - What if the tag does not have the attribute?
- All a tags do not havea target attribute
- Cant call the keys()method
- Have to catch anexception
for anchor in anchors if target in
anchor.keys() print anchortarget for
anchor in anchors try print
anchortarget except KeyError, e
print No target attribute
23Better Control Over Searching
- We saw that we could find all elements of a
certain type by using the element name - a for anchor elements
- img for image elements
- head for the head element
- BeautifulSoup also allows us to get elements
based on attribute names and values - Get all elements that have an align attribute
whose value is center
24Better Control Over Searching
- Search for elements whose align attribute is
center - Search for elements whose alignattribute can
be centeror blah - Search for elementswhich have an
alignattribute - Search for elements with no align attribute
elems soup.findAll(aligncenter) elems
soup.findAll(aligncenter, blah) elems
soup.findAll(alignTrue) elems
soup.findAll(alignNone)
25Better Control Over Searching
- What if we want to search for elements which have
a class attribute? - The first example leads to a syntax error
- class is a Python keyword
- In such cases, use the attr argument
elems soup.findAll(classbigpara) elems
soup.findAll(attrs class para) elems
soup.findAll(attrs id junk)
26Web Forms
- Generating HTML to display in a page is pretty
easy - Just string operations like formatting and
concatenation - The fun starts when we can provide data and get
back results - Need to provide forms on a page
27HTML Input Elements
- ltinputgt
- Single line text field
- Radio buttons
- Checkboxes
- Submit buttons
- lttextareagt - a multi-line text box
- ltselectgt - drop down lists
28Implementing an HTML Form
- Form elements must be enclosed within ltformgt
lt/formgt tags - Have to specify a URL which will handle the form
- Should have a submit button so that the user can
indicate processing should start - See http//cheminfo.informatics.indiana.edu/rguha
/class/2007/i211/week9/form.html
29An Example HTML Form
lthtmlgt lthead lt/headgt ltbodygt lth1gtAn example
formlt/h1gt ltform action"http//url/to/cgi/program
" name"aForm" method"get"gt ltinput type"text"
name"textfield"gt ltbrgt ltselect
name"cars"gt ltoption value"volvo"gtVolvolt/optiongt
ltoption value"saab"gtSaablt/optiongt ltoption
value"fiat"gtFiatlt/optiongt ltoption
value"audi"gtAudilt/optiongt lt/selectgt ltbrgt lttextar
ea name"textbox" rows"10" cols"40"gtlt/textareagt
ltbrgt ltinput type'submit' name'dowork'
value'Click Me!'gt lt/formgt lt/bodygtlt/htmlgt
30Using the Data
- Setting up the form is easy
- But we need to write a CGI program that does
something with the data - Lots of ways to do Python CGI
- Well consider mod_python
- Have to work on Sulu
31Python CGIs
- We have seens CGI programs before
- The Yahoo finance CGI
- The CGI programs may or may not take arguments
- How do we write the actual CGI program?
- Where does it get its arguments from?
32mod_python
- This is one way to write Python CGIs
- Designed to work with Apache
- Via mod_python, a Python program can be accessed
as if it were an URL - Functions within the program become components of
the URL - We can pass arguments to the function using the
?arg1fooarg2bar form
33Writing a mod_python Program
- Wed like to be able to access a URL like
http//some.host.com/someprog.py - Get back some basic HTML
- Well worry about arguments later on
34Writing a mod_python Program
- The URL just indicates a Python program
- Doesnt say anything about which function should
be called - mod_python assumes that you want to call the
function called index - The idea is similar to index.html
35Hello World on the Web
- Place the code in hello.py and place it under
public_html on sulu - Then navigate to http//sulu.informatics.india
na.edu/XXX/hello.py
def index(req) req.content_type
text/html return Hello World
36mod_python Terminology
- A Python function that is to be called via an
URL is a handler - A handler must take a single argument
- It will be an mp_request object
- Your program can have functions that will not be
accessed by a URL - These can be written any way you want to (no
args, 10 args etc)
37mod_python Request Object
- See http//www.modpython.org/live/current/doc-h
tml/pyapi-mprequest.html - Lots of methods and members
- Well focus mainly on two members
- content_type
- form
-
38Content Types
- When your browser requests a page form a web
server, it needs to know what type of data its
getting back - If its HTML itll display it
- If its plain text, no need to process it, so it
just dumps it to screen - If its an MP3, it will save to disk or open up
a media player - How does the web server indicate the type of
content?
39Content Types
- The HTTP protocol indicates that one of the
fields in an HTTP request is called content-type
and is set to some value - The value of this field is chosen from a list of
well-known, agreed upon values - See here for a list of values for different types
of data
40Content Types
- We are interested mainly in HTML pages or plain
text pages - So, if the web server is to return an HTML page
the content type should be text/html - If the web server is to return a plain text page
(I.e., no markup) then the content type is
text/plain - How does this affect mod_python programs?
41mod_python Content Type
- Apache will send the return value of a mod_python
handler back to the user - Return a string representing HTML page, the user
should see the HTML page - Return binary data representing an MP3, the user
should get an MP3 - Before returning, you must set the content_type
member of the request object
42mod_python Content Type
- In general, you just need to add
req.content_type text/htmlsomewhere before
returning - If youre not returning HTML and it is some sort
of text, use text/plain
def index(req) req.content_type
text/html return Hello World
43mod_python Caveats
- If there is an error in your code, using print
will not help, as you will not see the result in
the browser - You should see the exception in the browser
- To print out intermediate results, return them
- If they are not simple objects, Python will
convert them to a string representation using
str()
44mod_python Caveats
- This code will return an error page
- The exception will tell you where the error is
- You cant do print d to see what is in the
dictionary
def index(req) req.content_type
'text/plain' d d'one' 1 ret
d'two' return ret
45mod_python Caveats
- Instead, return the dictionary
- This will cause Python to call str() on the
dictionary and Apache will simply present the
representation of the dictionary
def index(req) req.content_type
'text/plain' d d'one' 1
return d ret d'two' return ret
46More Than One Handler?
- The handler called index is special
- It allows you to access the CGI without
specifying a specific function - You can write other handlers in the same program
- Each one must take a single request object as its
argument - Accessed by http//some.host.com/myprog.py/myhan
dler
47More Than One Handler?
- Download hp3.py
- Does not have an index handler
- http//sulu.informatics.indiana.edu/rguha/hp3.py
will not work
def asHTML(req) req.content_type
'text/html' return 'lthtmlgtltBodygtltbgtHellolt/bgtlt/
bodygtlt/htmlgt' def asText(req)
req.content_type 'text/plain' return
'lthtmlgtltBodygtltbgtHellolt/bgtlt/bodygtlt/htmlgt'
48More Than One Handler?
- http//sulu.informatics.indiana.edu/rguha/hp3.py/
asHTML - http//sulu.informatics.indiana.edu/rguha/hp3.py/
asText - Both the above work
- Depending on how we construct the URL, we
cancontrol which handler gets called
def asHTML(req) req.content_type
'text/html' return 'lthtmlgtltBodygtltbgtHellolt/bgtlt/
bodygtlt/htmlgt' def asText(req)
req.content_type 'text/plain' return
'lthtmlgtltBodygtltbgtHellolt/bgtlt/bodygtlt/htmlgt'
49mod_python Summary
- A handler is just a Python method that takes a
single argument - The argument is a request object
- You can have as many handlers as you want
- The handler called index is special
- We can invoke it by jut using the program name in
the URL
50mod_python Summary
- The return value of your handler is what gets
sent back to the user - Before returning make sure to set the proper
content type by doing req.content_type
some content type - Here req is the name of the argument for the
handler