Information Infrastructure II - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

Information Infrastructure II

Description:

select name='cars' option value='volvo' Volvo /option option value='saab' Saab /option ... option value='audi' Audi /option /select br ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 51

Provided by: rajars

Category:

more less

Transcript and Presenter's Notes

Title: Information Infrastructure II

1
Information Infrastructure II

I211 Week 9
Rajarshi Guha

2
Outline

Brief overview of HTML tags
The BeautifulSoup module
Parsing HTML
Introduction to Python CGI
mod_python

3
HTML Overview

A series of tags indicating what a document
should look like in a browser
HTML is a subset of XML
Two types of HTML
Ordinary HTML
XHTML

lthtmlgt ltheadgtlt/headgt ltbodygt This is ltIgtitaliclt/Igt
text followed by a lt a hrefhttp//www.google.comgt
hyperlinklt/agt ltpgt lt/bodygt lt/htmlgt
4
Ordinary HTML

This is what is commonly found on the Net
This version is not always a valid XML document
Closing tags can bedropped sometimes
The browser will tryits best to work out what
is meant
But its invalid XML
Painful for us to parse

lthtmlgt ltheadgtlt/headgt ltbodygt This is ltIgtitaliclt/Igt
text followed by a lt a hrefhttp//www.google.comgt
hyperlinklt/agt ltpgt lt/bodygt
5
XHTML

This is a stricter form of HTML
All opening tags must have closing tags
A valid XHTML document is a valid XML document
Can be easily parsedusing any XML parser

lthtmlgt ltheadgtlt/headgt ltbodygt This is ltigtitaliclt/igt
text followed by a lt a hrefhttp//www.google.comgt
hyperlinklt/agt ltp /gt lt/bodygt lt/htmlgt
6
Parsing HTML in Python

Since its supposed to be XML, we could use
ElementTree
This will work for well formed HTML or XHTML
documents
But if tags are not closed or are in the wrong
order, this will fail
But browsers still show proper HTML!

7
Parsing HTML in Python

Browsers are able to show malformed HTML because
they employ heuristics
Wed also like to employ these heuristics

lthtmlgt ltheadgtlt/headgt ltbodygt This is
ltigtitaliclt/igt ltform nameaformgt lttablegt lttrgtlttdgt
Blahlt/tdgtlt/trgt lt/formgt lt/tablegt lt/htmlgtlt/bodygt
lthtmlgt ltheadgtlt/headgt ltbodygt This is ltigtitaliclt/igt
text followed by a lt a hrefhttp//www.google.comgt
hyperlinklt/agt ltp /gt
8
Beautiful Soup

Copy BeautifulSoup.py from the class web page to
your current directory
Add the following import to your filefrom
BeautifulSoup import BeautifulSoup
We can load and parse an HTML document by
supplying a string to the constructor

9
BeautifulSoup

BeautifulSoup expects to get a string or a file
like object
Once youve made an objectof class
BeautifulSoupwere ready toextract stuff

from BeautifulSoup import BeautifulSoup s
lthtmlgtltbodygtHello ltIgtWorldlt/Igt this isa lta
hrefhttp//blah.comgtlinklt/agtlt/htmlgt soup
BeautifulSoup(s)
10
BeautifulSoup from URLs

Since the constructor can take a file-like
object, we can combine it with urlopen
Code is much simpler to read
Shorter to write
Should handlethe relevantexception

from BeautifulSoup import BeautifulSoup import
urllib con urllib.urlopen(http//www.google.co
m) soup BeautifulSoup(con)
11
What Can We Do with the Soup?

Most common task is to get various HTML elements
We can access them like we did for XML tags with
ElementTree
But its a little more flexible. We can access
elements by
Tag name (ltagt or ltigt or lttablegt etc)
By tag name CSS class

12
Useful Concepts Methods

A soup has two types of object
Tags - lthtmlgt, ltbgt etc
Tag objects can have attributesltp idmenu
alignleftgtA paragraphlt/pgt
NavigableStrings - the text of an element

13
Useful Concepts Methods

A Tag object behaves like a dictionary
You can get attributes values by their names

from BeautifulSoup import BeautifulSoup doc
'lthtmlgtltheadgtlttitlegtPage titlelt/titlegtlt/headgt',
'ltbodygtltp id"firstpara" align"center"gtThis
is paragraph ltbgtonelt/bgt.', 'ltp
id"secondpara" align"blah"gtThis is paragraph
ltbgttwolt/bgt.', 'lt/htmlgt' soup
BeautifulSoup(''.join(doc)) firstPTag,
secondPTag soup.findAll('p') firstPTag'id' fir
stPTagalign
14
Properties of Tag Objects

All tags will have the following properties
parent - the parent element containing the tag
object in question
contents - the contents of the tag. It might be a
simple string or it might be other elements
string - if the contents is a simple string, this
will return the string, otherwise None

15
Properties of Tag Objects
from BeautifulSoup import BeautifulSoup as
BS from urllib import urlopen doc
""" lthtmlgtlt/htmlgt soup BS(doc) head
soup.find('head') print head.contents print
head.string title soup.find('title') print
title.contents print title.string you can also
directly access elements print 'Directly
accessing elements is neat!\n\n' print
soup.head print soup.title

ex1.py on the class web has the full code
we get the first element of a certain type
using find
But we can also use the element name as a
property of the BeautifulSoup object

16
Contents of a Tag Object

We saw that the contents property gives you a
list of all the tags contained in the current tag
or else the plan string
Just as with ElementTree, we can loop through
these tags
In some cases this is required
Unlike ElementTree we cannot specify a path
like body/div/a

17
Looping Over Sub Tags

The ltheadgt element will contain a few fixed tags
like lttitlegt, ltscriptgt, ltstylegt
We can get the ltheadgt tag and then loop over its
contents
See ex2.py on theclass web site

from BeautifulSoup import BeautifulSoup as
BS from urllib import urlopen soup
BS(doc) head soup.find(head) for item in
head.contents print item for item in
soup.head.contents print item
18
Useful Concepts Methods

Searching the HTML tree
findAll()
Finds elements that match a set of criteria
Lots of different ways to specify criteria
Gets you a list of Tag objects which you can then
loop over
find()
Like the above, but just gets the first matching
Tag object

19
Getting All the Links from a Web Page

A link is represented by the a tag
Can have attributes
href
target
name
Wed like to find
All the a tags
a tags that represent web links

20
Getting All the Links from a Web Page

A link is represented by the a tag
We can get a list of a elements
We can then access the attributes of
each element

from BeautifulSoup import BeautifulSoup from
urllib import con urlopen(http//www.google.
com) soup BeautifulSoup(con) anchors
soup.findAll(a) print Found d links
(len(anchors))
21
Getting All the Links from a Web Page

A link is represented by the a tag
We can get a list of a elements
We can then access the attributes of
each element as if
they
were
dictionaries

for anchor in anchors print
anchorhref for anchor in anchors print
anchortarget
22
Getting All the Links from a Web Page

We can get attributes of a Tag by accessing the
tag like a dictionary, with the attribute name
What if the tag does not have the attribute?
All a tags do not havea target attribute
Cant call the keys()method
Have to catch anexception

for anchor in anchors if target in
anchor.keys() print anchortarget for
anchor in anchors try print
anchortarget except KeyError, e
print No target attribute
23
Better Control Over Searching

We saw that we could find all elements of a
certain type by using the element name
a for anchor elements
img for image elements
head for the head element
BeautifulSoup also allows us to get elements
based on attribute names and values
Get all elements that have an align attribute
whose value is center

24
Better Control Over Searching

Search for elements whose align attribute is
center
Search for elements whose alignattribute can
be centeror blah
Search for elementswhich have an
alignattribute
Search for elements with no align attribute

elems soup.findAll(aligncenter) elems
soup.findAll(aligncenter, blah) elems
soup.findAll(alignTrue) elems
soup.findAll(alignNone)
25
Better Control Over Searching

What if we want to search for elements which have
a class attribute?
The first example leads to a syntax error
class is a Python keyword
In such cases, use the attr argument

elems soup.findAll(classbigpara) elems
soup.findAll(attrs class para) elems
soup.findAll(attrs id junk)
26
Web Forms

Generating HTML to display in a page is pretty
easy
Just string operations like formatting and
concatenation
The fun starts when we can provide data and get
back results
Need to provide forms on a page

27
HTML Input Elements

ltinputgt
Single line text field
Radio buttons
Checkboxes
Submit buttons
lttextareagt - a multi-line text box
ltselectgt - drop down lists

28
Implementing an HTML Form

Form elements must be enclosed within ltformgt
lt/formgt tags
Have to specify a URL which will handle the form
Should have a submit button so that the user can
indicate processing should start
See http//cheminfo.informatics.indiana.edu/rguha
/class/2007/i211/week9/form.html

29
An Example HTML Form
lthtmlgt lthead lt/headgt ltbodygt lth1gtAn example
formlt/h1gt ltform action"http//url/to/cgi/program
" name"aForm" method"get"gt ltinput type"text"
name"textfield"gt ltbrgt ltselect
name"cars"gt ltoption value"volvo"gtVolvolt/optiongt
ltoption value"saab"gtSaablt/optiongt ltoption
value"fiat"gtFiatlt/optiongt ltoption
value"audi"gtAudilt/optiongt lt/selectgt ltbrgt lttextar
ea name"textbox" rows"10" cols"40"gtlt/textareagt
ltbrgt ltinput type'submit' name'dowork'
value'Click Me!'gt lt/formgt lt/bodygtlt/htmlgt
30
Using the Data

Setting up the form is easy
But we need to write a CGI program that does
something with the data
Lots of ways to do Python CGI
Well consider mod_python
Have to work on Sulu

31
Python CGIs

We have seens CGI programs before
The Yahoo finance CGI
The CGI programs may or may not take arguments
How do we write the actual CGI program?
Where does it get its arguments from?

32
mod_python

This is one way to write Python CGIs
Designed to work with Apache
Via mod_python, a Python program can be accessed
as if it were an URL
Functions within the program become components of
the URL
We can pass arguments to the function using the
?arg1fooarg2bar form

33
Writing a mod_python Program

Wed like to be able to access a URL like
http//some.host.com/someprog.py
Get back some basic HTML
Well worry about arguments later on

34
Writing a mod_python Program

The URL just indicates a Python program
Doesnt say anything about which function should
be called
mod_python assumes that you want to call the
function called index
The idea is similar to index.html

35
Hello World on the Web

Place the code in hello.py and place it under
public_html on sulu
Then navigate to http//sulu.informatics.india
na.edu/XXX/hello.py

def index(req) req.content_type
text/html return Hello World
36
mod_python Terminology

A Python function that is to be called via an
URL is a handler
A handler must take a single argument
It will be an mp_request object
Your program can have functions that will not be
accessed by a URL
These can be written any way you want to (no
args, 10 args etc)

37
mod_python Request Object

See http//www.modpython.org/live/current/doc-h
tml/pyapi-mprequest.html
Lots of methods and members
Well focus mainly on two members
content_type
form

38
Content Types

When your browser requests a page form a web
server, it needs to know what type of data its
getting back
If its HTML itll display it
If its plain text, no need to process it, so it
just dumps it to screen
If its an MP3, it will save to disk or open up
a media player
How does the web server indicate the type of
content?

39
Content Types

The HTTP protocol indicates that one of the
fields in an HTTP request is called content-type
and is set to some value
The value of this field is chosen from a list of
well-known, agreed upon values
See here for a list of values for different types
of data

40
Content Types

We are interested mainly in HTML pages or plain
text pages
So, if the web server is to return an HTML page
the content type should be text/html
If the web server is to return a plain text page
(I.e., no markup) then the content type is
text/plain
How does this affect mod_python programs?

41
mod_python Content Type

Apache will send the return value of a mod_python
handler back to the user
Return a string representing HTML page, the user
should see the HTML page
Return binary data representing an MP3, the user
should get an MP3
Before returning, you must set the content_type
member of the request object

42
mod_python Content Type

In general, you just need to add
req.content_type text/htmlsomewhere before
returning
If youre not returning HTML and it is some sort
of text, use text/plain

def index(req) req.content_type
text/html return Hello World
43
mod_python Caveats

If there is an error in your code, using print
will not help, as you will not see the result in
the browser
You should see the exception in the browser
To print out intermediate results, return them
If they are not simple objects, Python will
convert them to a string representation using
str()

44
mod_python Caveats

This code will return an error page
The exception will tell you where the error is
You cant do print d to see what is in the
dictionary

def index(req) req.content_type
'text/plain' d d'one' 1 ret
d'two' return ret
45
mod_python Caveats

Instead, return the dictionary
This will cause Python to call str() on the
dictionary and Apache will simply present the
representation of the dictionary

def index(req) req.content_type
'text/plain' d d'one' 1
return d ret d'two' return ret
46
More Than One Handler?

The handler called index is special
It allows you to access the CGI without
specifying a specific function
You can write other handlers in the same program
Each one must take a single request object as its
argument
Accessed by http//some.host.com/myprog.py/myhan
dler

47
More Than One Handler?

Download hp3.py
Does not have an index handler
http//sulu.informatics.indiana.edu/rguha/hp3.py
will not work

def asHTML(req) req.content_type
'text/html' return 'lthtmlgtltBodygtltbgtHellolt/bgtlt/
bodygtlt/htmlgt' def asText(req)
req.content_type 'text/plain' return
'lthtmlgtltBodygtltbgtHellolt/bgtlt/bodygtlt/htmlgt'
48
More Than One Handler?

http//sulu.informatics.indiana.edu/rguha/hp3.py/
asHTML
http//sulu.informatics.indiana.edu/rguha/hp3.py/
asText
Both the above work
Depending on how we construct the URL, we
cancontrol which handler gets called

A handler is just a Python method that takes a
single argument
The argument is a request object
You can have as many handlers as you want
The handler called index is special
We can invoke it by jut using the program name in
the URL

50
mod_python Summary

The return value of your handler is what gets
sent back to the user
Before returning make sure to set the proper
content type by doing req.content_type
some content type
Here req is the name of the argument for the
handler

Write a Comment

User Comments (0)