Information Infrastructure II - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Information Infrastructure II

Description:

select name='cars' option value='volvo' Volvo /option option value='saab' Saab /option ... option value='audi' Audi /option /select br ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 51
Provided by: rajars
Category:

less

Transcript and Presenter's Notes

Title: Information Infrastructure II


1
Information Infrastructure II
  • I211 Week 9
  • Rajarshi Guha

2
Outline
  • Brief overview of HTML tags
  • The BeautifulSoup module
  • Parsing HTML
  • Introduction to Python CGI
  • mod_python

3
HTML Overview
  • A series of tags indicating what a document
    should look like in a browser
  • HTML is a subset of XML
  • Two types of HTML
  • Ordinary HTML
  • XHTML

lthtmlgt ltheadgtlt/headgt ltbodygt This is ltIgtitaliclt/Igt
text followed by a lt a hrefhttp//www.google.comgt
hyperlinklt/agt ltpgt lt/bodygt lt/htmlgt
4
Ordinary HTML
  • This is what is commonly found on the Net
  • This version is not always a valid XML document
  • Closing tags can bedropped sometimes
  • The browser will tryits best to work out what
    is meant
  • But its invalid XML
  • Painful for us to parse

lthtmlgt ltheadgtlt/headgt ltbodygt This is ltIgtitaliclt/Igt
text followed by a lt a hrefhttp//www.google.comgt
hyperlinklt/agt ltpgt lt/bodygt
5
XHTML
  • This is a stricter form of HTML
  • All opening tags must have closing tags
  • A valid XHTML document is a valid XML document
  • Can be easily parsedusing any XML parser

lthtmlgt ltheadgtlt/headgt ltbodygt This is ltigtitaliclt/igt
text followed by a lt a hrefhttp//www.google.comgt
hyperlinklt/agt ltp /gt lt/bodygt lt/htmlgt
6
Parsing HTML in Python
  • Since its supposed to be XML, we could use
    ElementTree
  • This will work for well formed HTML or XHTML
    documents
  • But if tags are not closed or are in the wrong
    order, this will fail
  • But browsers still show proper HTML!

7
Parsing HTML in Python
  • Browsers are able to show malformed HTML because
    they employ heuristics
  • Wed also like to employ these heuristics

lthtmlgt ltheadgtlt/headgt ltbodygt This is
ltigtitaliclt/igt ltform nameaformgt lttablegt lttrgtlttdgt
Blahlt/tdgtlt/trgt lt/formgt lt/tablegt lt/htmlgtlt/bodygt
lthtmlgt ltheadgtlt/headgt ltbodygt This is ltigtitaliclt/igt
text followed by a lt a hrefhttp//www.google.comgt
hyperlinklt/agt ltp /gt
8
Beautiful Soup
  • Copy BeautifulSoup.py from the class web page to
    your current directory
  • Add the following import to your filefrom
    BeautifulSoup import BeautifulSoup
  • We can load and parse an HTML document by
    supplying a string to the constructor

9
BeautifulSoup
  • BeautifulSoup expects to get a string or a file
    like object
  • Once youve made an objectof class
    BeautifulSoupwere ready toextract stuff

from BeautifulSoup import BeautifulSoup s
lthtmlgtltbodygtHello ltIgtWorldlt/Igt this isa lta
hrefhttp//blah.comgtlinklt/agtlt/htmlgt soup
BeautifulSoup(s)
10
BeautifulSoup from URLs
  • Since the constructor can take a file-like
    object, we can combine it with urlopen
  • Code is much simpler to read
  • Shorter to write
  • Should handlethe relevantexception

from BeautifulSoup import BeautifulSoup import
urllib con urllib.urlopen(http//www.google.co
m) soup BeautifulSoup(con)
11
What Can We Do with the Soup?
  • Most common task is to get various HTML elements
  • We can access them like we did for XML tags with
    ElementTree
  • But its a little more flexible. We can access
    elements by
  • Tag name (ltagt or ltigt or lttablegt etc)
  • By tag name CSS class

12
Useful Concepts Methods
  • A soup has two types of object
  • Tags - lthtmlgt, ltbgt etc
  • Tag objects can have attributesltp idmenu
    alignleftgtA paragraphlt/pgt
  • NavigableStrings - the text of an element

13
Useful Concepts Methods
  • A Tag object behaves like a dictionary
  • You can get attributes values by their names

from BeautifulSoup import BeautifulSoup doc
'lthtmlgtltheadgtlttitlegtPage titlelt/titlegtlt/headgt',
'ltbodygtltp id"firstpara" align"center"gtThis
is paragraph ltbgtonelt/bgt.', 'ltp
id"secondpara" align"blah"gtThis is paragraph
ltbgttwolt/bgt.', 'lt/htmlgt' soup
BeautifulSoup(''.join(doc)) firstPTag,
secondPTag soup.findAll('p') firstPTag'id' fir
stPTagalign
14
Properties of Tag Objects
  • All tags will have the following properties
  • parent - the parent element containing the tag
    object in question
  • contents - the contents of the tag. It might be a
    simple string or it might be other elements
  • string - if the contents is a simple string, this
    will return the string, otherwise None

15
Properties of Tag Objects
from BeautifulSoup import BeautifulSoup as
BS from urllib import urlopen doc
""" lthtmlgtlt/htmlgt soup BS(doc) head
soup.find('head') print head.contents print
head.string title soup.find('title') print
title.contents print title.string you can also
directly access elements print 'Directly
accessing elements is neat!\n\n' print
soup.head print soup.title
  • ex1.py on the class web has the full code
  • we get the first element of a certain type
    using find
  • But we can also use the element name as a
    property of the BeautifulSoup object

16
Contents of a Tag Object
  • We saw that the contents property gives you a
    list of all the tags contained in the current tag
    or else the plan string
  • Just as with ElementTree, we can loop through
    these tags
  • In some cases this is required
  • Unlike ElementTree we cannot specify a path
    like body/div/a

17
Looping Over Sub Tags
  • The ltheadgt element will contain a few fixed tags
    like lttitlegt, ltscriptgt, ltstylegt
  • We can get the ltheadgt tag and then loop over its
    contents
  • See ex2.py on theclass web site

from BeautifulSoup import BeautifulSoup as
BS from urllib import urlopen soup
BS(doc) head soup.find(head) for item in
head.contents print item for item in
soup.head.contents print item
18
Useful Concepts Methods
  • Searching the HTML tree
  • findAll()
  • Finds elements that match a set of criteria
  • Lots of different ways to specify criteria
  • Gets you a list of Tag objects which you can then
    loop over
  • find()
  • Like the above, but just gets the first matching
    Tag object

19
Getting All the Links from a Web Page
  • A link is represented by the a tag
  • Can have attributes
  • href
  • target
  • name
  • Wed like to find
  • All the a tags
  • a tags that represent web links

20
Getting All the Links from a Web Page
  • A link is represented by the a tag
  • We can get a list of a elements
  • We can then access the attributes of
    each element

from BeautifulSoup import BeautifulSoup from
urllib import con urlopen(http//www.google.
com) soup BeautifulSoup(con) anchors
soup.findAll(a) print Found d links
(len(anchors))
21
Getting All the Links from a Web Page
  • A link is represented by the a tag
  • We can get a list of a elements
  • We can then access the attributes of
    each element as if
    they
    were
    dictionaries

for anchor in anchors print
anchorhref for anchor in anchors print
anchortarget
22
Getting All the Links from a Web Page
  • We can get attributes of a Tag by accessing the
    tag like a dictionary, with the attribute name
  • What if the tag does not have the attribute?
  • All a tags do not havea target attribute
  • Cant call the keys()method
  • Have to catch anexception

for anchor in anchors if target in
anchor.keys() print anchortarget for
anchor in anchors try print
anchortarget except KeyError, e
print No target attribute
23
Better Control Over Searching
  • We saw that we could find all elements of a
    certain type by using the element name
  • a for anchor elements
  • img for image elements
  • head for the head element
  • BeautifulSoup also allows us to get elements
    based on attribute names and values
  • Get all elements that have an align attribute
    whose value is center

24
Better Control Over Searching
  • Search for elements whose align attribute is
    center
  • Search for elements whose alignattribute can
    be centeror blah
  • Search for elementswhich have an
    alignattribute
  • Search for elements with no align attribute

elems soup.findAll(aligncenter) elems
soup.findAll(aligncenter, blah) elems
soup.findAll(alignTrue) elems
soup.findAll(alignNone)
25
Better Control Over Searching
  • What if we want to search for elements which have
    a class attribute?
  • The first example leads to a syntax error
  • class is a Python keyword
  • In such cases, use the attr argument

elems soup.findAll(classbigpara) elems
soup.findAll(attrs class para) elems
soup.findAll(attrs id junk)
26
Web Forms
  • Generating HTML to display in a page is pretty
    easy
  • Just string operations like formatting and
    concatenation
  • The fun starts when we can provide data and get
    back results
  • Need to provide forms on a page

27
HTML Input Elements
  • ltinputgt
  • Single line text field
  • Radio buttons
  • Checkboxes
  • Submit buttons
  • lttextareagt - a multi-line text box
  • ltselectgt - drop down lists

28
Implementing an HTML Form
  • Form elements must be enclosed within ltformgt
    lt/formgt tags
  • Have to specify a URL which will handle the form
  • Should have a submit button so that the user can
    indicate processing should start
  • See http//cheminfo.informatics.indiana.edu/rguha
    /class/2007/i211/week9/form.html

29
An Example HTML Form
lthtmlgt lthead lt/headgt ltbodygt lth1gtAn example
formlt/h1gt ltform action"http//url/to/cgi/program
" name"aForm" method"get"gt ltinput type"text"
name"textfield"gt ltbrgt ltselect
name"cars"gt ltoption value"volvo"gtVolvolt/optiongt
ltoption value"saab"gtSaablt/optiongt ltoption
value"fiat"gtFiatlt/optiongt ltoption
value"audi"gtAudilt/optiongt lt/selectgt ltbrgt lttextar
ea name"textbox" rows"10" cols"40"gtlt/textareagt
ltbrgt ltinput type'submit' name'dowork'
value'Click Me!'gt lt/formgt lt/bodygtlt/htmlgt
30
Using the Data
  • Setting up the form is easy
  • But we need to write a CGI program that does
    something with the data
  • Lots of ways to do Python CGI
  • Well consider mod_python
  • Have to work on Sulu

31
Python CGIs
  • We have seens CGI programs before
  • The Yahoo finance CGI
  • The CGI programs may or may not take arguments
  • How do we write the actual CGI program?
  • Where does it get its arguments from?

32
mod_python
  • This is one way to write Python CGIs
  • Designed to work with Apache
  • Via mod_python, a Python program can be accessed
    as if it were an URL
  • Functions within the program become components of
    the URL
  • We can pass arguments to the function using the
    ?arg1fooarg2bar form

33
Writing a mod_python Program
  • Wed like to be able to access a URL like
    http//some.host.com/someprog.py
  • Get back some basic HTML
  • Well worry about arguments later on

34
Writing a mod_python Program
  • The URL just indicates a Python program
  • Doesnt say anything about which function should
    be called
  • mod_python assumes that you want to call the
    function called index
  • The idea is similar to index.html

35
Hello World on the Web
  • Place the code in hello.py and place it under
    public_html on sulu
  • Then navigate to http//sulu.informatics.india
    na.edu/XXX/hello.py

def index(req) req.content_type
text/html return Hello World
36
mod_python Terminology
  • A Python function that is to be called via an
    URL is a handler
  • A handler must take a single argument
  • It will be an mp_request object
  • Your program can have functions that will not be
    accessed by a URL
  • These can be written any way you want to (no
    args, 10 args etc)

37
mod_python Request Object
  • See http//www.modpython.org/live/current/doc-h
    tml/pyapi-mprequest.html
  • Lots of methods and members
  • Well focus mainly on two members
  • content_type
  • form

38
Content Types
  • When your browser requests a page form a web
    server, it needs to know what type of data its
    getting back
  • If its HTML itll display it
  • If its plain text, no need to process it, so it
    just dumps it to screen
  • If its an MP3, it will save to disk or open up
    a media player
  • How does the web server indicate the type of
    content?

39
Content Types
  • The HTTP protocol indicates that one of the
    fields in an HTTP request is called content-type
    and is set to some value
  • The value of this field is chosen from a list of
    well-known, agreed upon values
  • See here for a list of values for different types
    of data

40
Content Types
  • We are interested mainly in HTML pages or plain
    text pages
  • So, if the web server is to return an HTML page
    the content type should be text/html
  • If the web server is to return a plain text page
    (I.e., no markup) then the content type is
    text/plain
  • How does this affect mod_python programs?

41
mod_python Content Type
  • Apache will send the return value of a mod_python
    handler back to the user
  • Return a string representing HTML page, the user
    should see the HTML page
  • Return binary data representing an MP3, the user
    should get an MP3
  • Before returning, you must set the content_type
    member of the request object

42
mod_python Content Type
  • In general, you just need to add
    req.content_type text/htmlsomewhere before
    returning
  • If youre not returning HTML and it is some sort
    of text, use text/plain

def index(req) req.content_type
text/html return Hello World
43
mod_python Caveats
  • If there is an error in your code, using print
    will not help, as you will not see the result in
    the browser
  • You should see the exception in the browser
  • To print out intermediate results, return them
  • If they are not simple objects, Python will
    convert them to a string representation using
    str()

44
mod_python Caveats
  • This code will return an error page
  • The exception will tell you where the error is
  • You cant do print d to see what is in the
    dictionary

def index(req) req.content_type
'text/plain' d d'one' 1 ret
d'two' return ret
45
mod_python Caveats
  • Instead, return the dictionary
  • This will cause Python to call str() on the
    dictionary and Apache will simply present the
    representation of the dictionary

def index(req) req.content_type
'text/plain' d d'one' 1
return d ret d'two' return ret
46
More Than One Handler?
  • The handler called index is special
  • It allows you to access the CGI without
    specifying a specific function
  • You can write other handlers in the same program
  • Each one must take a single request object as its
    argument
  • Accessed by http//some.host.com/myprog.py/myhan
    dler

47
More Than One Handler?
  • Download hp3.py
  • Does not have an index handler
  • http//sulu.informatics.indiana.edu/rguha/hp3.py
    will not work

def asHTML(req) req.content_type
'text/html' return 'lthtmlgtltBodygtltbgtHellolt/bgtlt/
bodygtlt/htmlgt' def asText(req)
req.content_type 'text/plain' return
'lthtmlgtltBodygtltbgtHellolt/bgtlt/bodygtlt/htmlgt'
48
More Than One Handler?
  • http//sulu.informatics.indiana.edu/rguha/hp3.py/
    asHTML
  • http//sulu.informatics.indiana.edu/rguha/hp3.py/
    asText
  • Both the above work
  • Depending on how we construct the URL, we
    cancontrol which handler gets called

def asHTML(req) req.content_type
'text/html' return 'lthtmlgtltBodygtltbgtHellolt/bgtlt/
bodygtlt/htmlgt' def asText(req)
req.content_type 'text/plain' return
'lthtmlgtltBodygtltbgtHellolt/bgtlt/bodygtlt/htmlgt'
49
mod_python Summary
  • A handler is just a Python method that takes a
    single argument
  • The argument is a request object
  • You can have as many handlers as you want
  • The handler called index is special
  • We can invoke it by jut using the program name in
    the URL

50
mod_python Summary
  • The return value of your handler is what gets
    sent back to the user
  • Before returning make sure to set the proper
    content type by doing req.content_type
    some content type
  • Here req is the name of the argument for the
    handler
Write a Comment
User Comments (0)
About PowerShow.com