Title: Information Infrastructure II
1Information Infrastructure II
- I211 Week 6
- Rajarshi Guha
2Outline
- Documenting Python code
- Project Overview
- Connecting to the web
3Why Document
- Simple code can be self-explanatory
- Code that is more than 1 screenful will require
some sort of explanation - For other people
- For yourself, after 6 months
- Trivial documentation is nearly as bad as no
documentation - No need to say that i is a loop counter!
4Python Docstrings
- Simple way to comment a function, a class etc
- The docstring is amulti-line commentas the
first thingin your function,method or class
def myfunc(a,b) A short summary of
the function What is the type of a and what
is it for What is the type of b and what is
it for What does the function return?
the rest of your code
5Making Use of Docstrings
- If you have a set of functions in a file, the
file is a module - After importing the module, you can access the
functions - For any given function you do f.__doc__ which
will output the doc string for that function
6Making Use of Docstrings
gtgtgt from myfunc import gtgtgt print f1.__doc__
The docstring for f1 gtgtgt print f2.__doc__
The docstring for f2 gtgtgt help(f1)
def f1(a) The doc string for f1
return a def f2(b,a) The
docstring for f2 return ab
myfunc.py
7Practicing Docstrings
- From here on, all classes and methods developed
as assignments will need to have docstrings - They dont need to be long
- A short description of what the function or class
does - What they take as input
- What they will output (if a function or method)
8Project Overview
- Unified interface to PubMed and PubChem
- Along with some other functionality based on
local databases/services - Web page front end to the whole thing
9PubMed
- Broad collection of databases
- Medical terms
- Literature
- Genes
- Proteins
10PubChem
- A collection of chemical structures (gt 10
million) - Biological data
- Searchable
11Goals of the Project
- Utilize the Entrez utilities to retrieve
information from PubChem/PubMed - A set of URLs which allow you query these
databases - Look up some databases located at IU
- Cache results, so that if a query gets repeated
you can pull it from a local DB - Utilize web services to get 2D images
- Put a web page front end onto all of this
12Requirements
- Youll be dealing with
- Bibliographic information
- Compound information
- Youll need to write classes that represent these
things and provide methods - To easily get specific pieces of information
- Interact with the databases
13What Will Be Given?
- Ill provide the relevant URLs, keywords etc
- Youll have to use various string methods, URL
related methods to construct the appropriate
URLs and get the information - Ill create database schema
- You will just need to perform inserts, updates
etc. - Ill give a brief overview of SQL, but it will be
enough to get the job done - We wont need any SQL magic!
- Ill provide an overview of HTML
- You can easily get HTML tutorials on the web
14Procedure?
- Each week Ill list out tasks that need to be
completed - Tasks will include the code to do the job
- Test code that will show that the actual code
works - You will need to submit the file(s) by the
following week - 5 points for submission and correct running of
the test code - At the end of semester, Ill run the code and
test the web page. - 10 points when it all comes together
- The total is scaled down to the range 0-30
15Connecting to the Web
- Python is very good for network related stuff
- Sending email
- Doing FTP, SSH
- Doing HTTP (i.e., WWW)
- Well just focus on using the WWW from Python
16The urllib module
- Provides various functions to
- Open a connection to a URL
- Read stuff from a URL
- Quote URLS
- Getting stuff from the web is equivalent to
opening a file and reading from it - Here a file is a URL
17Opening a Connection
- Simply specify the URL
- Need to add http
- Just www.google.com will not work
- http tells Python what protocol to use to
connect to the URL - For more useful thingsyoull construct aURL
- Have to worry about quoting
import urllib url http//www.google.com con
urllib.urlopen(url)
18Get Data from the Connection
- The result of urlopen is a connection object
- Behaves the same as a file object
- Except you cant write to it
- To get the page from that URL just call
readlines()
import urllib url http//www.google.com con
urllib.urlopen(url) page con.readlines()
19What Did We Get?
- readlines() returns a list containing the lines
of the page - REMEMBER You get back HTML which is not the same
as what you see in your browser - To view the data,just dump it to a file
- Open up page.html inyour browser
import urllib url http//www.google.com con
urllib.urlopen(url) page con.readlines() f
open(page.html, w) f.write(x) for x in
page f.close()
20So Whats the Big Deal?
- Its very easy to get a page
- The real utility is extracting information from
the page - In many cases, the URL is actually a CGI program
that can accept arguments - You can get different answers depending on what
URL you construct - Depending on what the developer provides you may
- Curse, tear your hair out, get drunk
- Write the code in 5 minutes and go party
21Getting Stock Quotes
- Yahoo provides a stock quote service
- Visit http//finance.yahoo.com/q?sGOOG
- Lots of useful info
- Last trade
- Volume
- Market cap
- I want to keeptrack of 10 stocks
- Navigate the page 10 times?
22Getting Stock Quotes
- Perfect job for a Python program
- If we open a connection using that URL we get the
page for the Google stock - What about others?
http//finance.yahoo.com/q?sGOOG
The constant URL
This changes for each company
23Question
s http//finance.yahoo.com/q?s googleSymbol
GOOG urlForGoogle ?
24Getting the Stock Pages
- The code is basically the same as before
- This time we construct the URL for each company
import urllib baseurl http//finance.yahoo.com
/q?s cos GOOG, AAPL, MSFT, SNE for
company in cos con urllib.urlopen(baseurlc
ompany) lines con.readlines() page
.join(lines) do something with the page
25Extracting Data (?)
- So weve gotten the stock page for GOOG
- We want to the last trade, market cap etc
- Should be a matter of just looking at the text we
downloaded and finding Last Trade etc? - Yes
- You will not want to do it this way
26lthtmlgtltheadgtltmeta http-equiv"Content-Type"
content"text/html charsetiso-8859-1"gtlttitlegtGOO
G Summary for GOOGLE - Yahoo! Financelt/titlegtltlin
k rel"stylesheet" type"text/css"
href"http//us.js2.yimg.com/us.yimg.com/i/us/fi/y
fs/css/yfs_popup_1.17.css"gtltlink
href"http//us.js2.yimg.com/us.js.yimg.com/i/us/f
i/navbar/css/gnav_200703141540.css"
type"text/css" rel"stylesheet"gtltscript
type"text/javascript" src"http//us.js2.yimg.com
/us.js.yimg.com/i/us/fi/scripts/gnavb_200702051500
.js"gtlt/scriptgtlt!--if lte IE 7gt ltlink
href"http//us.js2.yimg.com/us.js.yimg.com/i/us/f
i/navbar/css/ie6_200701252030.css"
type"text/css" rel"stylesheet" media"all"/gt
ltscript type"text/javascript"
src"http//us.js2.yimg.com/us.js.yimg.com/i/us/fi
/navbar/scripts/pseudohover_200701262100.js"gtlt/scr
iptgt lt!endif--gtltscript
type"text/javascript" src"http//us.js2.yimg.com
/us.yimg.com/i/us/fi/03rd/yg_csstare_nobgcolor.js"
gtlt/scriptgtltlink rel"stylesheet"
href"http//us.js2.yimg.com/us.js.yimg.com/i/us/f
i/03rd/yfnc_200706251450.css" type"text/css"
media"screen"gtltstyle media"screen"gt screen,mas
theadtext-aligncentermargin0width752px sc
reen.xpand,masthead.xpandwidth100 screen.xp
and yfncsubtitwidth100 mastheadpadding-bot
tom12px leftcol,rightcolmargin0 leftcol
width155pxfloatleft rightcolwidth585px
contentwidth752px footer clearboth ///
bodytext-aligncenter screentext-alignleft
min-width700pxwidth62.5emwidthexpression( doc
ument.all.footer ? (document.all.footer.offsetWidt
hgt1000) ? '980''px' document.all.footer.offsetW
idth '62.5''em') marginautoborder1px solid
FFF screen.xpandwidth100
text-aligncenter portfoliomargin10px
auto hdradsmargin10px auto 0
auto masthead,leftcol,rightcolmargin0 ma
stheadwidth100 leftcolfloatleftwidth18
rightcolfloatrightwidth80 contentwidth
100 footerclearbothtext-aligncenterpaddi
ng10px 0width60emmarginautoborder0px solid
FFF leftNavTable,yfncmh,yfncmkttme,yfncdupl
gnwrn,yfncpsnlbar,yfnctitbar,y fncsubtit,yfncb
robtn,yfncsumtab, .yfnc_systitlelinea1,.yfnc_syst
itlelineb1,.yfnc_modtitlew1 width100 yfncm
hwidth100 .yfncsumdatagridbackgroundDCDCDC
width100 .yfnc_modtitlew2width49 // .
yfncnhlcolor666margin-bottom10px .yfncnhl
.yfncnhlblcolor000width1.6emtext-aligncente
r lt/stylegtltstyle media"print"gt leftcoldisplay
none rightcoldisplayblock lt/stylegtltmeta
name"keywords" content"quote result, quote,
quote summary, real-time, chart, technical chart,
price history, headlines, message board, key
statistics, annual income, E
View the page in your browser and then right
click and view source - this is what your program
sees
27Extracting Data
- In many cases this is all you get
- You have to look at this type of data and
identify patterns - In many cases its doable, but is very tiresome
- Luckily Yahoo provides a much easier way to get
the data we need
28Getting Stock Data Easily
- Yahoo provides a CGI program that returns stock
quote data in CSV format - The URL for Google stock is
- If we connect to this URL we dont get back a
HTML page - Instead we get a single line
http//quote.yahoo.com/d/quotes.csv?sGOOGfsl1d1
t1c1ohgvj1pp2owerne.csv
"GOOG",528.75,"9/14/2007","400pm",3.97,523.18,53
0.27,522.22,2765940,165.0B,524.78,"0.76",523.18,
"392.74 - 558.58",12.306,42.64,"GOOGLE"
29What Are We Getting?
- If you look at the components of the line, youll
see that they match the web page - What this URL returns to you is
- Symbol, last trade, time, change
- Open, high, low, volume
- Market cap, previous close, previous change
- Year low, year high, eps, PE Ratio
- And its all in a nicely splittable string!
30Getting the Last Trade
- Now we can easily get the last trade for a set of
stocks
import urllib baseurl http//quote.yahoo.com/d
/quotes.csv?ssfsl1d1t1c1ohgvj1pp2owerne.csv
cos GOOG, AAPL, MSFT, SNE for
company in cos url baseurl (company)
con urllib.urlopen(url) lines
con.readlines() data lines0 data
data.split(,) print Last trade for s was
3.2f (company, data1)
31Quoting
- A CGI program is something that anybody can
access - If it takes argument anybody can send anything
- People will try to hack you
- People will send invalid input
32Quoting
- In general, HTTP supports the letters A-Z, a-z,
0-9 and / and . - You can put in other symbols such as or a space
but those are special characters - Ideally, you should escape them
- Just as we use \n or \t in an ordinary string
- When working with URLs we call this quoting
33Quoting Hex
- We dont use the \ symbol to do quoting
- Instead special characters are converted to their
hexadecimal codes - In the computer a single character is represented
by a single integer (ASCII code) - A 65, a 97, 126 and so on
- We can get the integer code for a character using
the ord function
34Quoting Hex
- So quoting special characters in a string means
- Identify the special character
- Get its integer code
- Convert the code to hex
- Finally replace the special character with the
hex number in the form XX where XX is
char
n ord(char)
h X (n)
35Examples
- http//localhost/rguha becomes
http3A//localhost/7Erguha - becomes 3A and becomes 7E
- http//moin.org/A Wiki Page becomes
http3A//moin.org/A20Wiki20Page - Spaces become 20
36urllib Lets Us Do This Easily
- Use the quote function
- In many cases you dont need to bother
- Its a good habit to quote any URL that you use
to connectto something - Very easy to do!
gtgtgt import urllib gtgtgt url http//rguha.ath.cx/
rguha gtgtgt url urllib.quote(url) gtgtgt print
url http3A//rguha.ath.cx/7Erguha
37Going the Other Way
- You might receive a quoted URL
- Special chars are already replaced by hex
- Its a pain to work with them
- Convert it back to the special char form using
unquote
gtgtgt u http3A//moin.org/A20Wiki20Page gtgtgt
newu urllib.unquote(u) gtgtgt print
newu 'http//moin.org/A Wiki Page'
38Connecting to CGI Programs
- CGI programs are basically programs behind a web
page - They are represented as URLs
- Take arguments just like a function
- Basically a CGI program can be viewed as a
function located on someone elses computer - The return values can vary
- Could be tough-to-parse HTML
- Could be as simple as a comma separated list
39Connecting to a CGI Program
- Simply construct the proper URL
- The URL for a CGI program is of the form
http//finance.yahoo.com/q?sGOOG
http//quote.yahoo.com/d/quotes.csv?sGOOGfsl1d1
t1c1ohgvj1pp2owerne.csv
URL_TO_CGI?arg1value1arg2value2
40An Example
http//quote.yahoo.com/d/quotes.csv? sGOOG
fsl1d1t1c1oh e.csv
URL for the CGI
Parameter Name Value
Basically, making calls to CGIs is just string
manipulation. Handling the return value is where
all the effort goes
41urllib Makes it Easier
- Since the parameters are basically name, value
pairs what type of object should we use to
represent them?
42urllib Makes it Easier
- Create a dictionary of the name, value pairs and
then call urlencode - And pass the result to urlopen along with the URL
of the program
43Example
gtgtgt url 'http//quote.yahoo.com/d/quotes.csv?
gtgtgt d 's' 'GOOG', 'f' 'sl1d1t1c1ohgvj1pp2ow
ern', 'e' '.csv' gtgtgt params
urllib.urlencode(d) gtgtgt print params sGOOGe.csv
fsl1d1t1c1ohgvj1pp2owern gtgtgt con
urllib.urlopen(urlparams) gtgtgt print
con.readlines() "GOOG",528.75,"9/14/2007","400p
m",3.97,523.18,530.27