Information Infrastructure II - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Information Infrastructure II

Description:

Utilize web services to get 2D images. Put a web page front end onto all of this ... Yahoo provides a CGI program that returns stock quote data in CSV format ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 44

Provided by: rajars

Category:

more less

Transcript and Presenter's Notes

Title: Information Infrastructure II

1
Information Infrastructure II

I211 Week 6
Rajarshi Guha

2
Outline

Documenting Python code
Project Overview
Connecting to the web

3
Why Document

Simple code can be self-explanatory
Code that is more than 1 screenful will require
some sort of explanation
For other people
For yourself, after 6 months
Trivial documentation is nearly as bad as no
documentation
No need to say that i is a loop counter!

4
Python Docstrings

Simple way to comment a function, a class etc
The docstring is amulti-line commentas the
first thingin your function,method or class

def myfunc(a,b) A short summary of
the function What is the type of a and what
is it for What is the type of b and what is
it for What does the function return?
the rest of your code
5
Making Use of Docstrings

If you have a set of functions in a file, the
file is a module
After importing the module, you can access the
functions
For any given function you do f.__doc__ which
will output the doc string for that function

6
Making Use of Docstrings
gtgtgt from myfunc import gtgtgt print f1.__doc__
The docstring for f1 gtgtgt print f2.__doc__
The docstring for f2 gtgtgt help(f1)
def f1(a) The doc string for f1
return a def f2(b,a) The
docstring for f2 return ab
myfunc.py
7
Practicing Docstrings

From here on, all classes and methods developed
as assignments will need to have docstrings
They dont need to be long
A short description of what the function or class
does
What they take as input
What they will output (if a function or method)

8
Project Overview

Unified interface to PubMed and PubChem
Along with some other functionality based on
local databases/services
Web page front end to the whole thing

9
PubMed

Broad collection of databases
Medical terms
Literature
Genes
Proteins

10
PubChem

A collection of chemical structures (gt 10
million)
Biological data
Searchable

11
Goals of the Project

Utilize the Entrez utilities to retrieve
information from PubChem/PubMed
A set of URLs which allow you query these
databases
Look up some databases located at IU
Cache results, so that if a query gets repeated
you can pull it from a local DB
Utilize web services to get 2D images
Put a web page front end onto all of this

12
Requirements

Youll be dealing with
Bibliographic information
Compound information
Youll need to write classes that represent these
things and provide methods
To easily get specific pieces of information
Interact with the databases

13
What Will Be Given?

Ill provide the relevant URLs, keywords etc
Youll have to use various string methods, URL
related methods to construct the appropriate
URLs and get the information
Ill create database schema
You will just need to perform inserts, updates
etc.
Ill give a brief overview of SQL, but it will be
enough to get the job done
We wont need any SQL magic!
Ill provide an overview of HTML
You can easily get HTML tutorials on the web

14
Procedure?

Each week Ill list out tasks that need to be
completed
Tasks will include the code to do the job
Test code that will show that the actual code
works
You will need to submit the file(s) by the
following week
5 points for submission and correct running of
the test code
At the end of semester, Ill run the code and
test the web page.
10 points when it all comes together
The total is scaled down to the range 0-30

15
Connecting to the Web

Python is very good for network related stuff
Sending email
Doing FTP, SSH
Doing HTTP (i.e., WWW)
Well just focus on using the WWW from Python

16
The urllib module

Provides various functions to
Open a connection to a URL
Read stuff from a URL
Quote URLS
Getting stuff from the web is equivalent to
opening a file and reading from it
Here a file is a URL

17
Opening a Connection

Simply specify the URL
Need to add http
Just www.google.com will not work
http tells Python what protocol to use to
connect to the URL
For more useful thingsyoull construct aURL
Have to worry about quoting

import urllib url http//www.google.com con
urllib.urlopen(url)
18
Get Data from the Connection

The result of urlopen is a connection object
Behaves the same as a file object
Except you cant write to it
To get the page from that URL just call
readlines()

import urllib url http//www.google.com con
urllib.urlopen(url) page con.readlines()
19
What Did We Get?

readlines() returns a list containing the lines
of the page
REMEMBER You get back HTML which is not the same
as what you see in your browser
To view the data,just dump it to a file
Open up page.html inyour browser

import urllib url http//www.google.com con
urllib.urlopen(url) page con.readlines() f
open(page.html, w) f.write(x) for x in
page f.close()
20
So Whats the Big Deal?

Its very easy to get a page
The real utility is extracting information from
the page
In many cases, the URL is actually a CGI program
that can accept arguments
You can get different answers depending on what
URL you construct
Depending on what the developer provides you may
Curse, tear your hair out, get drunk
Write the code in 5 minutes and go party

21
Getting Stock Quotes

Yahoo provides a stock quote service
Visit http//finance.yahoo.com/q?sGOOG
Lots of useful info
Last trade
Volume
Market cap
I want to keeptrack of 10 stocks
Navigate the page 10 times?

22
Getting Stock Quotes

Perfect job for a Python program
If we open a connection using that URL we get the
page for the Google stock
What about others?

http//finance.yahoo.com/q?sGOOG
The constant URL
This changes for each company
23
Question
s http//finance.yahoo.com/q?s googleSymbol
GOOG urlForGoogle ?
24
Getting the Stock Pages

The code is basically the same as before
This time we construct the URL for each company

import urllib baseurl http//finance.yahoo.com
/q?s cos GOOG, AAPL, MSFT, SNE for
company in cos con urllib.urlopen(baseurlc
ompany) lines con.readlines() page
.join(lines) do something with the page
25
Extracting Data (?)

So weve gotten the stock page for GOOG
We want to the last trade, market cap etc
Should be a matter of just looking at the text we
downloaded and finding Last Trade etc?
Yes
You will not want to do it this way

26
lthtmlgtltheadgtltmeta http-equiv"Content-Type"
content"text/html charsetiso-8859-1"gtlttitlegtGOO
G Summary for GOOGLE - Yahoo! Financelt/titlegtltlin
k rel"stylesheet" type"text/css"
href"http//us.js2.yimg.com/us.yimg.com/i/us/fi/y
fs/css/yfs_popup_1.17.css"gtltlink
href"http//us.js2.yimg.com/us.js.yimg.com/i/us/f
i/navbar/css/gnav_200703141540.css"
type"text/css" rel"stylesheet"gtltscript
type"text/javascript" src"http//us.js2.yimg.com
/us.js.yimg.com/i/us/fi/scripts/gnavb_200702051500
.js"gtlt/scriptgtlt!--if lte IE 7gt ltlink
href"http//us.js2.yimg.com/us.js.yimg.com/i/us/f
i/navbar/css/ie6_200701252030.css"
type"text/css" rel"stylesheet" media"all"/gt
ltscript type"text/javascript"
src"http//us.js2.yimg.com/us.js.yimg.com/i/us/fi
/navbar/scripts/pseudohover_200701262100.js"gtlt/scr
iptgt lt!endif--gtltscript
type"text/javascript" src"http//us.js2.yimg.com
/us.yimg.com/i/us/fi/03rd/yg_csstare_nobgcolor.js"
gtlt/scriptgtltlink rel"stylesheet"
href"http//us.js2.yimg.com/us.js.yimg.com/i/us/f
i/03rd/yfnc_200706251450.css" type"text/css"
media"screen"gtltstyle media"screen"gt screen,mas
theadtext-aligncentermargin0width752px sc
reen.xpand,masthead.xpandwidth100 screen.xp
and yfncsubtitwidth100 mastheadpadding-bot
tom12px leftcol,rightcolmargin0 leftcol
width155pxfloatleft rightcolwidth585px
contentwidth752px footer clearboth ///
bodytext-aligncenter screentext-alignleft
min-width700pxwidth62.5emwidthexpression( doc
ument.all.footer ? (document.all.footer.offsetWidt
hgt1000) ? '980''px' document.all.footer.offsetW
idth '62.5''em') marginautoborder1px solid
FFF screen.xpandwidth100
text-aligncenter portfoliomargin10px
auto hdradsmargin10px auto 0
auto masthead,leftcol,rightcolmargin0 ma
stheadwidth100 leftcolfloatleftwidth18
rightcolfloatrightwidth80 contentwidth
100 footerclearbothtext-aligncenterpaddi
ng10px 0width60emmarginautoborder0px solid
FFF leftNavTable,yfncmh,yfncmkttme,yfncdupl
gnwrn,yfncpsnlbar,yfnctitbar,y fncsubtit,yfncb
robtn,yfncsumtab, .yfnc_systitlelinea1,.yfnc_syst
itlelineb1,.yfnc_modtitlew1 width100 yfncm
hwidth100 .yfncsumdatagridbackgroundDCDCDC
width100 .yfnc_modtitlew2width49 // .
yfncnhlcolor666margin-bottom10px .yfncnhl
.yfncnhlblcolor000width1.6emtext-aligncente
r lt/stylegtltstyle media"print"gt leftcoldisplay
none rightcoldisplayblock lt/stylegtltmeta
name"keywords" content"quote result, quote,
quote summary, real-time, chart, technical chart,
price history, headlines, message board, key
statistics, annual income, E
View the page in your browser and then right
click and view source - this is what your program
sees
27
Extracting Data

In many cases this is all you get
You have to look at this type of data and
identify patterns
In many cases its doable, but is very tiresome
Luckily Yahoo provides a much easier way to get
the data we need

28
Getting Stock Data Easily

Yahoo provides a CGI program that returns stock
quote data in CSV format
The URL for Google stock is
If we connect to this URL we dont get back a
HTML page
Instead we get a single line

http//quote.yahoo.com/d/quotes.csv?sGOOGfsl1d1
t1c1ohgvj1pp2owerne.csv
"GOOG",528.75,"9/14/2007","400pm",3.97,523.18,53
0.27,522.22,2765940,165.0B,524.78,"0.76",523.18,
"392.74 - 558.58",12.306,42.64,"GOOGLE"
29
What Are We Getting?

If you look at the components of the line, youll
see that they match the web page
What this URL returns to you is
Symbol, last trade, time, change
Open, high, low, volume
Market cap, previous close, previous change
Year low, year high, eps, PE Ratio
And its all in a nicely splittable string!

30
Getting the Last Trade

Now we can easily get the last trade for a set of
stocks

import urllib baseurl http//quote.yahoo.com/d
/quotes.csv?ssfsl1d1t1c1ohgvj1pp2owerne.csv
cos GOOG, AAPL, MSFT, SNE for
company in cos url baseurl (company)
con urllib.urlopen(url) lines
con.readlines() data lines0 data
data.split(,) print Last trade for s was
3.2f (company, data1)
31
Quoting

A CGI program is something that anybody can
access
If it takes argument anybody can send anything
People will try to hack you
People will send invalid input

32
Quoting

In general, HTTP supports the letters A-Z, a-z,
0-9 and / and .
You can put in other symbols such as or a space
but those are special characters
Ideally, you should escape them
Just as we use \n or \t in an ordinary string
When working with URLs we call this quoting

33
Quoting Hex

We dont use the \ symbol to do quoting
Instead special characters are converted to their
hexadecimal codes
In the computer a single character is represented
by a single integer (ASCII code)
A 65, a 97, 126 and so on
We can get the integer code for a character using
the ord function

34
Quoting Hex

So quoting special characters in a string means
Identify the special character
Get its integer code
Convert the code to hex
Finally replace the special character with the
hex number in the form XX where XX is

char
n ord(char)
h X (n)
35
Examples

http//localhost/rguha becomes
http3A//localhost/7Erguha
becomes 3A and becomes 7E
http//moin.org/A Wiki Page becomes
http3A//moin.org/A20Wiki20Page
Spaces become 20

36
urllib Lets Us Do This Easily

Use the quote function
In many cases you dont need to bother
Its a good habit to quote any URL that you use
to connectto something
Very easy to do!

gtgtgt import urllib gtgtgt url http//rguha.ath.cx/
rguha gtgtgt url urllib.quote(url) gtgtgt print
url http3A//rguha.ath.cx/7Erguha
37
Going the Other Way

You might receive a quoted URL
Special chars are already replaced by hex
Its a pain to work with them
Convert it back to the special char form using
unquote

gtgtgt u http3A//moin.org/A20Wiki20Page gtgtgt
newu urllib.unquote(u) gtgtgt print
newu 'http//moin.org/A Wiki Page'
38
Connecting to CGI Programs

CGI programs are basically programs behind a web
page
They are represented as URLs
Take arguments just like a function
Basically a CGI program can be viewed as a
function located on someone elses computer
The return values can vary
Could be tough-to-parse HTML
Could be as simple as a comma separated list

39
Connecting to a CGI Program

Simply construct the proper URL
The URL for a CGI program is of the form

http//finance.yahoo.com/q?sGOOG
http//quote.yahoo.com/d/quotes.csv?sGOOGfsl1d1
t1c1ohgvj1pp2owerne.csv
URL_TO_CGI?arg1value1arg2value2
40
An Example
http//quote.yahoo.com/d/quotes.csv? sGOOG
fsl1d1t1c1oh e.csv
URL for the CGI
Parameter Name Value
Basically, making calls to CGIs is just string
manipulation. Handling the return value is where
all the effort goes
41
urllib Makes it Easier

Since the parameters are basically name, value
pairs what type of object should we use to
represent them?

42
urllib Makes it Easier

Create a dictionary of the name, value pairs and
then call urlencode
And pass the result to urlopen along with the URL
of the program

43
Example
gtgtgt url 'http//quote.yahoo.com/d/quotes.csv?
gtgtgt d 's' 'GOOG', 'f' 'sl1d1t1c1ohgvj1pp2ow
ern', 'e' '.csv' gtgtgt params
urllib.urlencode(d) gtgtgt print params sGOOGe.csv
fsl1d1t1c1ohgvj1pp2owern gtgtgt con
urllib.urlopen(urlparams) gtgtgt print
con.readlines() "GOOG",528.75,"9/14/2007","400p
m",3.97,523.18,530.27

Write a Comment

User Comments (0)