Title: Web Programming: A Short History
1Web ProgrammingA Short History
- Armando Fox
- CS294-1 Fall 06
2Travel photo
3Web Programming 101 Outline
- Basics RPC, client-server, HTTP, Apache
- Web sites are really programs CGI FastCGI
- Server-side storage, cookies
- Programming stacks LAMP, J2EE, Rails
- User tracking
- Client-side fun Javascript, DHTML, DOM
- Workload generation
- Finding bottlenecks, Lab 1 discussion
4The Web is largely RPC using HTTP
- Review of Remote Procedure Call (RFC707, 1976)
- Problems RPC had to overcome
- Engineering Argument marshaling, argument
result types - Fundamental calling semantics (at-least-once,
at-most-once, exactly-once) - Fundamental (for all distributed systems)
failure semantics - How does HTTP fix these?
5A Conversation With a Web Server
- Open TCP connection to server on port 80
(default) - Browser uses TCP to send the following chunk
ostuff - GET /index.html HTTP/1.0
- User-Agent Mozilla/4.73 en (X11 U Linux
2.0.35 i686) - Host www.yahoo.com
- Accept image/gif, image/x-xbitmap, image/jpeg,
image/pjpeg, image/png, / - Accept-Encoding gzip
- Accept-Language en
- Accept-Charset iso-8859-1,,utf-8
- Cookie B2vsconq5p0h2n
6A Conversation With a Web Server
- Server replies
- HTTP/1.0 200 OK
- Content-Length 16018
- Content-Type text/html
- Yahoo!hrefhttp//www.yahoo.com/
- etc.
- If there are embedded images, such as
- m/us.yimg.com/a/an/anchor/icons2.gif"
- then repeat the whole process with this new URL.
7HTTP, a simple chatty protocol
- ASCII-based commands over TCP/IP
- a bad fit for TCP/IPwhy?
- Fundamentally request-reply (like RPC),
client-initiated - precludes true server push
- Stateless every request completely independent
- No intrinsic way to create associations between
distinct requests - No provisions for maintaining state that persists
across requests - Early addition to HTTP cookies
- Extra header added by server to HTTP response
- Cookie is typically opaque to client
- Client should hand cookie back on subsequent
requests to same server - Client isnt obligated to honor (usually a user
pref) but in practice most sites are now useless
without it
8Meet Apache (a patchy web server)
- Naive server implementation (NCSA httpd) listen,
accept, forkexec, repeat ad infinitum - breaks
down quickly in engineering - Open-source Apache evolved c.1996 from patches to
original httpd, now most popular on Web (70 in
2006) - Replace fork() with select() thread management
- Many processes, many threads, sophisticated
memory management - Can be configured as a proxy or cache too (later)
- Many, many functionalities and configuration
options - Apache modules compiled-in glue to other
components without process fork and context
switch - Eg, relational databases and interpreted
languages like perl
9Web sites as programs CGI
- Idea run a program and send its output back to
browser - Need to name the program, pass input parameters,
execute program, capture output, deal with errors
(i.e. everything RPC does) - First cut Common Gateway Interface
- Allowable programs to execute are specified in a
server config file that maps URLs to
subdirectories - Parameters embedded in URLs or forms (later,
cookies) - http//www.foo.com/search?termwhite20rabbitshow
10page1 - fork()exec() used to execute program (later, for
performance, embed certain types of code in
server process fastCGI) - program must generate correct HTTP headers, etc.
- join stderr to stdout to capture errors
- Remote program may circumvent HTTP limitations
- embed tokens or other hidden parameter to
associate requests from same user - store data persistently on server, eg in
filesystem, Berkeley DB, etc.
- How many concurrent clients?
- FCGI dispatchers
- CGI limited by ability to fork()
- open file descriptors (sockets)
- Noteworthy
- Logically, dispatching mechanism is orthogonal to
app code - In practice, most middleware/app servers
hardwire the choice - Rails works with any
- Problems in moving cgi programs to fastcgi
environment? - Primitive steps toward separating I/O resource
management from app code
Filesystem or database
your app
Filesystem or database
TCP or Unix domain sockets
11Server-side storage
- How to persist data across consecutive HTTP
requests from same user? - Embed data in URL, cookie, or as hidden variables
in forms - Better Store data on server side, embed just the
handle ...why? - Message truth is on the server program
defensively - Special case of data Session identifier that
ties together requests from same user session
(eg, generated on user login) - Why better to store data on server?
- Untrustworthy client, unreliable client, size of
data... - Handle can be authenticated/cryptographically
signed, have an expiration time, etc. (client
cant forge data if the only access to it is via
authenticated handle) - General pattern servertruth clienthint
12Programming Stacks application servers
- Observation many programmers were reinventing
common machinery for dynamic Web sites - Naming routing between URLs and programs
- Connections between programs and storage (eg
database) - Presentation of output (wrapping in HTTP
HTML) - Managing concept of user sessions
- Marshalling unmarshalling stuff from cookies,
forms, etc. - Programming environments capture commonalities
- App writer creates business logic
- Mechanics of above largely handled by programming
stack - Improves business logic portability by
virtualizing resources such as DB - Pick your favorite language, storage solution,
13LAMP (Linux, Apache, PHP, MySQL)
- MySQL excellent open-source RDBMS
- PHP Perl-like language that can be embedded into
HTML pages - .php pages are passed through PHP interpreter
before sent back to browser - Provides a lot of common machinery
- Virtualizes connection to MySQL
- Extract params from URLs, forms, etc
- Lots of libraries available via PEAR
- Mixing of code and HTML can get messy as app gets
complex - How many interpreter instances?
have already voted. Go away. setcookie(test, rated, time()86400) ? T COLORblueYou haven't voted before so I
recorded your vote
admin, adpass) mysql_select_db(db,
mysql_link) result mysql_query("SELECT
impressions from tds_counter where
COUNT_ID'cid'", mysql_link) nmysql_num_rows(
result)) for (i0 i mysql_fetch_row(result) .... ?
14Noteworthy about PHP
- Virtualizes DB API, but not stored objects
- Programmer still writes raw SQL queries to
access objects mapping of database fields to
program objects not intrinsic - In contrast, cookies and sessions are first-class
abstractions - Proprietary alternativesMicrosoft ASP.net,
.php page
.php page
.php page
Filesystem or database
PHP interpreter
15Example Java 2 Enterprise Edition (a/k/a EJB)
- Application collection of Java components
called beans - Different bean types encapsulate business logic
(functions) or access to objects stored in
underlying DB - At runtime, beans are deployed (instantiated)
into containers - J2EE server manages deployment, memory, Java
thread allocation, database, etc. - Java servlets or Java server pages activate
appropriate bean(s) - Open but incomplete spec open-source (eg Jboss)
and commercial (eg Weblogic, Websphere) J2EE
servers compete on features and engineering
16J2EE continued
- J2EE began life as Txn Monitors (eg Tuxedo)
- Before 100s of DB clients, each a client-side
app with long-lived connection - Now 100Ks of clients, each a Web browser?
Doh!! - Transaction monitors a/k/a txn-oriented
middleware multiplex a small pool of
long-lived DB connections across Web servers - There are some gnarly scheduling issues here
- J2EE TMmore common functionsthe blessing of
Java - Noteworthy management features
- can distinguish (e.g.) singleton beans from
replicatable ones - session object beans can be marked stateful
(manages a database row) or stateless - Accepted practice generally eschews stateful
SBs...why? - Lets app developer talk about state management
17J2EE vs. LAMP
- J2EE much more heavyweight
- Harder learning curve, HelloWorld involves
setting up JSPs, mapping to EJBs, declaration
of EJB types, mapping of EJB accesses to database
tables, ... - Very challenging to engineer efficient J2EE
serverthe tail wagging the dog (typ. 90 of
total programming stack) - Memory management is a bear long-lived objects,
references all over the place, no good time to do
garbage collection, leaks a fact of life - Richness of Java language makes it harder
- Arguably promotes good practices
- Cleaner separation of logic (EJBs) from
presentation (JSPs, servlet pages) - Java platform independence (but some appservers
APIs nonstandard) - modularity, class-based OOP, mountains of files
describing configurations of EJBs, etc. - 800-pound gorilla for enterprise, less common in
18The new kid Ruby on Rails
- Rails instantiates object-relational model over
DB - concise code in terms of ORM, not SQL queries
- Strongest separation yet of model, view,
controller - Easy to write inefficient code (issues multiple
queries) - High level abstraction makes data relationships
obvious - Ruby reflection facilitates convention over
MODEL class Order ... belongs_to customer class
Customer ... has_many orders CONTROLLER def
findOrder(dt) _at_ords Order.find(conditions
(shipDate ?, Time.now)) end VIEW
_at_ords.each do o Name
admin, adpass) mysql_select_db(db,
mysql_link) result mysql_query("SELECT
c.name, o.shipDate from Customers c, Orders o
WHERE c.name name AND o.customer_idc.id AND
o.shipDate NOW, mysql_link) if
(nmysql_num_rows(result)) for (i0
echo Name . row0 . ship date .
row1 ?
19User-tracking techniques
- Apache log scraping
- Cant identify correlated requests, but can
guess - Time granularity typically seconds, depends on
httpd.conf - adsl-70-132-27-24.dsl.snfc21.sbcglobal.net - -
04/Sep/2006214427 0000 "GET
HTTP/1.1" 200 186 - adsl-70-132-27-24.dsl.snfc21.sbcglobal.net - -
04/Sep/2006214427 0000 "GET
HTTP/1.1" 200 161 - adsl-70-132-27-24.dsl.snfc21.sbcglobal.net - -
04/Sep/2006214427 0000 "GET
HTTP/1.1" 200 174 - HTTP redirect, the oldest trick in the book
- Original page increments a counter, redirects to
real page (possibly on different site) - Fat URLs a cheaper way to do redirection
(Google does this) - More ambitious target page rewriting (Babelfish
used to do this) - Cross-site cookies (eg Doubleclick)
20Client Side Fun DOM, Javascript (or Enough Rope
To Hang Yourself)
- Document Object Model most browsers since 1998
- treats browser environment delivered page as
hierarchical collection of objects (page,
embedded images, paragraphs, tables, forms, ...) - Namespace of these objects is available to
JavaScript - eg if (navigator.platformMacPPC) // do
MacOS-specific stuff - Javascript interpreter embedded in most browsers
c.1998 - tags mark scripts embedded in HTML pages
or fetched from server (like embedded images) - Events corresponding to document manipulation (eg
onLoad) and UI actions (onClick, onMouseOver) are
dispatched to JS handlers - Key modifying DOM elements changes the page
content! - document.write(I appear in the page but not
between HTML tags) - document.images0.locationhttp//.../foo.gif
// load new image - Great delivery mechanism for malware, viruses,
etc. - Tail wags dog MSDN homepage has 72 embedded
21Other Client-side Fun
- ActiveX controls (Microsoft IE only)
- Native x86 code loaded into browsers address
space - Can run with privileges of browser itself
- Contrast Javascript is restricted in what
operations are allowed - Browser plug-ins/helpers/etc.
- Most browsers support all mutually incompatible
22AJAX Asynchronous Javascript XML
- Idea use Javascript to perform HTTP requests in
the background (popularized by Google Maps) - Already done for things like loading animated
images - XMLHTTPRequest (appeared in browsers c.2000)
allows async HTTP request with callback browser
plugin or native impl. - Eg, use for prefetching or for asynchronously
filling in page content after page frame has
loaded - Not new functionality, but a set of practices and
libraries - Implications
- Complete separation of req/resp from rendering
means browser-UI-based apps can be structured
like desktop apps - Breaks workload modeling that assumes think
timeidle - DANGER collecting persistent input at client
before committing to server
23Workload Generation Testing
- Stress testing, functional testing, benchmarking
- Stress Subject system to high offered load, what
happens? - Functional Does app code work and handle cases
correctly? - Rails provides particularly good support for this
- Benchmark How fast is my app/platform compared
to alternate implementations? - Coverage have all code paths in app been tested?
- What to look for in a stress generator
- Open loop how many requests/second can be
generated? - Closed loop How many concurrent users can be
simulated? - How accurately? (eg transition probabilities,
AJAX requests) - Ability to simulate hotspots?
- Read-only, or read-write workload?
24Balancing the 3-tier Pipeline
- Horizontal scaling concerns at each tier
- Latency measurement as seen by each tier
25Canonical Graphs Parameters
- Latency vs. offered load
- Closed or open loop users with think times
- Any system will eventually fall off cliff to
high response times. The only question is what
bottleneck was hit first. - Throughput is another metric that is less
user-centric but important to admins - Bottlenecks
- I/O in out of server
- Virtual machine traps
- Database performance
- CPUexecuting mundane Ruby code (read the Ruby
performance article for interesting notes on
this) - RAMused by MySQL for caching table rows
queries, by Rails for caching fragments pages,
and to hold footprints of Ruby interpreters
26Finding CPU-related bottlenecks
- Scan logs to map queries to operations. Where are
hotspots? - latency breakdown by part of system. Where are
bottlenecks? Can you do more detailed profiling? - trace will take this to new heights
- relationship of latency to table sizes
- Are these operations fundamentally slow or just
badly written? - Are there quadratic relationships that could be
linear? - whats in the session object? (It has to be
serialized/deserialized) - What about trading this for additional query by
storing object ID only?
27Key messages
- Web RPC with same liabilities advantages.
- Explicit state management has bred
recovery-friendly app structure because you
always know where the state is, and what kind of
state it is. - Persistent/relational - business data
- Persistent/single-key - user ID/profile
- Session-lived - session state, shopping carts,
etc - Transient (less than a session)
- Dealing with replication is still a challenge,
but at least now you know what could be
replicated - Corollary the truth is on the server anything
on the client is a hint. You generally can make
guarantees about durability and integrity on the
server side that you cant make on the client
side. - Mashups and AJAX are rapidly obsoleting previous
assumptions about user behavior, offered load,
- For the masses, shortest learning curve wins
- PHPMySQL beat out EJB
- AJAX/XMLHttpRequest beat out WSDL/SOAP
- Didnt help that SOAP over HTTP is 10x slower
than RMI/CORBA and not clear if anyone
understands WSDL - Google Web Toolkit has made it even easier by
providing Java-to-Javascript/DOM/Ajax compiler - Ruby on Rails may beat out PHP?
29Lab 1 discussion
- VM vs. raw hardware overhead
- Using the logs to find hotspots
- Caching