Title: Proxy Servers
1 Proxy Servers
2What Is a Proxy Server?
- Intermediary server between clients and the
actual server - Proxy processes request
- Proxy processes response
- Intranet proxy may restrict all outbound/inbound
requests the intranet server
3What Does a Proxy Server Do?
- Between client and server
- Receives the client request
- Decides if request will go on to the server
- May have cache may respond from cache
- Acts as the client with respect to the server
- Uses one of its own IP addresses to get page
from server
4Usual Uses for Proxies
- Firewalls
- Employee web use control (email etc.)
- Web content filtering (kids)
- Black lists (sites not allowed)
- White lists (sites allowed)
- Keyword filtering of page content
5User Perspective
- Proxy is invisible to the client
- IP address of proxy is the one used or the
browser is configured to go there - Speed up retrieval if using caching
- Can implement profiles or personalization
6Main Proxy Functions
- Caching
- Firewall
- Filtering
- Logging
7Web Cache Proxy
- Our concern is not with browser cache!
- Store frequently used pages at proxy rather than
request the server to find or create again - Why?
- Reduce latency faster to get from proxy so
makes the server seem more responsive - Reduce traffic reduces traffic to actual server
8Proxy Caches
- Proxy cache serves hundreds/thousands of users
- Corporate and intranets often use
- Most popular requests are generated only once
- Good news
- Proxy cache hit rates often hit 50
- Bad news
- Stale content (stock quotes)
9How Does a Web Cache Work?
- Set of rules in either or both
- Proxy admin
- HTTP header
10Dont Cache Rules
- HTTP header
- Cache-control max-agexxx, must-revalidate
- Expires date
- Last-modified date
- Pragma no-cache (doesnt always work!)
- Object is authenticated or secure
- Fails proxy filter rules
- URL
- Meta data
- MIME type
- Contents
11Getting From Cache
- Use cache copy if it is fresh
- Within date constraint
- Used recently and modified date is not recent
122. Firewalls
- Proxies for security protection
- More on this later
133. Filtering at the Proxy
- URL lists (black and white lists)
- Meta data
- Content filters
14Filtering
label base
Web doc
URL lists
keywords
URLs
ratings
URLs
ratings
15The Problem the Web
- 1 billion documents (April 2000)
- Average query is 2 words (e.g., Sara name)
- Continual growth
- Balance global indexing and access and
unintentional access to inappropriate material
16Filtering Application Types
- Proxies
- Black lists
- White lists
- Keyword profiles
- Labels
17Black and White Lists
- Black list URLs proxy will not access
- White list URLs proxy will allow access
18How Is Filtering/selection Done?
- Build a profile of preferences
- Match input against the profile using rules
19Black and White Lists
- Black list of URLs
- No access allowed
- White list of URLs
- Access permitted
20Lists in Action
- 1 billion documents!
- Who builds the lists
- Who updates them
- Frequency of updates
21Labels
- Metadata tags
- Rule driven PICS rules for example
- Labels are part of document or separate
- Separate label bureau
22Labels
- Metadata (goes with page)
- Label Bureau (stored separately from page)
23Meta Data as part of HTML doc
- ltHTMLgt
- ltHEADgt
- ltMETA
- HTTP-EQUIVkeywords CONTENTfederalgt
- ltMETA
- HTTP-EQUIVkeywords
- CONTENTtaxgt
- lt/HEADgt
-
- lt/HTMLgt
- Browser and/or proxy interpret the metadata
24Metadata Apart From Doc
- Label bureaus
- Request for a doc is also a request for labels
from one or more label bureaus - Who makes the labels
- Text analysis
- Community of users
- Creator of document
25Labels Collaborative Filtering
Search Engine
Label Bureau B
Labels
Author Labels
Label Bureau A
Web Site
Rating Service
26PICS and PICS Rules
- Tools for communities to use profiles and
control/direct access - Structure designed by W3 consortium
- Content designed by communities of users
27PICS Rating Data
- (PICS1-1 http//www.abc.org/r1.5
- by John Doe
- labels on 1998.11.05
- until 2000.11.01
- for http//www.xyz.com/new.html
- ratings (violence 2 blood 1 language 4)
- )
28Using a URL List Filtering
- (PicsRule-1.1
- (Policy (RejectByURL (http//www.xyz.com/)
- Policy (AcceptIf otherwise)
- )
- )
29Using the PICS Data
- (PicsRule-1.1
- (serviceinfo (
- http//www.lablist.org/ratings/v1.html
- shortname PTA
- bureauURL http//www.lablist.org/ratings
- UseEmbedded N
- )
- Policy (RejectIf ((PTA.violence gt3) or
(PTA.language gt2))) - Policy (AcceptIf otherwise)
- )
- )
30Example Medical PICS labels
- Su UMLS vocab word 0-9999999
- Aud- audience 1-patient, 3-para, 5-GP, etc.
- Ty-information type 5-scientist, 3-patient,
4-prod - C-country 1-Can, 4-Afghan, etc.
- Etc.
- Ratings(su 0019186 aud 35 Ty 3 C 1)
31User Profiles for Labels
- Rules for interpreting ratings
- Based on
- User preferences
- User access privileges
- Who keeps these
- Who updates these
- How fine is the granularity
32Labels and Digital Signatures
- Labels can also be used to carry digital
- Signature and authority information
33Example
- (''byKey'' ((''N'' ''aba21241241'')
- (''E'' ''abcdefghijklmnop'')))
- (''on'' ''1996.12.02T2220-0000'')
- (''SigCrypto'' ''aba1241241''))
- (''Signature'' ''http//www.w3.org/TR/1998/REC-DS
ig-label/DSS-1_0'' - (''ByName'' ''plipp_at_iaik.tu-graz.ac.at'')
- (''on'' ''1996.12.02T2220-0000'')
- (''SigCrypto'' ((''R'' ''aba124124156'')
- (''S'' ''casdfkl3r489'')))
))
34Proxy level (hidden)
35Text analysis of Page content
- Proxy examines text of page before showing it
- Generally keyword based
- Profile of black and/or white keywords
36Profiles for Text analysis
- Keywords ( weights sometimes)
- Reflect interest of user or user group
- May be used to eliminate pages
- All but
- May be used to select pages
- Only those
37Keyword matching algorithms
- Extract keywords
- Eliminate noisy words with stop list (1/3)
- Stem (computer compute computation)
- Match to profile
- Evaluate value of match
- Check against a threshold for match
- Show or throw!
38Stop List (35)
- the for
- of on
- and is
- to with
- in by
- a as
- be this
- will are
- from that
- or at
- been an
- was were
- have has
- it
- (27 words)
39Matching Profile to Page
- Similarity?
- How many profile terms occur in doc?
- How often?
- How many docs does term occur in?
- How important is the term to the profile?
40Cosine Similarity Measurement
- Profile terms weighted PW (0,1) ? importance
- Document terms weighted TW (0,1)
- frequency in doc
- frequency in whole set
- Overall closeness of doc to profile
- ?(all profile terms)TW PW
- --------------------------------------------
- ?(?(all profile terms)TW2PW2)
41What works well?
Nothing
42Whats the problem?
- Site Labels
- Who does them?
- Are they authentic?
- Has the source changed?
- A billion docs?
- Black and White lists
- Ditto
- Text analysis of page contents
- Poor results