Crawling and Ranking - PowerPoint PPT Presentation

About This Presentation
Title:

Crawling and Ranking

Description:

... etc.) to text browsers (lynx, links, w3m, etc.) to all other user agents including Web crawlers The HTML language Text and tags Tags define structure Used for ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 58
Provided by: ams61
Category:

less

Transcript and Presenter's Notes

Title: Crawling and Ranking


1
Crawling and Ranking
2
HTML (HyperText Markup Language)
  • Described the structure and content of a (web)
    document
  • HTML 4.01 most common version, W3C standard
  • XHTML 1.0 XML-ization of HTML 4.01, minor
    differences
  • Validation (http//validator.w3.org/) against a
    schema. Checks the conformity of a Web page with
    respect to recommendations, for accessibility
  • to all graphical browsers (IE, Firefox, Safari,
    Opera, etc.)
  • to text browsers (lynx, links, w3m, etc.)
  • to all other user agents including Web crawlers

3
The HTML language
  • Text and tags
  • Tags define structure
  • Used for instance by a browser to lay out the
    document.
  • Header and Body

4
HTML structure
  • lt!DOCTYPE html gt
  • lthtml lang"en"gt
  • ltheadgt
  • lt!-- Header of the document --gt
  • lt/headgt
  • ltbodygt
  • lt!-- Body of the document --gt
  • lt/bodygt
  • lt/htmlgt

5
  • lt!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
    Strict//EN "http//www.w3.org/TR/xhtml1/DTD/xhtml
    1-strict.dtd"gt
  • lthtml xmlnshttp//www.w3.org/1999/xhtml
    lang"en" xmllang"en"gt
  • ltheadgt
  • ltmeta http-equiv"Content-Type
    content"text/html charsetutf-8" /gt
  • lttitlegtExample XHTML documentlt/titlegt
  • lt/headgt
  • ltbodygt
  • ltpgtThis is a lta href"http//www.w3.org/"gtlink to
    the
  • W3Clt/agtlt/pgt
  • lt/bodygt
  • lt/htmlgt



6
Header
  • Appears between the tags ltheadgt ... lt/headgt
  • Includes meta-data such as language, encoding
  • Also include document title
  • Used by (e.g.) the browser to decipher the body

7
Body
  • Between ltbodygt ... lt/bodygt tags
  • The body is structured into sections, paragraphs,
    lists, etc.
  • lth1gtTitle of the pagelt/h1gt
  • lth2gtTitle of a main sectionlt/h2gt
  • lth3gtTitle of a subsectionlt/h3gt
  • . . .
  • ltpgt ... lt/pgt define paragraphs
  • More block elements such as table, list

8
HTTP
  • Application protocol
  • Client request
  • GET /MarkUp/ HTTP/1.1
  • Host www.google.com
  • Server response
  • HTTP/1.1 200 OK
  • Two main HTTP methods GET and POST

9
GET
  • URL http//www.google.com/search?qBGU
  • Corresponding HTTP GET request
  • GET /search?qBGU HTTP/1.1
  • Host www.google.com

10
POST
  • Used for submitting forms
  • POST /php/test.php HTTP/1.1
  • Host www.bgu.ac.il
  • Content-Type application/x-www-formurlencoded
  • Content-Length 100

11
Status codes
  • HTTP response always starts with a status code
    followed by a human-readable message (e.g., 200
    OK)
  • First digit indicates the class of the response
  • 1 Information
  • 2 Success
  • 3 Redirection
  • 4 Client-side error
  • 5 Server-side error

12
Authentication
  • HTTPS is a variant of HTTP that includes
    encryption, cryptographic authentication, session
    tracking, etc.
  • It can be used instead to transmit sensitive data
  • GET ... HTTP/1.1
  • Authorization Basic dG90bzp0aXRp

13
Cookies
  • Key/value pairs, that a server asks a client to
    store and retransmit with each HTTP request (for
    a given domain name).
  • Can be used to keep information on users between
    visits
  • Often what is stored is a session ID
  • Connected, on the server side, to all session
    information

14
Crawling
15
Basics of Crawling
  • Crawlers, (Web) spiders, (Web) robots autonomous
    agents that retrieve pages from the Web
  • Basics crawling algorithm
  • 1. Start from a given URL or set of URLs
  • 2. Retrieve and process the corresponding
    page
  • 3. Discover new URLs (next slide)
  • 4. Repeat on each found URL
  • Problem The web is huge!

16
Discovering new URLs
  • Browse the "internet graph" (following e.g.
    hyperlinks)
  • Site maps (sitemap.org)

17
The internet graph
  • At least 14.06 billion nodes pages
  • At least 140 billion edges links
  • Lots of "junk"

18
Graph-browsing algorithms
  • Depth-?rst
  • Breath-first
  • Combinations..
  • Parallel crawling

19
Duplicates
  • Identifying duplicates or near-duplicates on the
    Web to prevent multiple indexing
  • Trivial duplicates same resource at the same
    canonized URL
  • http//example.com80/toto
  • http//example.com/titi/../toto
  • Exact duplicates identi?cation by hashing
  • near-duplicates (timestamps, tip of the day,
    etc.) more complex!

20
Near-duplicate detection
  • Edit distance
  • Good measure of similarity,
  • Does not scale to a large collection of documents
    (unreasonable to compute
  • the edit distance for every pair!).
  • Shingles two documents similar if they mostly
    share the same succession of k-grams

21
Crawling ethics
  • robots.txt at the root of a Web server
  • User-agent
  • Allow /searchhistory/
  • Disallow /search
  • Per-page exclusion (de facto standard).
  • ltmeta name"ROBOTS" content"NOINDEX,NOFOLLOW"
    gt
  • Per-link exclusion (de facto standard).
  • lta href"toto.html" rel"nofollow"gtTotolt/agt
  • Avoid Denial Of Service (DOS), wait 100ms/1s
    between two
  • repeated requests to the same Web server

22
Overview
  • Crawl
  • Retrieve relevant documents
  • How?
  • To define relevance, to find relevant docs..
  • Rank
  • How?

23
Relevance
  • Input keyword (or set of keywords), the web
  • First question how to define the relevance of a
    page with respect to a key word?
  • Second question how to store pages such that the
    relevant ones for a given keyword are easily
    retrieved?

24
Relevance definition
  • Boolean based on existence of a word in the
    document
  • Synonyms
  • Disadvantages?
  • Word count
  • Synonyms
  • Disadvantages?
  • Can we do better?

25
TF-IDF
26
Storing pages
  • Offline pre-processing can help online search
  • Offline preprocessing includes stemming, stop
    words removal
  • As well as the creation of an index

27
Inverted Index
28
More advanced text analysis
  • N-grams
  • HMM language models
  • PCFG langage models
  • We will discuss all that later in the course!

29
Ranking
30
Why Ranking?
  • Huge number of pages
  • Huge even if we filter according to relevance
  • Keep only pages that include the keywords
  • A lot of the pages are not informative
  • And anyway it is impossible for users to go
    through 10K results

31
When to rank?
  • Before retrieving results
  • Advantage offline!
  • Disadvantage huge set
  • After retrieving results
  • Advantage smaller set
  • Disadvantage online, user is waiting..

32
How to rank?
  • Observation links are very informative!
  • Not just for discovering new sites, but also for
    estimating the importance of a site
  • CNN.com has more links to it than my homepage
  • Quality and Efficiency are key factors

33
Authority and Hubness
  • Authority a site is very authoritative if it
    receives many citations. Citation from
    important sites has more weight than citations
    from less-important sites
  • A(v) The authority of v
  • Hubness A good hub is a site that links to many
    authoritative sites
  • H(v) The hubness of v

34
HITS
  • Recursive dependency
  • a(v) S(u,v) h(u)
  • h(v) S(v,u) a(u)
  • Normalize (when?) according to square root of
    sum of squares of authorities \ hubness values
  • Start by setting all values to 1
  • We could also add bias
  • We can show that a(v) and h(v) converge

35
HITS (cont.)
  • Works rather well if applied only on relevant web
    pages
  • E.g. pages that include the input keywords
  • The results are less satisfying if applied on
    the whole web
  • On the other hand, online ranking is a problem

36
Google PageRank
  • Works offline, i.e. computes for every web-site a
    score that can then be used online
  • Extremely efficient and high-quality
  • The PageRank algorithm that we will describe here
    appears in Brin Page, 1998

37
Random Surfer Model
  • Consider a "random surfer"
  • At each point chooses a link and clicks on it
  • A link is chosen with uniform distribution
  • A simplifying assumption..
  • What is the probability of being, at a random
    time, at a web-page W?

38
Recursive definition
  • If PageRank reflects the probability of being in
    a web-page (PR(w) P(w)) then
  • PR(W) PR(W1) (1/O(W1))
  • PR(Wn) (1/O(Wn))
  • Where O(W) is the out-degree of W

39
Problems
  • A random surfer may get stuck in one component of
    the graph
  • May get stuck in loops
  • Rank Sink Problem
  • Many Web pages have no inlinks/outlinks

40
Damping Factor
  • Add some probability d for "jumping" to a random
    page
  • Now PR(W) (1-d) PR(W1) (1/O(W1))
  • PR(Wn) (1/O(Wn)) d 1/N
  • Where N is the number of pages in the index

41
How to compute PR?
  • Simulation
  • Analytical methods
  • Can we solve the equations?

42
Simulation A random surfer algorithm
  • Start from an arbitrary page
  • Toss a coin to decide if you want to follow a
    link or to randomly choose a new page
  • Then toss another coin to decide which link to
    follow \ which page to go to
  • Keep record of the frequency of the web-pages
    visited

43
Convergence
  • Not guaranteed without the damping factor!
  • (Partial) intuition if unlucky, the algorithm
    may get stuck forever in a connected component
  • Claim with damping, the probability of getting
    stuck forever is 0
  • More difficult claim with damping, convergence
    is guaranteed

44
Markov Chain Monte Carlo (MCMC)
  • A class of very useful algorithms for sampling a
    given distribution
  • We first need to know what is a Markov Chain

45
Markov Chain
  • A finite or countably infinite state machine
  • We will consider the case of finitely many states
  • Transitions are associated with probabilities
  • Markovian property given the present state,
    future choices are independent from the past

46
MCMC framework
  • Construct (explicitly or implicitly) a Markov
    Chain (MC) that describes the desired
    distribution
  • Perform a random walk on the MC, keeping track of
    the proportion of state visits
  • Discard samples made before Mixing
  • Return proportion as an approximation of the
    correct distribution

47
Properties of Markov Chains
  • A Markov Chain defines a distribution on the
    different states (P(state) probability of being
    in the state at a random time)
  • We want conditions on when this distribution is
    unique, and when will a random walk approximate
    it

48
Properties
  • Periodicity
  • A state i has period k if any return to state i
    must occur in multiples of k time steps
  • Aperiodic period 1 for all states
  • Reducibility
  • An MC is irreducible if there is a probability 1
    of (eventually) getting from every state to every
    state
  • Theorem A finite-state MC has a unique
    stationary distribution if it is aperiodic and
    irreducible

49
Back to PageRank
  • The MC is on the graph with probabilities we have
    defined
  • MCMC is the random walk algorithm
  • Is the MC aperiodic? Irreducible?
  • Why?

50
Problem with MCMC
  • In general no guarantees on convergence time
  • Even for those nice MCs
  • A lot of work on characterizing nicer MCs
  • That will allow fast convergence
  • In practice for the web graph it converges rather
    slowly
  • Why?

51
A different approach
  • Reconsider the equation system
  • PR(W) (1-d) PR(W1) (1/O(W1))
  • PR(Wn) (1/O(Wn)) d 1/N
  • A linear equation system!

52
Transition Matrix
  • T (0 0.33 0.33 0.33
  • 0 0 0.5 0.5
  • 0.25 0.25 0.25 0.25
  • 0 0 0 0)
  • Stochastic matrix

53
EigenVector!
  • PR (column vector) is the right eigenvector of
    the stochastic transition matrix
  • I.e. the adjacency matrix normalized to have the
    sum of every column to be 1
  • The Perron-Frobinius theorem ensures that such a
    vector exists
  • Unique under the same assumptions as before

54
Direct solution
  • Solving the equations set
  • Via e.g. Gaussian elimination
  • This is time-consuming
  • Observation the matrix is sparse
  • So iterative methods work better here

55
Power method
  • Start with some arbitrary rank vector R0
  • Compute Ri A Ri-1
  • If we happen to get to the eigenvector we will
    stay there
  • Theorem The process converges to the
    eigenvector!
  • Convergence is in practice pretty fast (100
    iterations)

56
Power method (cont.)
  • Every iteration is still expensive
  • But since the matrix is sparse it becomes
    feasible
  • Still, need a lot of tweaks and optimizations to
    make it work efficiently

57
Other issues
  • Accelerating Computation
  • Updates
  • Distributed PageRank
  • Mixed Model (Incorporating "static" importance)
  • Personalized PageRank
Write a Comment
User Comments (0)
About PowerShow.com