C20'0046: Database Management Systems Lecture

1 / 46
About This Presentation
Title:

C20'0046: Database Management Systems Lecture

Description:

each possible solution x as some fitness f(x) space of possible ... lots of time-series data for different props for different stocks. return market signals ... –

Number of Views:56
Avg rating:3.0/5.0
Slides: 47
Provided by: pagesSt
Category:

less

Transcript and Presenter's Notes

Title: C20'0046: Database Management Systems Lecture


1
C20.0046 Database Management SystemsLecture 28
  • Matthew P. Johnson
  • Stern School of Business, NYU
  • Spring, 2004

2
Agenda
  • Previously Failure/recovery
  • Next
  • Finish Data mining
  • Websearch
  • Proj5 due today
  • no extensions!
  • Final Exam Thursday, 5/6,10-1150am
  • 1-minute responses

3
Method Genetic Algorithms
  • for optimization problems
  • each possible solution x as some fitness f(x)
  • space of possible solutions ? a surface
  • goal find an x at the top of a hill on the
    surface
  • example TSP
  • with limitations
  • must go here before some point
  • cannot go from here to there
  • 25 cities ? 24! Possible ways
  • billions of years, under reasonable assumptions
  • GAs are one method for hill climbing
  • Avoid local maxima

4
GA motivation
  • Analogy from genetics
  • Create from random a pop. of poss. solns
  • Most probably very bad, but
  • Compute fitness f(x) for each member x
  • Create next generation by picking relatively good
    pairs
  • Merge half of info from each to create new one
  • repeat
  • Examples Glower, Dawkins, etc.

5
Moodys example (DS)
  • Financial services
  • needed helpdesk dispatching system
  • send right person for each IT problem
  • experienced technicians to solve hard problems
  • minimize loss of value due to computer downtime
  • minimize individual wait time
  • tell users when ere issues would be fixed
  • schedules did not have to be perfect but good
  • job-shop scheduling

6
Moodys example
  • SOGA Scheduling Optimizing GA
  • Fitness function fitness inverse to total amount
    of downtime, max indy. waittime
  • on live set of current jobs!
  • includes time of request, type of user, type of
    prob., etc.
  • updates schedules every 10-15 minutes, based on
    new inputs
  • not-yet-started jobs can be reassigned

7
Neural Networks
  • Also do hill climbing, but completely different
    idea
  • based on connections between neurons in brain
  • but used for supervised learning
  • simple NN
  • input layer with 3 nodes
  • hidden layer with 2 nodes
  • output layer with 1 node
  • each node points to each node in next level
  • each node as some activation level and a
    something like a critical mass
  • Draw picture
  • What kind of graph is this?

8
Neural Networks
  • values passed into input nodes represent the
    problem instance
  • given weighted sum of inputs to neuron, sends out
    pulse only if the sum is greater than its
    threshold
  • values output by hidden nodes sent to output node
  • if the weighted sum going into output node is
    high enough, it outputs 1, otherwise 0

9
NN applications
  • plausible application we have data about
    potential customers
  • party registration
  • married or not
  • gender
  • income level
  • or have credit applicant information
  • employment
  • income
  • home ownership
  • bankruptcy
  • should we give a credit card to him?

10
How NNs work
  • hope plug in customer ? out comes whether we
    should market toward him
  • How does it get the right answer?
  •  
  • Initially, all weights are random!
  • But we assume we have data for lots of people
    which we know to be either interested in our
    products or not
  • we have data for both kinds
  • so, when we plug in one of these customers, we
    know what the right answer is supposed to be

11
How NNs work
  • can used the Backpropagation algorithm
  • for each known problem instance, plug in and look
    at answer
  • if answer is wrong, cane edges weights in one
    way o.w., change them the opposite way (details
    omitted)
  • repeat
  • the more iterations we do, the more the NN learns
    our known data
  • with enough confidence, can apply NN to unknown
    customer data to learn whether to market toward
    them

12
LBS example
  • Investments
  • goal maximize return on investments
  • buy/sell right securities at right time
  • lots of time-series data for different props for
    different stocks
  • return market signals
  • pick right ones
  • react
  • soln create NN for each stock
  • retrain weekly

13
Decision Trees
  • One technology decision trees
  • Yet another use of trees!
  • Each node one attribute
  • Its children possible values of that attribute
  • E.g. given votes on issues, dist. Dems v. GOP
  • Each path from root to leaf is one rule
  • Like 20Qs/BSTs/B-trees whats important diff?
  • Learn rules from the data
  • Given value of this and this and that, this value
    will be something
  • Customer details ? potential good customer,
    financial details ? good credit risk, etc.

14
Decision Trees
  • Details
  • for binary property, two out edges, but may be
    more
  • for continuous property (income), divide values
    into discrete ranges
  • a property may appear more tan once
  • Example top node history of bankruptcy?
  • if yes, REJECT
  • if no, ten employed?
  • If no, (maybe look for high monthly housing
    payment)
  • If yes,
  • Algorithms ID3, C4.5, CART

15
DB Functionality
  • Association
  • study frequencies of items occurring together
  • buys(x,cereal) ? buys(x,milk)
  • Mail order tulip bulbs ? Conservative party
    voters
  • intuition
  • Techniques
  • Decision trees
  • GAs - Glower

16
Mining v. warehousing
  • Warehousing let user search, group by
    interesting properties
  • Give me the sales of A4s by year and dealer, for
    these colors
  • User tries to learn from results which properties
    are important/interesting
  • Mining tell the user what the interesting
    properties are
  • Whats driving sales?

17
Social/political concerns
  • Privacy
  • TIA
  • Sensitive data
  • Allow mining but not queries
  • Opt-in/opt-out
  • Dont be evil.

18
For more info
  • See Dahr Stein Seven Methods for Transforming
    Corporate Data into Business Intelligence (1997)
  • Drawn on above
  • A few years old, but very accessible
  • Data mining courses offered here

19
Next topic Websearch
  • Websearch is not running queries on the web
  • Info/connections downloaded from web
  • Crawl paths from tree rooted at homepage
  • Special database is formed
  • On request, we run query on DB
  • Draw picture
  • Issues
  • DNS bottleneck
  • search strategy
  • refresh strategy
  • robot exclusion protocol
  • bad HTML, non-responsive servers, etc.

20
Crawling issues
  • Content-seen test
  • compute fingerprint/hash of page content
  • also URL-seen test
  • DNS bottleneck
  • to view page by text link, must get address
  • BP claim 87 crawling time DNS look-up
  • Referring to webpages
  • use DocIDs, not addresses
  • more popular pages get shorter DocIDs (why?)
  • many commodity machines ? frequent crashes
  • Logging, checkpointing

21
Crawling
  • Comparatively straightforward
  • although Google uses PageRank for crawling, too
  • First, must get pages
  • Spiders
  • Prof. Davis (NYU/CS) http//www.cs.nyu.edu/course
    s/fall02/G22.3033-008/WebCrawler.java
  • http//pages.stern.nyu.edu/mjohnson/dbms/eg/WebC
    rawler.java
  • Run program
  • sales java WebCrawler http//pages.stern.nyu.edu/
    mjohnson/dbms 200

22
Inverted indices
  • Basic idea of finding pages
  • Create inverted index mapping words to pages
  • First, think of each webpage as a tuple
  • One column for every possible word
  • True means the word appears on the page
  • Index on all columns
  • Now can search youre fired
  • ? select from T where youreT and firedT

23
Inverted indices
  • Can simplify somewhat
  • For each field index, delete False entries
  • True entries for each index become a bucket
  • Create inverted index
  • One entry for each search word
  • Search word entry points to corresponding bucket
  • Bucket points to pages with its word
  • Amazon

24
Inverted Indices
  • Whats stored?
  • For each word W, for each doc D
  • relevance of D to W
  • / occurs. of W in D
  • meta-data/context bold, font size, title, etc.
  • In addition to page importance, keep in mind
  • this info is used to determine relevance of
    particular words appearing on the page

25
Google-like infrastructure
  • Very large distributed system
  • 100MB, GB files ? Google File System
  • Block size 64MB (not kb)!
  • hundreds or thousands of low-quality Linux
    servers
  • ? system failures are rule, not exception
  • Divide index up by words into many barrels
  • lexicon maps word ids to words barrel
  • also, do RAID ? two-D matrix of servers
  • Draw picture

26
Google-like infrastructure
  • Respond to query Q(w1wn)
  • for each wi, send to barrel column for wi
  • pick random server in that column
  • for each wi in parallel, step through until find
    doc containing all words, add to set of results
  • index ordered on worddocID
  • Linear time
  • return sorted results

27
Sorting Results
  • How to respond to Q(w1,w2,,wn)?
  • Search index for pages with w1,w2,,wn
  • Return in sorted order (how?)
  • Soln 1 current order
  • Return 100,000 (mostly) useless results
  • Soln 2 ways from Information Retrieval Theory
  • library science CS

28
Information Retrieval Theory
  • Standard ways from IR theory
  • Clustering algorithms
  • simplest for each word W in a doc D, compute
  • occurs of W in D / total word occurs in D
  • ? each document becomes a point in a space
  • one dimension for every possible word
  • value in that dim is ratio from above (maybe with
    logs)
  • Choose pages with high values for sought words
  • Variants work well for small sets of docs
  • But the web has billions
  • Problem very short pages containing query words
    are privileged
  • query bill clinton ? Bill Clinton Sucks page
    (BP)

29
Sorting Results
  • Soln 3 sort by quality
  • What do you mean by quality?
  • hire readers to rate your webpage (early Yahoo)
  • problem more webpages than Yahoo employees
  • Soln 4 citations (links) peer review
  • idea the rest of the web has already voted on
    the quality of your webpage
  • 1 link to your page 1 vote
  • similar to counting academic citations

30
Sorting Results
  • Soln 5 Googles PageRank
  • count citations, but not equally weighted sum
  • motiv already decided some pages are better
  • ? their votes count for more
  • two cases at ends of continuum
  • many pages link to you
  • Google.com links to you
  • More precisely, for page P, each Li links to P
    each page Li links to NLi pages
  • PR(P) SUM(PR(Li)/NLi))
  • Motiv each page votes with its quality
  • its quality is divided among the pages it votes
    for

31
Understanding PageRank
  • Analogy 1 Friendster
  • someone lets you in
  • someone else let that you in, etc.
  • Analogy 2 PKE certificates
  • my cert authenticated by your cert
  • your cert endorsed by someone else's
  • Both cases ere eventually reach a foundation
  • Analogy 3 job/school recommendations
  • three people recommend you
  • why should we believe them?
  • three other people rec-ed them, etc.
  • eventually, we trust

32
Understanding PageRank
  • Analogy 3 Random Surfer Model
  • Idealized web surfer
  • start at some page
  • at each page, pick a random link
  • Turns out after long time surfing,
  • Pr(were at some page P right now) PR(P)
  • PRs are normalized

33
Computing PageRank
  • For each page P, we want
  • PR(P) SUM(PR(Li)/NLi))
  • But its circular how to compute?
  • Meth 1 for n pages, we've got n linear eqs and n
    vars
  • can solve for all PR(P)s, but too hard
  • see your mathematical programming course
  • Meth 2 iteratively
  • start with PR(P) set to E for each P
  • iterate until no more significant change
  • PB report O(50) iterations for O(300M) links,
    O(30M) pages
  • iters req. rows only with log of web size

34
Problems with PageRank
  • Example (Ullman)
  • A points to Y, M
  • Y points to self, A
  • M points nowhere
  • Start A,Y,M at 1
  • (1,1,1) ? (0,0,0)
  • The rank dissipates
  • stern java PageRank
  • Soln add self link to any dead-end

35
Problems with PageRank
  • Example (Ullman)
  • A points to Y, M
  • Y points to self, A
  • M points to self
  • Start A,Y,M at 1
  • (1,1,1) ? (0,0,3)
  • M becomes a rank sink
  • RSM interp eventually we end up at M and get
    stuck
  • stern java PageRank2
  • Soln add inherent quality E to each page

36
Modified PageRank
  • Apart from inherited quality, each page also as
    inherent quality E
  • PR(P) E SUM(PR(Li)/OLi))
  • More precisely, have weighted sum of the two
    terms
  • PR(P) .15E .85SUM(PR(Li)/OLi))
  • stern java PageRank2
  • Leads to new random surfer model

37
Random Surfer Model
  • Motiv if we end up at M, we dont really stay
    there forever
  • We type in a new URL
  • Idealized web surfer
  • start at some page
  • at each page, pick a random link
  • but occasionally, we get bored and jump to a
    random page
  • Turns out after long time surfing,
  • Pr(were at some page P right now) PR(P)

38
Understanding PageRank
  • One more interp hydraulic model
  • picture graph of web
  • imagine each link as a tube bet. two nodes,
    imagine quality as fluid
  • each node is a reservoir of E amount of fluid
  • Now let flow
  • Steady state is each node P w/PR(P) amount of
    fluid
  • PR(P) eventually settles in node P

39
Understanding PageRank
  • Sornette Why Stock Markets Crash
  • Si(t1) sign(ei SUM(Sj(t))
  • trader buys/sells based on
  • is inclination and
  • what is associates are saying
  • directions. of magnet det-ed by
  • old direction and
  • dirs. of neighbors
  • activation of neuron det-ed by
  • its props and
  • activation of neighbors connected by synapses
  • PR of P based on
  • its inherent value and
  • PR of in-links

40
Non-uniform Es
  • So far, assumed E was const for all pages
  • Can let EE(P)
  • vary by page
  • How do we choose E(P)?
  • Choice 1 set high for pages we already trust
  • Choice 2 set high for pages I like
  • PB paper gave high E to John McCarthys homepage
  • ? pages he links to get high PR, etc.
  • Result Google personalized for him
  • Q How would google.com get your prefs?

41
Tricking search engines
  • Search Engine Optimization
  • Old search engines tat dont sort results or sort
    tem badly
  • include on your page (maybe hidden) lots of words
    you think people will query on
  • This doesnt work for Google because the pages
    doing this probably aren't linked to
  • but

42
Tricking search engines
  • I can try to make my page look popular to Google
  • create a page wit 1000 links to my page
  • does this work?
  • Create 1000 other pages linking to it
  • PR2 put limit on weight one domain can give to
    itself
  • Trick2 buy another domain and put the 1000 pages
    there
  • PR3 put limit on weight from any single domain

43
Googles good ideas
  • Google had two good ideas
  • Google's big idea PageRank
  • Google's little idea use anchor text
  • Motiv pages may not give best descrips. of
    themselves
  • most search engines dont contain "search engine"
  • PB claim only 1 of top 4 search engines could
    find themselves on query "search engine"
  • Anchor text also describes page
  • many pages link to Google
  • many of them likely say "search engine" in/near
    the link
  • ? Treat anchor text words as part of page
  • This is why search for US West found Quest!

44
Tricking search engines
  • This provides a new way to trick the search
    engine
  • Use of anchor text is a big part of Google's
    result quality
  • but provides way to trick Google on other peoples
    pages
  • Google Bombs
  • put up lots of pages linking to my page, using
    some particular phrase in anchor text
  • result search for words you chose gives my page
  • Examples "talentless hack", "miserable failure",
    "weapons of mass destruction", last name of PAs
    junior senator

45
For more info
  • See sources drawn upon here
  • Prof. Davis (NYU/CS) search engines course
  • http//www.cs.nyu.edu/courses/fall02/G22.3033-008/
  • Original research papers by Page Brin
  • The PageRank Citation Ranking Brining Order to
    the Web
  • The Anatomy of a Large-Scale Hypertextual Web
    Search Engine
  • very accessible
  • Google Labs http//labs.google.com

46
Future
  • Final Exam Thursday, 5/6,10-1150am
  • Final exam info is up
  • Interest in a review session?
  • Please fill out evals!
  • Thanks!
Write a Comment
User Comments (0)
About PowerShow.com