C20'0046: Database Management Systems Lecture presentation

About This Presentation

Title:

C20'0046: Database Management Systems Lecture

Description:

each possible solution x as some fitness f(x) space of possible ... lots of time-series data for different props for different stocks. return market signals ... –

Number of Views:56

Avg rating:3.0/5.0

Slides: 47

Provided by: pagesSt

Learn more at: https://pages.stern.nyu.edu

Category:

more less

Transcript and Presenter's Notes

Title: C20'0046: Database Management Systems Lecture

1
C20.0046 Database Management SystemsLecture 28

Matthew P. Johnson
Stern School of Business, NYU
Spring, 2004

2
Agenda

Previously Failure/recovery
Next
Finish Data mining
Websearch
Proj5 due today
no extensions!
Final Exam Thursday, 5/6,10-1150am
1-minute responses

3
Method Genetic Algorithms

for optimization problems
each possible solution x as some fitness f(x)
space of possible solutions ? a surface
goal find an x at the top of a hill on the
surface
example TSP
with limitations
must go here before some point
cannot go from here to there
25 cities ? 24! Possible ways
billions of years, under reasonable assumptions
GAs are one method for hill climbing
Avoid local maxima

4
GA motivation

Analogy from genetics
Create from random a pop. of poss. solns
Most probably very bad, but
Compute fitness f(x) for each member x
Create next generation by picking relatively good
pairs
Merge half of info from each to create new one
repeat
Examples Glower, Dawkins, etc.

5
Moodys example (DS)

Financial services
needed helpdesk dispatching system
send right person for each IT problem
experienced technicians to solve hard problems
minimize loss of value due to computer downtime
minimize individual wait time
tell users when ere issues would be fixed
schedules did not have to be perfect but good
job-shop scheduling

6
Moodys example

SOGA Scheduling Optimizing GA
Fitness function fitness inverse to total amount
of downtime, max indy. waittime
on live set of current jobs!
includes time of request, type of user, type of
prob., etc.
updates schedules every 10-15 minutes, based on
new inputs
not-yet-started jobs can be reassigned

7
Neural Networks

Also do hill climbing, but completely different
idea
based on connections between neurons in brain
but used for supervised learning
simple NN
input layer with 3 nodes
hidden layer with 2 nodes
output layer with 1 node
each node points to each node in next level
each node as some activation level and a
something like a critical mass
Draw picture
What kind of graph is this?

8
Neural Networks

values passed into input nodes represent the
problem instance
given weighted sum of inputs to neuron, sends out
pulse only if the sum is greater than its
threshold
values output by hidden nodes sent to output node
if the weighted sum going into output node is
high enough, it outputs 1, otherwise 0

9
NN applications

plausible application we have data about
potential customers
party registration
married or not
gender
income level
or have credit applicant information
employment
income
home ownership
bankruptcy
should we give a credit card to him?

10
How NNs work

hope plug in customer ? out comes whether we
should market toward him
How does it get the right answer?
Initially, all weights are random!
But we assume we have data for lots of people
which we know to be either interested in our
products or not
we have data for both kinds
so, when we plug in one of these customers, we
know what the right answer is supposed to be

11
How NNs work

can used the Backpropagation algorithm
for each known problem instance, plug in and look
at answer
if answer is wrong, cane edges weights in one
way o.w., change them the opposite way (details
omitted)
repeat
the more iterations we do, the more the NN learns
our known data
with enough confidence, can apply NN to unknown
customer data to learn whether to market toward
them

12
LBS example

Investments
goal maximize return on investments
buy/sell right securities at right time
lots of time-series data for different props for
different stocks
return market signals
pick right ones
react
soln create NN for each stock
retrain weekly

13
Decision Trees

One technology decision trees
Yet another use of trees!
Each node one attribute
Its children possible values of that attribute
E.g. given votes on issues, dist. Dems v. GOP
Each path from root to leaf is one rule
Like 20Qs/BSTs/B-trees whats important diff?
Learn rules from the data
Given value of this and this and that, this value
will be something
Customer details ? potential good customer,
financial details ? good credit risk, etc.

14
Decision Trees

Details
for binary property, two out edges, but may be
more
for continuous property (income), divide values
into discrete ranges
a property may appear more tan once
Example top node history of bankruptcy?
if yes, REJECT
if no, ten employed?
If no, (maybe look for high monthly housing
payment)
If yes,
Algorithms ID3, C4.5, CART

15
DB Functionality

Association
study frequencies of items occurring together
buys(x,cereal) ? buys(x,milk)
Mail order tulip bulbs ? Conservative party
voters
intuition
Techniques
Decision trees
GAs - Glower

16
Mining v. warehousing

Warehousing let user search, group by
interesting properties
Give me the sales of A4s by year and dealer, for
these colors
User tries to learn from results which properties
are important/interesting
Mining tell the user what the interesting
properties are
Whats driving sales?

17
Social/political concerns

Privacy
TIA
Sensitive data
Allow mining but not queries
Opt-in/opt-out
Dont be evil.

18
For more info

See Dahr Stein Seven Methods for Transforming
Corporate Data into Business Intelligence (1997)
Drawn on above
A few years old, but very accessible
Data mining courses offered here

19
Next topic Websearch

Websearch is not running queries on the web
Info/connections downloaded from web
Crawl paths from tree rooted at homepage
Special database is formed
On request, we run query on DB
Draw picture
Issues
DNS bottleneck
search strategy
refresh strategy
robot exclusion protocol
bad HTML, non-responsive servers, etc.

20
Crawling issues

Content-seen test
compute fingerprint/hash of page content
also URL-seen test
DNS bottleneck
to view page by text link, must get address
BP claim 87 crawling time DNS look-up
Referring to webpages
use DocIDs, not addresses
more popular pages get shorter DocIDs (why?)
many commodity machines ? frequent crashes
Logging, checkpointing

21
Crawling

Comparatively straightforward
although Google uses PageRank for crawling, too
First, must get pages
Spiders
Prof. Davis (NYU/CS) http//www.cs.nyu.edu/course
s/fall02/G22.3033-008/WebCrawler.java
http//pages.stern.nyu.edu/mjohnson/dbms/eg/WebC
rawler.java
Run program
sales java WebCrawler http//pages.stern.nyu.edu/
mjohnson/dbms 200

22
Inverted indices

Basic idea of finding pages
Create inverted index mapping words to pages
First, think of each webpage as a tuple
One column for every possible word
True means the word appears on the page
Index on all columns
Now can search youre fired
? select from T where youreT and firedT

23
Inverted indices

Can simplify somewhat
For each field index, delete False entries
True entries for each index become a bucket
Create inverted index
One entry for each search word
Search word entry points to corresponding bucket
Bucket points to pages with its word
Amazon

24
Inverted Indices

Whats stored?
For each word W, for each doc D
relevance of D to W
/ occurs. of W in D
meta-data/context bold, font size, title, etc.
In addition to page importance, keep in mind
this info is used to determine relevance of
particular words appearing on the page

25
Google-like infrastructure

Very large distributed system
100MB, GB files ? Google File System
Block size 64MB (not kb)!
hundreds or thousands of low-quality Linux
servers
? system failures are rule, not exception
Divide index up by words into many barrels
lexicon maps word ids to words barrel
also, do RAID ? two-D matrix of servers
Draw picture

26
Google-like infrastructure

Respond to query Q(w1wn)
for each wi, send to barrel column for wi
pick random server in that column
for each wi in parallel, step through until find
doc containing all words, add to set of results
index ordered on worddocID
Linear time
return sorted results

27
Sorting Results

How to respond to Q(w1,w2,,wn)?
Search index for pages with w1,w2,,wn
Return in sorted order (how?)
Soln 1 current order
Return 100,000 (mostly) useless results
Soln 2 ways from Information Retrieval Theory
library science CS

28
Information Retrieval Theory

Standard ways from IR theory
Clustering algorithms
simplest for each word W in a doc D, compute
occurs of W in D / total word occurs in D
? each document becomes a point in a space
one dimension for every possible word
value in that dim is ratio from above (maybe with
logs)
Choose pages with high values for sought words
Variants work well for small sets of docs
But the web has billions
Problem very short pages containing query words
are privileged
query bill clinton ? Bill Clinton Sucks page
(BP)

29
Sorting Results

Soln 3 sort by quality
What do you mean by quality?
hire readers to rate your webpage (early Yahoo)
problem more webpages than Yahoo employees
Soln 4 citations (links) peer review
idea the rest of the web has already voted on
the quality of your webpage
1 link to your page 1 vote
similar to counting academic citations

30
Sorting Results

Soln 5 Googles PageRank
count citations, but not equally weighted sum
motiv already decided some pages are better
? their votes count for more
two cases at ends of continuum
many pages link to you
Google.com links to you
More precisely, for page P, each Li links to P
each page Li links to NLi pages
PR(P) SUM(PR(Li)/NLi))
Motiv each page votes with its quality
its quality is divided among the pages it votes
for

31
Understanding PageRank

Analogy 1 Friendster
someone lets you in
someone else let that you in, etc.
Analogy 2 PKE certificates
my cert authenticated by your cert
your cert endorsed by someone else's
Both cases ere eventually reach a foundation
Analogy 3 job/school recommendations
three people recommend you
why should we believe them?
three other people rec-ed them, etc.
eventually, we trust

32
Understanding PageRank

Analogy 3 Random Surfer Model
Idealized web surfer
start at some page
at each page, pick a random link
Turns out after long time surfing,
Pr(were at some page P right now) PR(P)
PRs are normalized

33
Computing PageRank

For each page P, we want
PR(P) SUM(PR(Li)/NLi))
But its circular how to compute?
Meth 1 for n pages, we've got n linear eqs and n
vars
can solve for all PR(P)s, but too hard
see your mathematical programming course
Meth 2 iteratively
start with PR(P) set to E for each P
iterate until no more significant change
PB report O(50) iterations for O(300M) links,
O(30M) pages
iters req. rows only with log of web size

34
Problems with PageRank

Example (Ullman)
A points to Y, M
Y points to self, A
M points nowhere
Start A,Y,M at 1
(1,1,1) ? (0,0,0)
The rank dissipates
stern java PageRank
Soln add self link to any dead-end

35
Problems with PageRank

Example (Ullman)
A points to Y, M
Y points to self, A
M points to self
Start A,Y,M at 1
(1,1,1) ? (0,0,3)
M becomes a rank sink
RSM interp eventually we end up at M and get
stuck
stern java PageRank2
Soln add inherent quality E to each page

36
Modified PageRank

Apart from inherited quality, each page also as
inherent quality E
PR(P) E SUM(PR(Li)/OLi))
More precisely, have weighted sum of the two
terms
PR(P) .15E .85SUM(PR(Li)/OLi))
stern java PageRank2
Leads to new random surfer model

37
Random Surfer Model

Motiv if we end up at M, we dont really stay
there forever
We type in a new URL
Idealized web surfer
start at some page
at each page, pick a random link
but occasionally, we get bored and jump to a
random page
Turns out after long time surfing,
Pr(were at some page P right now) PR(P)

38
Understanding PageRank

One more interp hydraulic model
picture graph of web
imagine each link as a tube bet. two nodes,
imagine quality as fluid
each node is a reservoir of E amount of fluid
Now let flow
Steady state is each node P w/PR(P) amount of
fluid
PR(P) eventually settles in node P

39
Understanding PageRank

Sornette Why Stock Markets Crash
Si(t1) sign(ei SUM(Sj(t))
trader buys/sells based on
is inclination and
what is associates are saying
directions. of magnet det-ed by
old direction and
dirs. of neighbors
activation of neuron det-ed by
its props and
activation of neighbors connected by synapses
PR of P based on
its inherent value and
PR of in-links

40
Non-uniform Es

So far, assumed E was const for all pages
Can let EE(P)
vary by page
How do we choose E(P)?
Choice 1 set high for pages we already trust
Choice 2 set high for pages I like
PB paper gave high E to John McCarthys homepage
? pages he links to get high PR, etc.
Result Google personalized for him
Q How would google.com get your prefs?

41
Tricking search engines

Search Engine Optimization
Old search engines tat dont sort results or sort
tem badly
include on your page (maybe hidden) lots of words
you think people will query on
This doesnt work for Google because the pages
doing this probably aren't linked to
but

42
Tricking search engines

I can try to make my page look popular to Google
create a page wit 1000 links to my page
does this work?
Create 1000 other pages linking to it
PR2 put limit on weight one domain can give to
itself
Trick2 buy another domain and put the 1000 pages
there
PR3 put limit on weight from any single domain

43
Googles good ideas

Google had two good ideas
Google's big idea PageRank
Google's little idea use anchor text
Motiv pages may not give best descrips. of
themselves
most search engines dont contain "search engine"
PB claim only 1 of top 4 search engines could
find themselves on query "search engine"
Anchor text also describes page
many pages link to Google
many of them likely say "search engine" in/near
the link
? Treat anchor text words as part of page
This is why search for US West found Quest!

44
Tricking search engines

This provides a new way to trick the search
engine
Use of anchor text is a big part of Google's
result quality
but provides way to trick Google on other peoples
pages
Google Bombs
put up lots of pages linking to my page, using
some particular phrase in anchor text
result search for words you chose gives my page
Examples "talentless hack", "miserable failure",
"weapons of mass destruction", last name of PAs
junior senator

45
For more info