Dealing with MASSIVE Data - PowerPoint PPT Presentation

About This Presentation

Title:

Dealing with MASSIVE Data

Description:

Title: Cache-Oblivious Priority Queue and Graph Algorithm Applications Author: Lars Arge Description: unix compatible title Created Date: 5/14/2002 6:45:48 PM – PowerPoint PPT presentation

Number of Views:111

Avg rating:3.0/5.0

Slides: 47

Provided by: Lars67

Learn more at: https://users.cs.utah.edu

Category:

more less

Transcript and Presenter's Notes

Title: Dealing with MASSIVE Data

1
Dealing with MASSIVE Data
Feifei Li lifeifei_at_cs.fsu.edu Dept Computer
Science, FSU Sep 9, 2008
2
Brief Bio

B.A.S. in computer engineering from Nanyang
Technological University in 2002
Ph.D. in computer science from Boston University
in 2007
Research Interns/Visitors at ATT Labs, IBM T. J.
Watson Research Center, Microsoft Research.
Now Assistant Professor in CS Department at FSU

3
Research Areas
Database Applications
indexing
query processing
spatial databases
Algorithms and Data structures
data streams
I/O-efficient algorithms
computational geometry
streaming algorithms
misc.
Geographic Information Systems
data security and privacy
Probabilistic Data
4
Massive Data

Massive datasets are being collected everywhere
Storage management software is billion- industry

Examples (2002)
Phone ATT 20TB phone call database, wireless
tracking
Consumer WalMart 70TB database, buying patterns
WEB Web crawl of 200M pages and 2000M links,
Googles huge indexes
Geography NASA satellites generate 1.2TB per day

5
Example LIDAR Terrain Data

Massive (irregular) point sets (1-10m resolution)
Becoming relatively cheap and easy to collect
Appalachian Mountains between 50GB and 5TB
Exceeds memory limit and needs to be stored on
disk

6
Example Network Flow Data

ATT IP backbone generates 500 GB per day
Gigascope A data stream management system
Compute certain statistics
Can we do computation without storing the data?

7
Traditional Random Access Machine Model
R A M

Standard theoretical model of computation
Infinite memory (how nice!)
Uniform access cost
Simple model crucial for success of computer
industry

8
How to Deal with MASSIVE Data?

when there is not enough memory

9
Solution 1 Buy More Memory

Expensive
(Probably) not scalable
Growth rate of data is higher than the growth of
memory

10
Solution 2 Cheat! (by random sampling)

Provide approximate solution for some problems
average, frequency of an element, etc.
What if we want the exact result?
Many problems cant be solved by sampling
maximum, and all problems mentioned later

11
Solution 3 Using the Right Computation Model

External Memory Model
Streaming Model
Probabilistic Model (brief)

12
Computation Model for Massive Data (1)External
Memory Model

Internal memory is limited but fast
External memory is unlimited but slow

13
Memory Hierarchy

Modern machines have complicated memory hierarchy
Levels get larger and slower further away from
CPU
Block sizes and memory sizes are different!
There are a few attempts to model the hierarchy
but not successful
They are too complicated!

14
Slow I/O

Disk access is 106 times slower than main memory
access

The difference in speed between modern CPU and
disk technologies is analogous to the difference
in speed in sharpening a pencil using a sharpener
on ones desk or by taking an airplane to the
other side of the world and using a sharpener on
someone elses desk. (D. Comer)

Disk systems try to amortize large access time
transferring large contiguous blocks of data
(8-16Kbytes)
Important to store/access data to take advantage
of blocks (locality)

15
Puzzle 1 Majority Counting
b
a
e
c
a
d
a
a
d
a
a
e
a
b
a
a
f
a
g
b

A huge file of characters stored on disk
Question Is there a character that appears gt 50
of the time
Solution 1 sort scan
A few passes (O(logM/B N)) will come to it later
Solution 2 divide-and-conquer
Load a chunk in to memory N/M chunks
Count them, return majority
The overall majority must be the majority in gt50
chunks
Iterate until lt M
Very few passes (O(logM N)), geometrically
decreasing
Solution 3 O(1) memory, 2 passes (answer to be
posted later)

16
External Memory Model AV88

N of items in the problem instance
B of items per disk block
M of items that fit in main memory
I/O Move block between memory and disk
Performance measure of I/Os performed by
algorithm
We assume (for convenience) that M gtB2

D
Block I/O
M
P
17
Sorting in External Memory

Break all N elements into N/M chunks of size M
each
Sort each chunk individually in memory
Merge them together
Can merge ltM/B sorted lists (queues) at once
M/B blocks in main memory

18
Sorting in External Memory

Merge sort
Create N/M memory sized sorted lists
Repeatedly merge lists together T(M/B) at a time
? phases using
I/Os each ? I/Os

19
External Searching B-Tree

Each node (except root) has fan-out between B/2
and B
Size O(N/B) blocks on disk
Search O(logBN) I/Os following a root-to-leaf
path
Insertion and deletion O(logBN) I/Os

20
Fundamental Bounds

Internal External
Scanning N
Sorting N log N
Searching
More Results
List ranking N
Minimal spanning tree N log N
Offline union-find N
Interval searching log N T
logBN T/B
Rectangle enclosure log N T
log N T/B
R-tree search

21
Does All the Theory Matter?

Programs developed in RAM-modelstill runs even
there is not enough memory
Run on large datasets because
OS moves blocks as needed
OS utilizes paging and prefetching strategies
But if program makes scattered accesses even good
OS cannot take advantage of block access
?
Thrashing!

22
Toy Experiment Permuting

Problem
Input N elements out of order 6, 7, 1, 3, 2, 5,
10, 9, 4, 8
Each element knows its correct position
Output Store them on disk in the right order
Internal memory solution
Just scan the original sequence and move every
element in the right place!
O(N) time, O(N) I/Os
External memory solution
Use sorting
O(N log N) time, I/Os

23
A Practical Example on Real Data

Computing persistence on large terrain data

24
Takeaways

Need to be very careful when your programs space
usage exceeds physical memory size
If program mostly makes highly localized accesses
Let the OS handle it automatically
If program makes many non-localized accesses
Need I/O-efficient techniques
Three common techniques (recall the majority
counting puzzle)
Convert to sort scan
Divide-and-conquer
Other tricks

25
Want to know more about I/O-efficient algorithms?

A course on I/O-efficient algorithms is offered
as CIS5930 (Advanced Topics in Data Management)

26
Computation Model for Massive Data (2)Streaming
Model
Cannot Dont want to store data and do further
processing Cant wait to

You got to look at each element only once!

27
Streaming Algorithms Applications
What are the top (most frequent) 1000 (source,
dest) pairs seen over the last month?
How many distinct (source, dest) pairs have been
seen?
Off-line analysis slow, expensive
Set-Expression Query
Network Operations Center (NOC)
SELECT COUNT (R1.source, R2.dest) FROM R1,
R2 WHERE R1.dest R2.source
Peer
SQL Join Query

Other applications
Sensor networks
Network security
Financial applications
Web logs and clickstreams

EnterpriseNetworks
PSTN

DSL/Cable Networks
28
Puzzle 2 Find Missing Card
Mahjong tile

How to find the missing tile by making one pass
over everything?
Assuming you cant memorize everything (of
course)
Assign a number to each type of tiles
8, 14, 22
Compute the sum of all remaining tiles
(1911192129)4 sum missing tile!

29
A Research Problem Count Distinct Elements
b
a
e
c
a
d
a
a
d
a
a
e
a
b
a
a
f
a
g
b
distinct elements 7

Unfortunately, there is a lower bound saying you
cant do this without using O(n) memory
But if we allow some errors, then can approximate
it well

30
Solution FM Sketch FM85, AMS99

Take a (pseudo) random hash function h 1,,n
? 1,,2d, where 2d gt n
For each incoming element x, compute h(x)
e.g., h(5) 10101100010000
Count how many trailing zeros
Remember the maximum number of trailing zeroes in
any h(x)
Let Y be the maximum number of trailing zeroes
Can show E2Y distinct elements
2 elements, on average there is one h(x) with 1
trailing zero
4 elements, on average there is one h(x) with 2
trailing zeroes
8 elements, on average there is one h(x) with 3
trailing zeroes

31
Counting Paintballs

Imagine the following scenario
A bag of n paintballs is emptied at the top of a
long stair-case.
At each step, each paintball either bursts and
marks the step, or bounces to the next step.
50/50 chance either way.

Looking only at the pattern of marked steps, what
was n?
32
Counting Paintballs (cont)
B(n,1/2)

What does the distribution of paintball bursts
look like?
The number of bursts at each step follows a
binomial distribution.
The expected number of bursts drops
geometrically.
Few bursts after log2 n steps

B(n,1/4)
1st
2nd
B(n,1/2 Y)
Y th
B(n,1/2 Y)
33
Solution FM Sketch FM85, AMS99

So 2Y is an unbiased estimator for distinct
elements
However, has a large variance
Use O(1/e2 log(1/d)) copies to guarantee a good
estimator that has probability 1d to be within
relative error e
Applications
How many distinct IP addresses used a given link
to send their traffic from the beginning of the
day?
How many new IP addresses appeared today that
didnt appear before?

34
Finding Heavy Hitters

Which elements appeared in the stream more than
10 of the time?
Applications
Networking
Finding IP addresses sending most traffic
Databases
Iceberg queries
Data mining
Finding hot items (item sets) in transaction
data
Solution
Exact solution is difficult
If allow approximation of e
Use O(1/e) space and O(1) time per element in
stream

35
Streaming in a Distributed World
Network Operations Center (NOC)

Large-scale querying/monitoring Inherently
distributed!
Streams physically distributed across remote
sitesE.g., stream of UDP packets through subset
of edge routers
Challenge is holistic querying/monitoring
Queries over the union of distributed streams
Q(S1 ? S2 ? )
Streaming data is spread throughout the network

36
Streaming in a Distributed World
Network Operations Center (NOC)

Need timely, accurate, and efficient query
answers
Additional complexity over centralized data
streaming!
Need space/time- and communication-efficient
solutions
Minimize network overhead
Maximize network lifetime (e.g., sensor battery
life)
Cannot afford to centralize all streaming data

37
Want to know more about streaming algorithms?

A graduate-level course on streaming algorithms
willbe approximately offered in the next next
next semester with an error guarantee of 5!

Or, talk to me tomorrow!
38
Top-k Queries

Extremely useful in information retrieval
top-k sellers, popular movies, etc.
google

tuple score
t1t2t3t4t5 65301008087
tuple score
t3t5t4t1t2 10087806530
Threshold Alg RankSQL
top-2 t3, t5
39
Top-k Queries on Uncertain Data
tuple score
t3t5t4t1t2 10087806530
confidence
0.20.80.90.50.6
tuple score
t3t5t4t1t2 10087806530
confidence
0.20.80.90.50.6
top-k answer depends onthe interplay
between score and confidence
(sensor reading, reliability) (page rank, how
well match query)
40
Top-k Definition U-Topk
The k tuples with the maximum probabilityof
being the top-k
t3, t5 0.20.8 0.16 t3, t4
0.2(1-0.8)0.9 0.036 t5, t4
(1-0.2)0.80.9 0.576 ...
tuple score
t3t5t4t1t2 10087806530
confidence
0.20.80.90.50.6
Potential problem top-k could be very different
from top-(k1)
41
Top-k Definition U-kRanks
The i-th tuple is the one with the
maximumprobability of being at rank i, i1,...,k
Rank 1 t3 0.2 t5 (1-0.2)0.8 0.64 t4
(1-0.2)(1-0.8)0.9 0.144 ... Rank 2 t3
0 t5 0.20.8 0.16 t4 0.9(0.2(1-0.8)(1-0
.2)0.8)
0.612
tuple score confidence
t3t5t4t1t2 10087806530 0.20.80.90.50.6
Potential problem duplicated tuples in top-k
42
Uncertain Data Models

An uncertain data model represents a probability
distribution of database instances (possible
worlds)
Basic model mutual independence among all tuples
Complete models able to represent any
distribution of possible worlds
Atomic independent random Boolean variables
Each tuple corresponds to a Boolean formula,
appears iff the formula evaluates to true
Exponential complexity

43
Uncertain Data Model x-relations
Each x-tuple represents a discrete probability
distribution of tuples x-tuples are mutually
independent, and disjoint
single-alternative multi-alternative
U-Top2 t1,t2 U-2Ranks (t1, t3)
44
Want to know more about uncertainty data
management?

A graduate-level course on uncertainty data
management will be (likely probably) offered in
the next next next next next semester

Or, talk to me tomorrow!
45
Recap

External memory model
Main memory is fast but limited
External memory slow but unlimited
Aim to optimize I/O performance
Streaming model
Main memory is fast but small
Cant store, not willing to store, or cant wait
to store data
Compute the desired answers in one pass
Probabilistic data model
Cant store, query exponential possible instances
of possible worlds
Compute the desired answers in the succinct
representation of the probabilistic data
(efficiently!! Possibly allow some errors)