Title: Dealing with MASSIVE Data
1Dealing with MASSIVE Data
Feifei Li lifeifei_at_cs.fsu.edu Dept Computer
Science, FSU Sep 9, 2008
2Brief Bio
- B.A.S. in computer engineering from Nanyang
Technological University in 2002 - Ph.D. in computer science from Boston University
in 2007 - Research Interns/Visitors at ATT Labs, IBM T. J.
Watson Research Center, Microsoft Research. - Now Assistant Professor in CS Department at FSU
3Research Areas
Database Applications
indexing
query processing
spatial databases
Algorithms and Data structures
data streams
I/O-efficient algorithms
computational geometry
streaming algorithms
misc.
Geographic Information Systems
data security and privacy
Probabilistic Data
4Massive Data
- Massive datasets are being collected everywhere
- Storage management software is billion- industry
- Examples (2002)
- Phone ATT 20TB phone call database, wireless
tracking - Consumer WalMart 70TB database, buying patterns
- WEB Web crawl of 200M pages and 2000M links,
Googles huge indexes - Geography NASA satellites generate 1.2TB per day
5Example LIDAR Terrain Data
- Massive (irregular) point sets (1-10m resolution)
- Becoming relatively cheap and easy to collect
- Appalachian Mountains between 50GB and 5TB
- Exceeds memory limit and needs to be stored on
disk
6Example Network Flow Data
- ATT IP backbone generates 500 GB per day
- Gigascope A data stream management system
- Compute certain statistics
- Can we do computation without storing the data?
7Traditional Random Access Machine Model
R A M
- Standard theoretical model of computation
- Infinite memory (how nice!)
- Uniform access cost
- Simple model crucial for success of computer
industry
8How to Deal with MASSIVE Data?
- when there is not enough memory
9Solution 1 Buy More Memory
- Expensive
- (Probably) not scalable
- Growth rate of data is higher than the growth of
memory
10Solution 2 Cheat! (by random sampling)
- Provide approximate solution for some problems
- average, frequency of an element, etc.
- What if we want the exact result?
- Many problems cant be solved by sampling
- maximum, and all problems mentioned later
11Solution 3 Using the Right Computation Model
- External Memory Model
- Streaming Model
- Probabilistic Model (brief)
12Computation Model for Massive Data (1)External
Memory Model
- Internal memory is limited but fast
- External memory is unlimited but slow
13Memory Hierarchy
- Modern machines have complicated memory hierarchy
- Levels get larger and slower further away from
CPU - Block sizes and memory sizes are different!
- There are a few attempts to model the hierarchy
but not successful - They are too complicated!
14Slow I/O
- Disk access is 106 times slower than main memory
access
The difference in speed between modern CPU and
disk technologies is analogous to the difference
in speed in sharpening a pencil using a sharpener
on ones desk or by taking an airplane to the
other side of the world and using a sharpener on
someone elses desk. (D. Comer)
- Disk systems try to amortize large access time
transferring large contiguous blocks of data
(8-16Kbytes) - Important to store/access data to take advantage
of blocks (locality)
15Puzzle 1 Majority Counting
b
a
e
c
a
d
a
a
d
a
a
e
a
b
a
a
f
a
g
b
- A huge file of characters stored on disk
- Question Is there a character that appears gt 50
of the time - Solution 1 sort scan
- A few passes (O(logM/B N)) will come to it later
- Solution 2 divide-and-conquer
- Load a chunk in to memory N/M chunks
- Count them, return majority
- The overall majority must be the majority in gt50
chunks - Iterate until lt M
- Very few passes (O(logM N)), geometrically
decreasing - Solution 3 O(1) memory, 2 passes (answer to be
posted later)
16External Memory Model AV88
- N of items in the problem instance
- B of items per disk block
- M of items that fit in main memory
- I/O Move block between memory and disk
- Performance measure of I/Os performed by
algorithm - We assume (for convenience) that M gtB2
D
Block I/O
M
P
17Sorting in External Memory
- Break all N elements into N/M chunks of size M
each - Sort each chunk individually in memory
- Merge them together
- Can merge ltM/B sorted lists (queues) at once
-
M/B blocks in main memory
18Sorting in External Memory
- Merge sort
- Create N/M memory sized sorted lists
- Repeatedly merge lists together T(M/B) at a time
- ? phases using
I/Os each ? I/Os
19External Searching B-Tree
- Each node (except root) has fan-out between B/2
and B - Size O(N/B) blocks on disk
- Search O(logBN) I/Os following a root-to-leaf
path - Insertion and deletion O(logBN) I/Os
20Fundamental Bounds
- Internal External
- Scanning N
- Sorting N log N
- Searching
- More Results
- List ranking N
- Minimal spanning tree N log N
- Offline union-find N
- Interval searching log N T
logBN T/B - Rectangle enclosure log N T
log N T/B - R-tree search
21Does All the Theory Matter?
- Programs developed in RAM-modelstill runs even
there is not enough memory - Run on large datasets because
- OS moves blocks as needed
- OS utilizes paging and prefetching strategies
- But if program makes scattered accesses even good
OS cannot take advantage of block access - ?
- Thrashing!
22Toy Experiment Permuting
- Problem
- Input N elements out of order 6, 7, 1, 3, 2, 5,
10, 9, 4, 8 - Each element knows its correct position
- Output Store them on disk in the right order
- Internal memory solution
- Just scan the original sequence and move every
element in the right place! - O(N) time, O(N) I/Os
- External memory solution
- Use sorting
- O(N log N) time, I/Os
23A Practical Example on Real Data
- Computing persistence on large terrain data
24Takeaways
- Need to be very careful when your programs space
usage exceeds physical memory size - If program mostly makes highly localized accesses
- Let the OS handle it automatically
- If program makes many non-localized accesses
- Need I/O-efficient techniques
- Three common techniques (recall the majority
counting puzzle) - Convert to sort scan
- Divide-and-conquer
- Other tricks
25Want to know more about I/O-efficient algorithms?
- A course on I/O-efficient algorithms is offered
as CIS5930 (Advanced Topics in Data Management)
26Computation Model for Massive Data (2)Streaming
Model
Cannot Dont want to store data and do further
processing Cant wait to
- You got to look at each element only once!
27Streaming Algorithms Applications
What are the top (most frequent) 1000 (source,
dest) pairs seen over the last month?
How many distinct (source, dest) pairs have been
seen?
Off-line analysis slow, expensive
Set-Expression Query
Network Operations Center (NOC)
SELECT COUNT (R1.source, R2.dest) FROM R1,
R2 WHERE R1.dest R2.source
Peer
SQL Join Query
- Other applications
- Sensor networks
- Network security
- Financial applications
- Web logs and clickstreams
EnterpriseNetworks
PSTN
DSL/Cable Networks
28Puzzle 2 Find Missing Card
Mahjong tile
- How to find the missing tile by making one pass
over everything? - Assuming you cant memorize everything (of
course) - Assign a number to each type of tiles
8, 14, 22 - Compute the sum of all remaining tiles
- (1911192129)4 sum missing tile!
29A Research Problem Count Distinct Elements
b
a
e
c
a
d
a
a
d
a
a
e
a
b
a
a
f
a
g
b
distinct elements 7
- Unfortunately, there is a lower bound saying you
cant do this without using O(n) memory - But if we allow some errors, then can approximate
it well
30Solution FM Sketch FM85, AMS99
- Take a (pseudo) random hash function h 1,,n
? 1,,2d, where 2d gt n - For each incoming element x, compute h(x)
- e.g., h(5) 10101100010000
- Count how many trailing zeros
- Remember the maximum number of trailing zeroes in
any h(x) - Let Y be the maximum number of trailing zeroes
- Can show E2Y distinct elements
- 2 elements, on average there is one h(x) with 1
trailing zero - 4 elements, on average there is one h(x) with 2
trailing zeroes - 8 elements, on average there is one h(x) with 3
trailing zeroes
31Counting Paintballs
- Imagine the following scenario
- A bag of n paintballs is emptied at the top of a
long stair-case. - At each step, each paintball either bursts and
marks the step, or bounces to the next step.
50/50 chance either way.
Looking only at the pattern of marked steps, what
was n?
32Counting Paintballs (cont)
B(n,1/2)
- What does the distribution of paintball bursts
look like? - The number of bursts at each step follows a
binomial distribution. - The expected number of bursts drops
geometrically. - Few bursts after log2 n steps
B(n,1/4)
1st
2nd
B(n,1/2 Y)
Y th
B(n,1/2 Y)
33Solution FM Sketch FM85, AMS99
- So 2Y is an unbiased estimator for distinct
elements - However, has a large variance
- Use O(1/e2 log(1/d)) copies to guarantee a good
estimator that has probability 1d to be within
relative error e - Applications
- How many distinct IP addresses used a given link
to send their traffic from the beginning of the
day? - How many new IP addresses appeared today that
didnt appear before?
34Finding Heavy Hitters
- Which elements appeared in the stream more than
10 of the time? - Applications
- Networking
- Finding IP addresses sending most traffic
- Databases
- Iceberg queries
- Data mining
- Finding hot items (item sets) in transaction
data - Solution
- Exact solution is difficult
- If allow approximation of e
- Use O(1/e) space and O(1) time per element in
stream
35Streaming in a Distributed World
Network Operations Center (NOC)
- Large-scale querying/monitoring Inherently
distributed! - Streams physically distributed across remote
sitesE.g., stream of UDP packets through subset
of edge routers - Challenge is holistic querying/monitoring
- Queries over the union of distributed streams
Q(S1 ? S2 ? ) - Streaming data is spread throughout the network
36Streaming in a Distributed World
Network Operations Center (NOC)
- Need timely, accurate, and efficient query
answers - Additional complexity over centralized data
streaming! - Need space/time- and communication-efficient
solutions - Minimize network overhead
- Maximize network lifetime (e.g., sensor battery
life) - Cannot afford to centralize all streaming data
37Want to know more about streaming algorithms?
- A graduate-level course on streaming algorithms
willbe approximately offered in the next next
next semester with an error guarantee of 5!
Or, talk to me tomorrow!
38Top-k Queries
- Extremely useful in information retrieval
- top-k sellers, popular movies, etc.
- google
tuple score
t1t2t3t4t5 65301008087
tuple score
t3t5t4t1t2 10087806530
Threshold Alg RankSQL
top-2 t3, t5
39Top-k Queries on Uncertain Data
tuple score
t3t5t4t1t2 10087806530
confidence
0.20.80.90.50.6
tuple score
t3t5t4t1t2 10087806530
confidence
0.20.80.90.50.6
top-k answer depends onthe interplay
between score and confidence
(sensor reading, reliability) (page rank, how
well match query)
40Top-k Definition U-Topk
The k tuples with the maximum probabilityof
being the top-k
t3, t5 0.20.8 0.16 t3, t4
0.2(1-0.8)0.9 0.036 t5, t4
(1-0.2)0.80.9 0.576 ...
tuple score
t3t5t4t1t2 10087806530
confidence
0.20.80.90.50.6
Potential problem top-k could be very different
from top-(k1)
41Top-k Definition U-kRanks
The i-th tuple is the one with the
maximumprobability of being at rank i, i1,...,k
Rank 1 t3 0.2 t5 (1-0.2)0.8 0.64 t4
(1-0.2)(1-0.8)0.9 0.144 ... Rank 2 t3
0 t5 0.20.8 0.16 t4 0.9(0.2(1-0.8)(1-0
.2)0.8)
0.612
tuple score confidence
t3t5t4t1t2 10087806530 0.20.80.90.50.6
Potential problem duplicated tuples in top-k
42Uncertain Data Models
- An uncertain data model represents a probability
distribution of database instances (possible
worlds) - Basic model mutual independence among all tuples
- Complete models able to represent any
distribution of possible worlds - Atomic independent random Boolean variables
- Each tuple corresponds to a Boolean formula,
appears iff the formula evaluates to true - Exponential complexity
43Uncertain Data Model x-relations
Each x-tuple represents a discrete probability
distribution of tuples x-tuples are mutually
independent, and disjoint
single-alternative multi-alternative
U-Top2 t1,t2 U-2Ranks (t1, t3)
44Want to know more about uncertainty data
management?
- A graduate-level course on uncertainty data
management will be (likely probably) offered in
the next next next next next semester
Or, talk to me tomorrow!
45Recap
- External memory model
- Main memory is fast but limited
- External memory slow but unlimited
- Aim to optimize I/O performance
- Streaming model
- Main memory is fast but small
- Cant store, not willing to store, or cant wait
to store data - Compute the desired answers in one pass
- Probabilistic data model
- Cant store, query exponential possible instances
of possible worlds - Compute the desired answers in the succinct
representation of the probabilistic data
(efficiently!! Possibly allow some errors)
46Thanks!