Dealing with MASSIVE Data - PowerPoint PPT Presentation

About This Presentation
Title:

Dealing with MASSIVE Data

Description:

Title: Cache-Oblivious Priority Queue and Graph Algorithm Applications Author: Lars Arge Description: unix compatible title Created Date: 5/14/2002 6:45:48 PM – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 47
Provided by: Lars67
Category:

less

Transcript and Presenter's Notes

Title: Dealing with MASSIVE Data


1
Dealing with MASSIVE Data
Feifei Li lifeifei_at_cs.fsu.edu Dept Computer
Science, FSU Sep 9, 2008
2
Brief Bio
  • B.A.S. in computer engineering from Nanyang
    Technological University in 2002
  • Ph.D. in computer science from Boston University
    in 2007
  • Research Interns/Visitors at ATT Labs, IBM T. J.
    Watson Research Center, Microsoft Research.
  • Now Assistant Professor in CS Department at FSU

3
Research Areas
Database Applications
indexing
query processing
spatial databases
Algorithms and Data structures
data streams
I/O-efficient algorithms
computational geometry
streaming algorithms
misc.
Geographic Information Systems
data security and privacy
Probabilistic Data
4
Massive Data
  • Massive datasets are being collected everywhere
  • Storage management software is billion- industry
  • Examples (2002)
  • Phone ATT 20TB phone call database, wireless
    tracking
  • Consumer WalMart 70TB database, buying patterns
  • WEB Web crawl of 200M pages and 2000M links,
    Googles huge indexes
  • Geography NASA satellites generate 1.2TB per day

5
Example LIDAR Terrain Data
  • Massive (irregular) point sets (1-10m resolution)
  • Becoming relatively cheap and easy to collect
  • Appalachian Mountains between 50GB and 5TB
  • Exceeds memory limit and needs to be stored on
    disk

6
Example Network Flow Data
  • ATT IP backbone generates 500 GB per day
  • Gigascope A data stream management system
  • Compute certain statistics
  • Can we do computation without storing the data?

7
Traditional Random Access Machine Model
R A M
  • Standard theoretical model of computation
  • Infinite memory (how nice!)
  • Uniform access cost
  • Simple model crucial for success of computer
    industry

8
How to Deal with MASSIVE Data?
  • when there is not enough memory

9
Solution 1 Buy More Memory
  • Expensive
  • (Probably) not scalable
  • Growth rate of data is higher than the growth of
    memory

10
Solution 2 Cheat! (by random sampling)
  • Provide approximate solution for some problems
  • average, frequency of an element, etc.
  • What if we want the exact result?
  • Many problems cant be solved by sampling
  • maximum, and all problems mentioned later

11
Solution 3 Using the Right Computation Model
  • External Memory Model
  • Streaming Model
  • Probabilistic Model (brief)

12
Computation Model for Massive Data (1)External
Memory Model
  • Internal memory is limited but fast
  • External memory is unlimited but slow

13
Memory Hierarchy
  • Modern machines have complicated memory hierarchy
  • Levels get larger and slower further away from
    CPU
  • Block sizes and memory sizes are different!
  • There are a few attempts to model the hierarchy
    but not successful
  • They are too complicated!

14
Slow I/O
  • Disk access is 106 times slower than main memory
    access

The difference in speed between modern CPU and
disk technologies is analogous to the difference
in speed in sharpening a pencil using a sharpener
on ones desk or by taking an airplane to the
other side of the world and using a sharpener on
someone elses desk. (D. Comer)
  • Disk systems try to amortize large access time
    transferring large contiguous blocks of data
    (8-16Kbytes)
  • Important to store/access data to take advantage
    of blocks (locality)

15
Puzzle 1 Majority Counting
b
a
e
c
a
d
a
a
d
a
a
e
a
b
a
a
f
a
g
b
  • A huge file of characters stored on disk
  • Question Is there a character that appears gt 50
    of the time
  • Solution 1 sort scan
  • A few passes (O(logM/B N)) will come to it later
  • Solution 2 divide-and-conquer
  • Load a chunk in to memory N/M chunks
  • Count them, return majority
  • The overall majority must be the majority in gt50
    chunks
  • Iterate until lt M
  • Very few passes (O(logM N)), geometrically
    decreasing
  • Solution 3 O(1) memory, 2 passes (answer to be
    posted later)

16
External Memory Model AV88
  • N of items in the problem instance
  • B of items per disk block
  • M of items that fit in main memory
  • I/O Move block between memory and disk
  • Performance measure of I/Os performed by
    algorithm
  • We assume (for convenience) that M gtB2

D
Block I/O
M
P
17
Sorting in External Memory
  • Break all N elements into N/M chunks of size M
    each
  • Sort each chunk individually in memory
  • Merge them together
  • Can merge ltM/B sorted lists (queues) at once

  • M/B blocks in main memory

18
Sorting in External Memory
  • Merge sort
  • Create N/M memory sized sorted lists
  • Repeatedly merge lists together T(M/B) at a time
  • ? phases using
    I/Os each ? I/Os

19
External Searching B-Tree
  • Each node (except root) has fan-out between B/2
    and B
  • Size O(N/B) blocks on disk
  • Search O(logBN) I/Os following a root-to-leaf
    path
  • Insertion and deletion O(logBN) I/Os

20
Fundamental Bounds
  • Internal External
  • Scanning N
  • Sorting N log N
  • Searching
  • More Results
  • List ranking N
  • Minimal spanning tree N log N
  • Offline union-find N
  • Interval searching log N T
    logBN T/B
  • Rectangle enclosure log N T
    log N T/B
  • R-tree search

21
Does All the Theory Matter?
  • Programs developed in RAM-modelstill runs even
    there is not enough memory
  • Run on large datasets because
  • OS moves blocks as needed
  • OS utilizes paging and prefetching strategies
  • But if program makes scattered accesses even good
    OS cannot take advantage of block access
  • ?
  • Thrashing!

22
Toy Experiment Permuting
  • Problem
  • Input N elements out of order 6, 7, 1, 3, 2, 5,
    10, 9, 4, 8
  • Each element knows its correct position
  • Output Store them on disk in the right order
  • Internal memory solution
  • Just scan the original sequence and move every
    element in the right place!
  • O(N) time, O(N) I/Os
  • External memory solution
  • Use sorting
  • O(N log N) time, I/Os

23
A Practical Example on Real Data
  • Computing persistence on large terrain data

24
Takeaways
  • Need to be very careful when your programs space
    usage exceeds physical memory size
  • If program mostly makes highly localized accesses
  • Let the OS handle it automatically
  • If program makes many non-localized accesses
  • Need I/O-efficient techniques
  • Three common techniques (recall the majority
    counting puzzle)
  • Convert to sort scan
  • Divide-and-conquer
  • Other tricks

25
Want to know more about I/O-efficient algorithms?
  • A course on I/O-efficient algorithms is offered
    as CIS5930 (Advanced Topics in Data Management)

26
Computation Model for Massive Data (2)Streaming
Model
Cannot Dont want to store data and do further
processing Cant wait to
  • You got to look at each element only once!

27
Streaming Algorithms Applications
What are the top (most frequent) 1000 (source,
dest) pairs seen over the last month?
How many distinct (source, dest) pairs have been
seen?
Off-line analysis slow, expensive
Set-Expression Query
Network Operations Center (NOC)
SELECT COUNT (R1.source, R2.dest) FROM R1,
R2 WHERE R1.dest R2.source
Peer
SQL Join Query
  • Other applications
  • Sensor networks
  • Network security
  • Financial applications
  • Web logs and clickstreams



EnterpriseNetworks
PSTN

DSL/Cable Networks
28
Puzzle 2 Find Missing Card
Mahjong tile
  • How to find the missing tile by making one pass
    over everything?
  • Assuming you cant memorize everything (of
    course)
  • Assign a number to each type of tiles
    8, 14, 22
  • Compute the sum of all remaining tiles
  • (1911192129)4 sum missing tile!

29
A Research Problem Count Distinct Elements
b
a
e
c
a
d
a
a
d
a
a
e
a
b
a
a
f
a
g
b
distinct elements 7
  • Unfortunately, there is a lower bound saying you
    cant do this without using O(n) memory
  • But if we allow some errors, then can approximate
    it well

30
Solution FM Sketch FM85, AMS99
  • Take a (pseudo) random hash function h 1,,n
    ? 1,,2d, where 2d gt n
  • For each incoming element x, compute h(x)
  • e.g., h(5) 10101100010000
  • Count how many trailing zeros
  • Remember the maximum number of trailing zeroes in
    any h(x)
  • Let Y be the maximum number of trailing zeroes
  • Can show E2Y distinct elements
  • 2 elements, on average there is one h(x) with 1
    trailing zero
  • 4 elements, on average there is one h(x) with 2
    trailing zeroes
  • 8 elements, on average there is one h(x) with 3
    trailing zeroes

31
Counting Paintballs
  • Imagine the following scenario
  • A bag of n paintballs is emptied at the top of a
    long stair-case.
  • At each step, each paintball either bursts and
    marks the step, or bounces to the next step.
    50/50 chance either way.

Looking only at the pattern of marked steps, what
was n?
32
Counting Paintballs (cont)
B(n,1/2)
  • What does the distribution of paintball bursts
    look like?
  • The number of bursts at each step follows a
    binomial distribution.
  • The expected number of bursts drops
    geometrically.
  • Few bursts after log2 n steps

B(n,1/4)
1st
2nd
B(n,1/2 Y)
Y th
B(n,1/2 Y)
33
Solution FM Sketch FM85, AMS99
  • So 2Y is an unbiased estimator for distinct
    elements
  • However, has a large variance
  • Use O(1/e2 log(1/d)) copies to guarantee a good
    estimator that has probability 1d to be within
    relative error e
  • Applications
  • How many distinct IP addresses used a given link
    to send their traffic from the beginning of the
    day?
  • How many new IP addresses appeared today that
    didnt appear before?

34
Finding Heavy Hitters
  • Which elements appeared in the stream more than
    10 of the time?
  • Applications
  • Networking
  • Finding IP addresses sending most traffic
  • Databases
  • Iceberg queries
  • Data mining
  • Finding hot items (item sets) in transaction
    data
  • Solution
  • Exact solution is difficult
  • If allow approximation of e
  • Use O(1/e) space and O(1) time per element in
    stream

35
Streaming in a Distributed World
Network Operations Center (NOC)
  • Large-scale querying/monitoring Inherently
    distributed!
  • Streams physically distributed across remote
    sitesE.g., stream of UDP packets through subset
    of edge routers
  • Challenge is holistic querying/monitoring
  • Queries over the union of distributed streams
    Q(S1 ? S2 ? )
  • Streaming data is spread throughout the network

36
Streaming in a Distributed World
Network Operations Center (NOC)
  • Need timely, accurate, and efficient query
    answers
  • Additional complexity over centralized data
    streaming!
  • Need space/time- and communication-efficient
    solutions
  • Minimize network overhead
  • Maximize network lifetime (e.g., sensor battery
    life)
  • Cannot afford to centralize all streaming data

37
Want to know more about streaming algorithms?
  • A graduate-level course on streaming algorithms
    willbe approximately offered in the next next
    next semester with an error guarantee of 5!

Or, talk to me tomorrow!
38
Top-k Queries
  • Extremely useful in information retrieval
  • top-k sellers, popular movies, etc.
  • google

tuple score
t1t2t3t4t5 65301008087
tuple score
t3t5t4t1t2 10087806530
Threshold Alg RankSQL
top-2 t3, t5
39
Top-k Queries on Uncertain Data
tuple score
t3t5t4t1t2 10087806530
confidence
0.20.80.90.50.6
tuple score
t3t5t4t1t2 10087806530
confidence
0.20.80.90.50.6
top-k answer depends onthe interplay
between score and confidence
(sensor reading, reliability) (page rank, how
well match query)
40
Top-k Definition U-Topk
The k tuples with the maximum probabilityof
being the top-k
t3, t5 0.20.8 0.16 t3, t4
0.2(1-0.8)0.9 0.036 t5, t4
(1-0.2)0.80.9 0.576 ...
tuple score
t3t5t4t1t2 10087806530
confidence
0.20.80.90.50.6
Potential problem top-k could be very different
from top-(k1)
41
Top-k Definition U-kRanks
The i-th tuple is the one with the
maximumprobability of being at rank i, i1,...,k
Rank 1 t3 0.2 t5 (1-0.2)0.8 0.64 t4
(1-0.2)(1-0.8)0.9 0.144 ... Rank 2 t3
0 t5 0.20.8 0.16 t4 0.9(0.2(1-0.8)(1-0
.2)0.8)
0.612
tuple score confidence
t3t5t4t1t2 10087806530 0.20.80.90.50.6
Potential problem duplicated tuples in top-k
42
Uncertain Data Models
  • An uncertain data model represents a probability
    distribution of database instances (possible
    worlds)
  • Basic model mutual independence among all tuples
  • Complete models able to represent any
    distribution of possible worlds
  • Atomic independent random Boolean variables
  • Each tuple corresponds to a Boolean formula,
    appears iff the formula evaluates to true
  • Exponential complexity

43
Uncertain Data Model x-relations
Each x-tuple represents a discrete probability
distribution of tuples x-tuples are mutually
independent, and disjoint
single-alternative multi-alternative
U-Top2 t1,t2 U-2Ranks (t1, t3)
44
Want to know more about uncertainty data
management?
  • A graduate-level course on uncertainty data
    management will be (likely probably) offered in
    the next next next next next semester

Or, talk to me tomorrow!
45
Recap
  • External memory model
  • Main memory is fast but limited
  • External memory slow but unlimited
  • Aim to optimize I/O performance
  • Streaming model
  • Main memory is fast but small
  • Cant store, not willing to store, or cant wait
    to store data
  • Compute the desired answers in one pass
  • Probabilistic data model
  • Cant store, query exponential possible instances
    of possible worlds
  • Compute the desired answers in the succinct
    representation of the probabilistic data
    (efficiently!! Possibly allow some errors)

46
Thanks!
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com