Title: IOEfficient Algorithms
1I/O-Efficient Algorithms Data Structures
Ke Yi February 14, 2008
2Recap Merge Sort
- Merge sort
- Internal memory two-way merge, naïve I/O O(N/B
log2N/B) - External memory O(M/B)-way merge
Total I/O O(N/B logM/BN/B) sort(N)
3Recap External Heap
insert buffer
main memory
in memory
heap has fan-out T(M/B) each node has T(M/B)
blocks
Amortized I/O per insert or delete-max O(1/B
logM/BN/B)
naïve I/O with internal heap O(log2N/B)
Heap property All elements in a child are
smaller than those in its parent
4External Heap In Practice
- In practice Know the scale of your problem!
- Suppose M 512M, B 256K, then two levels can
support M(M/B) 1024G 1T of data!
5Recap Basic General I/O Techniques
(3) Reduce to sort pqueue
6Pointer Dereferencing
- Almost every problem in computer science can be
solved by another level of indirection - Dereference each pointer needs many random I/Os
- How do we get the values I/O-efficiently?
- Output (i, data) pairs
pointer array Pi
data array Di
7I/O-Efficient Pointer Dereferencing
pointer array Pi
data array Di
Total I/O sort(N)
- Sort pointer array by pointers
- Produce a list of (i, Pi) pairs, sorted by Pi
- Scan both arrays in parallel
- Produce (i, data) pairs
- Sort the list back by i if needed
8Time-Forward Processing
- Scan sequence in order, create a priority queue
- For a cell
- For each incoming edge
- DeleteMin from pq if theres a match, obtain the
incoming value - Compute the outgoing value
- For each outgoing edge
- Insert (destination address, value) to pq, with
destination as key
Total I/O sort(N)
9Application Maximal Independent Set
- Given an undirected graph G (V,E) stored on
disk - A list of (vertex-id, vertex-id) pairs
representing all edges - An independent set is a set I of vertices so that
no two vertices in I are adjacent - Set I is maximal if any other vertex is added to
I, then I becomes not independent - Note maximum independent set is NP-hard!
- Internal memory
- Add vertices one by one until no more vertices
can be added - Time O(E)
10I/O-Efficient Maximal Independent Set
1
4
6
2
Total I/O sort(N)
3
7
5
- Make all edges directed from a low vertex id to a
high vertex id - Sort all edges by source
- Now have a time-forward processing problem!
11Distribution Sweeping
- An I/O-Efficient Technique for Solving Batched
Geometry Problems
12Plane Sweep
- A technique for solving batched geometry problems
- Plane sweep an important technique in
computational geometry (in internal memory) - Example orthogonal segment intersection
- Given a set of horizontal and vertical segments,
goal is to report all intersections - Internal memory plane sweep binary search tree
- Time O(N log N K), Koutput size
(output-sensitive algorithm)
13Distribution Sweeping
- Divide into M/B slabs
- Only consider the red middle segment on this
level - Push blue leftovers one level down and process
recursively - One input segment pushes two blue segments down
only once! - Total size is linear at any level (Phew)
14Distribution Sweeping
- Maintain an active list for each slab storing all
vertical segments intersecting the sweep line - Report is fine, but how to delete?
- Delete lazily!
15Distribution Sweeping
- Total I/O on this level O(N/B K/B)
- K intersections found on this level
- Total I/O on all levels O(N/B logM/BN/B K/B)
- Optimal in the comparison-I/O model
16Distribution Sweeping Framework
- Sort objects by x-coordinate and y-coordinate
- Divide into M/B slabs
- Sweep the plane, solving problems on the slab
level - Generate M/B problem instances inside the slabs
- Solve the problem inside each slab recursively
until smaller than memory size - Total levels O(logM/B N/M)
- Key is how to solve the problem on the slab level