External Sorting - PowerPoint PPT Presentation

About This Presentation
Title:

External Sorting

Description:

... are no duplicate items among different lists, then the for-loop ... Total number of bytes for all keys = 80MB. So, we cannot do internal sorting nor keysorting. ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 31
Provided by: nihankes
Category:
Tags: external | keys | sorting

less

Transcript and Presenter's Notes

Title: External Sorting


1
External Sorting
  • Reference Chapter 8

2
Outline
  • Heapsort
  • Multi-way Merging
  • Multi-step merging
  • Replacement Selection

3
External Sorting
  • Problem Sort 1Gb of data with 1Mb of RAM.
  • When a file doesnt fit in memory, there are two
    stages in sorting
  • File is divided into several segments, each of
    which sorted separately
  • Sorted segments are merged
  • (Each stage involves reading and writing the file
    at least once)

4
Sorting Segments
  • Heapsort
  • optimal routine if only one disk drive is
    available.
  • It can be executed by overlapping the
    input/output with processing
  • Each sorted segment will be the size of the
    available memory.
  • Replacement selection
  • optimal for two or more disk drives.
  • Sorted segments are twice the size of memory.
  • Reading in and writing out can be overlapped

5
Heapsort
  • What is a heap?
  • A heap is a binary tree with the following
    properties
  • Each node has a single key and that key is
    greater than or equal to the key at its parent
    node.
  • It is a complete binary tree. i.e. All leaves are
    on at most 2 levels, leaves on the lowest level
    are at the leftmost position.
  • Can be stored in an array the root is at index
    1, the children of node i are at indexes 2i, and
    2i1. Conversely, the parent of node j is stored
    at index ?j/2? (very compact no need to store
    pointers)

6
Example
Heap as a binary tree Height ?log n?
10
35
20
25
30
45
40
60
50
55
Heap as an array
7
Heapsort Algorithm
  • First Stage Building the heap while reading the
    file
  • While there is available space
  • Get the next record from current input buffer
  • Put the new record at the end of the heap
  • Reestablish the heap by exchanging the new node
    with its parent, if it is smaller than the
    parent otherwise leave it, where it should be.
    Repeat this step as long as heap property is
    violated.
  • Second stage Sorting while writing the heap out
    to the file
  • While there are records in heap
  • Put the root record in the current output buffer.
  • Replace the root by the last record in the heap.
  • Restore the heap again, which has the complexity
    of O(log n)

8
Example
  • Trace the algorithm with
  • 48 70 30 19 50 45 100 15

9
Heapsort
  • How big is a heap?
  • As big as the available memory.
  • What is the time it takes to create the sorted
    segments?
  • Ignoring the seek time and assuming b blocks in
    the file, where heap processing overlaps
    (approximately) with I/O.
  • The time for creating the initial sorted segments
    is 2bebt (read in the segment and write out the
    runs)
  • Note that the entire file has not been sorted
    yet. These are just sorted segments, and the size
    of each segment is limited to the size of the
    available memory used for this purpose.

10
Merging Two Lists
  • int Merge (char L1Name, char L2Name, char
    OLName)
  • InitializeList (1, L1Name)
  • InitializeList (2, L2Name)
  • InitOutputList (OLName)
  • bool More1 NextItem (1) bool More2
    NextItem (2)
  • while (More1 More2)
  • if (Item(1) lt Item(2))
  • ProcessItem(1)
  • More1 NextItem (1)
  • else if (Item(1) Item(2))
  • ProcessItem(1)
  • More1 NextItem (1)
  • More2 NextItem (2)
  • else
  • ProcessItem(2)
  • More2 NextItem (2)

11
Multiway Merging
  • K-way merge we want to merge K input lists to
    create a single sequentially ordered output list.
    (K is the order of a K-way merge)
  • We will adapt the 2-way merge algorithm
  • Instead of two lists, keep an array of lists
    list0, list1, listk-1
  • Keep an array of the items that are being used
    from each list item0, item1, itemk-1
  • The merge processing requires a call to a
    function (say MinIndex) to find the index of the
    item with the minimum value.

12
Merge Processing
  • We modify the main loop of the merge as follows
  • int minItem MinIndex(Item, k)
  • ProcessItem(minItem) // next output
  • for (i0 i lt k i)
  • if (Item(minItem) Item(i))
  • MoreItemsi NextItemInList(i)
  • If there are no duplicate items among different
    lists, then the for-loop can be eliminated.

13
Finding the minimum item
  • When the number of lists is small (K? 8)
    sequential search among items works nicely.
    (O(K))
  • When the number of lists is large, we could place
    the items in a priority queue (an array heap).
  • The min value will be at the root (1st position
    in array)
  • Replace the root with the next value from the
    associated list. This insert operation is O(log
    K)

14
Merging as a way of Sorting Large Files
  • Let us consider the following example
  • File to be sorted
  • 8,000,000 records
  • R 100 bytes
  • Size of the key 10 bytes
  • Memory available as a work area 10MB (not
    counting memory used to hold program, O.S., I/O
    buffers etc.)
  • Total file size 800MB
  • Total number of bytes for all keys 80MB
  • So, we cannot do internal sorting nor keysorting.

15
Basic idea
  • Forming runs (i.e. sorted subfiles)
  • bring as many records as possible to main memory,
    sort them using heapsort, save it into a small
    file.
  • Repeat this until we have read all records from
    the original file.
  • Do a multiway merge of the sorted subfiles.

16
Cost of Merge Sort
  • I/O operations are performed in the following
    times
  • Reading each record into main memory for sorting
    and forming the runs.
  • Writing sorted runs to disk.
  • These two steps are done as follows
  • Read a chunk of 10MB, write a chunk of 10Mb
    (repeat this 80 times)
  • In terms of basic disk operations, we spend
  • For reading 80 seeks transfer time for 800 MB
  • Same for writing

17
  • Reading runs into memory for merging. Read one
    chunk of each run, so 80 chunks. Since available
    memory is 10MB each chunk can have
    (10,000,000/80)bytes 125,000 bytes 1250
    records.
  • How many chunks to be read for each run?
  • Size of run/size of chunk 10,000,000/125,000
    80
  • Total number of basic seeks Total number of
    chunks (counting all runs) is 80 runs 80
    chunks/run 802 chunks 6400 seeks.
  • Reading each chunk involves average seeking.

18
  • Writing sorted file to disk the number of seeks
    depends on the size of output buffer
  • Bytes in file/bytes in output buffer
  • e.g. if output buffer is 200K, the number of
    seeks is 800,000,000/200,000 4,000 seeks
  • Among steps 1-4, step 3 dominates the running
    time.

19
Sorting a File that is 10 times larger
  • How is the time for merge phase affected if the
    file is 80 million records?
  • More runs 800 runs
  • 800-way merge in 10MB memory
  • i.e. divide the memory into 800 buffers.
  • Each buffer holds 1/800th of a run
  • So, 800 runs 800 seeks/run 640,000 seeks

20
The cost of increasing the file size
  • In general, for a K-way merge of K runs, the
    buffer size for each run is
  • (1/K) size of memory space (1/K) size of
    each run
  • So K seeks are required to read all of the
    records in each run.
  • Since there are K runs, merge requires K2 seeks.
  • Because K is directly proportional to N it also
    follows that the sort merge is an O(N2) operation.

21
Improvements
  • There are several ways to reduce the time
  • Allocate more hardware (e.g. Disk drives, memory)
  • Perform merge in more than one step.
  • Algorithmically increase the lengths of the
    initial sorted runs
  • Find ways to overlap I/O operations.

22
Multiple-step merges
  • Instead of merging all runs at once, we break the
    original set of runs into small groups and merge
    the runs in these groups separately.
  • more buffer space is available for each run
    hence fewer seeks are required per run.
  • When all of the smaller merges are completed, a
    second pass merges the new set of merged runs.

23
25 sets of 32 runs each


Two-step merge of 800 runs
24
Cost of multi-step merge
  • 25 sets of 32 runs, followed by 25-way merge
  • Disadvantage we read every record twice.
  • Advantage we can use larger buffers and avoid a
    large number of disk seeks.
  • Calculations
  • First Merge Step
  • Buffer size 1/32 run gt 3232 1024 seeks
  • For 25 32-way mergesgt 25 1024 25,600 seeks

25
  • Second Merge Step
  • For each 25 final runs, 1/25 buffer space is
    allocated.
  • So each input buffer can hold 4000 records (or
    1/800 run)
  • Hence, 800 seeks per run, so we end up making 25
    800 20,000 seeks.
  • Total number of seeks for two steps
  • 25600 20000 45,600
  • What about the total time for merge?
  • We now have to transmit all of the records 4
    times instead of two.
  • We also write the records twice, requiring an
    extra 40000seeks.
  • Still the trade is profitable (see sections
    8.5.1-8.5.5 for actual times)

26
Increasing Run Lengths
  • Assume initial runs contain 200000 records.Then
    instead of 800-way merge we need 400-way merge.
  • A longer initial run means
  • fewer total runs,
  • a lower-order merge,
  • bigger buffers,
  • fewer seeks.
  • How can we create initial runs that are twice as
    large as the number of records that we can hold
    in memory?
  • gt Replacement selection

27
Replacement Selection
  • Idea
  • always select the key from memory that has the
    lowest value
  • output the key
  • replacing it with a new key from the input list

28
  • Input
  • 21,67,12, 5, 47, 16
  • Remaining input Memory (P3) Output run
  • 21,67,12 5 47 16 _
  • 21,67 12 47 16 5
  • 21 67 47 16 12,5
  • _ 67 47 21 16,12,5
  • _ 67 47 _ 21,16,12,5
  • _ 67 _ _ 47, 21,16,12,5
  • _ _ _ _ 67,47, 21,16,12,5
  • What about a key arriving in memory too late to
    be output into its proper position? gt use of
    second heap

Front of input
29
Trace of replacement selection
  • Input ( P 3)
  • 33, 18, 24,58,14,17,7,21,67,12,5,47,16

30
Replacement Selection with two disks
  • Algorithm
  • Construct a heap (primary heap) in the memory,
    while reading records block by block from the
    first disk drive,
  • As we move records from the heap to output
    buffer, we replace those records with records
    from the input buffer.
  • If some new records have keys smaller than those
    already written out, a secondary heap is created
    for them.
  • The other new records are inserted to the primary
    heap.
  • Repeat step 2 as long as there are records left
    in the primary heap and there are records to be
    read.
  • When the primary heap is empty make the secondary
    heap into primary heap and repeat steps 1-3.
Write a Comment
User Comments (0)
About PowerShow.com