The four main - PowerPoint PPT Presentation

About This Presentation
Title:

The four main

Description:

The four main – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 52
Provided by: eam9
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: The four main


1
The four main classes of fingerprints
Johannes Purkinje 1787-1869
Distribution of classes (Caucasian)
2
No Normal Class on Thursday!!
Instead I will be in my office, and the TAs will
be in the TA lab, to answer questions about any
section of the material covered since day
one. Congratulations to Foula, who passed her
qualifying exams with the highest grade!
3
The Henry Classification System
The sum of the values of the white squares that
contain a Whorl (plus one) is the numerator of
the primary classification.
The sum of the values of the dark squares that
contain a Whorl (plus one) is the denominator of
the primary classification.
U W A U A A U U W U
4
The Henry Classification System II
?
?
You cannot simplify the classes by canceling
above and below the fraction bar. Read as Ten
over Four not as ten fourths
5
The Henry Classification System III
How many different classifications are there?
Each finger position has exactly 2 possibilities,
either it contains a Whorl or it does not. There
are 5 positions in the numerator, so there are 25
possible numerators. The same argument is holds
for the denominator. So the number of possible
numerators times the number of possible
denominators is 25 25 210 1024
Some of the 1024 bins in Scotland Yards
Fingerprint Bureau during the 1950s
6
(No Transcript)
7
Static Hashing
  • A bucket is a unit of storage containing one or
    more records (a bucket is typically a disk
    block).
  • In a hash file organization we obtain the bucket
    of a record directly from its search-key value
    using a hash function.
  • Hash function h is a function from the set of all
    search-key values K to the set of all bucket
    addresses B.
  • Hash function is used to locate records for
    access, insertion as well as deletion.
  • Records with different search-key values may be
    mapped to the same bucket thus entire bucket has
    to be searched sequentially to locate a record.

8
Static Hashing
With hash based indexing, we assume that we have
a function h, which tells us where to place any
given record.
h

Data Page 1
Data Page 2
Data Page 3
Data Page N-1
Data Page N
9
Static Hashing
The hash function usually takes the last k bits
of the search key, multiples it by a, and adds b
to get some value. The id of the page to access
is identified by page_number h(value) mod
N The values of a and b must be carefully
selected, N should be prime. We desired the
hash function to uniform and random (explained
later).
10
Static Hashing
As with ISAM, insertion can cause overflow. The
solution is to create chains of overflow pages.
h

Data Page 1
Data Page 2
Data Page 3
Data Page N-1
Data Page N
Overflow for 3
11
Note about Notation
Our book, and more generally the world, is
inconsistent about notation. The page number that
we should place our record in is sometimes given
as page_number h(value) mod N And sometimes
as page_number h(value) There is nothing
deep going on here, it is just that the Mod N
has been moved inside the notation in the second
case.
12
Hash Functions
  • Worst has function maps all search-key values to
    the same bucket this makes access time
    proportional to the number of search-key values
    in the file.
  • An ideal hash function is uniform, i.e., each
    bucket is assigned the same number of search-key
    values from the set of all possible values.
  • Ideal hash function is random, so each bucket
    will have the same number of records assigned to
    it irrespective of the actual distribution of
    search-key values in the file.
  • Typical hash functions perform computation on the
    internal binary representation of the search-key.
  • For example, for a string search-key, the binary
    representations of all the characters in the
    string could be added and the sum modulo the
    number of buckets could be returned.

13
Handling of Bucket Overflows
  • Bucket overflow can occur because of
  • Insufficient buckets
  • Skew in distribution of records. This can occur
    due to two reasons
  • multiple records have same search-key value
  • chosen hash function produces non-uniform
    distribution of key values
  • Although the probability of bucket overflow can
    be reduced, it cannot be eliminated.

14
Deficiencies of Static Hashing
  • In static hashing, function h maps search-key
    values to a fixed set of B of bucket addresses.
  • Databases grow with time. If initial number of
    buckets is too small, performance will degrade
    due to too much overflows.
  • If file size at some point in the future is
    anticipated and number of buckets allocated
    accordingly, significant amount of space will be
    wasted initially.
  • If database shrinks, again space will be wasted.
  • One option is periodic re-organization of the
    file with a new hash function, but it is very
    expensive.
  • These problems can be avoided by using techniques
    that allow the number of buckets to be modified
    dynamically.

15
Deficiencies of Static Hashing
  • These problems can be avoided by using
    techniques that allow the number of buckets to be
    modified dynamically.
  • This is called dynamic hashing.
  • There are several types of dynamic hashing, we
    will learn about extendible hashing, and linear
    hashing.
  • First we must learn about hash indices.

16
Hash Indices
  • Hashing can be used not only for file
    organization, but also for index-structure
    creation.
  • A hash index organizes the search keys, with
    their associated record pointers, into a hash
    file structure.
  • Strictly speaking, hash indices are always
    secondary indices
  • if the file itself is organized using hashing, a
    separate primary hash index on it using the same
    search-key is unnecessary.
  • However, we use the term hash index to refer to
    both secondary index structures and hash
    organized files.

17
Example of Hash Index
h
18
Extendable Hashing (Formal)
  • Extendable hashing one form of dynamic hashing
  • Hash function generates values over a large range
    typically b-bit integers, with b 32.
  • At any time use only a prefix (or suffix) of the
    hash function to index into a table of bucket
    addresses.
  • Let the length of the prefix (or suffix) be i
    bits, 0 ? i ? 32.
  • Bucket address table size 2i. Initially i 0
  • Value of i grows and shrinks as the size of the
    database grows and shrinks.
  • Multiple entries in the bucket address table may
    point to a bucket.
  • Thus, actual number of buckets is lt 2i
  • The number of buckets also changes dynamically
    due to coalescing and splitting of buckets.

19
Use of Extendable Hash Structure (Formal)
  • Each bucket j stores a value ij all the entries
    that point to the same bucket have the same
    values on the first ij bits.
  • To locate the bucket containing search-key Kj
  • 1. Compute h(Kj) X
  • 2. Use the first i high order bits of X as a
    displacement into bucket address table, and
    follow the pointer to appropriate bucket
  • To insert a record with search-key value Kj
  • follow same procedure as look-up and locate the
    bucket, say j.
  • If there is room in the bucket j insert record in
    the bucket.
  • Else the bucket must be split and insertion
    re-attempted (next slide.)
  • Overflow buckets used instead in some cases (will
    see shortly)

20
Updates in Extendable Hash Structure (Formal)
To split a bucket j when inserting record with
search-key value Kj
  • If i gt ij (more than one pointer to bucket j)
  • allocate a new bucket z, and set ij and iz to the
    old ij - 1.
  • make the second half of the bucket address table
    entries pointing to j to point to z
  • remove and reinsert each record in bucket j.
  • recompute new bucket for Kj and insert record in
    the bucket (further splitting is required if the
    bucket is still full)
  • If i ij (only one pointer to bucket j)
  • increment i and double the size of the bucket
    address table.
  • replace each entry in the table by two entries
    that point to the same bucket.
  • recompute new bucket address table entry for
    KjNow i gt ij so use the first case above.

21
Updates in Extendable Hash Structure (Cont.)
(Formal)
  • When inserting a value, if the bucket is full
    after several splits (that is, i reaches some
    limit b) create an overflow bucket instead of
    splitting bucket entry table further.
  • To delete a key value,
  • locate it in its bucket and remove it.
  • The bucket itself can be removed if it becomes
    empty (with appropriate updates to the bucket
    address table).
  • Coalescing of buckets can be done (can coalesce
    only with a buddy bucket having same value of
    ij and same ij 1 prefix, if it is present)
  • Decreasing bucket address table size is also
    possible
  • Note decreasing bucket address table size is an
    expensive operation and should be done only if
    number of buckets becomes much smaller than the
    size of the table

22
Extendible Hashing (Intuition)
  • Situation Bucket (primary page) becomes full.
    Why not re-organize file by doubling of
    buckets?
  • Reading and writing all pages is expensive!
  • Idea Use directory of pointers to buckets,
    double of buckets by doubling the directory,
    splitting just the bucket that overflowed!
  • Directory much smaller than file, so doubling it
    is much cheaper. Only one page of data entries
    is split. No overflow page!
  • Trick lies in how hash function is adjusted!

23
Example
2
LOCAL DEPTH
Bucket A
16
4
12
32
GLOBAL DEPTH
2
2
Bucket B
13
00
1
21
5
01
  • Directory is array of size 4.
  • To find bucket for r, take last global depth
    bits of h(r) we denote r by h(r).
  • If h(r) 5 binary 101, it is in bucket
    pointed to by 01.

2
10
Bucket C
10
11
2
DIRECTORY
Bucket D
15
7
19
DATA PAGES
  • Insert If bucket is full, split it (allocate
    new page, re-distribute).
  • If necessary, double the directory. (As we will
    see, splitting a bucket does not always require
    doubling we can tell by comparing global depth
    with local depth for the split bucket.)

24
Insert h(r) 6 (The Easy Case)
6 binary 00110
Bucket A
Bucket A
LOCAL DEPTH
LOCAL DEPTH
2
2
16
16
4
12
32
4
12
32
GLOBAL DEPTH
GLOBAL DEPTH
Bucket B
Bucket B
2
2
2
2
13
00
1
21
5
00
13
1
21
5
01
01
Bucket C
Bucket C
2
2
10
10
10
10
6
11
11
Bucket D
Bucket D
2
2
DIRECTORY
DIRECTORY
15
7
19
15
7
19
DATA PAGES
DATA PAGES
25
Insert h(r) 20 (Causes Doubling)
20 binary 10100
2
Bucket A
LOCAL DEPTH
Bucket A
LOCAL DEPTH
2
16
32
GLOBAL DEPTH
16
4
12
32
GLOBAL DEPTH
2
2
Bucket B
1
5
21
13
Bucket B
00
2
2
01
13
00
1
21
5
2
Bucket C
10
01
10
11
Bucket C
2
10
2
Bucket D
10
11
DIRECTORY
15
7
19
Bucket D
2
DIRECTORY
2
Bucket A2
15
7
19
20
4
12
(split image'
DATA PAGES
of Bucket A)
26
Insert h(r)20 (Causes Doubling)
20 binary 10100
2
Bucket A
LOCAL DEPTH
3
LOCAL DEPTH
16
32
GLOBAL DEPTH
32
16
Bucket A
GLOBAL DEPTH
2
2
Bucket B
2
3
1
5
21
13
00
1
5
21
13
000
Bucket B
01
001
2
Bucket C
10
2
010
10
11
10
Bucket C
011
100
2
Bucket D
2
DIRECTORY
101
15
7
19
15
7
19
Bucket D
110
111
2
Bucket A2
3
20
4
12
DIRECTORY
Bucket A2
20
12
4
(split image'
(split image'
of Bucket A)
of Bucket A)
27
Points to Note
  • 20 binary 10100. Last 2 bits (00) tell us r
    belongs in A or A2. Last 3 bits needed to tell
    which.
  • Global depth of directory Max of bits needed
    to tell which bucket an entry belongs to.
  • Local depth of a bucket of bits used to
    determine if an entry belongs to this bucket.
  • When does bucket split cause directory doubling?
  • Before insert, local depth of bucket global
    depth. Insert causes local depth to become gt
    global depth directory is doubled by copying it
    over and fixing pointer to split image page.
    (Use of least significant bits enables efficient
    doubling via copying of directory!)

28
Directory Doubling
  • Why use least significant bits in directory?
  • Allows for doubling via copying!

6 110
6 110
3
3
000
000
001
100
2
2
010
010
00
00
1
1
011
110
6
01
10
0
0
100
001
6
6
10
01
1
1
101
101
6
6
11
11
6
110
011
111
111
vs.
Least Significant
Most Significant
29
Comments on Extendible Hashing
  • If directory fits in memory, equality search
    answered with one disk access else two.
  • 100MB file, 100 bytes/rec, 4K pages contains
    1,000,000 records (as data entries) and 25,000
    directory elements chances are high that
    directory will fit in memory.
  • Directory grows in spurts, and, if the
    distribution of hash values is skewed, directory
    can grow large.
  • Delete If removal of data entry makes bucket
    empty, can be merged with split image. If each
    directory element points to same bucket as its
    split image, we can halve directory (this is rare
    in practice).

30
(No Transcript)
31
Extendable Hashing vs. Other Schemes
  • Benefits of extendable hashing
  • Hash performance does not degrade with growth of
    file
  • Minimal space overhead
  • Disadvantages of extendable hashing
  • Extra level of indirection to find desired record
  • Bucket address table may itself become very big
    (larger than memory)
  • Need a tree structure to locate desired record in
    the structure!
  • Changing size of bucket address table is an
    expensive operation
  • Linear hashing is an alternative mechanism which
    avoids these disadvantages at the possible cost
    of more bucket overflows

32
Linear Hashing
  • This is another dynamic hashing scheme, an
    alternative to Extendible Hashing.
  • LH handles the problem of long overflow chains
    without using a directory.
  • Idea Use a family of hash functions h0, h1,
    h2, ...
  • hi(key) h(key) mod(2iN) N initial buckets
  • h is some hash function (range is not 0 to N-1)
  • If N 2d0, for some d0, hi consists of applying
    h and looking at the last di bits, where di d0
    i.
  • hi1 doubles the range of hi (similar to
    directory doubling)

33
Linear Hashing (Contd.)
  • Directory avoided in LH by using overflow pages,
    and choosing bucket to split round-robin.
  • Splitting proceeds in rounds. Round ends when
    all NR initial (for round R) buckets are split.
    Buckets 0 to Next-1 have been split Next to NR
    yet to be split.
  • Current round number is Level.
  • Search To find bucket for data entry r, find
    hLevel(r)
  • If hLevel(r) in range Next to NR , r belongs
    here.
  • Else, r could belong to bucket hLevel(r) or
    bucket hLevel(r) NR must apply hLevel1(r) to
    find out.

34
Linear Hashing (background)
Insert 20
3
We have seen what it means to split a bucket
32
16
3
4
12
20
After
Before
We have seen what it means to add an overflow
page to a bucket
2
20
Before
After
35
Linear Hashing (background)
Insert 20
It is meaningful to split a bucket with its
overflow
2
Before
20
3
Note that as before, the number of bits pointing
to the bucket must increase
32
16
After
3
4
12
20
36
100010101010010101
Why split the first bucket, and overflow the 10,
6, 2, 14 bucket?
Bucket to be split
Bucket to be split
32
Next
Next
13
1
21
5
4
12
32
6
14
5
10
13
1
21
2
Stop when you get here
15
7
19
10
6
2
14
Stop when you get here
4
12
15
7
19
Because 6 and 14, and 10 and 2 differ in the
3rd bit.
A new round begins
37
Bucket to be split
A round is about to end
Next
32
13
1
21
5
h
d
6
14
15
7
19
4
12
f
g
h
q
Stop when you get here
g
p
t
A new round begins
38
Overview of LH File
  • In the middle of a round.

Buckets split in this round
Bucket to be split
h
search key value
)
(
If
Level
Next
is in this range, must use
search key value
)
(
h
Level1
Buckets that existed at the
to decide if entry is in
beginning of this round
split image' bucket.
this is the range of
h
Level
split image' buckets
created (through splitting
of other buckets) in this round
39
Bucket to be split
Next
h
Level
Bucket to be split
Next
Bucket to be split
Next
h
h
Level
Level
40
Linear Hashing (Contd.)
  • Insert Find bucket by applying hLevel /
    hLevel1
  • If bucket to insert into is full
  • Add overflow page and insert data entry.
  • (Maybe) Split Next bucket and increment Next.
  • Can choose any criterion to trigger split.
  • Since buckets are split round-robin, long
    overflow chains dont develop!
  • Doubling of directory in Extendible Hashing is
    similar switching of hash functions is implicit
    in how the of bits examined is increased.

41
Example of Linear Hashing
  • On split, hLevel1 is used to re-distribute
    entries.

Level0, N4
Level0
PRIMARY
h
h
h
h
OVERFLOW
PRIMARY
0
1
PAGES
0
1
PAGES
PAGES
Next0
32
32
44
36
00
000
00
000
Next1
Data entry r
25
9
5
25
9
5
with h(r)5
01
001
01
001
30
30
14
18
10
14
18
10
Primary
10
10
010
010
bucket page
31
35
11
7
31
35
11
7
43
011
011
11
11
(This info is for illustration only!)
(The actual contents of the linear hashed file)
100
44
36
00
42
Example End of a Round
Level1
PRIMARY
OVERFLOW
h
h
PAGES
0
1
PAGES
Next0
Level0
00
000
32
PRIMARY
OVERFLOW
PAGES
h
PAGES
h
0
1
001
01
9
25
32
00
000
10
010
10
50
66
18
34
9
25
001
01
011
11
35
11
43
10
66
10
18
34
010
Next3
100
00
44
36
43
35
31
7
11
011
11
101
11
5
29
37
44
36
100
00
14
30
22
10
110
5
37
29
101
01
22
14
30
31
7
111
11
10
110
43
LH Described as a Variant of EH
  • The two schemes are actually quite similar
  • Begin with an EH index where directory has N
    elements.
  • Use overflow pages, split buckets round-robin.
  • First split is at bucket 0. (Imagine directory
    being doubled at this point.) But elements
    lt1,N1gt, lt2,N2gt, ... are the same. So, need
    only create directory element N, which differs
    from 0, now.
  • When bucket 1 splits, create directory element
    N1, etc.
  • So, directory can double gradually. Also, primary
    bucket pages are created in order. If they are
    allocated in sequence too (so that finding ith
    is easy), we actually dont need a directory!
    Voila, LH.

44
Comparison of Ordered Indexing (Trees) and Hashing
  • Cost of periodic re-organization
  • Relative frequency of insertions and deletions
  • Is it desirable to optimize average access time
    at the expense of worst-case access time?
  • Expected type of queries
  • Hashing is generally better at retrieving records
    having a specified value of the key.
  • If range queries are common, ordered indices are
    to be preferred.

45
The red dotted slides that follow contain a nice
worked example of extendible hashing. Starting
with an empty database
46
General Extendable Hash Structure
In this structure, i2 i3 i, whereas i1 i 1
47
Use of Extendable Hash Structure Example
Initial Hash structure, bucket size 2
48
Example (Cont.)
  • Hash structure after insertion of one Brighton
    and two Downtown records

49
Example (Cont.)
Hash structure after insertion of Mianus record
50
Example (Cont.)
Hash structure after insertion of three
Perryridge records
51
Example (Cont.)
  • Hash structure after insertion of Redwood and
    Round Hill records
Write a Comment
User Comments (0)
About PowerShow.com