Title: CSC%20172%20DATA%20STRUCTURES
1CSC 172 DATA STRUCTURES
2SETS and HASHING
- Unadvertised in-store special SETS!
- in JAVA, see Weiss 4.8
- Simple Idea Characteristic Vector
- HASHING...The main event.
3Representation of Sets
- List
- Simple O(n) dictionary operations
- Binary Search Trees
- O(log n) average time
- Range queries, sorting
- Characteristic Vector
- O(1) dictionary ops, but limited to small sets
- Hash Table
- O(1) average for dictionary ops
- Tricky to expand, no range queries
4Characteristic Vectors
- Boolean Strings whose position corresponds to the
members of some fixed universal set - A 1 in a location means that the element is in
the set - A 0 means that it is not
5MUSIC THEORY
- A chord is a set of notes played at the same
time. - Represented by a 12 bit vector called a pitch
class - B,A,A,G,G,F,F,E,D,D,C,C
- 000010010001 represents C major
- 000010001001 represents C minor
- Rotation is transposition
- Bit reversal is inversion
6UNIX file privileges
- user, group, others x read, write, execute
- 9 possible privileges
- Type ls l on UNIX
- total 142
- -rw-rw-r-- 1 pawlicki none 76 Jun 20
2000 PKG416.desc - -rw-rw-r-- 1 pawlicki none 28906 Jun 20
2000 PKG416.pdf - -rw-rw-r-- 1 pawlicki none 1849 Jun 20
2000 let.1 - -rw-rw-r-- 1 pawlicki none 0 Apr 2
1303 out - -rw-rw-r-- 1 pawlicki none 39891 Jun 20
2000 stapp.uu
7UNIX files
- The order is rwx for each of user (owner), group,
and others - So, a protection mode of 110100000 means that the
owner may read and write (but not execute), the
group can read only and others cannot even read
8GAMBLING
- A deck has 52 cards
- 2C,2H,2S,2D,3C, .... KD,AC,AH,AS,AD
- Represent a hand as a vector of 52 bits
- 00000000000000000000000000000000000000000000000001
01 is a pair of aces - In Texas Hold'em everyone gets two hole cards
and 5 board cards - We can use bitwise to find hands
9CV advantages
- If the universal set is small, sets can be
represented by bits packed 32 to a word - Insert, delete, and lookup are O(1) on the proper
bit - Union, intersection, difference are implemented
on a word-by-word basis - O(m) where m is the size of the set
- Small constant factor (1/32)
- Fast, machine operations
10Hashing
- A cool way to get from an element x to the place
where x can be found - An array 0..B-1 of buckets
- Bucket contains a list of set elements
- B number of buckets
- A hash function that takes potential set elements
and quickly produces a random integer 0..B-1
11Example
- If the set elements are integers then the
simplest/best hash function is usually h(x) x
B or h(x) x - (xB), (never 0). - Suppose B 6 and we wish to store the integers
- 70, 53, 99, 94, 83, 76, 64, 30
- They belong in the buckets 4, 5, 3, 4, 5, 4, 4,
and 0 - Note If B 7 0,4,1,3,6,6,1,2
12Pitfalls of Hash Function Selection
- We want to get a uniform distribution of elements
into buckets - Beware of data patterns that cause non-uniform
distribution
13Example
- If integers were all even, then B 6 would cause
only buckets 0,2, and 4 to fill - If we hashed words in the UNIX dictionary into 10
buckets by length of word then 20 go into bucket
7
14Dictionary Operations
- Lookup
- Go to head of bucket h(x)
- Search for bucket list. If x is in the bucket
- Insertion append if not found
- Delete list deletion from bucket list
15Analysis
- If we pick B to be new N, the number of elements
in the set, then the average list is O(1) long - Thus, dictionary ops take O(1) time
- Worst case all elements go into one bucket
- O(n)
16Managing Hash Table Size
- If n gets as high as 2B, create a new hash table
with 2B buckets - Rehash every element into the new table
- O(n) time total
- There were at least n inserts since the last
rehash - All these inserts took time O(n)
- Thus, we amortize the cost of rehashing over
the inserts since the last rehash - Constant factor, at worst
- So, even with rehashing we get O(1) time ops
17Collisions
- A collision occurs when two values in the set
hash to the same value - There are several ways to deal with this
- Chaining (using a linked list or some secondary
structure) - Open Addressing
- Double hashing
- Linear Probing
18Chaining
Very efficient Time Wise
Other approaches Use less space
?
19Open Addressing
- When a collision occurs,
- if the table is not full find an available space
- Linear Probing
- Quadratic Probing
- Double Hashing
20Linear Probing
- If the current location is occupied, try the next
table location - LinearProbingInsert(K)
- if (table is full) error
- probe h(K)
- while (tableprobe is occupied)
- probe probe M
- tableprobe K
-
- Walk along table until an empty spot is found
- Uses less memory than chaining (no links)
- Takes more time than chaining (long walks)
- Deleting is a pain (mark a slot as having been
deleted)
21Linear Probing
h(K) K 13
Insert 18, 41, 22, 59, 32, 31, 73
h(K) 5,
22Linear Probing
h(K) K 13
Insert 18, 41, 22, 59, 32, 31, 73
h(K) 5, 2,
23Linear Probing
h(K) K 13
Insert 18, 41, 22, 59, 32, 31, 73
h(K) 5, 2, 9,
24Linear Probing
h(K) K 13
Insert 18, 41, 22, 59, 32, 31, 73
h(K) 5, 2, 9, 7,
25Linear Probing
h(K) K 13
Insert 18, 41, 22, 59, 32, 31, 73
h(K) 5, 2, 9, 7, 6,
26Linear Probing
h(K) K 13
Insert 18, 41, 22, 59, 32, 31, 73
h(K) 5, 2, 9, 7, 6, 5,
27Linear Probing
h(K) K 13
Insert 18, 41, 22, 59, 32, 31, 73
h(K) 5, 2, 9, 7, 6, 5,
28Linear Probing
h(K) K 13
Insert 18, 41, 22, 59, 32, 31, 73
h(K) 5, 2, 9, 7, 6, 5,
29Linear Probing
h(K) K 13
Insert 18, 41, 22, 59, 32, 31, 73
h(K) 5, 2, 9, 7, 6, 5,
30Linear Probing
h(K) K 13
Insert 18, 41, 22, 59, 32, 31, 73
h(K) 5, 2, 9, 7, 6, 5,
8
31Linear Probing
h(K) K 13
Insert 18, 41, 22, 59, 32, 31, 73
h(K) 5, 2, 9, 7, 6, 5,
8
73
32Double Hashing
- If the current location is occupied, try another
table location - Use two hash functions
- If M is prime, eventually will examine every
location - DoubleHashInsert(K)
- if (table is full) error
- probe h1(K)
- offset h2(K)
- while (tableprobe is occupied)
- probe (probeoffset) M
- tableprobe K
-
- Many of the same (dis)advantages as linear
probing - Distributes keys more evenly than linear probing
33Quadratic Probing
- Don't step by 1 each time. Add i2 to the h(x)
hashed location (mod B of course) for i 1,2,...
34Double Hashing
h1(K) K 13 h1(K) 8 - K 8
Insert 18, 41, 22, 59, 32, 31, 73
h1(K) 5, 2, 9, 7, 6, 5,
8
h2(K) 6, 7, 2, 5, 8, 1,
7
35Double Hashing
h1(K) K 13 h1(K) 8 - K 8
Insert 18, 41, 22, 59, 32, 31, 73
h1(K) 5, 2, 9, 7, 6, 5,
8
h2(K) 6, 7, 2, 5, 8, 1,
7
31
36Double Hashing
h1(K) K 13 h1(K) 8 - K 8
Insert 18, 41, 22, 59, 32, 31, 73
h1(K) 5, 2, 9, 7, 6, 5,
8
h2(K) 6, 7, 2, 5, 8, 1,
7
31
73
37Theoretical Results
38Expected Probes
1.0
0.5
1.0