Practice 3: Data Structures

Difficulty assessment:

(*) Basic. You should be able to easily do this.

(**) Standard. Core material to test your understanding.

(***) Advanced. Possibly slightly above the level of the assignments.

Hash tables

Exercise 1 ( ** ):

(a) You have a universe of 30 numbers $\{0, 1, \ldots, 29\}$ and a hash-table of size $10$. Which hash-function is better, $h_1(k) = k \mod 10$, or $h_2(k) = \lfloor k/10\rfloor$?

Solution: The first hash-function selects the last digit of a number as a hash value, and the second hash function the first. The first hash function is better because its image covers the whole hash-table, whereas the image of the second hash function, $\{0, 1, 2\}$, is only a subset of the hash-table. Moreover, the first hash-function distributes the hash-values evenly over the hash table, thus if the keys drawn from the universe are random, there will be few collisions.

(b) How many different hash functions exist that map from a universe of size $n$ to the integers $1$ to $m$?

Solution: The number of functions that map from a universe of size $n$ to the integers $1\ldots m$ is $m^n$, as each of the $n$ elements can map to one of $m$ integers.

Exercise 2 ( ** ):

(a) Draw the hash table of length $m = 16$ resulting from hashing the keys $8, 12, 40, 13, 88, 45, 29, 20, 23$, and $77$, using the hash function $h(k) = (3k + 1) \mod 16$ and assuming collisions are handled by chaining. Is $m = 16$ a good choice for the size of a table? Why or why not?

Solution: The following image illustrates the keys inserted into a hash-table of size 16:

For the given hash function, $m = 16$ is a bad choice for the size of a hash table because it is a power of $2$. When we are using the division method to create a hash function, if the size of the table $m = 2^p$, then the hash values will only depend on the lower $p$ bits. This may lead to more collisions.

(b) Repeat the previous exercise for a table of size $m = 13$ using the hash function $h(k) = (3k + 1) \mod 13$. Why is $m = 13$ a better choice for the size of a hash table than $m = 16$?

Solution: The following image illustrates the keys inserted into a hash-table of size $13$:

$m = 13$ is a better choice for a size of a hash table when using the division method to create a hash function because it is prime and it is far enough from a power of two. A hash function created by using the division method will not only depend on the lower bits of the keys.

(c) What is the result of the previous exercise, assuming collisions are handled by linear probing? Report for every key how often you had to probe unsuccessfully before you could insert the key, and report the total number of unsuccessful probes.

Solution:

For keys $\{8, 12, 40, 13, 88, 45, 29, 20\}$ there were $0$ unsuccessful probes. For keys $23$ and $77$ there were $2$ unsuccessful probes. The total number of unsuccessful probes is $4$.

(d) What is the result of (c) assuming collisions are handled by double hashing with a primary hash function $h'(k) = h(k)$ and a secondary hash function $h''(k) = k \mod 15$?

Solution:

For keys $\{8, 12, 40, 13, 88, 45, 29, 20\}$ there were $0$ unsuccessful probes. For key $23$ there was $1$ unsuccessful probe, and for key $77$ there were $2$ unsuccessful probes. The total number of unsuccessful probes is 3.

Exercise 3: Describe algorithms for Hash-Chain-Insert, Hash-Chain-Delete, and Hash-Chain-Search for a hash table with chaining, where the (doubly-linked) lists are stored in a sorted order. Analyze the running time of the algorithms.

Solution:

(i) Hash-Chain-Insert: instead of inserting $x$ at the head of $T[h(x:key)]$ we need to find the proper position of $x$ in the list. So, we compare the elements of $T[h(x:key)]$ one by one with $x$ until we reach the end of the list or until we find an element $y$ such that $y:key > x:key$. Then, we insert $x$ in front of $y$, or at the end of the list if no such $y$ was found.

Running time analysis: The worst-case running time of Hash-Chain-Insert is $\Theta(n)$ (if all the elements hash into the same slot, and $x$ is the maximum element, then the whole list needs to be traversed until we can insert $x$. On average, if the load factor of the hash table is $\alpha$, the running time of Hash-Chain-Insert will be $\Theta(1 + \alpha)$.

(ii) Hash-Chain-Search: search for an element with key $k$ in the list $T[h(k)]$. We can use the same algorithm as for the non-sorted lists. We can slightly improve the algorithm by stopping the search when we find a $y$ with $y:key > x:key$. The running time for both variants is $\Theta(1 + \alpha)$.

(iii) Hash-Chain-Delete: delete element $x$ from the list $T[h(x:key)]$. The algorithm is the same as for the non-sorted lists: when deleting $x$ from a sorted list, the sorted order of the remaining elements is preserved. The running time is $O(1)$ if the lists are doubly linked.

Exercise 4: Consider two sets of integers, $S = \{s_1, s_2, \ldots, s_m\}$ and $T = \{t_1, t_2, \ldots, t_n\}$, $m \leq n$. Describe an algorithm that uses a hash table of size $m$ to test whether $S$ is a subset of $T$, that runs in $O(n)$ time. You may assume the existence of a suitable hash function that satisifies the assumption of simple uniform hashing.

Solution:

Let $H$ be the hash table of size $m$ and $h()$ be a corresponding uniform hash function. The algorithm is the following: First, insert all the elements of $T$ into $H$. Next, for each element $s_i$ in $S$ search for $s_i$ in $H$. If for some $s_i$ the search was unsuccessful (i.e. returns None), then $s_i \not \in T$, and thus $S \not \subset T$. Otherwise, report that $S \subset T$.

Note: You could, but don't need to, add Python code here to clarify the algorithm. Python code for this algorithm is given below.

Correctness: If $S \subset T$, then for all elements of $S$ the search will return the corresponding element of $T$ from the hash table. Therefore all searches are successful, and the algorithm will correctly report that $S \subset T$.

If $S \not \subset T$, then there will be an element $s_i$ from $S$ such that $s_i \not \in T$. Then, searching for $s_i$ in the hash table will return None, and the algorithm correctly reports that $S \not \subset T$.

Running Time: Inserting $n$ elements into the a hashtable takes (with chaining) takes $O(1)\cdot n = O(n)$ time. Searching takes $O(1+\alpha)$ time per element, where $\alpha$ is the load factor, and $\alpha = n/m$. Thus, searching for $m$ elements takes $m O(1+n/m) = O(m + n)$ time. Overall the algorithm runs in $O(m + n)$ time.

In [1]:
# Solution code:
# A simplistic hash table for the purpose of this exercise. It just uses keys (instead of elements with keys)
class HashTableChaining:
    hashTable = []
    size = 0
    
    def __init__(self, size):
        self.size = size
        self.hashTable = [[]]*size
        
    def h(self,key):
        # division method. Note that depending on the table size, this might be not a good hash function, but it serves as an example here.
        return key%self.size 
    
    def insert(self, key):
        self.hashTable[self.h(key)].append(key)
         
    def search(self,key): 
        # Since there are no elements, this search routine will simply return the key if it finds it, and None otherwise 
        if key in self.hashTable[self.h(key)]: return key
        else: return None
        
def isSubset (S,T):
    m = len(S)
    n = len(T)
    H = HashTableChaining(m)
    for t in T: H.insert(t)
    for s in S:
        if not H.search(s): return False
    return True

isSubset([1,2,3],[1,4,2,3])
Out[1]:
True
In [ ]: