Fall 2011 Strategic Practice 4: Section 2 (Indicator Random Variables and Linearity of Expectation) - Question 4
A hash table is a commonly used data structure in computer science, allowing for fast information retrieval. For example, suppose we want to store some people's phone numbers. Assume that no two of the people have the same name. For each name x, a hash function h is used, where h(x) is the location to store x's phone number. After such a table has been computed, to look up x's phone number one just recomputes h(x) and then looks up what is stored in that location. The hash function h is deterministic, since we don't want to get different results every time we compute h(x). But h is often chosen to be pseudorandom. For this problem, assume that true randomness is used. So let there be k people, with each person's phone number stored in a random location (independently), represented by an integer between 1 and n. It then might happen that one location has more than one phone number stored there, if two different people x and y end up with the same random location for their information to be stored. Find the expected number of locations with no phone numbers stored, the expected number with exactly one phone number, and the expected number with more than one phone number (should these quantities add up to n?).
Solution: Let Ij be an indicator random variable equal to 1 if the jth location is empty, and 0 otherwise, for 1 <= j <= n. Then P(Ij = 1) = (1 - 1/n)^k, since the phone numbers are stored in independent random locations. Then I1 + ... +In is the number of empty locations. By linearity of expectation, we have E(summation of In) = summation of E(Ij) = n(1 - 1/n)^k. Similarly, the probability of a specific location having exactly 1 phone number stored is (k/n)(1 - 1/n)^(k-1), so the expected number of such locations is k(1-1/n)^(k-1). By linearity, the sum of the three expected values is n, so the expected number of locations with more than one phone number is n minus the above results.
"Mathematics is the logic of certainty, but statistics is the logic of uncertainty."