Fall 2011 Homework 7: Question 7
Shakespeare wrote a total of 884647 words in his known works. Of course, many words are used more than once, and the number of distinct words in Shakespeare's known writings is 31534 (according to one computation). This puts a lower bound on the size of Shakespeare's vocabulary, but it is likely that Shakespeare knew words which he did not use in these known writings. More specifically, suppose that a new poem of Shakespeare were uncovered, and consider the following (seemingly impossible) problem: give a good prediction of the number of words in the new poem that do not appear anywhere in Shakespeare's previously known works. The statisticians Ronald Thisted and Bradley Efron studied this problem in a paper called "Did Shakespeare write a newly-discovered poem?", which performed statistical tests to try to determine whether Shakespeare was the author of a poem discovered by a Shakespearean scholar in 1985. A simplified version of their method is developed in the problem below. The method was originally invented by Alan Turing (the founder of computer science) and I.J. Good as part of the effort to break the German Enigma code during World War II. Let N be the number of distinct words that Shakespeare knew, and assume these words are numbered from 1 to N. Suppose for simplicity that Shakespeare wrote only two plays, A and B. The plays are reasonably long and they are of the same length. Let $X_{j}$ be the number of times that word j appears in play A, and $Y_{j}$ be the number of times it appears in play B, for $1\leq j\leq N$.

(a) Explain why it is reasonable to model $X_{j}$ as being Poisson, and $Y_{j}$ as being Poisson with the same parameter as $X_{j}$.

(b) Let the numbers of occurrences of the word "eyeball" (which was coined by Shakespeare) in the two plays be independent Pois($\lambda$) r.v.s. Show that the probability that "eyeball" is used in play B but not in play A is $e^{\lambda}(\lambda -\lambda ^{2}/2! + \lambda^{3}/3!-\lambda ^{4}/4! )+...$.

(c) Now assume that $\lambda$ from (b) is unknown and is itself taken to be a random variable to reflect this uncertainty. So let $\lambda$ have a PDF $f_{0}$. Let X be the number of times the word "eyeball" appears in play A and Y be the corresponding value for play B. Assume that the conditional distribution of X, Y given $\lambda$ is that they are independent Pois($\lambda$) r.v.s. Show that the probability that "eyeball" is used in play B but not in play A is the alternating series P(X = 1) - P(X = 2) + P(X = 3) - P(X = 4) + ... Hint: condition on $\lambda$ and use (b).

(d) Assume that every word's numbers of occurrences in A and B are distributed as in (c), where $\lambda$  may be different for different words but $f_{0}$ is fixed. Let $W_{j}$ be the number of words that appear exactly j times in play A. Show that the expected number of distinct words appearing in play B but not in play A is $E(W_{1})-E(W_{2})+E(W_{3})-E(W_{4})+...$. (This shows that $W_{1}-W_{2}+W_{3}-W_{4}+...$ is an unbiased predictor of the number of distinct words appearing in play B but not in play A: on average it is correct. Moreover, it can be computed just from having seen play A, without needing to know $f_{0}$ or any of the $\lambda_{j}$. This method can be extended in various ways to give predictions for unobserved plays based on observed plays.)
Solution: (a) This distribution is used to describe the number of events (such as emails received) happening at some average rate in a fixed interval or volume. The Poisson paradigm applies here: each individual word in a play has some very small probability of being word j, and the words are weakly dependent. It is reasonable to assume that the average rate of occurrence of a particular word is the same for two plays by the same author, so we take lambda to be the same for Xj and Yj. (b) Let X be the number of times that "eyeball" is used in play A, and Y be the number of times that it is used in play B. Use the fact that X and Y are independent Pois(lambda) to get P(X = 0, Y > 0). (c) Now lambda is a random variable. Given lambda, the calculation from (b) holds. By the law of total probability, we obtain our desired result. (d) Also, note that the number of words that appear exactly i times in play A is Wi = I(X1 = i) + I(X2 = i) + I(X3 = i) + … + I(XN = i), where I(Xj = i) is the indicator of word j appearing exactly i times in play A.Consult iTunes course for full solution.
"Mathematics is the logic of certainty, but statistics is the logic of uncertainty."