Fall 2011 Homework 4: Question 5
A scientist wishes to study whether men or women are more likely to have a certain disease, or whether they are equally likely. A random sample of m women and n men is gathered, and each person is tested for the disease (assume for this problem that the test is completely accurate). The numbers of women and men in the sample who have the disease are X and Y respectively, with X ~ Bin(m, $p_{1}$) and Y ~ Bin(n, $p_{2}$). Here $p_{1}$ and $p_{2}$ are unknown, and we are interested in testing the "null hypothesis" $p_{1}$ = $p_{2}$.
(a) Consider a 2 by 2 table listing with rows corresponding to disease status and columns corresponding to gender, with each entry the count of how many people have that disease status and gender (so m + n is the sum of all 4 entries). Suppose that it is observed that X + Y = r. The Fisher exact test is based on conditioning on both the row and column sums, so m, n, r are all treated as fixed, and then seeing if the observed value of X is "extreme" compared to this conditional distribution. Assuming the null hypothesis, use Bayes' Rule to find the conditional PMF of X given X + Y = r. Is this a distribution we have studied in class? If so, say which (and give its parameters).
(b) Give an intuitive explanation for the distribution of (a), explaining how this problem relates to other problems we've seen, and why $p_{1}$ disappears (magically?) in the distribution found in (a).
Solution: (a) We use Bayes' Rule. When we assume the null hypothesis, X ~ Bin(m, p) and Y ~ Bin (n, p) are both binomial with the same parameter p, and the conditional distribution is Hypergeometric with parameters, m, n, r. (remember this as the happy-men-sad-women story that describes the Conditional Binomial). (b) This problem has the same structure as the elk (capture-recapture) problem. In the elk problem, we take a sample of elk from a population, where earlier some were tagged, and we want to know the distribution of the number of tagged elk in the sample. By analogy, think of the women as corresponding to tagged elk, and men as corresponding to untagged elk. Having r people be infected with the disease corresponds to capturing a new sample of r elk the number of women among the r diseased individuals corresponds to the number of tagged elk in the new sample. Under the null hypothesis and given that X + Y = r, the set of diseased people is equally likely to be any set of r people. It makes sense that the conditional distribution of the number of diseased women does not depend on p, since once we know that X + Y = r, we can work directly in terms of the fact that we have a population with r diseased and m + n - r undiseased people, without worrying about the value of p that originally generated the population characteristics.
"Mathematics is the logic of certainty, but statistics is the logic of uncertainty."