Consider the following:

(a) Suppose that we have a list of the populations of every country in the world. Guess, without looking at data yet, what percentage of the populations have the digit 1 as their first digit (e.g., a country with a population of 1,234,567 has first digit 1 and a country with population 89,012,345 does not). Note: (a) is a rare problem where the only way to lose points is to find out the right answer rather than guessing!

(b) After having done (a), look through a list of populations such as http://en.wikipedia.org/wiki/List_of_countries_by_population and count how many start with a 1. What percentage of countries is this?

(c) (c) Benford's Law states that in a very large variety of real-life data sets, the first digit approximately follows a particular distribution with about a 30% chance of a 1, an 18% chance of a 2, and in general for , where D is the first digit of a randomly chosen element. Check that this is a PMF (using properties of logs, not with a calculator).

(d) Suppose that we write the random value in some problem (e.g., the population of a random country) in scientific notation as , where N is a nonnegative integer and . Assume that X is a continuous r.v. with PDF f(x) = c/x, for (and 0 otherwise), with c a constant. What is the value of c (be careful with the bases of logs)? Intuitively, we might hope that the distribution of X does not depend on the choice of units in which X is measured. To see whether this holds, let Y = aX with a > 0. What is the PDF of Y (specifying where it is nonzero)?

(e) Show that if we have a random number (written in scientific notation) and X has the PDF f(x) from (d), then the first digit (which is also the first digit of X) has the PMF given in (c).*Hint: what does D = j correspond to in terms of the values of X?*

(a) Suppose that we have a list of the populations of every country in the world. Guess, without looking at data yet, what percentage of the populations have the digit 1 as their first digit (e.g., a country with a population of 1,234,567 has first digit 1 and a country with population 89,012,345 does not). Note: (a) is a rare problem where the only way to lose points is to find out the right answer rather than guessing!

(b) After having done (a), look through a list of populations such as http://en.wikipedia.org/wiki/List_of_countries_by_population and count how many start with a 1. What percentage of countries is this?

(c) (c) Benford's Law states that in a very large variety of real-life data sets, the first digit approximately follows a particular distribution with about a 30% chance of a 1, an 18% chance of a 2, and in general for , where D is the first digit of a randomly chosen element. Check that this is a PMF (using properties of logs, not with a calculator).

(d) Suppose that we write the random value in some problem (e.g., the population of a random country) in scientific notation as , where N is a nonnegative integer and . Assume that X is a continuous r.v. with PDF f(x) = c/x, for (and 0 otherwise), with c a constant. What is the value of c (be careful with the bases of logs)? Intuitively, we might hope that the distribution of X does not depend on the choice of units in which X is measured. To see whether this holds, let Y = aX with a > 0. What is the PDF of Y (specifying where it is nonzero)?

(e) Show that if we have a random number (written in scientific notation) and X has the PDF f(x) from (d), then the first digit (which is also the first digit of X) has the PMF given in (c).

Solution: (part a) Guess. (part b) 28% of the countries have first digit 1, as this is so much higher than one would expect from guessing that the first digit is equally likely to be any of 1, 2,…, 9. This phenomenon is known as Benford's Law and a distribution similar to the one derived below has been observed in many diverse settings (such as lengths of rivers, physical constants, stock prices). (part c) The function P(D = j) is nonnegative and the sum over all values is 1 (all the terms cancel in the telescoping series). Thus we have a PMF. (part d) c = 1/ln 10 and Y has PDF c/y with same c but y between a and 10a inclusive. So the PDF takes the same form for aX as for X, but over a different range. (part e) The first digit D = d when d =< X < d + 1. The probability of this is then log{10}(d + 1) − log{10}(d), identical to our earlier PMF.

Copyright © 2011 Stat 110 Harvard.
Website layout by former Stat110'er.