In the examples we gave early in the course for determining the probability that someone has a disease (D), given that he or she tests positive (+), we assumed that the probabilities p0=P(D), p1=P(+|D) and p2=P(+|~D) were known exactly. In real life, this is not so, and in fact our assessments of these probabilities will be based on statistics from a finite sample of patients. Thus, we may have a particular sample of people, some of which have the disease, and by division we estimate p0; similarly we may have a different sample of people who have the disease, determine which of these test positive, and by division estimate p1, and so forth.
This is an unrealistic way of estimating p0, p1 and p2; since we have a finite sample, we will really be only able to estimate credible intervals for these numbers from the sample. This means also that the best we can do is estimate credible intervals for theta, the probability that someone has the disease, given that they test positive.
Jim Berger has recently published a paper here, which you should download, where he discusses this problem amongst others. We will be interested in sections 2.2 and 2.3 of this paper. He and Douglas Mossman had earlier studied this problem, and came up with the following simple scheme for estimating a credible interval:
Suppose that x0, x1 and x2 are the number of people in three independent samples of size n0, n1 and n2 that have the disease, have the disease and test positive, and do not have the disease and test positive, respectively. Then, by Bayes' theorem, assuming a Jeffreys prior for binomial proportions pi-1/2(1-pi)-1/2 on each of p0, p1 and p2, the posterior probability of each of these quantities is given by a Beta(pi|xi+1/2,ni-xi+1/2) distribution (see Section 2.2 of the Berger paper). Thus, you can obtain a large sample of size N from the posterior distribution of theta by sampling p0, p1 and p2 from the appropriate Beta distribution using the rbeta function of R. Since the samples are all independent, you can then calculate for each of these samples an estimate of theta=p0*p1/[p0*p1 + (1-p0)*p2], which is just Bayes' theorem. Doing this for all samples you can obtain a sample from the posterior distribution of theta (posterior, that is, to the data x0, x1, x2, n0, n1 and n2). Once you have the samples of p0, p1 and p2, this can be done in a single R statement using vector calculations in R.
Write a function to draw such a sample and compute a 95% credible interval from the 2.5% and 97.5% quantiles of the sample using the R function quantile. Test your program by computing the quantiles of the credible interval on N=10,000 points using the data n0=n1=n2=20, x0=5, x1=15, x2=5. Compare this to the exact value if p0=0.25, p1=0.75, p2=0.25 from line 1 in Berger's Table 1. Does the result look reasonable?
Mossman, who is a psychiatrist and publishes in journals that are used to frequentist statistics, was concerned that the frequentist coverage of this procedure might not be as good as as the Bayesian probability. Therefore, Berger and Mossman performed an experiment where they computed the credible intervals many times, and simply counted what proportion of the credible intervals captured (covered) the actual theta. That's the coverage probability.
You can do the same by using the function you have written that outputs the lower and upper ends of the credible interval, given the data x0,...,n2. For simplicity, fix the values of n0, n1 and n2 at 20, as Berger and Mossman did. Then draw x0, x1 and x2 from independent binomial distributions with proportions p0, p1 and p2, respectively to get the data; these will vary from sample to sample. Then use your credible interval function to compute the 95% credible interval corresponding to the fixed n0, n0 and n0 and the x0, x1 and x2 you just sampled. The true value of theta is of course theta=p0*p1/[p0*p1 + (1-p0)*p2]. Following Berger and Mossman, you can calculate from the true value and from the endpoints of the credible interval whether theta is contained within the credible interval, falls to the left of the interval's lower endpoint, or falls to the right of its upper endpoint. Do this many times (I recommend 10,000 times after you are sure that the program is working properly) and you can then determine in what proportion of the cases theta falls below the lower limit of the credible interval, falls above the upper limit (as Berger displays in Table 1), or falls within the interval. The last of these (which Berger doesn't show in his table) is the coverage probability. Compare the computed coverage probability with the 95% figure. How close is it?
Do this for each of the three cases in Berger's Table 1.
This should not use much code, because except for the sampling over the many instances of x0, x1 and x2, everything can be done using the vector capabilities of R. My program fits on one screen of my computer, and there's no reason why your program shouldn't be similarly short.
Comments: Berger and Mossman showed (see Table 1) that the Bayesian procedure actually produced credible intervals that, when interpreted as confidence intervals, had superior coverage properties to any of the three standard frequentist techniques that had been used to do the same task. In my lectures I have pointed out that there are good reasons to want Bayesian procedures to have good frequentist properties, and in this example they do. This cannot always be arranged, but it often can be arranged.
I also recommend that you take a look at the journal, Bayesian Analysis, and consider adding it to your reading list. It is a new journal, published by the International Society for Bayesian Analysis (ISBA), and its quality is very high. I also recommend that you consider joining the ISBA. Student memberships are very inexpensive ($10 per year), and have all sorts of benefits, including a newsletter, announcements of meetings, and so forth.
All materials at this website Copyright (C) 1994-2007 by William
H. Jefferys. All Rights Reserved.
![]()