Quantifying uncertainty in probability predictions / Toban Wiebe

Suppose you’re interested in knowing the chances of an event $X$ occuring (e.g., $X =$ “a nuclear strike over any populated area in the year 2019”). When making predictions about events with binary outcomes (either the event happens or it doesn’t), people generally report a single probability (e.g., a 2% chance of $X$ occurring). But, you may wonder, why not report an interval around that prediction, e.g. a prediction interval like 2% $\pm$ 0.5%, or a distribution of probabilities (e.g., a Beta distribution) to reflect uncertainty?

For example, this question comes up with prediction markets, where the market price can be interpreted as the best estimate of the probability of the event $X$ occuring. But there are no prediction intervals around this market price. Or consider models for classification, such as logistic regression or other machine learning algorithms, which produce predicted probabilities for each possible class (e.g., $X_i =$ “transaction $i$ is fraudulent”). In both of these cases, we face the same issue with representing uncertainty — how does the market/model express confidence in its predicted probabilities?

In this post, I’ll explain why this question stems from a fundamental confusion: it’s a misconception to think that a predicted probability is a point estimate that doesn’t convey any uncertainty. Below, I’ll show that there are two distinct sources of uncertainty that are being conflated here, and that one or both can be used to express uncertainty.

Two types of uncertainty

The key distinction here is between:

Uncertainty over the outcome, $X$ $X$ vs $\neg X$ $\neg X$
- Also known as aleatoric uncertainty
- (FYI: the symbol " $\neg$ " is the negation operator and can be read as “not”)
Uncertainty over model parameters which are used to generate a prediction for the outcome
- Also known as epistemic uncertainty

Let’s unpack each case in depth.

Uncertainty over outcomes

When we aren’t working with a model, we only have the first source of uncertainty to deal with. But it isn’t obvious where the uncertainty lies: if we say $Pr[X] = 0.02$ , it may appear that we’ve just given a point estimate. But recall that this is a binary outcome space, i.e., the only possible outcomes are $X$ or $\neg X$ . So the full probability distribution (over the two possible outcomes) can be summarized by one probability, $p := Pr[X]$ (which implies $1-p = Pr[\neg X]$ ). As we’ve provided a full probability distribution over the outcome space, it’s not possible to say anything more — any uncertainty must be embedded in this distribution.

Intuitively, probabilities near 0 or 1 reflect a high degree of certainty. A prediction without any uncertainty at all would just be a yes or no answer, i.e., a predicted probability of 0 or 1. It would just state which outcome will occur, with no notion of uncertainty or hedging.

More precisely, confidence in a probability prediction is reflected by how extreme it is relative to a baseline or prior belief. To see this, suppose that there is an event $X$ that is very likely to occur, and that a prediction market has given $X$ a predicted probability of 0.97. If you are maximally uncertain/ignorant about $X$ , what probability do you assign? Intuitively, you hedge your bets and stick to 0.97. Here, 0.97 is the baseline, which you can treat as your prior probability. Given this prior information, a prediction of 0.97 reflects maximal uncertainty. (If you didn’t have any prior information whatsoever, you would go with 0.5.)

Then, if you have some new information about $X$ , you can update your prior to get a posterior. If your information provides strong evidence in favor of $X$ , then your posterior probability might jump up to, say, 0.997. On the other hand, if your information strongly supports $\neg X$ , then your posterior might drop to, say, 0.78. Thus, your degree of confidence is revealed by the degree to which your probability moves away from the baseline and toward 0 or 1.

You can use Bayes’ Theorem to play with some numbers yourself. Denote your prior by $p := Pr[X]$ , and assume you’ve used your information $D$ to compute the likelihoods $q(X) := Pr[D \mid X]$ and $q(\neg X) := Pr[D \mid \neg X]$ . Denote the likelihood ratio by $\lambda := q(X) / q(\neg X)$ .

Then compute the posterior and rearrange in terms of the likelihood ratio:

\begin{aligned} Pr[X \mid D] &= \frac{p \cdot q(X)}{p \cdot q(X) + (1-p)q(\neg X)}\\ &= \frac{p \cdot q(X)/q(\neg X)}{p \cdot q(X)/q(\neg X) + (1-p)}\\ &= \frac{p \cdot \lambda}{p \cdot \lambda + (1-p)} \end{aligned}

Note that the posterior can be expressed purely in terms of the prior and the likelihood ratio (i.e., it doesn’t depend on the individual likelihoods). This means that the magnitudes of the likelihoods don’t matter; all that matters is their ratio, which indicates how much the information $D$ favors $X$ relative to $\neg X$ .

If you play with this formula, you’ll get a sense of how the information in the likelihoods updates the prior to a posterior probability. Notice that when $\lambda = 1$ , the posterior reduces to $p$ , the prior. In other words, when $D$ is uninformative about $X$ , it leaves your prior belief unchanged. Furthermore, for any prior belief $p$ , if $\lambda > 1$ , then your posterior will be pushed upward from your prior (and vice versa for $\lambda < 1$ ). That is, any information in favor of $X$ will increase your confidence in $X$ — even if $p=0.999$ !

Judging the quality of probability predictions is simple: just check that they’re calibrated. For example, predictions made with, 80% confidence should be correct 80% of the time. With enough completed predictions, you can plot a reliability diagram to assess the calibration of the predictions.

Uncertainty over models

When we are working with a model, we also have a second source of uncertainty — that of the model. This uncertainty is reflected in the posterior distribution over the parameters (at least for Bayesians — frequentists would use the sampling distribution of the parameter estimator to derive confidence intervals / standard errors). Because these parameters are used to produce predictions, their uncertainty propagates through to produce additional uncertainty over the outcome.

This is meta-uncertainty: uncertainty over the model which produces the uncertain prediction of the outcome. (In fact, you can have higher levels of meta-uncertainty by including uncertainty over any hyperparameters of the model.)

For example, here’s a specification for a Bayesian logistic regression model, where I’ve put an informative prior on the model coefficients:

\begin{aligned} y_i &\sim Bernoulli(p_i) \\ \log\left(\frac{p_i}{1 - p_i}\right) &= \beta_0 + x_{i1} \beta_1 + x_{i2} \beta_2 + \ldots + x_{iK} \beta_K \\ \beta_0 &\sim \mathcal{N}(0,1.5) \\ \beta_k &\sim \mathcal{N}(0,1.5), \; k = 1,2,\ldots,K \end{aligned}

You can see how the uncertainty from the model’s prior (Normal distribution) propagates through, adding to the uncertainty in the likelihood (Bernoulli distribution). As a result, we get a full density for $p_i$ over the interval $(0,1)$ . The spread of this density reflects model uncertainty, whereas the distance from the prior distribution reflects the degree of confidence in the prediction of the outcome. The key thing to realize is that these two sources of uncertainty are orthogonal.

For example, you could simultaneously have a very confident prediction of the outcome but with a lot of model uncertainty — a widely spread posterior distribution that is far away from the prior distribution:

Confident prediction, uncertain model

Posterior Beta(2, 4) and prior Beta(95, 5) densities over predicted probabilities.

posterior Beta(2, 4)

posterior: Beta(2, 4)
prior: Beta(95, 5)
means: 0.33 / 0.95

Or you could have a very unconfident prediction of the outcome with very little model uncertainty — a tightly distributed posterior distribution that remains close to the prior distribution:

Uncertain prediction, confident model

Posterior Beta(120, 30) and prior Beta(60, 15) densities over predicted probabilities.

posterior Beta(120, 30)

posterior: Beta(120, 30)
prior: Beta(60, 15)
means: 0.80 / 0.80

Summing up

The argument I’ve made here can be summed up as:

A probability prediction is a full probability distribution and so it inherently quantifies uncertainty — it’s a misconception to think that you need a prediction interval to express uncertainty
Model uncertainty propagates through to produce an additional (but orthogonal) layer of uncertainty over the outcome

This may seem obvious in retrospect, but it’s always good to gain clarity on the fundamentals, where confused intuitions may lurk unnoticed. Here, intuitions such as “a scalar prediction must be a point estimate” and “you need a confidence/prediction interval to express uncertainty” are highly misleading.

In this case, understanding the distinct sources of uncertainty has resolved some confusion I had about prediction markets and machine learning model predictions. The upshot is that (if I’m comfortable ignoring model uncertainty), I need only be concerned that the probability predictions are calibrated. This is straightforward to check: for prediction markets, you just need some historical data on outcomes; for classifier models, you can check on a holdout dataset. Then you can be comfortable that the predictions are accurately quantifying uncertainty.