Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Up until now, we’ve used error rates to help us understand tradeoffs in binary decision-making. In this section, we’ll introduce a more general theoretical framework to understand and quantify errors we make, and start to explore the theoretical branch of statistics known as statistical decision theory.

Remember our setup: we have some unknown quantity θ\theta that we’re interested in. We collect data xx. Our data are random, and come from the distribution p(xθ)p(x|\theta). We use the data to reason about θ\theta. Often, we’ll want to use the data to compute an estimate for θ\theta but sometimes, we may want to do something slightly different. In order to describe “the thing we do to the data”, we’ll use the notation δ(x)\delta(x). This represents the result of applying some procedure δ\delta to the data. For example, δ\delta might be the sample average of many data points, or the result of logistic regression. The obvious next question is: how do we choose which procedure δ\delta to use? We’ll decide by quantifying how “good” each δ\delta is, and then trying to choose the “best” one.

“Good” is a very abstract notion: to quantify it, we’ll need a quantitative measure of how good (or to be more precise, how bad) our procedure δ\delta is. We’ll call this a loss function. Notationally, we’ll write (θ,δ(x))\ell(\theta, \delta(x)) to represent the loss associated with the outcome δ(x)\delta(x) if the true value is θ\theta. To summarize:

Variable/notationWhat it meansθunknown quantity/quantities of interest: parameter(s)xobserved datap(xθ)probability distribution for data x (depends on θ)δ(x)decision or result computed from x, often an estimate of θ(δ(x),θ)loss (badness) for output δ(x) and true parameter(s) θ\begin{align*} \text{Variable/notation} & \quad \text{What it means} \\ \hline \theta & \quad \text{unknown quantity/quantities of interest: parameter(s)} \\ x & \quad \text{observed data} \\ p(x|\theta) & \quad \text{probability distribution for data $x$ (depends on $\theta$)} \\ \delta(x) & \quad \text{decision or result computed from $x$, often an estimate of $\theta$} \\ \ell(\delta(x), \theta) & \quad \text{loss (badness) for output $\delta(x)$ and true parameter(s) $\theta$} \end{align*}

Examples

That’s a very abstract definition: let’s make it more concrete with a few examples.

Binary decision: 0-1 loss

For our first example, we’ll return to our binary decision-making setting. In that case:

  • Our unknown parameter θ\theta is binary, and corresponds to reality, which we’ve been calling RR.

  • Our data xx were whatever we used to compute the p-value.

  • The decision δ\delta is a binary decision, which we’ve been calling DD.

D = δ(x)\delta(x) = 0D = δ(x)\delta(x) = 1
R=θ=0R=\theta=0TN lossFP loss
R=θ=1R=\theta=1FN lossTP loss

Here are a few concrete examples, and what each of these quantities would represent:

ExampleUnknown θ\theta / RRData xxDecision δ\delta / DD
Disease testingWhether a person has a diseaseCollected clinical data (blood sample, vital signs, etc.)Should we give the person treatments for that disease?
Finding oil wellsWhether underground oil exists in a certain areaReadings from seismic sensors, etc.Should we drill for oil in this location?
Product recommendationWill a user buy this product?User behavior, interest in similar products, etc.Should we recommend the product to the user?

Note that we haven’t really talked much about p(xθ)p(x|\theta), since we’ve been working with δ(x)\delta(x) directly: we’ll discuss this more in the next chapter.

Our loss function will depend on the problem we’re solving. Since in this case, both the inputs (θ\theta/RR and δ\delta/DD) are binary, we can write the loss in a 2x2 table that looks exactly like the ones we’ve seen before. If both kinds of error (false positive and false negative) are equally bad, we can use the simplest loss function, the 0-1 loss:

(δ(x),θ)={0if θ=δ(x)1if θδ(x)\ell(\delta(x), \theta) = \begin{cases} 0 & \text{if }\theta = \delta(x) \\ 1 & \text{if }\theta \neq \delta(x) \end{cases}

Exercise: Suppose we have a situation where a false positive is five times worse than a false negative. How would you write the loss function?

Continuous decision: 2\ell_2 loss

Now, suppose our parameterθ\theta is continuous, and δ(x)\delta(x) is our estimate of the parameter from the data. To make things a little more concrete, θ\theta could be the average height of people in a population, and xx could be the heights of people in a random sample from that population. In this case, our loss shouldn’t just be right vs wrong: we should use a loss function that’s smaller when our estimate is close, and larger when our estimate is far away.

You’ve probably already seen one before: the squared error loss, also known as the 2\ell_2 loss:

(δ(x),θ)=(δ(x)θ)2\ell(\delta(x), \theta) = \big(\delta(x) - \theta\big)^2

We’ll analyze the 2\ell_2 loss a little more later.

Exercise: Suppose we have a situation where it’s much worse to make a guess that’s too high, compared to a guess that’s too low. How would you construct a loss function for this problem?

Known and unknown

At this point, you may be wondering: if θ\theta is unknown, how can we ever compute the loss function? It’s important to keep in mind that when we apply δ(x)\delta(x) on real data, we don’t know θ\theta. But right now, we’re building up some machinery to help us analyze different procedures. In other words, we’re trying to get to a place where we can answer questions like “what procedures are most likely to give us estimates that are close to θ\theta?”

Fixed and random: finding the average loss in Bayesian and frequentist approaches

The loss function is a function of δ(x)\delta(x), which is the procedure result for particular data xx, and the particular parameter θ\theta. This isn’t particularly useful to us: we’d like to understand how the loss does “on average”. But in order to compute any kind of averages, we need to decide what’s random and what’s fixed. This is an important fork in the road: we can either take the Bayesian or the frequentist route. Let’s examine what happens if we try each one:

Frequentist loss analysis

In the frequentist world, we assume that our unknown θ\theta is fixed. The data are the only random piece. So, we’re going to look at the average across different possibilities for the data xx. Since the data comes from the distribution p(xθ)p(x|\theta), which depends on θ\theta, we should expect that this “averaging” will produce something that depends on θ\theta. We’ll call our average the frequentist risk:

R(θ)=Exθ[(δ(x),θ)]={x(δ(x),θ)p(xθ)if x discrete(δ(x),θ)p(xθ)dxif x continuous\begin{align*} R(\theta) &= E_{x|\theta}\left[\ell(\delta(x), \theta)\right] \\ &= \begin{cases} \displaystyle \sum_x \ell(\delta(x), \theta) p(x|\theta) & \text{if $x$ discrete} \\ \displaystyle \int \ell(\delta(x), \theta) p(x|\theta) dx & \text{if $x$ continuous} \end{cases} \end{align*}

The frequentist risk is a function of θ\theta. It tells us: for a particular value of θ\theta, how poorly does the procedure δ\delta do if we average over all possible datasets?

Bayesian loss analysis

In the Bayesian world, we assume that our unknown θ\theta is random. Since we observe a particular dataset xx, we’ll be a lot more interested in the randomness in θ\theta than the randomness in xx. So, we’ll condition on the particular dataset we got, and look at the average across different possibilities for the unknown parameter θ\theta. We’ll call our average the Bayesian posterior risk:

ρ(x)=Eθx[(δ(x),θ)]={θ(δ(x),θ)p(θx)if θ discrete(δ(x),θ)p(θx)dθif θ continuous\begin{align*} \rho(x) &= E_{\theta|x}\left[\ell(\delta(x), \theta)\right] \\ &= \begin{cases} \displaystyle \sum_\theta \ell(\delta(x), \theta) p(\theta|x) & \text{if $\theta$ discrete} \\ \displaystyle \int \ell(\delta(x), \theta) p(\theta|x) d\theta & \text{if $\theta$ continuous} \end{cases} \end{align*}

The Bayesian risk is a function of xx. It tells us: given that we observed a particular dataset xx, how poorly does the procedure δ\delta do, averaged over all possible values of the parameter θ\theta?

Comparing frequentist and Bayesian risk

Operationally, both of these look kind of similar: we’re averaging the loss with respect to some conditional probability distribution. But conceptually, they’re very different: the frequentist risk fixes the parameter, and averages over all the data; the Bayesian posterior risk fixes the data, and averages over all parameters.

Example: frequentist risk for 2\ell_2 loss and the bias-variance tradeoff

Let’s work through an example compuing the frequentist risk using the 2\ell_2 loss. We’ll find that the result can give us some important insights.

R(θ)=Exθ[(δ(x),θ)]=Exθ[(δ(x)θ)2]\begin{align} R(\theta) &= E_{x|\theta}\left[\ell(\delta(x), \theta)\right] \\ &= E_{x|\theta}\Big[\big(\delta(x) - \theta\big)^2\Big] \\ \end{align}

To make the math work out later, we’ll add and subtract the term Exθ[δ(x)]E_{x|\theta}[\delta(x)]. Before we work out the result, let’s think about what this term means. It’s the average value of the procedure δ\delta: in other words, for a particular θ\theta, it tells us what value of δ(x)\delta(x) we should expect to get, averaged across different possible values of xx.

R(θ)=Exθ[(δ(x)θ)2]=Exθ[(δ(x)Exθ[δ(x)]+Exθ[δ(x)]=0θ)2]=Exθ[(δ(x)Exθ[δ(x)]prediction minus avg. prediction+Exθ[δ(x)]θavg. prediction minus true value)2]\begin{align} R(\theta) &= E_{x|\theta}\Big[\big(\delta(x) - \theta\big)^2\Big] \\ &= E_{x|\theta}\Big[\big( \delta(x) \overbrace{- E_{x|\theta}[\delta(x)] + E_{x|\theta}[\delta(x)]}^{=0} - \theta \big)^2\Big] \\ &= E_{x|\theta}\Big[\big( \underbrace{\delta(x) - E_{x|\theta}[\delta(x)]}_{\text{prediction minus avg. prediction}} + \underbrace{E_{x|\theta}[\delta(x)] - \theta}_{\text{avg. prediction minus true value}} \big)^2\Big] \\ \end{align}

To make the math a little easier to read, we’ll write δ=δ(x)\delta = \delta(x) and δˉ=Exθ[δ(x)]\bar{\delta} = E_{x|\theta}[\delta(x)]:

R(θ)=Exθ[(δ(x)Exθ[δ(x)]+Exθ[δ(x)]θ)2]=Exθ[(δδˉ+δˉθ)2]=Exθ[(δδˉ)2+2(δδˉ)(δˉθ)=0+(δˉθ)2]=Exθ[(δδˉ)2]+Exθ[(δˉθ)2]=Exθ[(δδˉ)2]variance of δ(x)+(δˉθbias of δ(x)))2\begin{align} R(\theta) &= E_{x|\theta}\Big[\big( \delta(x) - E_{x|\theta}[\delta(x)] + E_{x|\theta}[\delta(x)] - \theta \big)^2\Big] \\ &= E_{x|\theta}\Big[\big( \delta - \bar{\delta} + \bar{\delta} - \theta \big)^2\Big] \\ &= E_{x|\theta}\Big[ \big(\delta - \bar{\delta}\big)^2 + \underbrace{2\big(\delta - \bar{\delta}\big)\big(\bar{\delta} - \theta\big)}_{=0} + \big(\bar{\delta} - \theta\big)^2 \Big] \\ &= E_{x|\theta}\Big[\big(\delta - \bar{\delta}\big)^2\Big] + E_{x|\theta}\Big[\big(\bar{\delta} - \theta\big)^2\Big] \\ &= \underbrace{E_{x|\theta}\Big[\big(\delta - \bar{\delta}\big)^2\Big]}_{\text{variance of }\delta(x)} + \big(\underbrace{\bar{\delta} - \theta}_{\text{bias of }\delta(x))}\big)^2 \\ \end{align}

We’ve shown that for the 2\ell_2 loss, the frequentist risk is the sum of two terms, called the variance and the square of the bias.

The variance, Exθ[(δ(x)Exθ[δ(x)])2]E_{x|\theta}\Big[\big(\delta(x) - E_{x|\theta}[\delta(x)]\big)^2\Big], answers the question: as the data varies, how far away will δ\delta be from its average value? In general, if your procedure δ\delta is very sensitive to variations in the data, your variance will be high.

The bias, Exθ[δ(x)]θE_{x|\theta}[\delta(x)] - \theta, answers the question: how far is the average value of δ\delta from the true parameter θ\theta? In general, if your procedure δ\delta does a good job of capturing the complexity of predicting θ\theta, your bias will be low.

When trying to reduce the risk (average loss), most methods try to reduce the variance and/or the bias. Many methods for estimation and prediction try to deal with the tradeoff between variance and bias: ideally we’d like both to be as small as possible, but we often need to accept a little more of one in order to make big reductions in the other.

Bayes risk

The two risks above are obtained by taking the expectation with respect to either the data xx or the parameter θ\theta. What if we take the expectation with respect to both? The Bayes risk is exactly that:

R(δ)=Ex,θ[(δ(x),θ)]=Eθ[R(θ)]=Ex[R(x)]\begin{align*} R(\delta) &= E_{x, \theta} [\ell(\delta(x), \theta)] \\ &= E_\theta [R(\theta)] \\ &= E_x [R(x)] \end{align*}

where the last two inequalities follow from Fubini’s theorem (i.e., that we can do the integrals for the expectations in either order and get the same result). The Bayes risk is a single number that summarizes the procedure δ\delta. The name is somewhat misleading: it isn’t really Bayesian or frequentist.