Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Notebook Cell
from IPython.display import YouTubeVideo

Regression review

This section is a review of linear regression. You may find it helpful to review Section 15.4 of the Data 100 textbook, which covers the corresponding material.

Recall that regression is a form of supervised learning. Given some data xx (typically a scalar or a vector), we’re trying to predict a single value yy. You’ve seen cases where yy is a real number (linear regression) or a binary value {0,1}\in \{0, 1\} (logistic regression). Let’s briefly review the setup for linear regression.

We have a collection of data (x1,y1),,(xn,yn)(x_1, y_1), \ldots, (x_n, y_n). We’re trying to predict yiy_i from xix_i, but our prediction won’t be perfect. We’ll use the notation y^i\hat{y}_i to represent the predicted value for data point ii. We start by discussing how to get the predictions, and then move on to the relationship between the predictions and the actual observed values.

One dimension

In one dimension, we have data in the form (x1,y1),,(xn,yn)(x_1, y_1), \ldots, (x_n, y_n), where each xix_i is a scalar and each yiy_i is a scalar. We form a linear prediction

y^i=axi+b,\hat{y}_i = ax_i + b,

where aa is a slope and bb is an intercept. In one-dimensional linear regression, we compute aa and bb from the observed data points (x1,y1),,(xn,yn)(x_1, y_1), \ldots, (x_n, y_n).

Multiple dimensions

In multiple linear regression, we still have data in the form (x1,y1),,(xn,yn)(x_1, y_1), \ldots, (x_n, y_n), but now each xiRdx_i \in \mathbb{R}^d is a dd-dimensional vector. We can write

xi=(xi1,xi2,,xid)x_i = \left(x_{i1}, x_{i2}, \ldots, x_{id}\right)

Each entry of this vector corresponds to a different aspect of this data point that we’re using in our prediction. We call each of these a predictor or feature.

In multiple linear regression, we form our prediction for data point ii, y^i\hat{y}_i, as follows:

y^i=jβjxij\hat{y}_i = \sum_j \beta_j x_{ij}

The dd-dimensional vector β=(β1,,βd)\beta = (\beta_1, \ldots, \beta_d) contains the coefficients for each predictor: linear regression involves figuring out what these are. We can write this in vector notation using the vectors β\beta and xix_i:

y^i=βTxi=xiTβ\hat{y}_i = \beta^T x_i = x_i^T \beta

We can take this notation one step further, and construct a matrix with all the xx values for all data points and all features.

X=(x11x12x1dx21x22x2dxn1xn2xnd)X = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nd} \end{pmatrix}

One entry of this matrix, xijx_{ij}, represents feature jj for data point ii. If we consider the entire vector of predictions y^=(y^1,,y^n)\hat{y} = \left(\hat{y}_1, \ldots, \hat{y}_n\right), we can write the predictions in a fully vectorized way:

y^=Xβ\hat{y} = X\beta

We can interpret each coefficient βj\beta_j in the context of the model as follows: “if xjx_j increases by tt, then the model predicts that yy should increase by about βj×t\beta_j \times t.”

YouTubeVideo('rEUf3bW32jM')

Likelihoods and loss functions

In order to compute the vector of coefficients β\beta, we need some way to connect the predictions y^i\hat{y}_i (which are based on β\beta) with the actual observed values yiy_i. You’ve seen two ways of doing this:

  1. A loss function between the prediction y^\hat{y} and the observed value yy; we can minimize this loss function to find β\beta.

  2. A probabilistic model that describes the errors ϵ=yy^\epsilon = y - \hat{y} as random variables, and tries to maximize the likelihood of the data under that model.

Loss functions

Recall that in ordinary least-squares linear regression, we try to find the value of yy that minimizes the mean squared error (MSE). We can write the MSE as follows:

MSE=1ni(yiβTxi)2\text{MSE} = \frac{1}{n}\sum_i (y_i - \beta^T x_i)^2

We can also write it as the 2\ell_2 norm of the vector yy^y - \hat{y}:

MSE=1nyy^22=1nyXβ22\text{MSE} = \frac{1}{n}\left\|y - \hat{y}\right\|_2^2 = \frac{1}{n}\left\|y - X\beta\right\|_2^2

where for any vector zz, the 2\ell_2 norm of zz is z2=izi2\|z\|_2 = \sqrt{\sum_i z_i^2}.

We want to choose a value for β\beta that makes this as small as possible:

β^=argminβyXβ22\hat{\beta} = \text{argmin}_\beta \|y - X\beta\|_2^2

Likelihood and noise

We’ll build heavily off the formulation in this section, so make sure you understand what’s going on here well!

We can also describe the errors in our model:

yi=βTxi+ϵi,y_i = \beta^T x_i + \epsilon_i,

where ϵiN(0,σ2)\epsilon_i \sim N(0, \sigma^2) is a random variable that represents the noise, or error, in the observed value. We can vectorize this noise, too: we’ll write ϵ=(ϵ1,,ϵn)\epsilon = (\epsilon_1, \ldots, \epsilon_n) so that the vector ϵ\epsilon has a multivariate normal distribution ϵN(0,σ2In)\epsilon \sim N(0, \sigma^2 I_n). We can then write:

y=Xβ+ϵ,y = X\beta + \epsilon,

or equivalently, using properties of the normal distribution,

yβN(Xβ,σ2In).y | \beta \sim N(X\beta, \sigma^2 I_n).

We can interpret this equation as telling us that the average prediction is XβX\beta. This equation is a likelihood model: it tells us the likelihood of data yy given the parameter(s) β\beta. Note that we are treating XX as fixed and known, so there’s no probability model associated with it. We’ll generally focus on this version, rather than the previous one

Let’s think about the implicit assumptions we’re making by choosing a normal likelihood. Recall that under a normal distribution, we’re very unlikely to see values more than 3σ\sigma away from the mean. That means that we’re implicitly assuming that the vast majority of yy-values we see will be within 3σ\sigma of the average prediction prediction XβX\beta. This helps explain why linear regression is sensitive to outliers: the likelihood of a point very far away from the average prediction is very small, but the likelihood of several points all somewhat far away is much higher.

This model is often referred to as ordinary least squares, or OLS.

Under this model, one reasonable way to estimate β\beta is to choose the value that maximizes the likelihood. When choosing a value of β\beta to maximize the likelihood, we note that we don’t actually care about the normalizing constant in the normal distribution. So, we can write:

β^=argmaxβexp{12(yXβ)T(σ2In)1(yXβ)T}=argmaxβexp{12σ2yXβ22}\begin{align} \hat{\beta} &= \text{argmax}_\beta \exp\left\{-\frac{1}{2}(y - X\beta)^T(\sigma^2 I_n)^{-1}(y-X\beta)^T\right\} \\ &= \text{argmax}_\beta \exp\left\{-\frac{1}{2\sigma^2}\|y - X\beta\|_2^2\right\} \end{align}

Just as we did in our earlier foray into maximum likelihood estimation, we’ll take advantage of the fact that the log\log function is monotonically increasing, and optimize the log-likelihood. Furthermore, we’ll make this a minimization rather than a maximization. In general, for any well-behaved function ff:

argmaxθf(θ)=argmaxθlog(f(θ))=argminθ[log(f(θ))]\begin{align} \text{argmax}_\theta f(\theta) &= \text{argmax}_\theta \log(f(\theta)) \\ &= \text{argmin}_\theta \left[-\log(f(\theta))\right] \\ \end{align}

So, we can write:

β^=argmaxβexp{12σ2yXβ22}=argminβ[12σ2yXβ22]=argminβyXβ22\begin{align} \hat{\beta} &= \text{argmax}_\beta \exp\left\{-\frac{1}{2\sigma^2}\|y - X\beta\|_2^2\right\} \\ &= \text{argmin}_\beta \left[\frac{1}{2\sigma^2}\|y - X\beta\|_2^2 \right]\\ &= \text{argmin}_\beta \|y - X\beta\|_2^2 \end{align}

So, we’ve found that maximizing the Gaussian likelihood of the data is exactly equivalent to minimizing the squared loss. This is true in general for regression problems: we can arrive at the same answer by either choosing a loss function and minimizing it, or choosing a corresponding likelihood and maximizing it.

Uncertainty in Regression Predictions

It’s important to remember that when making a prediction for a new data point, there are multiple sources of uncertainty. Recall that our model states that y^=Xβ^+ϵ\hat{y} = X\hat{\beta} + \epsilon. When making a prediction, we have some uncertainty the first term Xβ^X\hat{\beta}, because the coefficients we estimate depend on the random data. We also have additional uncertainty from the second term, depending on how much variability the model estimates in the data around the average predictions.

Logistic regression

Recall that in logistic regression, we’re trying to predict binary outputs: yi{0,1}y_i \in \{0, 1\}. We’re trying to predict the probability that yiy_i will be 1, which we’ll call y^\hat{y}:

y^i=σ(βTxi),\hat{y}_i = \sigma(\beta^T x_i),

where σ\sigma is the sigmoid function, which converts real values to values between 0 and 1. To find β\beta, we minimize the binary cross-entropy loss:

β^=argminβi[yiln(y^i)+(1yi)ln(1y^i)]\hat{\beta} = \text{argmin}_\beta \sum_i -\left[ y_i \ln(\hat{y}_i) + (1-y_i) \ln(1-\hat{y}_i) \right]

You’ll show on the discussion worksheet that if we assume the likelihood model for yy is Bernoulli with parameter σ(βTxi)\sigma(\beta^T x_i), then maximizing the likelihood is equivalent to minimizing the binary cross-entropy loss.

For a deeper refresher on logistic regression, see Chapter 23 of the Data 100 textbook. Note that our notation is slightly different:

  • We’re using β\beta instead of θ\theta for the coefficients

  • We’re using y^\hat{y} for the predictions instead of fθ^f_{\hat{\theta}}.

Loading...

Fitting models with scikit-learn

You may recall that the easiest way to fit a linear model is to use the LinearRegression model from scikit-learn. We can then inspect the coefficients and intercept using the coef_ and intercept_ attributes respectively, and make predictions on new data points using the predict() method.

import numpy as np
from sklearn.linear_model import LinearRegression
x = np.random.normal(size=[8, 2])
y = np.random.normal(size=8)

model = LinearRegression()
model.fit(x, y)
model.intercept_, model.coef_
(-0.0638684061258262, array([ 0.15201513, -0.62278861]))

In this chapter, we’ll use two new packages for regression in frequentist and Bayesian paradigms. The first will be statsmodels, which fits linear models using a frequentist approach, and Bambi, which uses PyMC for approximate inference in Bayesian linear models.

Known, unknown, random, and fixed

In regression modeling, we typically use a probability model for the observed target values yy, but we tend to assume that the predictors XX are fixed and known. To summarize:

VariableDescriptionKnown or unknown?Fixed or random? (Bayesian)Fixed or random? (frequentist)
xxPredictorsKnownFixed, knownFixed, known
yyTarget valuesKnown for training set, unknown for test setRandomRandom
β\betaCoefficientsUnknownRandomFixed