Regression review

Regression review#

This section is a review of linear regression. You may find it helpful to review Section 15.4 of the Data 100 textbook, which covers the corresponding material.

Recall that regression is a form of supervised learning. Given some data \(x\) (typically a scalar or a vector), we’re trying to predict a single value \(y\). You’ve seen cases where \(y\) is a real number (linear regression) or a binary value \(\in \{0, 1\}\) (logistic regression). Let’s briefly review the setup for linear regression.

We have a collection of data \((x_1, y_1), \ldots, (x_n, y_n)\). We’re trying to predict \(y_i\) from \(x_i\), but our prediction won’t be perfect. We’ll use the notation \(\hat{y}_i\) to represent the predicted value for data point \(i\). We start by discussing how to get the predictions, and then move on to the relationship between the predictions and the actual observed values.

One dimension#

In one dimension, we have data in the form \((x_1, y_1), \ldots, (x_n, y_n)\), where each \(x_i\) is a scalar and each \(y_i\) is a scalar. We form a linear prediction

\[ \hat{y}_i = ax_i + b, \]

where \(a\) is a slope and \(b\) is an intercept. In one-dimensional linear regression, we compute \(a\) and \(b\) from the observed data points \((x_1, y_1), \ldots, (x_n, y_n)\).

Multiple dimensions#

In multiple linear regression, we still have data in the form \((x_1, y_1), \ldots, (x_n, y_n)\), but now each \(x_i \in \mathbb{R}^d\) is a \(d\)-dimensional vector. We can write

\[ x_i = \left(x_{i1}, x_{i2}, \ldots, x_{id}\right) \]

Each entry of this vector corresponds to a different aspect of this data point that we’re using in our prediction. We call each of these a predictor or feature.

In multiple linear regression, we form our prediction for data point \(i\), \(\hat{y}_i\), as follows:

\[ \hat{y}_i = \sum_j \beta_j x_{ij} \]

The \(d\)-dimensional vector \(\beta = (\beta_1, \ldots, \beta_d)\) contains the coefficients for each predictor: linear regression involves figuring out what these are. We can write this in vector notation using the vectors \(\beta\) and \(x_i\):

\[ \hat{y}_i = \beta^T x_i = x_i^T \beta \]

We can take this notation one step further, and construct a matrix with all the \(x\) values for all data points and all features.

\[\begin{split} X = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nd} \end{pmatrix} \end{split}\]

One entry of this matrix, \(x_{ij}\), represents feature \(j\) for data point \(i\). If we consider the entire vector of predictions \(\hat{y} = \left(\hat{y}_1, \ldots, \hat{y}_n\right)\), we can write the predictions in a fully vectorized way:

\[ \hat{y} = X\beta \]

We can interpret each coefficient \(\beta_j\) in the context of the model as follows: “if \(x_j\) increases by \(t\), then the model predicts that \(y\) should increase by about \(\beta_j \times t\).”

YouTubeVideo('rEUf3bW32jM')

Likelihoods and loss functions#

In order to compute the vector of coefficients \(\beta\), we need some way to connect the predictions \(\hat{y}_i\) (which are based on \(\beta\)) with the actual observed values \(y_i\). You’ve seen two ways of doing this:

A loss function between the prediction \(\hat{y}\) and the observed value \(y\); we can minimize this loss function to find \(\beta\).
A probabilistic model that describes the errors \(\epsilon = y - \hat{y}\) as random variables, and tries to maximize the likelihood of the data under that model.

Loss functions#

Recall that in ordinary least-squares linear regression, we try to find the value of \(y\) that minimizes the mean squared error (MSE). We can write the MSE as follows:

\[ \text{MSE} = \frac{1}{n}\sum_i (y_i - \beta^T x_i)^2 \]

We can also write it as the \(\ell_2\) norm of the vector \(y - \hat{y}\):

\[ \text{MSE} = \frac{1}{n}\left\|y - \hat{y}\right\|_2^2 = \frac{1}{n}\left\|y - X\beta\right\|_2^2 \]

where for any vector \(z\), the \(\ell_2\) norm of \(z\) is \(\|z\|_2 = \sqrt{\sum_i z_i^2}\).

We want to choose a value for \(\beta\) that makes this as small as possible:

\[ \hat{\beta} = \text{argmin}_\beta \|y - X\beta\|_2^2 \]

Likelihood and noise#

We’ll build heavily off the formulation in this section, so make sure you understand what’s going on here well!

We can also describe the errors in our model:

\[ y_i = \beta^T x_i + \epsilon_i, \]

where \(\epsilon_i \sim N(0, \sigma^2)\) is a random variable that represents the noise, or error, in the observed value. We can vectorize this noise, too: we’ll write \(\epsilon = (\epsilon_1, \ldots, \epsilon_n)\) so that the vector \(\epsilon\) has a multivariate normal distribution \(\epsilon \sim N(0, \sigma^2 I_n)\). We can then write:

\[ :label: additive-noise y = X\beta + \epsilon, \]

or equivalently, using properties of the normal distribution,

\[ :label: centered-noise y | \beta \sim N(X\beta, \sigma^2 I_n). \]

We can interpret this equation as telling us that the average prediction is \(X\beta\). This equation is a likelihood model: it tells us the likelihood of data \(y\) given the parameter(s) \(\beta\). Note that we are treating \(X\) as fixed and known, so there’s no probability model associated with it. We’ll generally focus on this version, rather than the previous one

Let’s think about the implicit assumptions we’re making by choosing a normal likelihood. Recall that under a normal distribution, we’re very unlikely to see values more than 3\(\sigma\) away from the mean. That means that we’re implicitly assuming that the vast majority of \(y\)-values we see will be within 3\(\sigma\) of the average prediction prediction \(X\beta\). This helps explain why linear regression is sensitive to outliers: the likelihood of a point very far away from the average prediction is very small, but the likelihood of several points all somewhat far away is much higher.

This model is often referred to as ordinary least squares, or OLS.

Under this model, one reasonable way to estimate \(\beta\) is to choose the value that maximizes the likelihood. When choosing a value of \(\beta\) to maximize the likelihood, we note that we don’t actually care about the normalizing constant in the normal distribution. So, we can write:

\[\begin{split} \begin{align} \hat{\beta} &= \text{argmax}_\beta \exp\left\{-\frac{1}{2}(y - X\beta)^T(\sigma^2 I_n)^{-1}(y-X\beta)^T\right\} \\ &= \text{argmax}_\beta \exp\left\{-\frac{1}{2\sigma^2}\|y - X\beta\|_2^2\right\} \end{align} \end{split}\]

Just as we did in our earlier foray into maximum likelihood estimation, we’ll take advantage of the fact that the \(\log\) function is monotonically increasing, and optimize the log-likelihood. Furthermore, we’ll make this a minimization rather than a maximization. In general, for any well-behaved function \(f\):

\[\begin{split} \begin{align} \text{argmax}_\theta f(\theta) &= \text{argmax}_\theta \log(f(\theta)) \\ &= \text{argmin}_\theta \left[-\log(f(\theta))\right] \\ \end{align} \end{split}\]

So, we can write:

\[\begin{split} \begin{align} \hat{\beta} &= \text{argmax}_\beta \exp\left\{-\frac{1}{2\sigma^2}\|y - X\beta\|_2^2\right\} \\ &= \text{argmin}_\beta \left[\frac{1}{2\sigma^2}\|y - X\beta\|_2^2 \right]\\ &= \text{argmin}_\beta \|y - X\beta\|_2^2 \end{align} \end{split}\]

So, we’ve found that maximizing the Gaussian likelihood of the data is exactly equivalent to minimizing the squared loss. This is true in general for regression problems: we can arrive at the same answer by either choosing a loss function and minimizing it, or choosing a corresponding likelihood and maximizing it.

Uncertainty in Regression Predictions#

It’s important to remember that when making a prediction for a new data point, there are multiple sources of uncertainty. Recall that our model states that \(\hat{y} = X\hat{\beta} + \epsilon\). When making a prediction, we have some uncertainty the first term \(X\hat{\beta}\), because the coefficients we estimate depend on the random data. We also have additional uncertainty from the second term, depending on how much variability the model estimates in the data around the average predictions.

Logistic regression#

Recall that in logistic regression, we’re trying to predict binary outputs: \(y_i \in \{0, 1\}\). We’re trying to predict the probability that \(y_i\) will be 1, which we’ll call \(\hat{y}\):

\[ \hat{y}_i = \sigma(\beta^T x_i), \]

where \(\sigma\) is the sigmoid function, which converts real values to values between 0 and 1. To find \(\beta\), we minimize the binary cross-entropy loss:

\[ \hat{\beta} = \text{argmin}_\beta \sum_i -\left[ y_i \ln(\hat{y}_i) + (1-y_i) \ln(1-\hat{y}_i) \right] \]

You’ll show on the discussion worksheet that if we assume the likelihood model for \(y\) is Bernoulli with parameter \(\sigma(\beta^T x_i)\), then maximizing the likelihood is equivalent to minimizing the binary cross-entropy loss.

For a deeper refresher on logistic regression, see Chapter 23 of the Data 100 textbook. Note that our notation is slightly different:

We’re using \(\beta\) instead of \(\theta\) for the coefficients
We’re using \(\hat{y}\) for the predictions instead of \(f_{\hat{\theta}}\).

Fitting models with `scikit-learn`#

You may recall that the easiest way to fit a linear model is to use the LinearRegression model from scikit-learn. We can then inspect the coefficients and intercept using the coef_ and intercept_ attributes respectively, and make predictions on new data points using the predict() method.

import numpy as np
from sklearn.linear_model import LinearRegression

x = np.random.normal(size=[8, 2])
y = np.random.normal(size=8)

model = LinearRegression()
model.fit(x, y)
model.intercept_, model.coef_

(np.float64(0.29724987419448917), array([-1.55065387, -1.84047484]))

In this chapter, we’ll use two new packages for regression in frequentist and Bayesian paradigms. The first will be statsmodels, which fits linear models using a frequentist approach, and Bambi, which uses PyMC for approximate inference in Bayesian linear models.

Known, unknown, random, and fixed#

In regression modeling, we typically use a probability model for the observed target values \(y\), but we tend to assume that the predictors \(X\) are fixed and known. To summarize:

Variable	Description	Known or unknown?	Fixed or random? (Bayesian)	Fixed or random? (frequentist)
\(x\)	Predictors	Known	Fixed, known	Fixed, known
\(y\)	Target values	Known for training set, unknown for test set	Random	Random
\(\beta\)	Coefficients	Unknown	Random	Fixed