Generalized Linear Models

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 9
      6 sns.set()
      8 import pymc as pm
----> 9 import bambi as bmb
     10 import arviz as az
     11 import statsmodels.api as sm

ModuleNotFoundError: No module named 'bambi'

Generalized Linear Models#

In this section, we’ll learn about an extension to linear regression called generalized linear models, or GLMs. In particular, GLMs are similar to linear regression, with two important extensions:

Instead of using the result of \(X\beta\) as the average prediction, we’ll apply a nonlinear function called the inverse link function or \(g^{-1}\) first, so that our average prediction becomes \(g^{-1}(X\beta)\). While we can use any arbitrary function here, we’ll see several examples that are particularly useful.
Instead of assuming a normal distribution around the average prediction as our model for the likelihood of the data, we’ll allow for arbitrary likelihood distributions (but still centered around the average prediction \(g^{-1}(X\beta)\)).

We’ll work through an example that demonstrates why these kinds of models are useful, and how choosing different inverse link functions and likelihood models can change the prediction results we get.

Linear Regression from Bayesian and frequentist perspectives#

For the rest of this section, we’ll work with a dataset that contains the number of wind turbines built in each state since the year 2000, focusing on Oklahoma. It contains the following columns:

t_built: the number of turbines built each year
t_cap: the power capacity added that year
year: the year, stored as the number of years since 2000
totals: the total number of turbines in the state built since 2000
log_totals: the log of the total.

	t_state	t_built	t_cap	year	totals	log_totals
361	OK	1	100.0	1.0	1	0.000000
362	OK	113	176250.0	3.0	114	4.736198
363	OK	182	298200.0	5.0	296	5.690359
364	OK	40	60000.0	6.0	336	5.817111
365	OK	85	154500.0	7.0	421	6.042633

ok_turbines.plot('year', 'totals');

../../../_images/30514321818b9e9dc062ebc692eaf40490abad79b439925fd3838ac5cb71152b.png

From looking at this data, we can immediately see that linear regression might not be a good fit: the relationship between the two variables seems exponential rather than linear. We can address this in one of two ways:

Log-transform our output data, so that we’re predicting \(\log(\)turbine count\()\), and then use a linear model.
Incorporate the exponential relationship into our model.

We’ll start with the first approach and see that it does reasonably well, and then see how GLMs can help us do even better by taking the second approach.

ok_turbines.plot('year', 'log_totals');

../../../_images/ce15eeb7f2ca9e1da7b0633651123ea74037cafab87f299988cae6e76f955246.png

With the exception of the outlier at the first year (\(t=0\), at the year 2000), a linear model seems like a good fit here.

Since linear regression is a statistical model where we are estimating an unknown (model coefficients) from random data, we can approach it from either a frequentist or Bayesian paradigm. Let’s see how a linear model for the log-counts will work under each of these settings.

Turbine Data Linear Regression: Frequentist Approach#

In the frequentist paradigm, we treat our unknown coefficients as fixed, and estimate them using maximum likelihood (or similar techniques). While using scikit-learn as we saw in the previous section was a perfectly valid way to implement linear regression in a frequentist paradigm, it doesn’t support many of the generalized linear models we’ll want to use throughout this chapter. So, we’ll instead use the statsmodels package in this chapter (imported using import statsmodels as sm).

We’ll primarily use the sm.GLM class, which takes in an array or series for y, an array or dataframe for X, and a model family: we’ll explore more of these families later, but for now we’ll stick to OLS, which we can implement with sm.families.Gaussian(). In order include an intercept in our model, we need to augment our data by applying the sm.add_constant() function.

Here’s what the model above would look like:

gaussian_model_intercept = sm.GLM(
    np.log(ok_turbines.totals), sm.add_constant(ok_turbines.year),
    family=sm.families.Gaussian()
)
gaussian_results = gaussian_model_intercept.fit()
print(gaussian_results.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                 totals   No. Observations:                   17
Model:                            GLM   Df Residuals:                       15
Model Family:                Gaussian   Df Model:                            1
Link Function:               Identity   Scale:                          1.1810
Method:                          IRLS   Log-Likelihood:                -24.472
Date:                Mon, 21 Oct 2024   Deviance:                       17.716
Time:                        11:20:11   Pearson chi2:                     17.7
No. Iterations:                     3   Pseudo R-squ. (CS):             0.9131
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.2602      0.590      5.526      0.000       2.104       4.417
year           0.3023      0.047      6.435      0.000       0.210       0.394
==============================================================================

The model output shows us some useful information about the model at the top, and then information about the estimated coefficient(s) at the bottom:

coef, the coefficient itself,
std err, the standard error (i.e., standard deviation of the estimator)
z, the \(z\)-score for the coefficient for a hypothesis test where the null hypothesis states that it is 0,
P>|z|, the \(p\)-value for the hypothesis test above, and
[0.025 and 0.975], the lower and upper bounds for a \(95\%\) confidence interval for the estimated coefficient.

This model tells us that we can predict the log of the number of turbines \(N\) in any year \(t\) as follows:

\[ \log(N) = 3.26 + 0.30 \times t \]

In most cases, we’re more interested in the actual turbine count, rather than the log: so, we can exponentiate both sides to obtain:

\[ N = e^{3.26} e^{0.3t} \]

This also tells us that according to this model, each year, the number of turbines is multiplied by \(e^{0.3}\), or roughly \(1.35\).

We can also visualize this prediction:

ok_turbines['pred_freq'] = np.exp(3.2602 + 0.3023 * ok_turbines['year'])

plt.plot(ok_turbines['year'], ok_turbines['totals'], label='Actual count')
plt.plot(ok_turbines['year'], ok_turbines['pred_freq'], label='Predicted count')
plt.legend();

../../../_images/5bb9fe67973a969891d44be71da7799de3eb9621a67543333670171dec8dcd37.png

We can see that the predictions from this model are reasonable from 2000-2016, but then start to diverge from reality after that.

Turbine Data Linear Regression: Bayesian Approach#

ok_turbines

	t_state	t_built	t_cap	year	totals	log_totals	pred_freq
361	OK	1	100.0	1.0	1	0.000000	35.251215
362	OK	113	176250.0	3.0	114	4.736198	64.528049
363	OK	182	298200.0	5.0	296	5.690359	118.119875
364	OK	40	60000.0	6.0	336	5.817111	159.812300
365	OK	85	154500.0	7.0	421	6.042633	216.220777
366	OK	91	141900.0	8.0	512	6.238325	292.539588
367	OK	153	299100.0	9.0	665	6.499787	395.796425
368	OK	195	352260.0	10.0	860	6.756932	535.499523
369	OK	257	524900.0	11.0	1117	7.018402	724.513214
370	OK	596	1127050.0	12.0	1713	7.446001	980.242510
371	OK	369	648100.0	14.0	2082	7.641084	1794.353376
372	OK	710	1399960.0	15.0	2792	7.934513	2427.701005
373	OK	602	1457525.0	16.0	3394	8.129764	3284.599481
374	OK	323	850725.0	17.0	3717	8.220672	4443.954890
375	OK	272	543245.0	18.0	3989	8.291296	6012.524565
376	OK	33	100050.0	19.0	4022	8.299535	8134.747659
377	OK	440	1120550.0	20.0	4462	8.403352	11006.045591

In the Bayesian paradigm, we treat our unknown coefficients as random, and compute the posterior distribution over them, conditioned on the data we observe. While we could implement the entire model in PyMC, the Bambi package provides a convenient layer on top of PyMC that lets us specify our models with less code. We’ll use the bmb.Model class, which takes a formula, a dataframe, and a model family (for now, we’ll stick with the gaussian family for OLS). The formula takes the form y ~ predictor1 + predictor2 + ..., where each variables is a column names from the provided dataframe. Here’s how we implement the model from above:

# The y-values are in the column `log_totals`, and the x-values are in the column `year`. So:
gaussian_model = bmb.Model(formula='log_totals ~ year', data=ok_turbines, family="gaussian")
gaussian_trace = gaussian_model.fit(random_seed=0)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, Intercept, year]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.

We can see the results using the arviz function plot_trace:

az.plot_trace(gaussian_trace)
plt.tight_layout()

../../../_images/ff50be4524a27a30bc3cafdc8e2349befd7e076f09a540c6c46eb040a04947a3.png

The first two rows each represent one coefficient. The third row, log_totals_sigma, is the estimated standard deviation of the data points around the average prediction. The left column shows the posterior distribution for that coefficient (by using KDE on the histogram of samples), and the right shows the trajectory of samples obtained during HMC.

Looking at the posterior distribution for the coefficient of year (second row), we see that the results are similar to the frequentist version: the MAP and LMSE estimates are around 0.3, By examining the histogram, we should expect that under the posterior, this coefficient will usually be between \(0.2\) to \(0.4\), with values outside this range fairly unlikely. This uncertainty is reasonable: since we’re only estimating these coefficients from seventeen data points, it makes sense to have some uncertainty in the coefficients we estimate. Just as before, we can interpret these estimates in the context of the data: each year, the number of turbines increases by a multiplicative factor of about \(1.3\).

When we use sampling for inference, we can think of each sample as one regression line: each sample has a value of intercept, a value of the coefficient for year, and an estimated \(\sigma\) for the residuals. We can therefore plot each sample as a line:

def identity(x):
    return x

def plot_posterior_samples(trace, turbines_df, num_lines=40, show_logy=True):
    f, ax = plt.subplots(1, 1)
    intercept = trace.posterior['Intercept'].values.flatten()
    slope = trace.posterior['year'].values.flatten()
    indices = np.random.choice(np.arange(slope.size), num_lines, replace=False)
    if show_logy:
        y_func = identity
    else:
        y_func = np.exp

    for i in indices:
        pred = y_func(intercept[i] + slope[i] * turbines_df['year'])
        ax.plot(turbines_df['year'], pred, color='gray', alpha=0.3)

    ax.scatter(turbines_df['year'], y_func(turbines_df['log_totals']))
    ax.set_xlabel('Years since 2000')
    if show_logy:
        ax.set_ylabel('log(turbine count)')
    else:
        ax.set_ylabel('turbine count')

plot_posterior_samples(gaussian_trace, ok_turbines, show_logy=True)

../../../_images/7c2fad35d14cfd6cfff809a8e167ce9b81facdd6778b3c51a49c75f8abe79c64.png

np.identity?

plot_posterior_samples(gaussian_trace, ok_turbines, show_logy=False)

../../../_images/32dce00bed6d8ad792909c26c0bc5f6f47f93515463d5fcb5c43da6c3e8679b5.png

These plots help us see that many of the lines have too high of a slope in the log-scale, likely due to the outlier at \(t=0\). This corresponds to a too-fast exponential rise after 2016 when looking at the counts.

It’s important to note that this only shows us the uncertainty in the average prediction intercept + coefficient * year: it doesn’t show us the uncertainty in the variation around the average prediction. Specifically, remember that we obtain predictions in linear regression by assuming \(y \sim \mathcal{N}(X\beta, \sigma^2 I)\). These lines show us the uncertainty in the average prediction \(X\beta\), but don’t show us how large the estimated variance \(\sigma\) is: we’ll come back to that later.

Exercise: How might you construct similar plots to visualize the uncertainty for the frequentist results from earlier?

Where are the Priors?#

You may have noticed that we just implemented a Bayesian model, but never specified any priors. In this case, Bambi chooses “sensible” defaults for us, inspired by the R library rstanarm. These defaults are chosen to be weak, uninformative priors that don’t encode much information about what the coefficients should be.

You can read more about these default choices at the rstanarm documentation.

Linear regression summary#

We saw that linear regression can be implemented in both frequentist and Bayesian paradigms: at the end of the day, both approaches give us estimated coefficients, along with some measure of uncertainty. We can use those coefficients to interpret the model (as we saw above), and to make predictions for new data points.

We saw that when making predictions with linear regression, we start by multiplying each feature by its corresponding coefficient, adding them all up to obtain an average prediction, and assuming some randomness around that average.

In order to determine the coefficients from data, we can take either a frequentist or Bayesian approach. In the frequentist paradigm, we estimated them using frequentist methods like MLE, and in the Bayesian paradigm we approximated the posterior distribution over the coefficients (conditioned on the observed data) using samples.

For the remainder of this chapter, we’ll move back and forth between frequentist and Bayesian paradigms to illustrate ideas in each one.

Beyond Linear Regression: GLMs#

The regression model above works somewhat well for years before 2016, but it doesn’t account for the fact that the variable we’re predicting is a whole number (meaning that it takes values \(0, 1, 2, 3, \ldots\)). When we say that \(y|\beta \sim N(X\beta, \sigma^2 I)\), and use log-transformed data for \(y\), we’re implicitly saying that \(y\) can never be 0. We’ll start by asking the question: can we use a different likelihood that is designed specifically for this kind of data? In answering this, we’ll explore two examples of generalized linear models: Poisson regression and negative binomial regression.

Poisson regression#

You may find it helpful to review Section 7.1 of the Data 140 textbook, which covers the Poisson distribution.

Recall that the Poisson distribution is a distribution over counts and count-like values. It has one positive parameter \(\lambda\) that represents its mean (and variance, too).

In Poisson regression, we’re going to assume a Poisson likelihood for each \(y_i\), and we’ll use the linear combination \(x_i^T\beta\) to help determine the parameter. The Poisson distribution is only defined for positive values of the parameter, but \(x_i^T\beta\) could be negative. There are several ways to make the possibly-negative value into a positive one so we can use it for the parameter (e.g., take absolute value, square, etc.), but inspired by our log-transformation of the outputs in the previous section, we’ll exponentiate.

So, we’ll use \(\exp(x_i^T \beta)\) as the mean. We can write out our likelihood for the observed data points \(y_i\):

\[ y_i | \beta \sim \text{Poisson}(\exp(x_i^T \beta)) \]

As before, we can use this likelihood model to estimate the coefficients \(\beta\) in either a frequentist or Bayesian setting. In this section, we’ll choose to take a Bayesian approach, mostly so that we can take advantage of the function plot_posterior_samples that we wrote in the last section to help us visualize the uncertainty in our model.

Using Bambi, switching from linear regression to Poisson regression is quite straightforward:

poisson_model = bmb.Model(formula='totals ~ year', data=ok_turbines, family='poisson')
poisson_trace = poisson_model.fit(random_seed=0)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [Intercept, year]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.

az.plot_trace(poisson_trace)
plt.tight_layout()

../../../_images/bf63432666dc03d266b3e8ac75b4d8f8fd91b4800aa16a9f16871ea3c86ebd92.png

Comparing this to the results from the Gaussian model, we can see that the posterior distributions for the two coefficients are much narrower. On top of that, the values of the coefficient for year seem much smaller: whereas before the posterior was centered around \(0.3\), now it looks like the entire distribution is narrowly distributed around 0.183.

np.exp(0.183)

1.2008144080808307

This corresponds to only a \(20\%\) average growth rate, compared to our \(35\%\) average growth rate from the earlier model. The lower growth rate seems reasonable: recall that the resulting curve(s) from linear regression grew too fast, especially after 2016. But, the degree of certainty in the results is concerning: given that we only have 17 data points, our results should have a lot more uncertainty in them. We can visualize how “overconfident” the model is by plotting each sample as a line/curve using the plot_posterior_samples function:

plot_posterior_samples(poisson_trace, ok_turbines, show_logy=True)

../../../_images/c52a4a692f722882207315d43bb23761cbaab779b66ec930294e4d8e57d7333f.png

plot_posterior_samples(poisson_trace, ok_turbines, show_logy=False)

../../../_images/d3f9b52d98cba2672a02a1eede359f4825e7ea202946ad294853ba0bac79c2aa.png

Just as before, it’s important to remember that these lines only show us the estimated uncertainty in the average prediction \(X\beta\): we aren’t visualizing any of the uncertainty shown by the Poisson model itself.

Why are the results so different?

Let’s think about the implicit assumptions we’re making when choosing a Poisson likelihood. In particular, one important property of the Poisson distribution is that its mean is equal to its variance. So, while the linear model could allow for an average prediction of \(\log(N) = 4\) with a variance of \(1.5\), the Poisson distribution requires that the mean and variance must be the same.

This problem is compounded by the fact that the Poisson distribution is modeling the true counts rather than the log-counts: in other words, consider 2018, when the number of turbines was about 4000. A Poisson distribution with mean 4000 has a variance of 4000, or in other words, a standard deviation of only around 63.

Clearly, the Poisson is a poor choice for fitting this data! When a model assumes a lower variance than is actually present in the data, like this Poisson model does, we say that the data are overdispersed: this means that they’re too spread out relative to the model’s assumptions). In order to solve this problem, we should choose a different distribution that gives us the ability to control the variance as well as the mean.

This wasn’t a problem with the normal likelihood earlier: because the normal distribution has two separate parameters for mean and variance, we can choose them separately to reflect the fact that the variance may be higher than the mean.

Negative binomial regression#

The negative binomial distribution is also a distribution over counts, but it allows for more complexity than the Poisson distribution. We can think of it one of two ways:

It’s the sum of \(r\) [Geometric random variables], each with parameter \(p\) (success probability).
It’s like a Poisson distribution if the mean parameter (\(\lambda\) above) were also random.

There are several different ways to parametrize the negative binomial. How do we choose which one to use? There are two answers:

We want a parametrization that lets us choose the mean, since we want the mean value for \(y_i\) to be \(\exp(x_i^T \beta)\) (in other words, we want to do almost the same thing as we just did with Poisson regression, but we want to use a negative binomial distribution instead of a Poisson distribution for the observed values).
Since we’re using PyMC through Bambi, we’re limited to whichever parametrization(s) it supports.

Even though the form of the distribution is significantly more complex and manipulating it involves more work, using it in our regression model requires only a tiny change to what we were doing before:

negbin_model = bmb.Model(formula='totals ~ year', data=ok_turbines, family='negativebinomial')
negbin_trace = negbin_model.fit(random_seed=0)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, Intercept, year]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.

np.sqrt(4000)

63.245553203367585

ok_turbines

	t_state	t_built	t_cap	year	totals	log_totals	pred_freq
361	OK	1	100.0	1.0	1	0.000000	35.251215
362	OK	113	176250.0	3.0	114	4.736198	64.528049
363	OK	182	298200.0	5.0	296	5.690359	118.119875
364	OK	40	60000.0	6.0	336	5.817111	159.812300
365	OK	85	154500.0	7.0	421	6.042633	216.220777
366	OK	91	141900.0	8.0	512	6.238325	292.539588
367	OK	153	299100.0	9.0	665	6.499787	395.796425
368	OK	195	352260.0	10.0	860	6.756932	535.499523
369	OK	257	524900.0	11.0	1117	7.018402	724.513214
370	OK	596	1127050.0	12.0	1713	7.446001	980.242510
371	OK	369	648100.0	14.0	2082	7.641084	1794.353376
372	OK	710	1399960.0	15.0	2792	7.934513	2427.701005
373	OK	602	1457525.0	16.0	3394	8.129764	3284.599481
374	OK	323	850725.0	17.0	3717	8.220672	4443.954890
375	OK	272	543245.0	18.0	3989	8.291296	6012.524565
376	OK	33	100050.0	19.0	4022	8.299535	8134.747659
377	OK	440	1120550.0	20.0	4462	8.403352	11006.045591

az.plot_trace(negbin_trace)
plt.tight_layout()

../../../_images/d294c2a439a0f1bde9cfccd3c76ced6fc2218e0f3447b162ea7a02f5aa6e5bdb.png

Here, the posterior distribution for the year coefficient has a wider spread, and the mean is around \(0.24\):

np.exp(0.24)

1.2712491503214047

This corresponds to a growth rate around 27%. We can visualize the uncertainty in our inference for the coefficients:

plot_posterior_samples(negbin_trace, ok_turbines, show_logy=True)

../../../_images/73b4d736dc778dfc096f4d5a255fcc14484456804c2e6845553cef09ea602852.png

plot_posterior_samples(negbin_trace, ok_turbines, show_logy=False)

../../../_images/108ff8705d9fece7d40eb70bd49adc417779235f03b4f250437cfeefe2b749f1.png

This seems to be a much better overall fit: the uncertainty in the various lines/curves matches up with the observed data, indicating that most of the prediction lines are somewhat reasonable given the observed data. It’s important once again to remember that the uncertainty visualized here is only in the average prediction, and that we aren’t visualizing the uncertainty in the estimated variance from the negative binomial model.

Comparing the Poisson model and the negative binomial model, both are very similar:

oth models took a linear combination of the features and coefficients, with somewhat similar coefficient values
Both models transformed the result of the linear combination using an exponential function

The key difference was our likelihood model for the observed data points around the average prediction: the Poisson model assumed a Poisson distribution, whose variance is determined by the mean. The negative binomial model, on the other hand, assumed a negative binomial distribution, whose variance could be inferred from the data as a separate variable in our model. This enabled the negative binomial model to better capture the uncertainty in the average prediction.

Exercise: Implement Poisson and negative binomial regression in the frequentist paradigm using statsmodels. Are the results similar?

Hint: you may find it helpful to use sm.families.Poisson() and sm.families.NegativeBinomial().