import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pymc as pm
import arviz as az
np.set_printoptions(suppress=True, formatter={'float_kind':'{:f}'.format})
#this option makes Jupyter print numbers in ordinary (as opposed to scientific) notation
We shall review linear regression in the context of an applied dataset. We will also use this example to motivate Poisson Regression.
The dataset (called MROZ) comes from an Econometrica paper by Mroz in 1987 and contains data on a bunch of variables for married women in the year 1975. This dataset is analyzed in the Econometrics book by Wooldridge (this book contains many other interesting economic and political datasets). The specific variables included in the Mroz dataset are:
#Import the MROZ.csv dataset
mroz = pd.read_csv("MROZ.csv")
print(mroz.head(12))
inlf hours kidslt6 kidsge6 age educ wage repwage hushrs husage \ 0 1 1610 1 0 32 12 3.3540 2.65 2708 34 1 1 1656 0 2 30 12 1.3889 2.65 2310 30 2 1 1980 1 3 35 12 4.5455 4.04 3072 40 3 1 456 0 3 34 12 1.0965 3.25 1920 53 4 1 1568 1 2 31 14 4.5918 3.60 2000 32 5 1 2032 0 0 54 12 4.7421 4.70 1040 57 6 1 1440 0 2 37 16 8.3333 5.95 2670 37 7 1 1020 0 0 54 12 7.8431 9.98 4120 53 8 1 1458 0 2 48 12 2.1262 0.00 1995 52 9 1 1600 0 2 39 12 4.6875 4.15 2100 43 10 1 1969 0 1 33 12 4.0630 4.30 2450 34 11 1 1960 0 1 42 11 4.5918 4.58 2375 47 ... faminc mtr motheduc fatheduc unem city exper nwifeinc \ 0 ... 16310 0.7215 12 7 5.0 0 14 10.910060 1 ... 21800 0.6615 7 7 11.0 1 5 19.499981 2 ... 21040 0.6915 12 7 5.0 0 15 12.039910 3 ... 7300 0.7815 7 7 5.0 0 6 6.799996 4 ... 27300 0.6215 12 14 9.5 1 7 20.100058 5 ... 19495 0.6915 14 7 7.5 1 33 9.859054 6 ... 21152 0.6915 14 7 5.0 0 11 9.152048 7 ... 18900 0.6915 3 3 5.0 0 35 10.900038 8 ... 20405 0.7515 7 7 3.0 0 24 17.305000 9 ... 20425 0.6915 7 7 5.0 0 21 12.925000 10 ... 32300 0.5815 12 3 5.0 0 15 24.299953 11 ... 28700 0.6215 14 7 5.0 0 14 19.700071 lwage expersq 0 1.210154 196 1 0.328512 25 2 1.514138 225 3 0.092123 36 4 1.524272 49 5 1.556480 1089 6 2.120260 121 7 2.059634 1225 8 0.754336 576 9 1.544899 441 10 1.401922 225 11 1.524272 196 [12 rows x 22 columns]
Several regressions can be fit on this dataset. Let us start by fitting a linear regression model with "hours" as the response variable, and "kidslt6", "kidsge6", "age", "educ", "exper", "expersq", "huswage", "huseduc", "hushrs", "motheduc" and "fatheduc" as the covariates (or predictor variables). Note that awe are using both "exper" and "expersq" (which is the square of the variable "exper") as covariates in the regression. This is becauswe we would expect "hours" to increase with "exper" for small values of "exper" (for women new to the workforce) but to perhaps decrease with "exper" for large values of "exper" (for example, people with very large "exper" might be closer to retirement and work less leading to smaller "hours"). Such a relationship cannot be captured by linear functions but can be captured by quadratic functions which is why both "exper" and "expersq" are included in the model.
Using the statsmodels package in Python, this regression can be carried out in the following way.
#Several regressions can be fit on this dataset. Let us fit one with
#hours as the response variable, and
#kidslt6, kidsge6, age, educ, exper, expersq, huswage, huseduc, hushrs, motheduc and fatheduc
#as covariates
import statsmodels.api as sm
#Define the response variable and covariates
Y = mroz['hours']
X = mroz[['kidslt6', 'kidsge6', 'age', 'educ',
'hushrs', 'huseduc', 'huswage', 'motheduc',
'fatheduc', 'exper', 'expersq']].copy()
#Add a constant (intercept) to the model
X = sm.add_constant(X)
#Fit the model:
model = sm.OLS(Y, X).fit()
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: hours R-squared: 0.273 Model: OLS Adj. R-squared: 0.262 Method: Least Squares F-statistic: 25.30 Date: Sat, 30 Sep 2023 Prob (F-statistic): 1.05e-44 Time: 03:18:56 Log-Likelihood: -6045.7 No. Observations: 753 AIC: 1.212e+04 Df Residuals: 741 BIC: 1.217e+04 Df Model: 11 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 1488.5175 293.748 5.067 0.000 911.839 2065.196 kidslt6 -439.3761 58.748 -7.479 0.000 -554.709 -324.043 kidsge6 -32.6212 23.202 -1.406 0.160 -78.171 12.928 age -30.4462 4.382 -6.948 0.000 -39.049 -21.844 educ 41.2569 16.470 2.505 0.012 8.924 73.590 hushrs -0.0635 0.049 -1.292 0.197 -0.160 0.033 huseduc -16.1572 12.297 -1.314 0.189 -40.298 7.983 huswage -13.7469 7.556 -1.819 0.069 -28.580 1.086 motheduc 10.9852 10.368 1.059 0.290 -9.370 31.340 fatheduc -4.0224 9.771 -0.412 0.681 -23.205 15.161 exper 65.9440 9.984 6.605 0.000 46.344 85.544 expersq -0.7289 0.325 -2.241 0.025 -1.368 -0.090 ============================================================================== Omnibus: 82.696 Durbin-Watson: 1.389 Prob(Omnibus): 0.000 Jarque-Bera (JB): 120.379 Skew: 0.786 Prob(JB): 7.25e-27 Kurtosis: 4.167 Cond. No. 2.54e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.54e+04. This might indicate that there are strong multicollinearity or other numerical problems.
The table above gives estimates of the regression coefficients and their standard errors for each of the covariates. These estimates and standard errors are computed in the following way. The data is denoted by $(y_i, x_{i1}, x_{i2}, \dots, x_{im})$ for $i = 1, \dots, n$ (here $m$ is the number of covariates and $n$ is the number of observations). One often represents the response data ($y_1, \dots, y_n$) in a $n \times 1$ vector called $Y$, and the covariate data $(x_{i1}, \dots, x_{im})$ for $i = 1, \dots, n$ in a $n \times (m+1)$ matrix $X$ as follows: \begin{align*} Y = \begin{pmatrix} y_1 \\ \cdot \\ \cdot \\ \cdot \\ y_n \end{pmatrix} ~~\text{ and }~~ X = \begin{pmatrix} 1 & x_{11} & \cdot & \cdot & \cdot & x_{1m} \\ \cdot & \cdot & \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot & \cdot & \cdot \\ 1 & x_{i1} & \cdot & \cdot & \cdot & x_{im} \\ \cdot & \cdot & \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot & \cdot & \cdot\\ 1 & x_{n1} & \cdot & \cdot & \cdot & x_{nm} \end{pmatrix} \end{align*} Note the additional column of ones in the $X$ matrix. In this notation, the estimates of the regression coefficients are computed by solving equation: \begin{align*} X^T X \beta = X^T Y ~~~\text{ leading to } ~~~ \hat{\beta} = (X^T X)^{-1} X^T Y. \end{align*} The standard errors are computed in the following way. First one computes the matrix: \begin{align*} \hat{\sigma}^2 (X^T X)^{-1} ~~~ \text{ where } ~~ \hat{\sigma} := \sqrt{\frac{\sum_{i=1}^n \left(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_{i1} - \hat{\beta}_2 x_{i2} - \dots - \hat{\beta}_m x_{im} \right)^2}{n-m-1}} \end{align*} The standard errors are given by the square roots of the diagonal entries of the $(m+1) \times (m+1)$ matrix $\hat{\sigma}^2 (X^T X)^{-1}$.
Looking back at the statsmodels regression summary, it is common practice to judge the importance of each variable by the value of the estimated regression coefficient as well as the corresponding standard error. The third column (the column titled t) of the regression output gives the ratio of the estimate of each regression coefficient to the corresponding standard error. Variables for which the magnitude of this ratio is large are considered important. For example, in this example, the variable "kidslt6" has a t value of -7.479 i.e., 7.479 in magnitude which is the largest of all other variables so this can be considered the most important variable among all the covariates. On the other hand, the variable "fatheduc" has t value equal to -0.412 (i.e., 0.412 in absolute value) which is small so this can be considered the least important variable. The right thing to compare these t-values are the quantiles of the Student t probability distribution with $n-m-1$ degrees of freedom. This comparison is done by the OLS function and its values are given in the fourth column of the regression table. This column (titled $P>|t|$) goes by the same 'p-values' and values close to zero indicate that the variable is important. For example, for 'kidslt6', this $p$-value is rounded to zero while for 'fathedu', this $p$-value is quit large equalling 0.681.
Usually one which looks at the table above and drops variables for which the $p$-value $P > |t|$ is large. In this problem, it might be reasonable to drop the variables "motheduc", "fatheduc", "hushrs", "huseduc" and "kidsge6" deeming these to be unimportant for explaining the response "hours". Below we carry out linear regression on "hours" with the other six variables "kidslt6", "age", "educ", "huswage", "exper", "expersq".
Y = mroz['hours']
X = mroz[['kidslt6', 'age', 'educ',
'huswage', 'exper', 'expersq']].copy()
X = sm.add_constant(X) #add a constant (intercept) to the model
#Fit the model:
linmodel = sm.OLS(Y, X).fit()
print(linmodel.summary())
OLS Regression Results ============================================================================== Dep. Variable: hours R-squared: 0.266 Model: OLS Adj. R-squared: 0.260 Method: Least Squares F-statistic: 44.99 Date: Sat, 30 Sep 2023 Prob (F-statistic): 4.67e-47 Time: 03:19:00 Log-Likelihood: -6049.5 No. Observations: 753 AIC: 1.211e+04 Df Residuals: 746 BIC: 1.215e+04 Df Model: 6 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 1166.8778 243.738 4.787 0.000 688.384 1645.372 kidslt6 -433.1229 58.416 -7.414 0.000 -547.803 -318.443 age -28.4308 4.067 -6.991 0.000 -36.414 -20.447 educ 32.6255 12.827 2.543 0.011 7.444 57.807 huswage -13.9353 6.857 -2.032 0.042 -27.397 -0.474 exper 67.7980 9.896 6.851 0.000 48.371 87.225 expersq -0.7375 0.325 -2.270 0.023 -1.375 -0.100 ============================================================================== Omnibus: 78.707 Durbin-Watson: 1.383 Prob(Omnibus): 0.000 Jarque-Bera (JB): 111.647 Skew: 0.768 Prob(JB): 5.70e-25 Kurtosis: 4.095 Cond. No. 2.76e+03 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.76e+03. This might indicate that there are strong multicollinearity or other numerical problems.
Instead of using the package StatsModels (specifically the function sm.OLS), we can also fit this linear regression model via the Bayesian approach (in PyMC) with flat priors on the regression coefficients (as well as $\log \sigma$). This is done as follows. Note that with these priors, it can be mathematically proved (by calculating the posterior explicitly) that the Bayesian approach gives identical results as the frequentist approach. This means that implementation in PyMC should give essentially the same results as found in the sm.OLS summary (with perhaps slight differences due to Monte Carlo fluctuations).
#We can also take the Bayesian Approach and use PyMC:
import pymc as pm
mrozmod = pm.Model()
with mrozmod:
# Priors for unknown model parameters
b0 = pm.Flat("b0")
b1 = pm.Flat("b1")
b2 = pm.Flat("b2")
b3 = pm.Flat("b3")
b4 = pm.Flat("b4")
b5 = pm.Flat("b5")
b6 = pm.Flat("b6")
log_sigma = pm.Flat("log_sigma")
sigma = pm.Deterministic("sigma", pm.math.exp(log_sigma))
# Expected value of outcome
mu = b0 + b1 * mroz['kidslt6'] + b2 * mroz['age'] + b3 * mroz['educ'] + b4 * mroz['huswage'] + b5 * mroz['exper'] + b6 * mroz['expersq']
# Likelihood
Y_obs = pm.Normal("Y_obs", mu=mu, sigma=sigma, observed=mroz['hours'])
idata = pm.sample(2000, chains = 2, return_inferencedata = True)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 4 jobs) NUTS: [b0, b1, b2, b3, b4, b5, b6, log_sigma]
Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 31 seconds. We recommend running at least 4 chains for robust computation of convergence diagnostics
The above PyMC code specifies the linear regression model in the following way: \begin{align*} &Y_i \sim N(\mu_i, \sigma^2) ~~ \text{ for } i = 1, \dots, n \\ &\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_m x_{im} ~~ \text{ for } i = 1, \dots, n \\ &\beta_0, \beta_1, \dots, \beta_m, \log \sigma \overset{\text{i.i.d}}{\sim} \text{Flat} = \text{Uniform}[-C, C] \text{ for very large } C \end{align*} Let us check that the Bayesian analysis leads to basically the same answers as statsmodels OLS.
b0_samples = idata.posterior['b0'].values.flatten()
b1_samples = idata.posterior['b1'].values.flatten()
b2_samples = idata.posterior['b2'].values.flatten()
b3_samples = idata.posterior['b3'].values.flatten()
b4_samples = idata.posterior['b4'].values.flatten()
b5_samples = idata.posterior['b5'].values.flatten()
b6_samples = idata.posterior['b6'].values.flatten()
allsamples = [b0_samples, b1_samples, b2_samples, b3_samples, b4_samples, b5_samples, b6_samples]
names = ['b0', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6']
print("Parameter | Mean | Std. Dev. | Least Squares | Std. Error")
print("------------|----------|----------")
for i, (name, arr) in enumerate(zip(names, allsamples)):
print(f"{name:10}| {np.mean(arr):.6f} | {np.std(arr):.6f} | {linmodel.params.values[i]:.6f} | {linmodel.bse.values[i]:.6f}")
Parameter | Mean | Std. Dev. | Least Squares | Std. Error ------------|----------|---------- b0 | 1175.858222 | 242.523307 | 1166.877797 | 243.737876 b1 | -434.363992 | 59.045116 | -433.122873 | 58.416292 b2 | -28.572263 | 4.043531 | -28.430794 | 4.066762 b3 | 32.440296 | 13.042272 | 32.625466 | 12.827177 b4 | -13.836047 | 7.102644 | -13.935253 | 6.857217 b5 | 67.582198 | 9.766088 | 67.797967 | 9.895837 b6 | -0.727326 | 0.320257 | -0.737492 | 0.324827
The estimated regression coefficent $\hat{\beta}_1$ for "kidslt6" is about -430 which can be interpreted as follows: having a small child reduces the mean hours worked by about 430. This result is clearly meaningless for women with small working hours (e.g., if someone only works for about 300 hours, what does a reduction of 430 mean?). It is also not very meaningful for women with lots of working hours (e.g., if someone works for more than 3000 hours, a reduction of 430 due to a small kid sounds too small). A much more meaningful and useful interpretation would report reductions in terms of percentages and not absolute hours. For example, a statement such as "having a small child reduces the mean hours worked by about 40%" would apply to all working women and be much more interpretable. Such an intepretation cannot be drawn based on this linear regression.
A percentage interpretation can be realized if we change our model by taking the response variable to be $\log(\text{Hours worked})$ as opposed to "hours". Indeed, in the linear regression model for $\log(\text{Hours})$: \begin{align*} \log(\text{Hours}) = \beta_0 + \beta_1 \text{kidslt6} + \beta_2 \text{age} + \dots, \end{align*} the coefficient $\beta_1$ gives the reduction in $\log(\text{Hours})$ due to a small child. This implies a percentage increase of $100 \times(\exp(\beta_1) - 1)$ in hours worked because: \begin{align*} \frac{\text{Hours}_{\text{new}} - \text{Hours}_{\text{old}}}{\text{Hours}_{\text{old}}} \times 100 = \left[\exp \left(\log \text{Hours}_{\text{new}} - \log \text{Hours}_{\text{old}} \right) - 1\right] \times 100 \end{align*} It is much more preferable to use $\log(\text{Hours})$ as the response variable in these problems compared to $\text{Hours}$. However, one big problem is that, in this dataset, there are quite a few observations for which the Hours variable equals 0. So we cannot really work with the variable $\log \text{Hours}$. In such case, Poisson regression provides a great alternative model.
In Poisson regression, one modifies the linear regression model: \begin{align*} &\log(Y_i) \sim N(\mu_i, \sigma^2) ~~ \text{ for } i = 1, \dots, n \\ &\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_m x_{im} ~~ \text{ for } i = 1, \dots, n \end{align*} (which cannot be used as the variable $\log(Y_i)$ is meaningless when $Y_i = 0$) in the following way: \begin{align*} &Y_i \sim \text{Poisson}(\mu_i) ~~ \text{ for } i = 1, \dots, n \\ &\log \mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_m x_{im} ~~ \text{ for } i = 1, \dots, n \end{align*} The main change is that the response variable $Y_i$ is now modelled as $\text{Poisson}(\mu_i)$. The Poisson distribution is often used to model counts. In this case, $Y_i$ is the number of hours worked in 1975 which is a count variable. The linear relationship between the response and covariates is specified through $\log \mu_i$ which enables percentage interpretation for the regression coefficients.
As in the case of linear regression, there are two ways of implementing Poisson regression:
Most practitioners use the first option above. Let us first see how it works before turning to the Bayesian PyMC approach.
#Poisson Regression through StatsModels
# Define the response variable and covariates
Y = mroz['hours']
X = mroz[['kidslt6', 'age', 'educ',
'huswage', 'exper', 'expersq']].copy()
X = sm.add_constant(X) # Add a constant (intercept) to the model
# Fit the Poisson regression model
poiregmodel = sm.GLM(Y, X, family=sm.families.Poisson()).fit()
print(poiregmodel.summary())
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: hours No. Observations: 753 Model: GLM Df Residuals: 746 Model Family: Poisson Df Model: 6 Link Function: Log Scale: 1.0000 Method: IRLS Log-Likelihood: -3.1563e+05 Date: Sat, 30 Sep 2023 Deviance: 6.2754e+05 Time: 03:19:46 Pearson chi2: 6.60e+05 No. Iterations: 5 Pseudo R-squ. (CS): 1.000 Covariance Type: nonrobust ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ const 6.9365 0.012 562.281 0.000 6.912 6.961 kidslt6 -0.8075 0.004 -193.217 0.000 -0.816 -0.799 age -0.0427 0.000 -201.166 0.000 -0.043 -0.042 educ 0.0528 0.001 83.439 0.000 0.052 0.054 huswage -0.0207 0.000 -54.548 0.000 -0.021 -0.020 exper 0.1204 0.001 219.231 0.000 0.119 0.121 expersq -0.0018 1.63e-05 -112.090 0.000 -0.002 -0.002 ==============================================================================
The output above looks very similar to the linear regression output. This model is much more interpretable compared to the linear regression model that we fit earlier. To see this, consider the interpretation of the coefficient -0.8075 for the "kidslt6" variable: having a small kid reduces mean hours worked by 55%. This is clearly a much more interpretable result compared to before.
#56% comes from:
print((np.exp(poiregmodel.params['kidslt6']) - 1)*100)
-55.40391074218227
Let us now understand how sm.GLM is obtaining the estimates and standard errors for the regression coefficients. The estimates are obtained by Maximum Likelihood Estimation. The likelihood in this Poisson Regression Model is given by \begin{align*} \text{likelihood} &= \prod_{i=1}^n \frac{e^{-\mu_i} \mu_i^{y_i}}{y_i!} \\ &= \prod_{i=1}^n \frac{e^{-e^{\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im}}} e^{y_i(\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im})}}{y_i!} \end{align*} It is much more easier to maximize the logarithm of the likelihood (as opposed to the likelihood directly). We can also ignore the $y_i!$ terms in the denominator as they do not involve the parameters $\beta_0, \dots, \beta_m$. Let us also replace the sum $\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im}$ by the simpler expression $x_i^T \beta$ where $\beta$ denotes the column vector with components $\beta_0, \dots, \beta_m$ and $x_i$ denotes the column vector with components $1, x_{i1}, x_{i2}, \dots, x_{im}$. With these changes, we obtain the log-likelihood: \begin{align*} \ell(\beta) = \sum_{i=1}^n \left(-e^{x_i^T \beta} + y_i x_i^T \beta \right) \end{align*} The goal is to maximize this log-likelihood over $\beta_0, \beta_1, \dots, \beta_m$. We can do this by taking the gradient (i.e., derivatives of $\ell(\beta)$ with respect to $\beta_0, \dots, \beta_m$ collected in a column vector), setting the gradient to zero and solving the resulting set of equations. The gradient of $\ell(\beta)$ is \begin{align*} \nabla \ell(\beta) = \sum_{i=1}^n \left(-e^{x_i^T \beta} x_i + y_i x_i \right) = \sum_{i=1}^n \left(y_i - e^{x_i^T \beta} \right) x_i = \sum_{i=1}^n \left(y_i - \mu_i \right) x_i = X^T(Y - \mu). \end{align*} where $\mu$ is the $n \times 1$ vector with entries $\mu_1, \dots, \mu_n$. Therefore setting the gradient equal to zero leads us to the equation: \begin{align*} X^T \mu = X^T Y. \end{align*} This gives $m+1$ equations for the $m+1$ parameters $\beta_0, \dots, \beta_m$. Note that the parameters appear in the above equation through $\mu$. It is interesting to observe that in the case of linear regression (where we have the normal distribution instead of Poisson), if we attempt to calculate the MLE in the same way as above, we would obtain exactly the same equation: $X^T \mu = X^T Y$. However, in that case, $\mu = X \beta$ leading to the equation $X^T X \beta = X^T Y$ which can be solved as $\hat{\beta} = (X^T X)^{-1} X^T Y$. Unfortunately, in the present Poisson regression case, $\mu$, which has components $e^{x_1^T \beta}, \dots, e^{x_n^T \beta}$, depends on $\beta$ in a nonlinear fashion and, consequently, it is not possible to solve $X^T \mu = X^T Y$ for $\beta$ in closed form. One needs to resort to solving this equation using numerical methods.
A very classical method for solving nonlinear equations is the Newton method (also known as the Newton-Raphson method): see https://en.wikipedia.org/wiki/Newton%27s_method. We apply this method to solve $\nabla \ell(\beta) = X^T Y - X^T \mu = 0$. Newton's method is iterative (starts with a rough guess $\beta^{(0)}$ for the solution and then successively modifies into better approximations for the solution eventually converging to the correct solution). Newton's interation for changing the current solution approximation $\beta^{(m)}$ into a better approximation $\beta^{(m+1)}$ of the solution to $\nabla \ell(\beta) = 0$ is given by: \begin{align*} \beta^{(m+1)} = \beta^{(m)} - (H\ell(\beta^{(m)}))^{-1} \nabla \ell(\beta^{(m)}), \end{align*} where $H \ell(\beta^{(m)}$ is the Hessian Matrix of the function $\ell(\beta)$ evaluated at the point $\beta = \beta^{(m)}$. It may be recalled the Hessian matrix is the matrix of second derivatives (see https://en.wikipedia.org/wiki/Hessian_matrix). To actually, implement Newton's method, we would need to calculate the Hessian. We can do this by further differentiating the gradient $\nabla \ell(\beta) = \sum_{i=1}^n (y_i - e^{x_i^T \beta}) x_i = X^T(Y - \mu)$ with respect to $\beta$. It can be checked that this leads to: \begin{align*} H \ell(\beta) = - \sum_{i=1}^n e^{x_i^T \beta} x_i x_i^T = - \sum_{i=1}^n \mu_i x_i x_i^T. \end{align*} We can rewrite this in matrix form as \begin{align*} H \ell(\beta) = - X^T M(\beta) X \end{align*} where $M(\beta)$ is the diagonal matrix with diagonal entries $\mu_1 = e^{x_1^T \beta}, \dots, \mu_n = e^{x_n^T \beta}$. Thus Newton's method for Poisson regression is: \begin{align*} \beta^{(m+1)} = \beta^{(m)} + \left(X^T M(\beta^{(m)}) X \right)^{-1} \left( X^T Y - X^T \mu(\beta^{(m)}) \right), \end{align*} where we wrote $\mu(\beta^{(m)})$ to emphasize that this vector $\mu$ is calculated as $e^{x_1^T \beta}, \dots, e^{x_n^T \beta}$ with $\beta = \beta^{(m)}$. This method can be implemented as follows.
#Newton's Method for Calculating MLE in Poisson Regression
beta_hat = poiregmodel.params.values
print(beta_hat)
#this is the answer computed by statsmodels and we shall show that Newton's method leads to the same answer if initialized reasonably
#Initialization for Newton's Method
m = 6
p = 7
beta_initial = [5, 0, 0, 0, 0, 0, 0]
n = mroz.shape[0]
Xmat = X.values
Yvec = mroz['hours'].values
#Newton's method for 100 iterations
num_iterations = 100
for i in range(num_iterations):
log_muvec = np.dot(Xmat, beta_initial)
muvec = np.exp(log_muvec)
gradient = np.dot(Xmat.T, Yvec - muvec)
M = np.diag(muvec)
Hessian = -Xmat.T @ M @ Xmat
Hessian_inv = np.linalg.inv(Hessian)
beta_initial = beta_initial - Hessian_inv @ gradient
print(beta_initial)
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [11.862361 -2.918359 -0.191565 0.219829 -0.093895 0.456819 -0.004969] [10.885550 -2.863001 -0.189547 0.218271 -0.093150 0.450808 -0.004889] [9.946436 -2.723529 -0.184228 0.214137 -0.091177 0.435017 -0.004679] [9.097029 -2.412093 -0.170935 0.203596 -0.086180 0.395908 -0.004161] [8.414084 -1.884261 -0.142079 0.179445 -0.074896 0.313363 -0.003085] [7.865945 -1.335615 -0.097770 0.136900 -0.055375 0.198124 -0.001679] [7.313642 -0.990725 -0.061477 0.090536 -0.034424 0.127770 -0.001108] [7.011093 -0.838093 -0.045790 0.061125 -0.022797 0.113816 -0.001399] [6.944282 -0.808540 -0.042825 0.053242 -0.020705 0.118791 -0.001758] [6.936650 -0.807526 -0.042682 0.052832 -0.020712 0.120342 -0.001827] [6.936480 -0.807524 -0.042681 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829] [6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
Because of the exponential calculation in $\mu_i = e^{x_i^T \beta}$, one needs to be careful with initialization. For example, go back and change the initialization to $\beta^{(0)} = (0, \dots, 0)$. In this case, $\beta^{(1)}$ has some large entries leading to $e^{x_i^T \beta^{(1)}$ blowing up. If this can be avoided, the method converges to the correct solution.
Before seeing how sm.GLM is calculating the standard errors, let us first look at the Bayesian approach for fitting the Poisson regression model with flat priors. For the Bayesian approach with flat priors, we have \begin{align*} \text{Prior} = 1 ~~~ \text{ and } ~~~ \text{Likelihood} = \prod_{i=1}^n \frac{e^{-e^{\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im}}} e^{y_i(\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im})}}{y_i!} \propto \exp(\ell(\beta)). \end{align*} The posterior is therefore given by \begin{align*} \text{Posterior}(\beta) \propto \exp \left(\ell(\beta) \right) = \exp \left(\sum_{i=1}^n \left(-e^{x_i^T \beta} + y_i x_i^T \beta \right) \right) \end{align*} This posterior in $\beta_0, \dots, \beta_m$ cannot be written in terms of standard distributions. One can use PyMC to generate samples from this posterior. However, PyMC can be somewhat unstable and might lead to (significantly different) answers for different runs. For example, in the following code, setting randomseed = 0 gives results similar to the frequentist output while setting randomseed = 4 gives quite different results.
#We can also take the Bayesian Approach and use PyMC:
import pymc as pm
mrozpoimod = pm.Model()
with mrozpoimod:
# Priors for unknown model parameters
b0 = pm.Flat("b0")
b1 = pm.Flat("b1")
b2 = pm.Flat("b2")
b3 = pm.Flat("b3")
b4 = pm.Flat("b4")
b5 = pm.Flat("b5")
b6 = pm.Flat("b6")
log_mu = b0 + b1 * mroz['kidslt6'] + b2 * mroz['age'] + b3 * mroz['educ'] + b4 * mroz['huswage'] + b5 * mroz['exper'] + b6 * mroz['expersq']
# Likelihood
Y_obs = pm.Poisson("Y_obs", mu=np.exp(log_mu), observed=mroz['hours'])
idata = pm.sample(2000, chains = 2, random_seed = 0, return_inferencedata = True)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 4 jobs) NUTS: [b0, b1, b2, b3, b4, b5, b6]
Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 185 seconds. We recommend running at least 4 chains for robust computation of convergence diagnostics
b0_samples = idata.posterior['b0'].values.flatten()
b1_samples = idata.posterior['b1'].values.flatten()
b2_samples = idata.posterior['b2'].values.flatten()
b3_samples = idata.posterior['b3'].values.flatten()
b4_samples = idata.posterior['b4'].values.flatten()
b5_samples = idata.posterior['b5'].values.flatten()
b6_samples = idata.posterior['b6'].values.flatten()
allsamples = [b0_samples, b1_samples, b2_samples, b3_samples, b4_samples, b5_samples, b6_samples]
names = ['b0', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6']
print("Parameter | Estimate | Std. Dev. | Frequentist | Std. Error")
print("------------|----------|----------")
for i, (name, arr) in enumerate(zip(names, allsamples)):
print(f"{name:8}| {np.mean(arr):.6f} | {np.std(arr):.6f} | {poiregmodel.params.values[i]:.6f} | {poiregmodel.bse.values[i]:.6f}")
#These results are quite close to the frequentist output.
#However PyMC is not very reliable here. Change the random seed from 0 to 4
#and look at the results.
Parameter | Estimate | Std. Dev. | Frequentist | Std. Error ------------|----------|---------- b0 | 6.936476 | 0.012442 | 6.936480 | 0.012336 b1 | -0.807527 | 0.004131 | -0.807524 | 0.004179 b2 | -0.042681 | 0.000213 | -0.042680 | 0.000212 b3 | 0.052838 | 0.000637 | 0.052831 | 0.000633 b4 | -0.020714 | 0.000378 | -0.020714 | 0.000380 b5 | 0.120361 | 0.000549 | 0.120372 | 0.000549 b6 | -0.001828 | 0.000016 | -0.001829 | 0.000016
When the posterior cannot be written in terms of standard distributions, and when MCMC seems unreliable, an alternative option is to approximate the posterior using simple standard distributions. This idea is at the heart of the method of Variational Inference which is quite popular in Bayesian inference. In this problem, one can approximate the posterior via a (multivariate) normal distribution. This works in the following way.
Our posterior is proportional to $\exp(\ell(\beta))$. The normal distribution, on the other hand, is proportional to $\exp \left(\text{quadratic function of } \beta \right)$. Thus, in order to obtain a normal approximation of the posterior, it makes sense to approximate $\ell(\beta)$ by a quadratic function of $\beta$. This can be done by Taylor approximation but we need to figure out around which point does the Taylor approximation need to be done. The function $\ell(\beta)$ is the log-likelihood which is maximized at the MLE $\hat{\beta}$. It makes sense, therefore, to do the Taylor approximation around $\hat{\beta}$. The idea is that the approximation will be quite accurate for points near the MLE $\hat{\beta}$. For points far from the MLE, the posterior density is probably small anyway so bad approximation for these points might have an insignificant effect. Motivated by these considerations, we Taylor expand the log-likelihood $\ell(\beta)$ around $\hat{\beta}$ up to second order as follows: \begin{align*} \ell(\beta) &\approx \ell(\hat{\beta}) + \left<\nabla \ell(\hat{\beta}), \beta - \hat{\beta} \right> + \frac{1}{2} \left(\beta - \hat{\beta} \right)^T H\ell(\hat{\beta}) \left(\beta - \hat{\beta} \right) \\ &= \ell(\hat{\beta}) + \frac{1}{2} \left(\beta - \hat{\beta} \right)^T H\ell(\hat{\beta}) \left(\beta - \hat{\beta} \right) ~~ \text{ because } \nabla \ell(\hat{\beta}) = 0. \end{align*} Thus the posterior is approximated as: \begin{align*} \text{posterior}(\beta) & \propto \exp \left(\ell(\beta) \right) \\ & \approx \exp \left(\ell(\hat{\beta}) + \frac{1}{2} \left(\beta - \hat{\beta} \right)^T H\ell(\hat{\beta}) \left(\beta - \hat{\beta} \right) \right) \\ &= \exp \left(\ell(\hat{\beta}) \right) \exp \left(\frac{1}{2} \left(\beta - \hat{\beta} \right)^T H\ell(\hat{\beta}) \left(\beta - \hat{\beta} \right) \right) \\ &\propto \exp \left(\frac{1}{2} \left(\beta - \hat{\beta} \right)^T H\ell(\hat{\beta}) \left(\beta - \hat{\beta} \right) \right). \end{align*} In the last step above, we ignored $\exp \left(\ell(\hat{\beta}) \right)$ in proportionality because it does not involve the parameters $\beta$.
Now we can compare the posterior approximation to the density of a multivariate normal (with mean $\mu$ and covariance $\Sigma$) which is proportional to $\exp \left(- (\beta - \mu)^T \Sigma^{-1} (\beta - \mu) \right)$. It is then clear that $\mu = \hat{\beta}$ and $\Sigma^{-1} = -H\ell(\hat{\beta})$. Thus the posterior normal approximation is given by: \begin{align*} \text{posterior} \approx N \left(\hat{\beta}, \left( - H\ell(\hat{\beta}) \right)^{-1}\right) \end{align*} This gives a very simple and practical approximation to the complicated posterior, especially because we already know how to calculate the Hessian (it was used in the Newton algorithm for computing the MLE). If one wants, posterior samples can be easily generated from $N \left(\hat{\beta}, \left( - H\ell(\hat{\beta}) \right)^{-1}\right)$. This will be much faster than using PyMC. However, PyMC tries to obtain samples from the actual posterior while this normal distribution is an approximation to the actual posterior.
The standard errors of Poisson Regression that were reported by sm.GLM are actually computed from the covariance of the posterior normal approximation. In other words, the standard errors are simply the square-roots of the diagonal entries of the matrix: \begin{align*} \left(-H\ell(\hat{\beta}) \right)^{-1} = \left(X^T M(\hat{\beta}) X \right)^{-1} \end{align*} where $\hat{\beta}$ is the MLE and $M(\beta)$ is the diagonal matrix with diagonal entries $\mu_1 = e^{x_1^T \beta}, \dots, \mu_n = e^{x_n^T \beta}$. This can be readily verified in the context of our dataset as follows.
#Standard Error Calculation:
log_muvec = np.dot(Xmat, beta_hat)
muvec = np.exp(log_muvec)
M = np.diag(muvec)
Hessian = -Xmat.T @ M @ Xmat
Hessian_inv = np.linalg.inv(Hessian)
CovMat = -Hessian_inv
print(np.sqrt(np.diag(CovMat)))
#Compare with
print(poiregmodel.bse)
[0.012336 0.004179 0.000212 0.000633 0.000380 0.000549 0.000016] const 0.012336 kidslt6 0.004179 age 0.000212 educ 0.000633 huswage 0.000380 exper 0.000549 expersq 0.000016 dtype: float64
These standard errors can also be justified using frequentist arguments but that justification is much more complicated (it involves asymptotics and notions such as Fisher information). The Bayesian justification is much simpler and intuitive (based on normal approximation via Taylor expansion).
The important takeaways are: