In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pymc as pm
import arviz as az
np.set_printoptions(suppress=True, formatter={'float_kind':'{:f}'.format}) 
#this option makes Jupyter print numbers in ordinary (as opposed to scientific) notation

Linear Regression and Poisson Regression¶

We shall review linear regression in the context of an applied dataset. We will also use this example to motivate Poisson Regression.

Review of Linear Regression with an Economics Dataset¶

The dataset (called MROZ) comes from an Econometrica paper by Mroz in 1987 and contains data on a bunch of variables for married women in the year 1975. This dataset is analyzed in the Econometrics book by Wooldridge (this book contains many other interesting economic and political datasets). The specific variables included in the Mroz dataset are:

  1. inlf: binary variable equaling 1 if the individual worked (i.e., they were 'in the labor force') in the year 1975 and 0 otherwise
  2. hours: number of hours worked in 1975
  3. kidslt6: number of kids < 6 years of age
  4. kidsge6: number of kids 6-18 years of age
  5. age: age in years
  6. educ: years of schooling
  7. wage: hourly wage in 1975
  8. repwage: reported wage at interview in 1976
  9. hushrs: hours worked by husband in 1975
  10. husage: husband's age
  11. huseduc: husband's years of schooling
  12. huswage: husband's hourly wage in 1975
  13. faminc: family income in 1975
  14. mtr: federal marginal tax rate facing woman
  15. motheduc: mother's years of schooling
  16. fatheduc: father's years of schooling
  17. unem: unemployment rate in county of residence
  18. city: =1 if live in Standard metropolitan statistical area
  19. exper: actual labor market experience
  20. nwifeinc: (faminc - wage*hours)/1000
  21. lwage: log(wage)
  22. expersq: $\text{exper}^2$ (the square of the experience variable)
In [12]:
#Import the MROZ.csv dataset
mroz = pd.read_csv("MROZ.csv")
print(mroz.head(12))
    inlf  hours  kidslt6  kidsge6  age  educ    wage  repwage  hushrs  husage  \
0      1   1610        1        0   32    12  3.3540     2.65    2708      34   
1      1   1656        0        2   30    12  1.3889     2.65    2310      30   
2      1   1980        1        3   35    12  4.5455     4.04    3072      40   
3      1    456        0        3   34    12  1.0965     3.25    1920      53   
4      1   1568        1        2   31    14  4.5918     3.60    2000      32   
5      1   2032        0        0   54    12  4.7421     4.70    1040      57   
6      1   1440        0        2   37    16  8.3333     5.95    2670      37   
7      1   1020        0        0   54    12  7.8431     9.98    4120      53   
8      1   1458        0        2   48    12  2.1262     0.00    1995      52   
9      1   1600        0        2   39    12  4.6875     4.15    2100      43   
10     1   1969        0        1   33    12  4.0630     4.30    2450      34   
11     1   1960        0        1   42    11  4.5918     4.58    2375      47   

    ...  faminc     mtr  motheduc  fatheduc  unem  city  exper   nwifeinc  \
0   ...   16310  0.7215        12         7   5.0     0     14  10.910060   
1   ...   21800  0.6615         7         7  11.0     1      5  19.499981   
2   ...   21040  0.6915        12         7   5.0     0     15  12.039910   
3   ...    7300  0.7815         7         7   5.0     0      6   6.799996   
4   ...   27300  0.6215        12        14   9.5     1      7  20.100058   
5   ...   19495  0.6915        14         7   7.5     1     33   9.859054   
6   ...   21152  0.6915        14         7   5.0     0     11   9.152048   
7   ...   18900  0.6915         3         3   5.0     0     35  10.900038   
8   ...   20405  0.7515         7         7   3.0     0     24  17.305000   
9   ...   20425  0.6915         7         7   5.0     0     21  12.925000   
10  ...   32300  0.5815        12         3   5.0     0     15  24.299953   
11  ...   28700  0.6215        14         7   5.0     0     14  19.700071   

       lwage  expersq  
0   1.210154      196  
1   0.328512       25  
2   1.514138      225  
3   0.092123       36  
4   1.524272       49  
5   1.556480     1089  
6   2.120260      121  
7   2.059634     1225  
8   0.754336      576  
9   1.544899      441  
10  1.401922      225  
11  1.524272      196  

[12 rows x 22 columns]

Several regressions can be fit on this dataset. Let us start by fitting a linear regression model with "hours" as the response variable, and "kidslt6", "kidsge6", "age", "educ", "exper", "expersq", "huswage", "huseduc", "hushrs", "motheduc" and "fatheduc" as the covariates (or predictor variables). Note that awe are using both "exper" and "expersq" (which is the square of the variable "exper") as covariates in the regression. This is becauswe we would expect "hours" to increase with "exper" for small values of "exper" (for women new to the workforce) but to perhaps decrease with "exper" for large values of "exper" (for example, people with very large "exper" might be closer to retirement and work less leading to smaller "hours"). Such a relationship cannot be captured by linear functions but can be captured by quadratic functions which is why both "exper" and "expersq" are included in the model.

Using the statsmodels package in Python, this regression can be carried out in the following way.

In [13]:
#Several regressions can be fit on this dataset. Let us fit one with
#hours as the response variable, and
#kidslt6, kidsge6, age, educ, exper, expersq, huswage, huseduc, hushrs, motheduc and fatheduc
#as covariates
import statsmodels.api as sm
#Define the response variable and covariates
Y = mroz['hours']
X = mroz[['kidslt6', 'kidsge6', 'age', 'educ', 
       'hushrs',  'huseduc', 'huswage',  'motheduc',
       'fatheduc', 'exper', 'expersq']].copy()
#Add a constant (intercept) to the model
X = sm.add_constant(X)

#Fit the model: 
model = sm.OLS(Y, X).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  hours   R-squared:                       0.273
Model:                            OLS   Adj. R-squared:                  0.262
Method:                 Least Squares   F-statistic:                     25.30
Date:                Sat, 30 Sep 2023   Prob (F-statistic):           1.05e-44
Time:                        03:18:56   Log-Likelihood:                -6045.7
No. Observations:                 753   AIC:                         1.212e+04
Df Residuals:                     741   BIC:                         1.217e+04
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       1488.5175    293.748      5.067      0.000     911.839    2065.196
kidslt6     -439.3761     58.748     -7.479      0.000    -554.709    -324.043
kidsge6      -32.6212     23.202     -1.406      0.160     -78.171      12.928
age          -30.4462      4.382     -6.948      0.000     -39.049     -21.844
educ          41.2569     16.470      2.505      0.012       8.924      73.590
hushrs        -0.0635      0.049     -1.292      0.197      -0.160       0.033
huseduc      -16.1572     12.297     -1.314      0.189     -40.298       7.983
huswage      -13.7469      7.556     -1.819      0.069     -28.580       1.086
motheduc      10.9852     10.368      1.059      0.290      -9.370      31.340
fatheduc      -4.0224      9.771     -0.412      0.681     -23.205      15.161
exper         65.9440      9.984      6.605      0.000      46.344      85.544
expersq       -0.7289      0.325     -2.241      0.025      -1.368      -0.090
==============================================================================
Omnibus:                       82.696   Durbin-Watson:                   1.389
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              120.379
Skew:                           0.786   Prob(JB):                     7.25e-27
Kurtosis:                       4.167   Cond. No.                     2.54e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.54e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

The table above gives estimates of the regression coefficients and their standard errors for each of the covariates. These estimates and standard errors are computed in the following way. The data is denoted by $(y_i, x_{i1}, x_{i2}, \dots, x_{im})$ for $i = 1, \dots, n$ (here $m$ is the number of covariates and $n$ is the number of observations). One often represents the response data ($y_1, \dots, y_n$) in a $n \times 1$ vector called $Y$, and the covariate data $(x_{i1}, \dots, x_{im})$ for $i = 1, \dots, n$ in a $n \times (m+1)$ matrix $X$ as follows: \begin{align*} Y = \begin{pmatrix} y_1 \\ \cdot \\ \cdot \\ \cdot \\ y_n \end{pmatrix} ~~\text{ and }~~ X = \begin{pmatrix} 1 & x_{11} & \cdot & \cdot & \cdot & x_{1m} \\ \cdot & \cdot & \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot & \cdot & \cdot \\ 1 & x_{i1} & \cdot & \cdot & \cdot & x_{im} \\ \cdot & \cdot & \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot & \cdot & \cdot\\ 1 & x_{n1} & \cdot & \cdot & \cdot & x_{nm} \end{pmatrix} \end{align*} Note the additional column of ones in the $X$ matrix. In this notation, the estimates of the regression coefficients are computed by solving equation: \begin{align*} X^T X \beta = X^T Y ~~~\text{ leading to } ~~~ \hat{\beta} = (X^T X)^{-1} X^T Y. \end{align*} The standard errors are computed in the following way. First one computes the matrix: \begin{align*} \hat{\sigma}^2 (X^T X)^{-1} ~~~ \text{ where } ~~ \hat{\sigma} := \sqrt{\frac{\sum_{i=1}^n \left(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_{i1} - \hat{\beta}_2 x_{i2} - \dots - \hat{\beta}_m x_{im} \right)^2}{n-m-1}} \end{align*} The standard errors are given by the square roots of the diagonal entries of the $(m+1) \times (m+1)$ matrix $\hat{\sigma}^2 (X^T X)^{-1}$.

Looking back at the statsmodels regression summary, it is common practice to judge the importance of each variable by the value of the estimated regression coefficient as well as the corresponding standard error. The third column (the column titled t) of the regression output gives the ratio of the estimate of each regression coefficient to the corresponding standard error. Variables for which the magnitude of this ratio is large are considered important. For example, in this example, the variable "kidslt6" has a t value of -7.479 i.e., 7.479 in magnitude which is the largest of all other variables so this can be considered the most important variable among all the covariates. On the other hand, the variable "fatheduc" has t value equal to -0.412 (i.e., 0.412 in absolute value) which is small so this can be considered the least important variable. The right thing to compare these t-values are the quantiles of the Student t probability distribution with $n-m-1$ degrees of freedom. This comparison is done by the OLS function and its values are given in the fourth column of the regression table. This column (titled $P>|t|$) goes by the same 'p-values' and values close to zero indicate that the variable is important. For example, for 'kidslt6', this $p$-value is rounded to zero while for 'fathedu', this $p$-value is quit large equalling 0.681.

Usually one which looks at the table above and drops variables for which the $p$-value $P > |t|$ is large. In this problem, it might be reasonable to drop the variables "motheduc", "fatheduc", "hushrs", "huseduc" and "kidsge6" deeming these to be unimportant for explaining the response "hours". Below we carry out linear regression on "hours" with the other six variables "kidslt6", "age", "educ", "huswage", "exper", "expersq".

In [14]:
Y = mroz['hours']
X = mroz[['kidslt6', 'age', 'educ', 
        'huswage', 'exper', 'expersq']].copy()
X = sm.add_constant(X) #add a constant (intercept) to the model
#Fit the model: 
linmodel = sm.OLS(Y, X).fit()
print(linmodel.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  hours   R-squared:                       0.266
Model:                            OLS   Adj. R-squared:                  0.260
Method:                 Least Squares   F-statistic:                     44.99
Date:                Sat, 30 Sep 2023   Prob (F-statistic):           4.67e-47
Time:                        03:19:00   Log-Likelihood:                -6049.5
No. Observations:                 753   AIC:                         1.211e+04
Df Residuals:                     746   BIC:                         1.215e+04
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       1166.8778    243.738      4.787      0.000     688.384    1645.372
kidslt6     -433.1229     58.416     -7.414      0.000    -547.803    -318.443
age          -28.4308      4.067     -6.991      0.000     -36.414     -20.447
educ          32.6255     12.827      2.543      0.011       7.444      57.807
huswage      -13.9353      6.857     -2.032      0.042     -27.397      -0.474
exper         67.7980      9.896      6.851      0.000      48.371      87.225
expersq       -0.7375      0.325     -2.270      0.023      -1.375      -0.100
==============================================================================
Omnibus:                       78.707   Durbin-Watson:                   1.383
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              111.647
Skew:                           0.768   Prob(JB):                     5.70e-25
Kurtosis:                       4.095   Cond. No.                     2.76e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.76e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Instead of using the package StatsModels (specifically the function sm.OLS), we can also fit this linear regression model via the Bayesian approach (in PyMC) with flat priors on the regression coefficients (as well as $\log \sigma$). This is done as follows. Note that with these priors, it can be mathematically proved (by calculating the posterior explicitly) that the Bayesian approach gives identical results as the frequentist approach. This means that implementation in PyMC should give essentially the same results as found in the sm.OLS summary (with perhaps slight differences due to Monte Carlo fluctuations).

In [15]:
#We can also take the Bayesian Approach and use PyMC:
import pymc as pm
mrozmod = pm.Model()
with mrozmod:
    # Priors for unknown model parameters
    b0 = pm.Flat("b0")
    b1 = pm.Flat("b1")
    b2 = pm.Flat("b2")
    b3 = pm.Flat("b3")
    b4 = pm.Flat("b4")
    b5 = pm.Flat("b5")
    b6 = pm.Flat("b6")
    log_sigma = pm.Flat("log_sigma")             
    sigma = pm.Deterministic("sigma", pm.math.exp(log_sigma))
    # Expected value of outcome
    mu = b0 + b1 * mroz['kidslt6'] +   b2 * mroz['age'] + b3 * mroz['educ'] + b4 * mroz['huswage'] + b5 * mroz['exper'] + b6 * mroz['expersq']
    # Likelihood
    Y_obs = pm.Normal("Y_obs", mu=mu, sigma=sigma, observed=mroz['hours'])
    idata = pm.sample(2000, chains = 2, return_inferencedata = True)  
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 4 jobs)
NUTS: [b0, b1, b2, b3, b4, b5, b6, log_sigma]
100.00% [6000/6000 00:30<00:00 Sampling 2 chains, 0 divergences]
Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 31 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics

The above PyMC code specifies the linear regression model in the following way: \begin{align*} &Y_i \sim N(\mu_i, \sigma^2) ~~ \text{ for } i = 1, \dots, n \\ &\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_m x_{im} ~~ \text{ for } i = 1, \dots, n \\ &\beta_0, \beta_1, \dots, \beta_m, \log \sigma \overset{\text{i.i.d}}{\sim} \text{Flat} = \text{Uniform}[-C, C] \text{ for very large } C \end{align*} Let us check that the Bayesian analysis leads to basically the same answers as statsmodels OLS.

In [16]:
b0_samples = idata.posterior['b0'].values.flatten()
b1_samples = idata.posterior['b1'].values.flatten()
b2_samples = idata.posterior['b2'].values.flatten()
b3_samples = idata.posterior['b3'].values.flatten()
b4_samples = idata.posterior['b4'].values.flatten()
b5_samples = idata.posterior['b5'].values.flatten()
b6_samples = idata.posterior['b6'].values.flatten()

allsamples = [b0_samples, b1_samples, b2_samples, b3_samples, b4_samples, b5_samples, b6_samples]
names = ['b0', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6']
print("Parameter   | Mean     | Std. Dev. | Least Squares | Std. Error")
print("------------|----------|----------")
for i, (name, arr) in enumerate(zip(names, allsamples)):
    print(f"{name:10}| {np.mean(arr):.6f} | {np.std(arr):.6f} | {linmodel.params.values[i]:.6f}  | {linmodel.bse.values[i]:.6f}")
Parameter   | Mean     | Std. Dev. | Least Squares | Std. Error
------------|----------|----------
b0        | 1175.858222 | 242.523307 | 1166.877797  | 243.737876
b1        | -434.363992 | 59.045116 | -433.122873  | 58.416292
b2        | -28.572263 | 4.043531 | -28.430794  | 4.066762
b3        | 32.440296 | 13.042272 | 32.625466  | 12.827177
b4        | -13.836047 | 7.102644 | -13.935253  | 6.857217
b5        | 67.582198 | 9.766088 | 67.797967  | 9.895837
b6        | -0.727326 | 0.320257 | -0.737492  | 0.324827

The estimated regression coefficent $\hat{\beta}_1$ for "kidslt6" is about -430 which can be interpreted as follows: having a small child reduces the mean hours worked by about 430. This result is clearly meaningless for women with small working hours (e.g., if someone only works for about 300 hours, what does a reduction of 430 mean?). It is also not very meaningful for women with lots of working hours (e.g., if someone works for more than 3000 hours, a reduction of 430 due to a small kid sounds too small). A much more meaningful and useful interpretation would report reductions in terms of percentages and not absolute hours. For example, a statement such as "having a small child reduces the mean hours worked by about 40%" would apply to all working women and be much more interpretable. Such an intepretation cannot be drawn based on this linear regression.

A percentage interpretation can be realized if we change our model by taking the response variable to be $\log(\text{Hours worked})$ as opposed to "hours". Indeed, in the linear regression model for $\log(\text{Hours})$: \begin{align*} \log(\text{Hours}) = \beta_0 + \beta_1 \text{kidslt6} + \beta_2 \text{age} + \dots, \end{align*} the coefficient $\beta_1$ gives the reduction in $\log(\text{Hours})$ due to a small child. This implies a percentage increase of $100 \times(\exp(\beta_1) - 1)$ in hours worked because: \begin{align*} \frac{\text{Hours}_{\text{new}} - \text{Hours}_{\text{old}}}{\text{Hours}_{\text{old}}} \times 100 = \left[\exp \left(\log \text{Hours}_{\text{new}} - \log \text{Hours}_{\text{old}} \right) - 1\right] \times 100 \end{align*} It is much more preferable to use $\log(\text{Hours})$ as the response variable in these problems compared to $\text{Hours}$. However, one big problem is that, in this dataset, there are quite a few observations for which the Hours variable equals 0. So we cannot really work with the variable $\log \text{Hours}$. In such case, Poisson regression provides a great alternative model.

Poisson Regression¶

In Poisson regression, one modifies the linear regression model: \begin{align*} &\log(Y_i) \sim N(\mu_i, \sigma^2) ~~ \text{ for } i = 1, \dots, n \\ &\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_m x_{im} ~~ \text{ for } i = 1, \dots, n \end{align*} (which cannot be used as the variable $\log(Y_i)$ is meaningless when $Y_i = 0$) in the following way: \begin{align*} &Y_i \sim \text{Poisson}(\mu_i) ~~ \text{ for } i = 1, \dots, n \\ &\log \mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_m x_{im} ~~ \text{ for } i = 1, \dots, n \end{align*} The main change is that the response variable $Y_i$ is now modelled as $\text{Poisson}(\mu_i)$. The Poisson distribution is often used to model counts. In this case, $Y_i$ is the number of hours worked in 1975 which is a count variable. The linear relationship between the response and covariates is specified through $\log \mu_i$ which enables percentage interpretation for the regression coefficients.

As in the case of linear regression, there are two ways of implementing Poisson regression:

  1. Through an in-built function in the library statsmodels. This function is called sm.GLM. GLM stands for Generalized Linear Models which is a general class of models which includes linear regression and Poisson regression (and also logistic regression). While using sm.GLM, we need to specify which GLM we are fitting using the family argument.
  2. Using PyMC to do Bayesian inference with flat priors: $\beta_0, \dots, \beta_m \overset{\text{i.i.d}}{\sim} \text{Flat}$.

Most practitioners use the first option above. Let us first see how it works before turning to the Bayesian PyMC approach.

In [17]:
#Poisson Regression through StatsModels
# Define the response variable and covariates
Y = mroz['hours']
X = mroz[['kidslt6', 'age', 'educ', 
        'huswage', 'exper', 'expersq']].copy()
X = sm.add_constant(X) # Add a constant (intercept) to the model
# Fit the Poisson regression model
poiregmodel = sm.GLM(Y, X, family=sm.families.Poisson()).fit()
print(poiregmodel.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  hours   No. Observations:                  753
Model:                            GLM   Df Residuals:                      746
Model Family:                 Poisson   Df Model:                            6
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:            -3.1563e+05
Date:                Sat, 30 Sep 2023   Deviance:                   6.2754e+05
Time:                        03:19:46   Pearson chi2:                 6.60e+05
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          6.9365      0.012    562.281      0.000       6.912       6.961
kidslt6       -0.8075      0.004   -193.217      0.000      -0.816      -0.799
age           -0.0427      0.000   -201.166      0.000      -0.043      -0.042
educ           0.0528      0.001     83.439      0.000       0.052       0.054
huswage       -0.0207      0.000    -54.548      0.000      -0.021      -0.020
exper          0.1204      0.001    219.231      0.000       0.119       0.121
expersq       -0.0018   1.63e-05   -112.090      0.000      -0.002      -0.002
==============================================================================

The output above looks very similar to the linear regression output. This model is much more interpretable compared to the linear regression model that we fit earlier. To see this, consider the interpretation of the coefficient -0.8075 for the "kidslt6" variable: having a small kid reduces mean hours worked by 55%. This is clearly a much more interpretable result compared to before.

In [18]:
#56% comes from:
print((np.exp(poiregmodel.params['kidslt6']) - 1)*100)
-55.40391074218227

Maximum Likelihood Estimation in Poisson Regression¶

Let us now understand how sm.GLM is obtaining the estimates and standard errors for the regression coefficients. The estimates are obtained by Maximum Likelihood Estimation. The likelihood in this Poisson Regression Model is given by \begin{align*} \text{likelihood} &= \prod_{i=1}^n \frac{e^{-\mu_i} \mu_i^{y_i}}{y_i!} \\ &= \prod_{i=1}^n \frac{e^{-e^{\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im}}} e^{y_i(\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im})}}{y_i!} \end{align*} It is much more easier to maximize the logarithm of the likelihood (as opposed to the likelihood directly). We can also ignore the $y_i!$ terms in the denominator as they do not involve the parameters $\beta_0, \dots, \beta_m$. Let us also replace the sum $\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im}$ by the simpler expression $x_i^T \beta$ where $\beta$ denotes the column vector with components $\beta_0, \dots, \beta_m$ and $x_i$ denotes the column vector with components $1, x_{i1}, x_{i2}, \dots, x_{im}$. With these changes, we obtain the log-likelihood: \begin{align*} \ell(\beta) = \sum_{i=1}^n \left(-e^{x_i^T \beta} + y_i x_i^T \beta \right) \end{align*} The goal is to maximize this log-likelihood over $\beta_0, \beta_1, \dots, \beta_m$. We can do this by taking the gradient (i.e., derivatives of $\ell(\beta)$ with respect to $\beta_0, \dots, \beta_m$ collected in a column vector), setting the gradient to zero and solving the resulting set of equations. The gradient of $\ell(\beta)$ is \begin{align*} \nabla \ell(\beta) = \sum_{i=1}^n \left(-e^{x_i^T \beta} x_i + y_i x_i \right) = \sum_{i=1}^n \left(y_i - e^{x_i^T \beta} \right) x_i = \sum_{i=1}^n \left(y_i - \mu_i \right) x_i = X^T(Y - \mu). \end{align*} where $\mu$ is the $n \times 1$ vector with entries $\mu_1, \dots, \mu_n$. Therefore setting the gradient equal to zero leads us to the equation: \begin{align*} X^T \mu = X^T Y. \end{align*} This gives $m+1$ equations for the $m+1$ parameters $\beta_0, \dots, \beta_m$. Note that the parameters appear in the above equation through $\mu$. It is interesting to observe that in the case of linear regression (where we have the normal distribution instead of Poisson), if we attempt to calculate the MLE in the same way as above, we would obtain exactly the same equation: $X^T \mu = X^T Y$. However, in that case, $\mu = X \beta$ leading to the equation $X^T X \beta = X^T Y$ which can be solved as $\hat{\beta} = (X^T X)^{-1} X^T Y$. Unfortunately, in the present Poisson regression case, $\mu$, which has components $e^{x_1^T \beta}, \dots, e^{x_n^T \beta}$, depends on $\beta$ in a nonlinear fashion and, consequently, it is not possible to solve $X^T \mu = X^T Y$ for $\beta$ in closed form. One needs to resort to solving this equation using numerical methods.

MLE via Newton's Method¶

A very classical method for solving nonlinear equations is the Newton method (also known as the Newton-Raphson method): see https://en.wikipedia.org/wiki/Newton%27s_method. We apply this method to solve $\nabla \ell(\beta) = X^T Y - X^T \mu = 0$. Newton's method is iterative (starts with a rough guess $\beta^{(0)}$ for the solution and then successively modifies into better approximations for the solution eventually converging to the correct solution). Newton's interation for changing the current solution approximation $\beta^{(m)}$ into a better approximation $\beta^{(m+1)}$ of the solution to $\nabla \ell(\beta) = 0$ is given by: \begin{align*} \beta^{(m+1)} = \beta^{(m)} - (H\ell(\beta^{(m)}))^{-1} \nabla \ell(\beta^{(m)}), \end{align*} where $H \ell(\beta^{(m)}$ is the Hessian Matrix of the function $\ell(\beta)$ evaluated at the point $\beta = \beta^{(m)}$. It may be recalled the Hessian matrix is the matrix of second derivatives (see https://en.wikipedia.org/wiki/Hessian_matrix). To actually, implement Newton's method, we would need to calculate the Hessian. We can do this by further differentiating the gradient $\nabla \ell(\beta) = \sum_{i=1}^n (y_i - e^{x_i^T \beta}) x_i = X^T(Y - \mu)$ with respect to $\beta$. It can be checked that this leads to: \begin{align*} H \ell(\beta) = - \sum_{i=1}^n e^{x_i^T \beta} x_i x_i^T = - \sum_{i=1}^n \mu_i x_i x_i^T. \end{align*} We can rewrite this in matrix form as \begin{align*} H \ell(\beta) = - X^T M(\beta) X \end{align*} where $M(\beta)$ is the diagonal matrix with diagonal entries $\mu_1 = e^{x_1^T \beta}, \dots, \mu_n = e^{x_n^T \beta}$. Thus Newton's method for Poisson regression is: \begin{align*} \beta^{(m+1)} = \beta^{(m)} + \left(X^T M(\beta^{(m)}) X \right)^{-1} \left( X^T Y - X^T \mu(\beta^{(m)}) \right), \end{align*} where we wrote $\mu(\beta^{(m)})$ to emphasize that this vector $\mu$ is calculated as $e^{x_1^T \beta}, \dots, e^{x_n^T \beta}$ with $\beta = \beta^{(m)}$. This method can be implemented as follows.

In [19]:
#Newton's Method for Calculating MLE in Poisson Regression
beta_hat = poiregmodel.params.values 
print(beta_hat)
#this is the answer computed by statsmodels and we shall show that Newton's method leads to the same answer if initialized reasonably

#Initialization for Newton's Method
m = 6
p = 7
beta_initial = [5, 0, 0, 0, 0, 0, 0]
n = mroz.shape[0]
Xmat = X.values
Yvec = mroz['hours'].values

#Newton's method for 100 iterations
num_iterations = 100
for i in range(num_iterations):
    log_muvec = np.dot(Xmat, beta_initial)
    muvec = np.exp(log_muvec)
    gradient = np.dot(Xmat.T,  Yvec - muvec)
    M = np.diag(muvec)
    Hessian = -Xmat.T @ M @ Xmat
    Hessian_inv = np.linalg.inv(Hessian)
    beta_initial = beta_initial - Hessian_inv @ gradient
    print(beta_initial)
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[11.862361 -2.918359 -0.191565 0.219829 -0.093895 0.456819 -0.004969]
[10.885550 -2.863001 -0.189547 0.218271 -0.093150 0.450808 -0.004889]
[9.946436 -2.723529 -0.184228 0.214137 -0.091177 0.435017 -0.004679]
[9.097029 -2.412093 -0.170935 0.203596 -0.086180 0.395908 -0.004161]
[8.414084 -1.884261 -0.142079 0.179445 -0.074896 0.313363 -0.003085]
[7.865945 -1.335615 -0.097770 0.136900 -0.055375 0.198124 -0.001679]
[7.313642 -0.990725 -0.061477 0.090536 -0.034424 0.127770 -0.001108]
[7.011093 -0.838093 -0.045790 0.061125 -0.022797 0.113816 -0.001399]
[6.944282 -0.808540 -0.042825 0.053242 -0.020705 0.118791 -0.001758]
[6.936650 -0.807526 -0.042682 0.052832 -0.020712 0.120342 -0.001827]
[6.936480 -0.807524 -0.042681 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]

Because of the exponential calculation in $\mu_i = e^{x_i^T \beta}$, one needs to be careful with initialization. For example, go back and change the initialization to $\beta^{(0)} = (0, \dots, 0)$. In this case, $\beta^{(1)}$ has some large entries leading to $e^{x_i^T \beta^{(1)}$ blowing up. If this can be avoided, the method converges to the correct solution.

Bayesian Analysis with Flat Prior¶

Before seeing how sm.GLM is calculating the standard errors, let us first look at the Bayesian approach for fitting the Poisson regression model with flat priors. For the Bayesian approach with flat priors, we have \begin{align*} \text{Prior} = 1 ~~~ \text{ and } ~~~ \text{Likelihood} = \prod_{i=1}^n \frac{e^{-e^{\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im}}} e^{y_i(\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im})}}{y_i!} \propto \exp(\ell(\beta)). \end{align*} The posterior is therefore given by \begin{align*} \text{Posterior}(\beta) \propto \exp \left(\ell(\beta) \right) = \exp \left(\sum_{i=1}^n \left(-e^{x_i^T \beta} + y_i x_i^T \beta \right) \right) \end{align*} This posterior in $\beta_0, \dots, \beta_m$ cannot be written in terms of standard distributions. One can use PyMC to generate samples from this posterior. However, PyMC can be somewhat unstable and might lead to (significantly different) answers for different runs. For example, in the following code, setting randomseed = 0 gives results similar to the frequentist output while setting randomseed = 4 gives quite different results.

In [20]:
#We can also take the Bayesian Approach and use PyMC:
import pymc as pm
mrozpoimod = pm.Model()
with mrozpoimod:
    # Priors for unknown model parameters
    b0 = pm.Flat("b0")
    b1 = pm.Flat("b1")
    b2 = pm.Flat("b2")
    b3 = pm.Flat("b3")
    b4 = pm.Flat("b4")
    b5 = pm.Flat("b5")
    b6 = pm.Flat("b6")
    log_mu = b0 + b1 * mroz['kidslt6'] +   b2 * mroz['age'] + b3 * mroz['educ'] + b4 * mroz['huswage'] + b5 * mroz['exper'] + b6 * mroz['expersq']
    # Likelihood
    Y_obs = pm.Poisson("Y_obs", mu=np.exp(log_mu), observed=mroz['hours'])
    idata = pm.sample(2000, chains = 2, random_seed = 0, return_inferencedata = True)  
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 4 jobs)
NUTS: [b0, b1, b2, b3, b4, b5, b6]
100.00% [6000/6000 03:04<00:00 Sampling 2 chains, 0 divergences]
Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 185 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics
In [21]:
b0_samples = idata.posterior['b0'].values.flatten()
b1_samples = idata.posterior['b1'].values.flatten()
b2_samples = idata.posterior['b2'].values.flatten()
b3_samples = idata.posterior['b3'].values.flatten()
b4_samples = idata.posterior['b4'].values.flatten()
b5_samples = idata.posterior['b5'].values.flatten()
b6_samples = idata.posterior['b6'].values.flatten()

allsamples = [b0_samples, b1_samples, b2_samples, b3_samples, b4_samples, b5_samples, b6_samples]
names = ['b0', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6']
print("Parameter | Estimate | Std. Dev. | Frequentist | Std. Error")
print("------------|----------|----------")
for i, (name, arr) in enumerate(zip(names, allsamples)):
    print(f"{name:8}| {np.mean(arr):.6f} | {np.std(arr):.6f}  | {poiregmodel.params.values[i]:.6f}  | {poiregmodel.bse.values[i]:.6f}")
#These results are quite close to the frequentist output.
#However PyMC is not very reliable here. Change the random seed from 0 to 4
#and look at the results. 
Parameter | Estimate | Std. Dev. | Frequentist | Std. Error
------------|----------|----------
b0      | 6.936476 | 0.012442  | 6.936480  | 0.012336
b1      | -0.807527 | 0.004131  | -0.807524  | 0.004179
b2      | -0.042681 | 0.000213  | -0.042680  | 0.000212
b3      | 0.052838 | 0.000637  | 0.052831  | 0.000633
b4      | -0.020714 | 0.000378  | -0.020714  | 0.000380
b5      | 0.120361 | 0.000549  | 0.120372  | 0.000549
b6      | -0.001828 | 0.000016  | -0.001829  | 0.000016

Posterior Normal Approximation and Standard Errors¶

When the posterior cannot be written in terms of standard distributions, and when MCMC seems unreliable, an alternative option is to approximate the posterior using simple standard distributions. This idea is at the heart of the method of Variational Inference which is quite popular in Bayesian inference. In this problem, one can approximate the posterior via a (multivariate) normal distribution. This works in the following way.

Our posterior is proportional to $\exp(\ell(\beta))$. The normal distribution, on the other hand, is proportional to $\exp \left(\text{quadratic function of } \beta \right)$. Thus, in order to obtain a normal approximation of the posterior, it makes sense to approximate $\ell(\beta)$ by a quadratic function of $\beta$. This can be done by Taylor approximation but we need to figure out around which point does the Taylor approximation need to be done. The function $\ell(\beta)$ is the log-likelihood which is maximized at the MLE $\hat{\beta}$. It makes sense, therefore, to do the Taylor approximation around $\hat{\beta}$. The idea is that the approximation will be quite accurate for points near the MLE $\hat{\beta}$. For points far from the MLE, the posterior density is probably small anyway so bad approximation for these points might have an insignificant effect. Motivated by these considerations, we Taylor expand the log-likelihood $\ell(\beta)$ around $\hat{\beta}$ up to second order as follows: \begin{align*} \ell(\beta) &\approx \ell(\hat{\beta}) + \left<\nabla \ell(\hat{\beta}), \beta - \hat{\beta} \right> + \frac{1}{2} \left(\beta - \hat{\beta} \right)^T H\ell(\hat{\beta}) \left(\beta - \hat{\beta} \right) \\ &= \ell(\hat{\beta}) + \frac{1}{2} \left(\beta - \hat{\beta} \right)^T H\ell(\hat{\beta}) \left(\beta - \hat{\beta} \right) ~~ \text{ because } \nabla \ell(\hat{\beta}) = 0. \end{align*} Thus the posterior is approximated as: \begin{align*} \text{posterior}(\beta) & \propto \exp \left(\ell(\beta) \right) \\ & \approx \exp \left(\ell(\hat{\beta}) + \frac{1}{2} \left(\beta - \hat{\beta} \right)^T H\ell(\hat{\beta}) \left(\beta - \hat{\beta} \right) \right) \\ &= \exp \left(\ell(\hat{\beta}) \right) \exp \left(\frac{1}{2} \left(\beta - \hat{\beta} \right)^T H\ell(\hat{\beta}) \left(\beta - \hat{\beta} \right) \right) \\ &\propto \exp \left(\frac{1}{2} \left(\beta - \hat{\beta} \right)^T H\ell(\hat{\beta}) \left(\beta - \hat{\beta} \right) \right). \end{align*} In the last step above, we ignored $\exp \left(\ell(\hat{\beta}) \right)$ in proportionality because it does not involve the parameters $\beta$.

Now we can compare the posterior approximation to the density of a multivariate normal (with mean $\mu$ and covariance $\Sigma$) which is proportional to $\exp \left(- (\beta - \mu)^T \Sigma^{-1} (\beta - \mu) \right)$. It is then clear that $\mu = \hat{\beta}$ and $\Sigma^{-1} = -H\ell(\hat{\beta})$. Thus the posterior normal approximation is given by: \begin{align*} \text{posterior} \approx N \left(\hat{\beta}, \left( - H\ell(\hat{\beta}) \right)^{-1}\right) \end{align*} This gives a very simple and practical approximation to the complicated posterior, especially because we already know how to calculate the Hessian (it was used in the Newton algorithm for computing the MLE). If one wants, posterior samples can be easily generated from $N \left(\hat{\beta}, \left( - H\ell(\hat{\beta}) \right)^{-1}\right)$. This will be much faster than using PyMC. However, PyMC tries to obtain samples from the actual posterior while this normal distribution is an approximation to the actual posterior.

Standard Errors for Poisson Regression¶

The standard errors of Poisson Regression that were reported by sm.GLM are actually computed from the covariance of the posterior normal approximation. In other words, the standard errors are simply the square-roots of the diagonal entries of the matrix: \begin{align*} \left(-H\ell(\hat{\beta}) \right)^{-1} = \left(X^T M(\hat{\beta}) X \right)^{-1} \end{align*} where $\hat{\beta}$ is the MLE and $M(\beta)$ is the diagonal matrix with diagonal entries $\mu_1 = e^{x_1^T \beta}, \dots, \mu_n = e^{x_n^T \beta}$. This can be readily verified in the context of our dataset as follows.

In [22]:
#Standard Error Calculation:
log_muvec = np.dot(Xmat, beta_hat)
muvec = np.exp(log_muvec)
M = np.diag(muvec)
Hessian = -Xmat.T @ M @ Xmat
Hessian_inv = np.linalg.inv(Hessian)
CovMat = -Hessian_inv
print(np.sqrt(np.diag(CovMat)))
#Compare with
print(poiregmodel.bse)
[0.012336 0.004179 0.000212 0.000633 0.000380 0.000549 0.000016]
const      0.012336
kidslt6    0.004179
age        0.000212
educ       0.000633
huswage    0.000380
exper      0.000549
expersq    0.000016
dtype: float64

These standard errors can also be justified using frequentist arguments but that justification is much more complicated (it involves asymptotics and notions such as Fisher information). The Bayesian justification is much simpler and intuitive (based on normal approximation via Taylor expansion).

The important takeaways are:

  1. Poisson Regression is an important alternative to linear regression which is applicable when the response variable involves counts. It gives more interpretable results than linear regression by allowing one to do log-linear modeling in situations where one cannot directly take logarithms of the response variable due to existence of zero counts.
  2. In the case of linear regression, frequentist inference and Bayesian inference with flat priors coincide exactly as can be rigorously mathematically proved.
  3. For Poisson regression, frequentist inference uses the MLE $\hat{\beta}$ and provides standard errors by taking square roots of the diagonal entries of the matrix $\left(-H\ell(\hat{\beta}) \right)^{-1}$. Bayesian inference with flat priors can give different answers to frequentist inference in Poisson regression. However, if one implements Bayesian inference by not working with the full posterior and instead approximating the posterior by a multivariate normal distribution centered at the MLE, then Bayesian inference gives identical answers to the frequentist inference.