In this lecture, we shall look at some popular methods for classification (classification refers to the special case of the regression problem where the response variable $Y$ is binary). The specific methods that we shall consider are: (a) Nearest Neighbour classification, (b) Decision Trees, and (c) Random Forests. These methods can be compared to logistic regression which belongs to the family of GLMs that we have seen in the past few lectures. We will apply all these methods (starting with logistic regression) to the Mroz dataset that was analyzed in the last two lectures. This dataset has data on a bunch of economic variables related to labor force participation for married women in the year 1975.
We shall fit models with 'inlf' as the response variable and 'kidslt6', 'age', 'educ', 'huswage', 'exper', 'expersq' as the explanatory variables (also known as features or covariates). The response variable 'inlf' is binary equaling 1 if the individual worked (i.e., they were 'in the labor force') in the year 1975 and 0 otherwise. The logistic regression model assumes that \begin{align*} Y \mid X_1 = x_1, \dots, X_m = x_m \sim \text{Bernoulli}(\mu(x_1, \dots, x_m)) ~~ \text{ where } ~~ \log \frac{\mu(x_1, \dots, x_m)}{1 - \mu(x_1, \dots, x_m)} = \beta_0 + \beta_1 x_1 + \dots \beta_m x_m \end{align*} In words, the log-odds of the event $Y = 1$ is modeled by the linear function $\beta_0 + \beta_1 X_1 + \dots + \beta_m X_m$. The interpretation of $\beta_j$ (for $1 \leq j \leq m$) would be the change in log-odds for unit change in $X_j$ when all other variables are held fixed. The Logistic Regression model (and more generally every GLM) specifies the relationship between $Y$ and $X_1, \dots, X_m$ in terms of the parameters $\beta_0, \dots, \beta_m$ (which will have to learned from data) and is thus an example of a parametric model.
We can fit the logistic regression model easily using statsmodels.api as follows.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
#Loading the mroz dataset:
mroz = pd.read_csv("MROZ.csv")
print(mroz.shape)
print(mroz.head(12))
print(mroz['inlf'].value_counts()) #this is the response variable which has 428 ones and 325 zeros
(753, 22) inlf hours kidslt6 kidsge6 age educ wage repwage hushrs husage \ 0 1 1610 1 0 32 12 3.3540 2.65 2708 34 1 1 1656 0 2 30 12 1.3889 2.65 2310 30 2 1 1980 1 3 35 12 4.5455 4.04 3072 40 3 1 456 0 3 34 12 1.0965 3.25 1920 53 4 1 1568 1 2 31 14 4.5918 3.60 2000 32 5 1 2032 0 0 54 12 4.7421 4.70 1040 57 6 1 1440 0 2 37 16 8.3333 5.95 2670 37 7 1 1020 0 0 54 12 7.8431 9.98 4120 53 8 1 1458 0 2 48 12 2.1262 0.00 1995 52 9 1 1600 0 2 39 12 4.6875 4.15 2100 43 10 1 1969 0 1 33 12 4.0630 4.30 2450 34 11 1 1960 0 1 42 11 4.5918 4.58 2375 47 ... faminc mtr motheduc fatheduc unem city exper nwifeinc \ 0 ... 16310 0.7215 12 7 5.0 0 14 10.910060 1 ... 21800 0.6615 7 7 11.0 1 5 19.499981 2 ... 21040 0.6915 12 7 5.0 0 15 12.039910 3 ... 7300 0.7815 7 7 5.0 0 6 6.799996 4 ... 27300 0.6215 12 14 9.5 1 7 20.100058 5 ... 19495 0.6915 14 7 7.5 1 33 9.859054 6 ... 21152 0.6915 14 7 5.0 0 11 9.152048 7 ... 18900 0.6915 3 3 5.0 0 35 10.900038 8 ... 20405 0.7515 7 7 3.0 0 24 17.305000 9 ... 20425 0.6915 7 7 5.0 0 21 12.925000 10 ... 32300 0.5815 12 3 5.0 0 15 24.299953 11 ... 28700 0.6215 14 7 5.0 0 14 19.700071 lwage expersq 0 1.210154 196 1 0.328512 25 2 1.514138 225 3 0.092123 36 4 1.524272 49 5 1.556480 1089 6 2.120260 121 7 2.059634 1225 8 0.754336 576 9 1.544899 441 10 1.401922 225 11 1.524272 196 [12 rows x 22 columns] inlf 1 428 0 325 Name: count, dtype: int64
#Logistic Regression
Y = mroz['inlf'] #this is the binary response variable
X = mroz[['kidslt6', 'age', 'educ',
'huswage', 'exper', 'expersq']] #our covariates kidslt6, age, educ, huswage, exper, expersq
X = sm.add_constant(X)
logimodel = sm.GLM(Y, X, family=sm.families.Binomial()).fit()
print(logimodel.summary())
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: inlf No. Observations: 753 Model: GLM Df Residuals: 746 Model Family: Binomial Df Model: 6 Link Function: Logit Scale: 1.0000 Method: IRLS Log-Likelihood: -403.13 Date: Wed, 11 Oct 2023 Deviance: 806.25 Time: 13:51:42 Pearson chi2: 728. No. Iterations: 5 Pseudo R-squ. (CS): 0.2568 Covariance Type: nonrobust ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ const 0.8430 0.757 1.114 0.265 -0.640 2.326 kidslt6 -1.4516 0.201 -7.232 0.000 -1.845 -1.058 age -0.0945 0.014 -6.948 0.000 -0.121 -0.068 educ 0.2071 0.042 4.886 0.000 0.124 0.290 huswage -0.0440 0.021 -2.066 0.039 -0.086 -0.002 exper 0.2079 0.032 6.540 0.000 0.146 0.270 expersq -0.0032 0.001 -3.137 0.002 -0.005 -0.001 ==============================================================================
What is the interpretation of the coefficient -1.4516 of the 'kidslt6' variable? It means that having a small kid reduces the log-odds of working by -1.4516 (for fixed values of all the other variables in the model). Equivalently, the odds of working are reduced by a factor of $\exp(-1.4516) \approx 0.234$. Thus the odds of working reduce by about 25% with each small kid (for fixed values of all other variables in the model).
We shall now fit the same logistic regression model using the scikit-learn Python library. This is because the other methods that we shall study today (k-NN, decision trees and random forests) cannot be fit via statsmodels.api. On the other hand, note that scikit-learn does not provide functionality for every GLM (e.g., negative binomial regression) unlike statsmodels.api.
#Loading the scikit-learn library for logistic regression:
from sklearn.linear_model import LogisticRegression
#Let us re-fit this logistic regression model using scikit-learn
Y = mroz['inlf'] # This is a binary variable
X = mroz[['kidslt6', 'age', 'educ', 'huswage', 'exper', 'expersq']]
logimodel_sk = LogisticRegression(max_iter = 1000)
logimodel_sk.fit(X, Y)
print("Intercept:", logimodel_sk.intercept_)
print("Coefficients:", logimodel_sk.coef_)
Intercept: [0.78500383] Coefficients: [[-1.39555397 -0.09272429 0.20477548 -0.04376797 0.2071669 -0.0031617 ]]
The 'Intercept' above refers to the estimate for $\beta_0$ and 'Coefficients' refer to the estimates for $\beta_1, \dots, \beta_m$. Observe that these estimates are slightly different from those given by statsmodels.api. This is because scikit-learn does not compute the MLE by default. Instead, the default setting adds an $L_2$ penalty for regularization to $(-2) \times \text{log-likelihood}$ before minimizing the resulting criterion to estimate $\beta_0, \dots, \beta_m$. Such regularization is helpful (both for numerical computation and statistical purposes) especially in the case where the number of covariates $m$ is large. We shall see the usefulness of regularization in a future lecture.
If we do not want regularization and want to simply compute the MLE, we can specify this via the "penalty = None" option as shown below.
#With no regularization
logimodel_sk = LogisticRegression(penalty=None, max_iter = 2000) #"penalty = None" makes it compute the MLE just like statsmodels.api
logimodel_sk.fit(X, Y)
print("Intercept:", logimodel_sk.intercept_)
print("Coefficients:", logimodel_sk.coef_)
#These estimates are essentially the same as the ones given by statsmodels.api
Intercept: [0.84388436] Coefficients: [[-1.45179926 -0.09456605 0.20715796 -0.04402401 0.20796276 -0.00317439]]
It is important to assess the prediction accuracy of regression models. One simple way of doing this is to split the existing dataset into two parts: training and test datasets. Then one fits the model on the training data and uses the fitted model to obtain predictions on the test dataset. The predictions can then be compared to actual response values from the test data and the discrepancy gives some idea of the prediction accuracy of the model. Below we evaluate the prediction accuracy of the fitted logistic regression model in this way by (randomly) splitting the data into a training dataset of size 503 and a test dataset of size 250. In practice, one would need to change the sizes of training and test datasets (and also do multiple random splits) to get a robust assessment of accuracy.
The scikit-learn library has in-built functions for randomly splitting the data into training and test datasets.
#Split the original data into training and test datasets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=503, test_size=250, random_state=1)
#Let us fit logistic regression on the training dataset:
logimodel_train = LogisticRegression(penalty = None, max_iter = 1000)
logimodel_train.fit(X_train, Y_train)
print("Intercept:", logimodel_train.intercept_)
print("Coefficients:", logimodel_train.coef_)
Intercept: [0.8104027] Coefficients: [[-1.42578681 -0.08662768 0.19075837 -0.05340385 0.22425066 -0.00391879]]
We now use this fitted logistic regression model to obtain predictions on the test dataset. Note that for a set of new covariates $x_1, \dots, x_m$, the fitted logistic regression model gives the prediction: \begin{align*} \frac{\exp(\beta_0 + \beta_1 x_1 + \dots + \beta_m x_m)}{1 + \exp(\beta_0 + \beta_1 x_1 + \dots + \beta_m x_m)} \end{align*} for the probability that $Y = 1$. This is a number that can be anywhere between 0 and 1. If we want a binary prediction for the response, one would need to place a threshold on the predicted probability. The most common threshold is 0.5 (i.e., if the fitted probability is greater than or equal to zero, the predicted class label is 1; otherwise, it is zero). One can obtain these binary predictions using the '.predict' attribute. Predicted probabilities can be obtained by the '.predict_proba' attribute.
#Prediction for the test data
Y_pred = logimodel_train.predict(X_test)
print(Y_pred)
accuracy_logimodel = np.mean(Y_test == Y_pred)
print(f"Accuracy of Logistic Regression on test set: {accuracy_logimodel}")
[1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 1 1 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 0 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 0 0 1 1 0 1 1 1 1 1 0 1 1 0 1 0 0 1 1 0 0 1 1 0 1 1 0 0 0 0 1 1 1 0 1 0 1 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0] Accuracy of Logistic Regression on test set: 0.728
Accuracy here is the number of correctly classified observations in the test dataset. One can also create the confusion matrix (or contingency table) comparing the predictions with actual response values on the test dataset.
contingency_table = pd.crosstab(Y_test, Y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
print(contingency_table)
Predicted 0 1 All Actual 0 64 49 113 1 19 118 137 All 83 167 250
This table gives a further breakdown of the discrepancies between predictions and actual test response values. For example, 28 observations for which $Y = 1$ in the test dataset were predicted as 0 and 83 observations for which $Y = 0$ in the test dataset were predicted as 1. Also, logistic regression seems to be predicting many more ones (192) compared to actual ones (137) in the response.
Given a positive integer $k$, the $k$-nearest neighbour classifier classifies each point based on the label of the $k$ nearest points in the training set. For example, suppose we want to predict the response for a test observation with covariate values $x_1, \dots, x_m$. The $k$-nearest neighbour method identifies $k$ observations in the training set whose covariate values are closes to $x_1, \dots, x_m$. It then looks at the observed response values for these $k$ observations in the training dataset. Suppose $k = 8$ and the response values of the $k$ closest training points are $1, 0, 1, 1, 1, 0, 0, 1$. Then the predicted probability for the response of the test equals $5/8 = 0.625$. If we want a binary prediction, it will be 1 as the predicted probability is more than 0.5.
scikit-learn has an in-built function for $k$-NN classification which works as follows.
from sklearn.neighbors import KNeighborsClassifier
#Using k-NN classification:
k_value = 8
knn = KNeighborsClassifier(n_neighbors = k_value)
knn.fit(X_train, Y_train)
Y_pred_knn = knn.predict(X_test)
accuracy_knn = np.mean(Y_test == Y_pred_knn)
print(f"Accuracy of k-NN Classification on test set: {accuracy_knn}")
#Contingency table (or confusion matrix)
print(pd.crosstab(Y_test, Y_pred_knn, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of k-NN Classification on test set: 0.744 Predicted 0 1 All Actual 0 78 35 113 1 29 108 137 All 107 143 250
Generally, $k$-nearest neighbour classification is a simple method that often gives decent accuracy (this accuracy can be used as a benchmark for more elaborate classification methods). The value of $k$ can be tuned by standard techniques by cross-validation. However, one needs to be aware of the following issues while using $k$-NN:
Unlike logistic regression where there is a clearly defined likelihood involving parameters $\beta_0, \beta_1, \dots, \beta_m$, there are no parameters and likelihood in k-NN. This is one reason why k-NN can be treated as a nonparametric method.
A decision tree is a method for classification (and regression) that uses a tree-like structure to decide what value to predict for a point. The decision tree is constructed using the training data, and the constructed tree is used to come up with predictions for test observations. The tree starts with the root node at the top which contains all the observations. Each node is then recursively split into smaller nodes until a stopping criterion is reached. The goal is to obtain nodes which are as pure as possible. Here pure means that the response values of observations in the node are homogeneous in the sense that all responses are equal to 0 or all responses are equal to 1. Here is an overview of the tree construction process:
If the node is maximally pure i.e., either $n_0 = 0$ or $n_1 = 0$, then Gini impurity takes its lowest possible value of 0. On the other hand, if the node is maximally impure i.e., $n_0 = n_1$, then Gini impurity takes its highest possible value of 0.5. The covariate and split point resulting in the lowest Gini impurity are selected for splitting the node. 2. Creating a node: If the node is pure or if other stopping conditions are met (e.g., maximum depth), mark the node as a leaf and assign it the most common response. Otherwise use the best covariate and split point (as in the previous step) to further split the node into two subsets. 3. Stopping Conditions: The default in scikit-learn is to grow the tree until all leaf nodes are pure. However deep decision trees are not interpretable and prone to overfitting. Several conditions can be used to stop the tree from growing further. One simple way is to specify the maximum depth of the tree.
After the tree is constructed on the training data, to predict the response value for a new observation, we start at the root node, and continue down until we reach a leaf node. The prediction is then given by the most common response in the leaf node.
Here's how it works in scikit-learn:
from sklearn.tree import DecisionTreeClassifier
# Initialize and train the Decision Tree classifier
dt = DecisionTreeClassifier(random_state=1)
dt.fit(X_train, Y_train)
# Predict on the test set
Y_pred_dt = dt.predict(X_test)
# Calculate the accuracy
accuracy_dt = np.mean(Y_test == Y_pred_dt)
print(f"Accuracy of Decision Tree Classification on test set: {accuracy_dt}")
# Create the confusion matrix (contingency table)
print(pd.crosstab(Y_test, Y_pred_dt, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of Decision Tree Classification on test set: 0.668 Predicted 0 1 All Actual 0 67 46 113 1 37 100 137 All 104 146 250
The accuracy of this decision tree is comparable to but smaller than that of logistic regression and k-NN. We can visualize the fitted tree as follows.
# Plot the tree
from sklearn.tree import plot_tree
plt.figure(figsize=(20,10))
plot_tree(dt, feature_names=X.columns,class_names = ('0','1'), filled=True, rounded=True)
plt.show()
Obviously this tree is way too big to even take a meaningful look at it. Also, it is also likely overfitting leading to poor accuracy. We can control the complexity of the tree by controlling its maximum depth.
#This tree is way too big to allow any meaningful interpretation
#Here are a few ways to create smaller trees:
dt_small = DecisionTreeClassifier(max_depth = 4, random_state = 1)
dt_small.fit(X_train, Y_train)
Y_pred_dt_small = dt_small.predict(X_test)
# Calculate the accuracy
accuracy_dt_small = np.mean(Y_test == Y_pred_dt_small)
print(f"Accuracy of Decision Tree Classification on test set: {accuracy_dt_small}")
# Plot the tree
from sklearn.tree import plot_tree
plt.figure(figsize=(20,10))
plot_tree(dt_small, feature_names=X.columns, class_names = ('0','1'), filled=True, rounded=True)
plt.show()
Accuracy of Decision Tree Classification on test set: 0.72