We shall fit models with 'inlf' as the response variable and 'kidslt6', 'age', 'educ', 'huswage', 'exper', 'expersq' as the explanatory variables (also known as features or covariates). The response variable 'inlf' is binary equaling 1 if the individual worked (i.e., they were 'in the labor force') in the year 1975 and 0 otherwise. The logistic regression model assumes that
\begin{align*}
Y \mid X_1 = x_1, \dots, X_m = x_m \sim \text{Bernoulli}(\mu(x_1, \dots, x_m)) ~~ \text{ where } ~~ \log \frac{\mu(x_1, \dots, x_m)}{1 - \mu(x_1, \dots, x_m)} = \beta_0 + \beta_1 x_1 + \dots \beta_m x_m
\end{align*}
In words, the **log-odds** of the event $Y = 1$ is modeled by the linear function $\beta_0 + \beta_1 X_1 + \dots + \beta_m X_m$. The interpretation of $\beta_j$ (for $1 \leq j \leq m$) would be the change in log-odds for unit change in $X_j$ when all other variables are held fixed. The Logistic Regression model (and more generally every GLM) specifies the relationship between $Y$ and $X_1, \dots, X_m$ in terms of the **parameters** $\beta_0, \dots, \beta_m$ (which will have to learned from data) and is thus an example of a **parametric model**.

We can fit the logistic regression model easily using statsmodels.api as follows.

In [3]:

```
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
```

In [4]:

```
#Loading the mroz dataset:
mroz = pd.read_csv("MROZ.csv")
print(mroz.shape)
print(mroz.head(12))
print(mroz['inlf'].value_counts()) #this is the response variable which has 428 ones and 325 zeros
```

In [5]:

```
#Logistic Regression
Y = mroz['inlf'] #this is the binary response variable
X = mroz[['kidslt6', 'age', 'educ',
'huswage', 'exper', 'expersq']] #our covariates kidslt6, age, educ, huswage, exper, expersq
X = sm.add_constant(X)
logimodel = sm.GLM(Y, X, family=sm.families.Binomial()).fit()
print(logimodel.summary())
```

**scikit-learn** Python library. This is because the other methods that we shall study today (k-NN, decision trees and random forests) cannot be fit via statsmodels.api. On the other hand, note that scikit-learn does not provide functionality for every GLM (e.g., negative binomial regression) unlike statsmodels.api.

In [6]:

```
#Loading the scikit-learn library for logistic regression:
from sklearn.linear_model import LogisticRegression
```

In [7]:

```
#Let us re-fit this logistic regression model using scikit-learn
Y = mroz['inlf'] # This is a binary variable
X = mroz[['kidslt6', 'age', 'educ', 'huswage', 'exper', 'expersq']]
logimodel_sk = LogisticRegression(max_iter = 1000)
logimodel_sk.fit(X, Y)
print("Intercept:", logimodel_sk.intercept_)
print("Coefficients:", logimodel_sk.coef_)
```

The 'Intercept' above refers to the estimate for $\beta_0$ and 'Coefficients' refer to the estimates for $\beta_1, \dots, \beta_m$. Observe that these estimates are slightly different from those given by statsmodels.api. This is because scikit-learn does not compute the MLE by default. Instead, the default setting adds an $L_2$ penalty for regularization to $(-2) \times \text{log-likelihood}$ before minimizing the resulting criterion to estimate $\beta_0, \dots, \beta_m$. Such regularization is helpful (both for numerical computation and statistical purposes) especially in the case where the number of covariates $m$ is large. We shall see the usefulness of regularization in a future lecture.

If we do not want regularization and want to simply compute the MLE, we can specify this via the "penalty = None" option as shown below.

In [8]:

```
#With no regularization
logimodel_sk = LogisticRegression(penalty=None, max_iter = 2000) #"penalty = None" makes it compute the MLE just like statsmodels.api
logimodel_sk.fit(X, Y)
print("Intercept:", logimodel_sk.intercept_)
print("Coefficients:", logimodel_sk.coef_)
#These estimates are essentially the same as the ones given by statsmodels.api
```

It is important to assess the prediction accuracy of regression models. One simple way of doing this is to split the existing dataset into two parts: training and test datasets. Then one fits the model on the training data and uses the fitted model to obtain predictions on the test dataset. The predictions can then be compared to actual response values from the test data and the discrepancy gives some idea of the prediction accuracy of the model. Below we evaluate the prediction accuracy of the fitted logistic regression model in this way by (randomly) splitting the data into a training dataset of size 503 and a test dataset of size 250. In practice, one would need to change the sizes of training and test datasets (and also do multiple random splits) to get a robust assessment of accuracy.

The scikit-learn library has in-built functions for randomly splitting the data into training and test datasets.

In [9]:

```
#Split the original data into training and test datasets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=503, test_size=250, random_state=1)
```

In [10]:

```
#Let us fit logistic regression on the training dataset:
logimodel_train = LogisticRegression(penalty = None, max_iter = 1000)
logimodel_train.fit(X_train, Y_train)
print("Intercept:", logimodel_train.intercept_)
print("Coefficients:", logimodel_train.coef_)
```

**probability** that $Y = 1$. This is a number that can be anywhere between 0 and 1. If we want a binary prediction for the response, one would need to place a threshold on the predicted probability. The most common threshold is 0.5 (i.e., if the fitted probability is greater than or equal to zero, the predicted class label is 1; otherwise, it is zero). One can obtain these binary predictions using the '.predict' attribute. Predicted probabilities can be obtained by the '.predict_proba' attribute.

In [11]:

```
#Prediction for the test data
Y_pred = logimodel_train.predict(X_test)
print(Y_pred)
accuracy_logimodel = np.mean(Y_test == Y_pred)
print(f"Accuracy of Logistic Regression on test set: {accuracy_logimodel}")
```

In [12]:

```
contingency_table = pd.crosstab(Y_test, Y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
print(contingency_table)
```

Predicted 0 1 All Actual 0 64 49 113 1 19 118 137 All 83 167 250

Given a positive integer $k$, the $k$-nearest neighbour classifier classifies each point based on the label of the $k$ nearest points in the training set. For example, suppose we want to predict the response for a test observation with covariate values $x_1, \dots, x_m$. The $k$-nearest neighbour method identifies $k$ observations in the training set whose covariate values are closes to $x_1, \dots, x_m$. It then looks at the observed response values for these $k$ observations in the training dataset. Suppose $k = 8$ and the response values of the $k$ closest training points are $1, 0, 1, 1, 1, 0, 0, 1$. Then the predicted probability for the response of the test equals $5/8 = 0.625$. If we want a binary prediction, it will be 1 as the predicted probability is more than 0.5.

scikit-learn has an in-built function for $k$-NN classification which works as follows.

In [13]:

```
from sklearn.neighbors import KNeighborsClassifier
#Using k-NN classification:
k_value = 8
knn = KNeighborsClassifier(n_neighbors = k_value)
knn.fit(X_train, Y_train)
Y_pred_knn = knn.predict(X_test)
accuracy_knn = np.mean(Y_test == Y_pred_knn)
print(f"Accuracy of k-NN Classification on test set: {accuracy_knn}")
#Contingency table (or confusion matrix)
print(pd.crosstab(Y_test, Y_pred_knn, rownames=['Actual'], colnames=['Predicted'], margins=True))
```

Generally, $k$-nearest neighbour classification is a simple method that often gives decent accuracy (this accuracy can be used as a benchmark for more elaborate classification methods). The value of $k$ can be tuned by standard techniques by cross-validation. However, one needs to be aware of the following issues while using $k$-NN:

- One needs to select a suitable notion of distance to measure the distance between $(x_1, \dots, x_m)$ and the covariate observations in the training set. The default choice is the Euclidean distance (square-root of the sum of the squares of the discrepancies in each covariate). However different covariates have different units which makes this distance not very principled. For example, if we are measure education in number of years and experience in number of months, then the Euclidean distance will probably be dominated by experience (as it will take larger values because of the specific choice of units). It is therefore generally recommended to scale the features before using k-NN. One can scale each covariate by either (a) subtracting the minimum value and dividing by the range, or (b) subtracting the mean and dividing by the standard deviation. This scaling makes all variables unitless (so that squaring and adding across variables makes more sense).
- When the number of covariates is large, the volume of the covariate space becomes very large and none of the 'nearest' neighbors might be actually near to the test observation. In this case, the method cannot provide accurate predictions.
- All the covariates are treated equally in the calculation of the distance used to determine nearest neighbors. This means that irrelevant features are given as much weight as important features.
- For large datasets, the computational cost can be high because the method requires computing distances to every training example for each prediction. Also, since k-NN requires storing the entire training dataset, memory consumption can be an issue for large datasets. In contrast, for predictions via logistic regression, one only needs to store the fitted regression coefficients $\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_m$.
- k-NN does not allow answering important questions like: what features are important for explaining the response? If we intervene and change one covariate by 5 units, what changes would we expect in the response?

Unlike logistic regression where there is a clearly defined likelihood involving parameters $\beta_0, \beta_1, \dots, \beta_m$, there are no parameters and likelihood in k-NN. This is one reason why k-NN can be treated as a **nonparametric** method.

A decision tree is a method for classification (and regression) that uses a tree-like structure to decide what value to predict for a point. The decision tree is constructed using the training data, and the constructed tree is used to come up with predictions for test observations. The tree starts with the root node at the top which contains all the observations. Each node is then recursively split into smaller nodes until a stopping criterion is reached. The goal is to obtain nodes which are as **pure** as possible. Here **pure** means that the response values of observations in the node are homogeneous in the sense that all responses are equal to 0 or all responses are equal to 1. Here is an overview of the tree construction process:

**Selection of the splitting covariate and split point**: For each covariate, potential split points are considered. The quality of a split is evaluated using a criterion. For classification, the common criterion is the Gini impurity. For a node containing observations having $n_0$ zero responses and $n_1$ one responses, the Gini impurity is defined as:

If the node is maximally pure i.e., either $n_0 = 0$ or $n_1 = 0$, then Gini impurity takes its lowest possible value of 0. On the other hand, if the node is maximally impure i.e., $n_0 = n_1$, then Gini impurity takes its highest possible value of 0.5. The covariate and split point resulting in the lowest Gini impurity are selected for splitting the node.
2. **Creating a node**: If the node is pure or if other stopping conditions are met (e.g., maximum depth), mark the node as a leaf and assign it the most common response. Otherwise use the best covariate and split point (as in the previous step) to further split the node into two subsets.
3. **Stopping Conditions**: The default in scikit-learn is to grow the tree until all leaf nodes are pure. However deep decision trees are not interpretable and prone to overfitting. Several conditions can be used to stop the tree from growing further. One simple way is to specify the maximum depth of the tree.

After the tree is constructed on the training data, to predict the response value for a new observation, we start at the root node, and continue down until we reach a leaf node. The prediction is then given by the most common response in the leaf node.

Here's how it works in scikit-learn:

In [14]:

```
from sklearn.tree import DecisionTreeClassifier
# Initialize and train the Decision Tree classifier
dt = DecisionTreeClassifier(random_state=1)
dt.fit(X_train, Y_train)
# Predict on the test set
Y_pred_dt = dt.predict(X_test)
# Calculate the accuracy
accuracy_dt = np.mean(Y_test == Y_pred_dt)
print(f"Accuracy of Decision Tree Classification on test set: {accuracy_dt}")
# Create the confusion matrix (contingency table)
print(pd.crosstab(Y_test, Y_pred_dt, rownames=['Actual'], colnames=['Predicted'], margins=True))
```

In [15]:

```
# Plot the tree
from sklearn.tree import plot_tree
plt.figure(figsize=(20,10))
plot_tree(dt, feature_names=X.columns,class_names = ('0','1'), filled=True, rounded=True)
plt.show()
```

In [18]:

```
#This tree is way too big to allow any meaningful interpretation
#Here are a few ways to create smaller trees:
dt_small = DecisionTreeClassifier(max_depth = 4, random_state = 1)
dt_small.fit(X_train, Y_train)
Y_pred_dt_small = dt_small.predict(X_test)
# Calculate the accuracy
accuracy_dt_small = np.mean(Y_test == Y_pred_dt_small)
print(f"Accuracy of Decision Tree Classification on test set: {accuracy_dt_small}")
# Plot the tree
from sklearn.tree import plot_tree
plt.figure(figsize=(20,10))
plot_tree(dt_small, feature_names=X.columns, class_names = ('0','1'), filled=True, rounded=True)
plt.show()
```

Accuracy of Decision Tree Classification on test set: 0.72