In this lecture, we shall look at some popular methods for classification (classification refers to the special case of the regression problem where the response variable $Y$ is binary). The specific methods that we shall consider are: (a) Nearest Neighbour classification, (b) Decision Trees, and (c) Random Forests. These methods can be compared to logistic regression which belongs to the family of GLMs that we have seen in the past few lectures. We will apply all these methods (starting with logistic regression) to the Mroz dataset that was analyzed in the last two lectures. This dataset has data on a bunch of economic variables related to labor force participation for married women in the year 1975.
We shall fit models with 'inlf' as the response variable and 'kidslt6', 'age', 'educ', 'huswage', 'exper', 'expersq' as the explanatory variables (also known as features or covariates). The response variable 'inlf' is binary equaling 1 if the individual worked (i.e., they were 'in the labor force') in the year 1975 and 0 otherwise. The logistic regression model assumes that \begin{align*} Y \mid X_1 = x_1, \dots, X_m = x_m \sim \text{Bernoulli}(\mu(x_1, \dots, x_m)) ~~ \text{ where } ~~ \log \frac{\mu(x_1, \dots, x_m)}{1 - \mu(x_1, \dots, x_m)} = \beta_0 + \beta_1 x_1 + \dots \beta_m x_m \end{align*} In words, the log-odds of the event $Y = 1$ is modeled by the linear function $\beta_0 + \beta_1 X_1 + \dots + \beta_m X_m$. The interpretation of $\beta_j$ (for $1 \leq j \leq m$) would be the change in log-odds for unit change in $X_j$ when all other variables are held fixed. The Logistic Regression model (and more generally every GLM) specifies the relationship between $Y$ and $X_1, \dots, X_m$ in terms of the parameters $\beta_0, \dots, \beta_m$ (which will have to learned from data) and is thus an example of a parametric model.
We can fit the logistic regression model easily using statsmodels.api as follows.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
#Loading the mroz dataset:
mroz = pd.read_csv("MROZ.csv")
print(mroz.shape)
print(mroz.head(12))
print(mroz['inlf'].value_counts()) #this is the response variable which has 428 ones and 325 zeros
(753, 22) inlf hours kidslt6 kidsge6 age educ wage repwage hushrs husage \ 0 1 1610 1 0 32 12 3.3540 2.65 2708 34 1 1 1656 0 2 30 12 1.3889 2.65 2310 30 2 1 1980 1 3 35 12 4.5455 4.04 3072 40 3 1 456 0 3 34 12 1.0965 3.25 1920 53 4 1 1568 1 2 31 14 4.5918 3.60 2000 32 5 1 2032 0 0 54 12 4.7421 4.70 1040 57 6 1 1440 0 2 37 16 8.3333 5.95 2670 37 7 1 1020 0 0 54 12 7.8431 9.98 4120 53 8 1 1458 0 2 48 12 2.1262 0.00 1995 52 9 1 1600 0 2 39 12 4.6875 4.15 2100 43 10 1 1969 0 1 33 12 4.0630 4.30 2450 34 11 1 1960 0 1 42 11 4.5918 4.58 2375 47 ... faminc mtr motheduc fatheduc unem city exper nwifeinc \ 0 ... 16310 0.7215 12 7 5.0 0 14 10.910060 1 ... 21800 0.6615 7 7 11.0 1 5 19.499981 2 ... 21040 0.6915 12 7 5.0 0 15 12.039910 3 ... 7300 0.7815 7 7 5.0 0 6 6.799996 4 ... 27300 0.6215 12 14 9.5 1 7 20.100058 5 ... 19495 0.6915 14 7 7.5 1 33 9.859054 6 ... 21152 0.6915 14 7 5.0 0 11 9.152048 7 ... 18900 0.6915 3 3 5.0 0 35 10.900038 8 ... 20405 0.7515 7 7 3.0 0 24 17.305000 9 ... 20425 0.6915 7 7 5.0 0 21 12.925000 10 ... 32300 0.5815 12 3 5.0 0 15 24.299953 11 ... 28700 0.6215 14 7 5.0 0 14 19.700071 lwage expersq 0 1.210154 196 1 0.328512 25 2 1.514138 225 3 0.092123 36 4 1.524272 49 5 1.556480 1089 6 2.120260 121 7 2.059634 1225 8 0.754336 576 9 1.544899 441 10 1.401922 225 11 1.524272 196 [12 rows x 22 columns] inlf 1 428 0 325 Name: count, dtype: int64
#Logistic Regression
Y = mroz['inlf'] #this is the binary response variable
X = mroz[['kidslt6', 'age', 'educ',
'huswage', 'exper', 'expersq']] #our covariates kidslt6, age, educ, huswage, exper, expersq
X = sm.add_constant(X)
logimodel = sm.GLM(Y, X, family=sm.families.Binomial()).fit()
print(logimodel.summary())
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: inlf No. Observations: 753 Model: GLM Df Residuals: 746 Model Family: Binomial Df Model: 6 Link Function: Logit Scale: 1.0000 Method: IRLS Log-Likelihood: -403.13 Date: Wed, 11 Oct 2023 Deviance: 806.25 Time: 13:51:42 Pearson chi2: 728. No. Iterations: 5 Pseudo R-squ. (CS): 0.2568 Covariance Type: nonrobust ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ const 0.8430 0.757 1.114 0.265 -0.640 2.326 kidslt6 -1.4516 0.201 -7.232 0.000 -1.845 -1.058 age -0.0945 0.014 -6.948 0.000 -0.121 -0.068 educ 0.2071 0.042 4.886 0.000 0.124 0.290 huswage -0.0440 0.021 -2.066 0.039 -0.086 -0.002 exper 0.2079 0.032 6.540 0.000 0.146 0.270 expersq -0.0032 0.001 -3.137 0.002 -0.005 -0.001 ==============================================================================
What is the interpretation of the coefficient -1.4516 of the 'kidslt6' variable? It means that having a small kid reduces the log-odds of working by -1.4516 (for fixed values of all the other variables in the model). Equivalently, the odds of working are reduced by a factor of $\exp(-1.4516) \approx 0.234$. Thus the odds of working reduce by about 25% with each small kid (for fixed values of all other variables in the model).
We shall now fit the same logistic regression model using the scikit-learn Python library. This is because the other methods that we shall study today (k-NN, decision trees and random forests) cannot be fit via statsmodels.api. On the other hand, note that scikit-learn does not provide functionality for every GLM (e.g., negative binomial regression) unlike statsmodels.api.
#Loading the scikit-learn library for logistic regression:
from sklearn.linear_model import LogisticRegression
#Let us re-fit this logistic regression model using scikit-learn
Y = mroz['inlf'] # This is a binary variable
X = mroz[['kidslt6', 'age', 'educ', 'huswage', 'exper', 'expersq']]
logimodel_sk = LogisticRegression(max_iter = 1000)
logimodel_sk.fit(X, Y)
print("Intercept:", logimodel_sk.intercept_)
print("Coefficients:", logimodel_sk.coef_)
Intercept: [0.78500383] Coefficients: [[-1.39555397 -0.09272429 0.20477548 -0.04376797 0.2071669 -0.0031617 ]]
The 'Intercept' above refers to the estimate for $\beta_0$ and 'Coefficients' refer to the estimates for $\beta_1, \dots, \beta_m$. Observe that these estimates are slightly different from those given by statsmodels.api. This is because scikit-learn does not compute the MLE by default. Instead, the default setting adds an $L_2$ penalty for regularization to $(-2) \times \text{log-likelihood}$ before minimizing the resulting criterion to estimate $\beta_0, \dots, \beta_m$. Such regularization is helpful (both for numerical computation and statistical purposes) especially in the case where the number of covariates $m$ is large. We shall see the usefulness of regularization in a future lecture.
If we do not want regularization and want to simply compute the MLE, we can specify this via the "penalty = None" option as shown below.
#With no regularization
logimodel_sk = LogisticRegression(penalty=None, max_iter = 2000) #"penalty = None" makes it compute the MLE just like statsmodels.api
logimodel_sk.fit(X, Y)
print("Intercept:", logimodel_sk.intercept_)
print("Coefficients:", logimodel_sk.coef_)
#These estimates are essentially the same as the ones given by statsmodels.api
Intercept: [0.84388436] Coefficients: [[-1.45179926 -0.09456605 0.20715796 -0.04402401 0.20796276 -0.00317439]]
It is important to assess the prediction accuracy of regression models. One simple way of doing this is to split the existing dataset into two parts: training and test datasets. Then one fits the model on the training data and uses the fitted model to obtain predictions on the test dataset. The predictions can then be compared to actual response values from the test data and the discrepancy gives some idea of the prediction accuracy of the model. Below we evaluate the prediction accuracy of the fitted logistic regression model in this way by (randomly) splitting the data into a training dataset of size 503 and a test dataset of size 250. In practice, one would need to change the sizes of training and test datasets (and also do multiple random splits) to get a robust assessment of accuracy.
The scikit-learn library has in-built functions for randomly splitting the data into training and test datasets.
#Split the original data into training and test datasets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=503, test_size=250, random_state=1)
#Let us fit logistic regression on the training dataset:
logimodel_train = LogisticRegression(penalty = None, max_iter = 1000)
logimodel_train.fit(X_train, Y_train)
print("Intercept:", logimodel_train.intercept_)
print("Coefficients:", logimodel_train.coef_)
Intercept: [0.8104027] Coefficients: [[-1.42578681 -0.08662768 0.19075837 -0.05340385 0.22425066 -0.00391879]]
We now use this fitted logistic regression model to obtain predictions on the test dataset. Note that for a set of new covariates $x_1, \dots, x_m$, the fitted logistic regression model gives the prediction: \begin{align*} \frac{\exp(\beta_0 + \beta_1 x_1 + \dots + \beta_m x_m)}{1 + \exp(\beta_0 + \beta_1 x_1 + \dots + \beta_m x_m)} \end{align*} for the probability that $Y = 1$. This is a number that can be anywhere between 0 and 1. If we want a binary prediction for the response, one would need to place a threshold on the predicted probability. The most common threshold is 0.5 (i.e., if the fitted probability is greater than or equal to zero, the predicted class label is 1; otherwise, it is zero). One can obtain these binary predictions using the '.predict' attribute. Predicted probabilities can be obtained by the '.predict_proba' attribute.
#Prediction for the test data
Y_pred = logimodel_train.predict(X_test)
print(Y_pred)
accuracy_logimodel = np.mean(Y_test == Y_pred)
print(f"Accuracy of Logistic Regression on test set: {accuracy_logimodel}")
[1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 1 1 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 0 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 0 0 1 1 0 1 1 1 1 1 0 1 1 0 1 0 0 1 1 0 0 1 1 0 1 1 0 0 0 0 1 1 1 0 1 0 1 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0] Accuracy of Logistic Regression on test set: 0.728
Accuracy here is the number of correctly classified observations in the test dataset. One can also create the confusion matrix (or contingency table) comparing the predictions with actual response values on the test dataset.
contingency_table = pd.crosstab(Y_test, Y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
print(contingency_table)
Predicted 0 1 All Actual 0 64 49 113 1 19 118 137 All 83 167 250
This table gives a further breakdown of the discrepancies between predictions and actual test response values. For example, 28 observations for which $Y = 1$ in the test dataset were predicted as 0 and 83 observations for which $Y = 0$ in the test dataset were predicted as 1. Also, logistic regression seems to be predicting many more ones (192) compared to actual ones (137) in the response.
Given a positive integer $k$, the $k$-nearest neighbour classifier classifies each point based on the label of the $k$ nearest points in the training set. For example, suppose we want to predict the response for a test observation with covariate values $x_1, \dots, x_m$. The $k$-nearest neighbour method identifies $k$ observations in the training set whose covariate values are closes to $x_1, \dots, x_m$. It then looks at the observed response values for these $k$ observations in the training dataset. Suppose $k = 8$ and the response values of the $k$ closest training points are $1, 0, 1, 1, 1, 0, 0, 1$. Then the predicted probability for the response of the test equals $5/8 = 0.625$. If we want a binary prediction, it will be 1 as the predicted probability is more than 0.5.
scikit-learn has an in-built function for $k$-NN classification which works as follows.
from sklearn.neighbors import KNeighborsClassifier
#Using k-NN classification:
k_value = 8
knn = KNeighborsClassifier(n_neighbors = k_value)
knn.fit(X_train, Y_train)
Y_pred_knn = knn.predict(X_test)
accuracy_knn = np.mean(Y_test == Y_pred_knn)
print(f"Accuracy of k-NN Classification on test set: {accuracy_knn}")
#Contingency table (or confusion matrix)
print(pd.crosstab(Y_test, Y_pred_knn, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of k-NN Classification on test set: 0.744 Predicted 0 1 All Actual 0 78 35 113 1 29 108 137 All 107 143 250
Generally, $k$-nearest neighbour classification is a simple method that often gives decent accuracy (this accuracy can be used as a benchmark for more elaborate classification methods). The value of $k$ can be tuned by standard techniques by cross-validation. However, one needs to be aware of the following issues while using $k$-NN:
Unlike logistic regression where there is a clearly defined likelihood involving parameters $\beta_0, \beta_1, \dots, \beta_m$, there are no parameters and likelihood in k-NN. This is one reason why k-NN can be treated as a nonparametric method.
A decision tree is a method for classification (and regression) that uses a tree-like structure to decide what value to predict for a point. The decision tree is constructed using the training data, and the constructed tree is used to come up with predictions for test observations. The tree starts with the root node at the top which contains all the observations. Each node is then recursively split into smaller nodes until a stopping criterion is reached. The goal is to obtain nodes which are as pure as possible. Here pure means that the response values of observations in the node are homogeneous in the sense that all responses are equal to 0 or all responses are equal to 1. Here is an overview of the tree construction process:
If the node is maximally pure i.e., either $n_0 = 0$ or $n_1 = 0$, then Gini impurity takes its lowest possible value of 0. On the other hand, if the node is maximally impure i.e., $n_0 = n_1$, then Gini impurity takes its highest possible value of 0.5. The covariate and split point resulting in the lowest Gini impurity are selected for splitting the node. 2. Creating a node: If the node is pure or if other stopping conditions are met (e.g., maximum depth), mark the node as a leaf and assign it the most common response. Otherwise use the best covariate and split point (as in the previous step) to further split the node into two subsets. 3. Stopping Conditions: The default in scikit-learn is to grow the tree until all leaf nodes are pure. However deep decision trees are not interpretable and prone to overfitting. Several conditions can be used to stop the tree from growing further. One simple way is to specify the maximum depth of the tree.
After the tree is constructed on the training data, to predict the response value for a new observation, we start at the root node, and continue down until we reach a leaf node. The prediction is then given by the most common response in the leaf node.
Here's how it works in scikit-learn:
from sklearn.tree import DecisionTreeClassifier
# Initialize and train the Decision Tree classifier
dt = DecisionTreeClassifier(random_state=1)
dt.fit(X_train, Y_train)
# Predict on the test set
Y_pred_dt = dt.predict(X_test)
# Calculate the accuracy
accuracy_dt = np.mean(Y_test == Y_pred_dt)
print(f"Accuracy of Decision Tree Classification on test set: {accuracy_dt}")
# Create the confusion matrix (contingency table)
print(pd.crosstab(Y_test, Y_pred_dt, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of Decision Tree Classification on test set: 0.668 Predicted 0 1 All Actual 0 67 46 113 1 37 100 137 All 104 146 250
The accuracy of this decision tree is comparable to but smaller than that of logistic regression and k-NN. We can visualize the fitted tree as follows.
# Plot the tree
from sklearn.tree import plot_tree
plt.figure(figsize=(20,10))
plot_tree(dt, feature_names=X.columns,class_names = ('0','1'), filled=True, rounded=True)
plt.show()
Obviously this tree is way too big to even take a meaningful look at it. Also, it is also likely overfitting leading to poor accuracy. We can control the complexity of the tree by controlling its maximum depth.
#This tree is way too big to allow any meaningful interpretation
#Here are a few ways to create smaller trees:
dt_small = DecisionTreeClassifier(max_depth = 4, random_state = 1)
dt_small.fit(X_train, Y_train)
Y_pred_dt_small = dt_small.predict(X_test)
# Calculate the accuracy
accuracy_dt_small = np.mean(Y_test == Y_pred_dt_small)
print(f"Accuracy of Decision Tree Classification on test set: {accuracy_dt_small}")
# Plot the tree
from sklearn.tree import plot_tree
plt.figure(figsize=(20,10))
plot_tree(dt_small, feature_names=X.columns, class_names = ('0','1'), filled=True, rounded=True)
plt.show()
Accuracy of Decision Tree Classification on test set: 0.72
The accuracy is now much improved to 0.72. One can also look at the tree (fitted on training data) to see which features are appearing in the splits. The root node is first split according to "exper <= 5.5". The left subtree corresponds to "exper <= 5.5" and the right subtree corresponds to "exper > 5.5". The color coding is orange for nodes with more zero responses, and blue for nodes with more one responses. Note that the left subtree is mostly orange while the right subtree is mostly blue. Also purer nodes (i.e., with lower Gini values) get darker colors and less pure nodes get lighter colors.
The Decision Tree method for classification does not specify a model in terms of any parameters (similar to the $k$-NN method) so it is also treated as nonparametric method.
The random forest algorithm constructs a whole bunch (usually over 100) of similar but distinct decision trees, and obtains predictions by averaging the predictions of each individual tree. The individual decision trees are constructed as follows:
Because of bootstrapping in the first step above and combining (aggregating) the results of individual trees at the end, the random forest algorithm is an example of a method involving Bootstrap AGGregation, or bagging.
Typically random forests have good accuracy on a wide variety of datasets.
Similar to $k$-NN and Decision Trees, Random Forests are also considered a nonparametric technique unlike logistic regression.
#Random Forest:
from sklearn.ensemble import RandomForestClassifier
# Initialize and train the Random Forest classifier
rf = RandomForestClassifier(random_state=1, max_features = None)
rf.fit(X_train, Y_train)
# Predict on the test set
Y_pred_rf = rf.predict(X_test)
# Calculate the accuracy
accuracy_rf = np.mean(Y_test == Y_pred_rf)
print(f"Accuracy of Random Forest Classification on test set: {accuracy_rf}")
# Create the confusion matrix (contingency table)
print(pd.crosstab(Y_test, Y_pred_rf, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of Random Forest Classification on test set: 0.728 Predicted 0 1 All Actual 0 71 42 113 1 26 111 137 All 97 153 250
In the above code, the 'max_features = None' option in the function RandomForestClassifer is telling it not subselect covariates in node splits (in other words, the second step above is not being implemented; only bootstrap samples are used to create distinct but similar decision trees). If this is changed to 'max_features = 2', then 2 covariates are random selected to create each split.
More insight into the workings of these four classification methods can be obtained by looking at simple simulated toy datasets. In these datasets, we can determine the relation between the response and covariates allowing us to better evaluate the performance of each method.
Our toy datasets will have $m = 2$ covariates $x_1$ and $x_2$. We shall consider 500 training datapoints and 250 test datapoints. In the first example, we will have a deterministic relationship between $Y$ and $(X_1, X_2)$ given by the following. $Y$ equals 1 if $X_1$ and $X_2$ have the same sign, and 0 if $X_1$ and $X_2$ have opposite signs. Mathematically, this can be represented as \begin{align*} Y = I\{X_1 X_2 > 0\}. \end{align*} Such a dataset can be simulated in the following way.
#EXAMPLE ONE:
n = 750
n_train = 500
n_test = 250
x1 = np.random.uniform(-1, 1, n)
x2 = np.random.uniform(-1, 1, n)
X = np.vstack([x1, x2]).transpose()
Y = (x1 * x2 > 0).astype(np.int64)
#Split the dataset into training and test datasets of sizes 500 and 250 respectively
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=500, test_size=250, random_state=1)
x1_train = X_train[:,0]
x2_train = X_train[:,1]
x1_test = X_test[:,0]
x2_test = X_test[:,1]
Because there are only two covariates, the covariate data can be looked at in a scatter plot. The binary response can then be indicated on this plot with a color scheme. This is done in the following code.
#In this case where there are only two features, the data can be plotted
#color of the points denotes Y
def draw_results(x1, x2, color, plot_title=''):
plt.figure()
plt.scatter(x1, x2, c=color, cmap='viridis', alpha=0.7);
plt.colorbar()
plt.title(plot_title)
plt.axis('equal')
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.tight_layout()
draw_results(x1_train, x2_train, color=Y_train, plot_title='Training data')
draw_results(x1_test, x2_test, color=Y_test, plot_title='Test data (ground truth)')
We shall apply each of our four methods on this dataset and compare their prediction accuracy. Let us start with logistic regression.
#Method One: Logistic regression:
logimodel_train = LogisticRegression(penalty = None, max_iter = 1000)
logimodel_train.fit(X_train, Y_train)
print("Intercept:", logimodel_train.intercept_)
print("Coefficients:", logimodel_train.coef_)
#Prediction for the test data
Y_pred = logimodel_train.predict(X_test)
print(Y_pred)
accuracy_logimodel = np.mean(Y_test == Y_pred)
print(f"Accuracy of Logistic Regression on test set: {accuracy_logimodel}")
Intercept: [0.09310359] Coefficients: [[-0.21360691 0.03125306]] [1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 0 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 1 0 0 1 1 1 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 1 1 1 0 0 1 1 1 1 1 1] Accuracy of Logistic Regression on test set: 0.54
Logistic regression led to an accuracy of only 0.54 (this number might change for you because of randomness in the observations). Observe that if we randomly guess the response for each observation as 0 or 1, we would expect an accuracy close to 0.5. In this sense, accuracy of 0.54 is quite poor (it basically means that logistic regression in this case is no better than random guessing). We can visualize the predicted probabilities as well as predicted labels in the following way.
# Visualize the results
pred_probs_logistic = logimodel_train.predict_proba(X_test)[:,1]
draw_results(
x1_test, x2_test, color=pred_probs_logistic,
plot_title="Predicted probability of y=1 (logistic)"
)
draw_results(
x1_test, x2_test, color=Y_pred,
plot_title="Logistic model prediction"
)
Logistic regression works poorly for this dataset because the function: \begin{align*} (x_1, x_2) \mapsto \frac{\exp \left(\beta_0 + \beta_1 x_1 + \beta_2 x_2 \right)}{1 + \exp \left(\beta_0 + \beta_1 x_1 + \beta_2 x_2 \right)} \end{align*) is a poor approximation of the function \begin{align*} (x_1, x_2) \mapsto I\{x_1 x_2 > 0\} \end{align*} for every choice of $\beta_0, \beta_1, \beta_2$. A simple way of fixing this problem is to add a third feature $x_1 x_2$. This is an example of feature engineering. Logistic regression with this new feature will have substantially larger accuracy as can be checked using the code below.
#Logistic Regression will work here if we do feature engineering (include x1 * x2 as a third feature):
# Create a new feature: x1 * x2
#the following function returns an array like X but with a new feature that is x1 * x2
def add_mult_feature(X):
"""Returns an array like X, but with a new feature that's X1 * X2"""
new_feature = X[:, 0] * X[:, 1]
return np.hstack([X, new_feature[:, None]])
X_train_feat = add_mult_feature(X_train)
X_test_feat = add_mult_feature(X_test)
logimodel_train_feat = LogisticRegression(penalty = None, max_iter = 2000)
logimodel_train_feat.fit(X_train_feat, Y_train)
print("Intercept:", logimodel_train_feat.intercept_)
print("Coefficients:", logimodel_train_feat.coef_)
#Prediction for the test data
Y_pred_feat = logimodel_train_feat.predict(X_test_feat)
print(Y_pred_feat)
accuracy_logimodel_feat = np.mean(Y_test == Y_pred_feat)
print(f"Accuracy of Logistic Regression (with new feature x1*x2) on test set: {accuracy_logimodel_feat}")
pred_probs_logistic_feat = logimodel_train_feat.predict_proba(X_test_feat)[:,1]
draw_results(
x1_test, x2_test, color=pred_probs_logistic_feat,
plot_title="Predicted probability of y=1 (logistic)"
)
draw_results(
x1_test, x2_test, color=Y_pred_feat,
plot_title="Logistic model prediction"
)
Intercept: [3.47948303] Coefficients: [[ 54.09568077 14.39074806 5570.18753813]] [1 1 0 0 1 0 1 1 0 1 1 0 1 0 0 0 0 1 0 0 1 0 1 1 1 0 1 1 0 1 1 0 1 1 1 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 0 0 1 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 1 1 1 1 1 0 1 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 0 1 1 0 0 0 0 1 1 1 1 0 1 1 0 0 1 0 0 0 1 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 1 0 0 1 0 1 0 0 0 0 0 1 1 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 1 0 0 0 1 0] Accuracy of Logistic Regression (with new feature x1*x2) on test set: 0.996
Observe that the fitted coefficient for the new feature $x_1 x_2$ is very high which leads to essentially binary predicted probabilities: 1 when $x_1 x_2 > 0$ and 0 when $x_1 x_2 < 0$.
One insight from this is that proper feature selection is crucial for logistic regression, and more generally any linear model. Unfortunately, feature selection is not easy to do in most real-world regression or classification tasks.
Next we apply $k$-NN to this dataset.
#Method Two: k-NN
#Using k-NN classification:
k_value = 8
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = k_value)
knn.fit(X_train, Y_train)
Y_pred_knn = knn.predict(X_test)
accuracy_knn = np.mean(Y_test == Y_pred_knn)
print(f"Accuracy of k-NN Classification on test set: {accuracy_knn}")
#Contingency table (or confusion matrix)
print(pd.crosstab(Y_test, Y_pred_knn, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of k-NN Classification on test set: 0.972 Predicted 0 1 All Actual 0 123 5 128 1 2 120 122 All 125 125 250
$k$-NN has very good accuracy and we did not have to any feature selection either. We can actually look at the predicted probabilities and labels to get more insights.
#Visualizing the results:
from sklearn.neighbors import KNeighborsClassifier
probs_knn = knn.predict_proba(X_test)[:, 1]
y_hat_knn = (probs_knn > 0.5).astype(np.int64)
draw_results(
x1_test, x2_test, color=probs_knn,
plot_title="Predicted probability of y=1 (k-NN)"
)
draw_results(
x1_test, x2_test, color=y_hat_knn,
plot_title="k-NN Model prediction"
)
From the predicted probability plot, it is clear that for points near the boundary, the predicted probabilities are in the middle range i.e., closer to 0.5 compared to points away from the boundary. So the method has some confusion about predicting these observations but the overall accuracy is quite good (though not as high as that of logistic regression with the feature $x_1 x_2$).
Next we apply the Decision Tree algorithm:
#Model Three: Decision Tree
dt = DecisionTreeClassifier(random_state=1)
dt.fit(X_train, Y_train)
# Predict on the test set
Y_pred_dt = dt.predict(X_test)
# Calculate the accuracy
accuracy_dt = np.mean(Y_test == Y_pred_dt)
print(f"Accuracy of Decision Tree Classification on test set: {accuracy_dt}")
# Create the confusion matrix (contingency table)
print(pd.crosstab(Y_test, Y_pred_dt, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of Decision Tree Classification on test set: 0.992 Predicted 0 1 All Actual 0 126 2 128 1 0 122 122 All 126 124 250
The accuracy is very high: only 2 test observations out of 250 are misclasified. We can get more insight by looking at the actual fitted tree.
# Plot the tree
from sklearn.tree import plot_tree
plt.figure(figsize=(20,10))
plot_tree(dt, feature_names=['x1', 'x2'],class_names = ('0','1'), filled=True, rounded=True)
plt.show()
The fitted tree is not too deep (even though we did not impose any maximum depth limit). Note however that there are four clear pure nodes in the actual data which we can obtain for example by splitting according to $x_1 <= 0$ and then according to $x_2 <= 0$ in each node. However the decision tree method gave us 8 pure nodes. This increase of complexity is happening because of the greedy method of growing the tree.
Finally let us apply Random Forest.
#Model Four: Random Forest
rf = RandomForestClassifier(random_state=1, max_features = None)
rf.fit(X_train, Y_train)
# Predict on the test set
Y_pred_rf = rf.predict(X_test)
# Calculate the accuracy
accuracy_rf = np.mean(Y_test == Y_pred_rf)
print(f"Accuracy of Random Forest Classification on test set: {accuracy_rf}")
# Create the confusion matrix (contingency table)
print(pd.crosstab(Y_test, Y_pred_rf, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of Random Forest Classification on test set: 1.0 Predicted 0 1 All Actual 0 128 0 128 1 0 122 122 All 128 122 250
This has even better accuracy compared to the decision tree. This is actually typical.
In the next toy dataset example, we take the previous dataset and intentionally corrupt 10% of the training observations by switching their label. The test dataset is left unaltered.
#EXAMPLE TWO: Let us add some training noise to the previous dataset:
Y_train_noisy = Y_train.copy()
pts_to_flip = np.random.random(n_train) < 0.1 #we are selecting 10% of the datapoints
Y_train_noisy[pts_to_flip] = 1 - Y_train_noisy[pts_to_flip] #we are corrupting the responses of these datapoints
draw_results(x1_train, x2_train, color=Y_train_noisy, plot_title='Training data with noise')
#Performance of Logistic Regression on this corrupted dataset:
logimodel_train = LogisticRegression(penalty = None, max_iter = 1000)
logimodel_train.fit(X_train, Y_train_noisy)
print("Intercept:", logimodel_train.intercept_)
print("Coefficients:", logimodel_train.coef_)
#Prediction for the test data
Y_pred = logimodel_train.predict(X_test)
print(Y_pred)
accuracy_logimodel = np.mean(Y_test == Y_pred)
print(f"Accuracy of Logistic Regression on test set: {accuracy_logimodel}")
Intercept: [0.19758252] Coefficients: [[-0.03969418 -0.0795844 ]] [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] Accuracy of Logistic Regression on test set: 0.488
Logistic Regression again has poor accuracy (which is unsurprising given that this problem with corrupted data is even harder than the previous problem). Next we try logistic regression with the added feature $x_1 * x_2$.
#Logistic Regression with the x1*x2 feature
logimodel_train_feat = LogisticRegression(penalty = None, max_iter = 1000)
logimodel_train_feat.fit(X_train_feat, Y_train_noisy)
print("Intercept:", logimodel_train_feat.intercept_)
print("Coefficients:", logimodel_train_feat.coef_)
#Prediction for the test data
Y_pred_feat = logimodel_train_feat.predict(X_test_feat)
print(Y_pred_feat)
accuracy_logimodel_feat = np.mean(Y_test == Y_pred_feat)
print(f"Accuracy of Logistic Regression (with new feature x1*x2) on test set: {accuracy_logimodel_feat}")
Intercept: [0.25340968] Coefficients: [[ 0.19991346 -0.02759687 7.38911646]] [1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 0 0 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 1 0 1 1 1 1 0 0 0 1 1 1 1 1 0 1 1 0 1 0 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 1 1 1 1 1 0 1 1 0 0 0 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 1 0 0 1 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 0 1 0 1 0 0 1 0 0 1 1 1 0 1 0 1 0] Accuracy of Logistic Regression (with new feature x1*x2) on test set: 0.92
The fitted coefficient for the new feature is now not as high as before. The accuracy is also somewhat lower than before (previously it was very close to 100% and now it came down to 92%).
#Performance of k-NN on this corrupted data:
k_value = 8
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = k_value)
knn.fit(X_train, Y_train_noisy)
Y_pred_knn = knn.predict(X_test)
accuracy_knn = np.mean(Y_test == Y_pred_knn)
print(f"Accuracy of k-NN Classification on test set: {accuracy_knn}")
#Contingency table (or confusion matrix)
print(pd.crosstab(Y_test, Y_pred_knn, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of k-NN Classification on test set: 0.96 Predicted 0 1 All Actual 0 119 9 128 1 1 121 122 All 120 130 250
For $k$-NN, the accuracy is still pretty good (although it deteriorated a little bit).
#Performance of Decision Tree on corrupted data:
dt = DecisionTreeClassifier(random_state=1)
dt.fit(X_train, Y_train_noisy)
# Predict on the test set
Y_pred_dt = dt.predict(X_test)
# Calculate the accuracy
accuracy_dt = np.mean(Y_test == Y_pred_dt)
print(f"Accuracy of Decision Tree Classification on test set: {accuracy_dt}")
# Create the confusion matrix (contingency table)
print(pd.crosstab(Y_test, Y_pred_dt, rownames=['Actual'], colnames=['Predicted'], margins=True))
# Plot the tree
from sklearn.tree import plot_tree
plt.figure(figsize=(20,10))
plot_tree(dt, feature_names=['x1', 'x2'],class_names = ('0','1'), filled=True, rounded=True)
plt.show()
Accuracy of Decision Tree Classification on test set: 0.908 Predicted 0 1 All Actual 0 113 15 128 1 8 114 122 All 121 129 250
The accuracy of decision tree is decent but much less than before. More seriously, the fitted tree itself is now way more complicated. In this sense, a few corrupted points completely changed the nature of the fitted tree. This points of the instability of individual decision trees. This instability is often cited as the motivation for using random forests.
#Performance of Random Forest:
rf = RandomForestClassifier(random_state=1, max_features = None)
rf.fit(X_train, Y_train_noisy)
# Predict on the test set
Y_pred_rf = rf.predict(X_test)
# Calculate the accuracy
accuracy_rf = np.mean(Y_test == Y_pred_rf)
print(f"Accuracy of Random Forest Classification on test set: {accuracy_rf}")
# Create the confusion matrix (contingency table)
print(pd.crosstab(Y_test, Y_pred_rf, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of Random Forest Classification on test set: 0.952 Predicted 0 1 All Actual 0 118 10 128 1 2 120 122 All 120 130 250
Random forests still maintains a high accuracy despite a slight drop compared to the no-corruption case.
In the final example, we change the dependence of $Y$ on $X_1, X_2$ to: \begin{align*} Y = I\{(X_1 + X_2) (X_1 - X_2) > 0\} \end{align*} Running the code below, one can check the following:
You are encouraged to try out these methods on more toy (and real) datasets.
#EXAMPLE THREE:
n_train = 500
n_test = 250
n = n_train + n_test
x1 = np.random.uniform(-1, 1, n)
x2 = np.random.uniform(-1, 1, n)
X = np.vstack([x1, x2]).transpose()
Y = ((x1+x2) * (x1 - x2) > 0).astype(np.int64)
#Split the dataset into training and test datasets of sizes 500 and 250 respectively
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=500, test_size=250, random_state=1)
x1_train = X_train[:,0]
x2_train = X_train[:,1]
x1_test = X_test[:,0]
x2_test = X_test[:,1]
draw_results(x1_train, x2_train, color=Y_train, plot_title='Training data')
draw_results(x1_test, x2_test, color=Y_test, plot_title='Test data (ground truth)')
#Logistic Regression
logimodel_train = LogisticRegression(penalty = None, max_iter = 1000)
logimodel_train.fit(X_train, Y_train)
print("Intercept:", logimodel_train.intercept_)
print("Coefficients:", logimodel_train.coef_)
#Prediction for the test data
Y_pred = logimodel_train.predict(X_test)
print(Y_pred)
accuracy_logimodel = np.mean(Y_test == Y_pred)
print(f"Accuracy of Logistic Regression on test set: {accuracy_logimodel}")
Intercept: [0.01178031] Coefficients: [[-0.12368168 -0.07390931]] [0 1 1 1 1 1 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 0 1 1 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 1 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 0 1 1 0 0 1 0 0 0 1 0 0 1 1 0 1 0 1 1 0 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 0 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 0 1 1 1 0 1 1 1 1 0 1 0 0 0 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 0 0 1 1 1 1 1 1 1] Accuracy of Logistic Regression on test set: 0.504
#Performance of k-NN:
k_value = 8
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = k_value)
knn.fit(X_train, Y_train)
Y_pred_knn = knn.predict(X_test)
accuracy_knn = np.mean(Y_test == Y_pred_knn)
print(f"Accuracy of k-NN Classification on test set: {accuracy_knn}")
#Contingency table (or confusion matrix)
print(pd.crosstab(Y_test, Y_pred_knn, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of k-NN Classification on test set: 0.932 Predicted 0 1 All Actual 0 104 4 108 1 13 129 142 All 117 133 250
#Performance of Decision Tree:
dt = DecisionTreeClassifier(random_state=1)
dt.fit(X_train, Y_train)
# Predict on the test set
Y_pred_dt = dt.predict(X_test)
# Calculate the accuracy
accuracy_dt = np.mean(Y_test == Y_pred_dt)
print(f"Accuracy of Decision Tree Classification on test set: {accuracy_dt}")
# Create the confusion matrix (contingency table)
print(pd.crosstab(Y_test, Y_pred_dt, rownames=['Actual'], colnames=['Predicted'], margins=True))
# Plot the tree
from sklearn.tree import plot_tree
plt.figure(figsize=(20,10))
plot_tree(dt, feature_names=['x1', 'x2'],class_names = ('0','1'), filled=True, rounded=True)
plt.show()
Accuracy of Decision Tree Classification on test set: 0.912 Predicted 0 1 All Actual 0 97 11 108 1 11 131 142 All 108 142 250
#Performance of Random Forest:
rf = RandomForestClassifier(random_state=1, max_features = None)
rf.fit(X_train, Y_train)
# Predict on the test set
Y_pred_rf = rf.predict(X_test)
# Calculate the accuracy
accuracy_rf = np.mean(Y_test == Y_pred_rf)
print(f"Accuracy of Random Forest Classification on test set: {accuracy_rf}")
# Create the confusion matrix (contingency table)
print(pd.crosstab(Y_test, Y_pred_rf, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of Random Forest Classification on test set: 0.968 Predicted 0 1 All Actual 0 105 3 108 1 5 137 142 All 110 140 250