Nonparametric methods (for Classification)¶

In this lecture, we shall look at some popular methods for classification (classification refers to the special case of the regression problem where the response variable $Y$ is binary). The specific methods that we shall consider are: (a) Nearest Neighbour classification, (b) Decision Trees, and (c) Random Forests. These methods can be compared to logistic regression which belongs to the family of GLMs that we have seen in the past few lectures. We will apply all these methods (starting with logistic regression) to the Mroz dataset that was analyzed in the last two lectures. This dataset has data on a bunch of economic variables related to labor force participation for married women in the year 1975.

Logistic Regression¶

We shall fit models with 'inlf' as the response variable and 'kidslt6', 'age', 'educ', 'huswage', 'exper', 'expersq' as the explanatory variables (also known as features or covariates). The response variable 'inlf' is binary equaling 1 if the individual worked (i.e., they were 'in the labor force') in the year 1975 and 0 otherwise. The logistic regression model assumes that \begin{align*} Y \mid X_1 = x_1, \dots, X_m = x_m \sim \text{Bernoulli}(\mu(x_1, \dots, x_m)) ~~ \text{ where } ~~ \log \frac{\mu(x_1, \dots, x_m)}{1 - \mu(x_1, \dots, x_m)} = \beta_0 + \beta_1 x_1 + \dots \beta_m x_m \end{align*} In words, the log-odds of the event $Y = 1$ is modeled by the linear function $\beta_0 + \beta_1 X_1 + \dots + \beta_m X_m$. The interpretation of $\beta_j$ (for $1 \leq j \leq m$) would be the change in log-odds for unit change in $X_j$ when all other variables are held fixed. The Logistic Regression model (and more generally every GLM) specifies the relationship between $Y$ and $X_1, \dots, X_m$ in terms of the parameters $\beta_0, \dots, \beta_m$ (which will have to learned from data) and is thus an example of a parametric model.

We can fit the logistic regression model easily using statsmodels.api as follows.

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
In [4]:
#Loading the mroz dataset: 
mroz = pd.read_csv("MROZ.csv")
print(mroz.shape)
print(mroz.head(12))
print(mroz['inlf'].value_counts()) #this is the response variable which has 428 ones and 325 zeros
(753, 22)
    inlf  hours  kidslt6  kidsge6  age  educ    wage  repwage  hushrs  husage  \
0      1   1610        1        0   32    12  3.3540     2.65    2708      34   
1      1   1656        0        2   30    12  1.3889     2.65    2310      30   
2      1   1980        1        3   35    12  4.5455     4.04    3072      40   
3      1    456        0        3   34    12  1.0965     3.25    1920      53   
4      1   1568        1        2   31    14  4.5918     3.60    2000      32   
5      1   2032        0        0   54    12  4.7421     4.70    1040      57   
6      1   1440        0        2   37    16  8.3333     5.95    2670      37   
7      1   1020        0        0   54    12  7.8431     9.98    4120      53   
8      1   1458        0        2   48    12  2.1262     0.00    1995      52   
9      1   1600        0        2   39    12  4.6875     4.15    2100      43   
10     1   1969        0        1   33    12  4.0630     4.30    2450      34   
11     1   1960        0        1   42    11  4.5918     4.58    2375      47   

    ...  faminc     mtr  motheduc  fatheduc  unem  city  exper   nwifeinc  \
0   ...   16310  0.7215        12         7   5.0     0     14  10.910060   
1   ...   21800  0.6615         7         7  11.0     1      5  19.499981   
2   ...   21040  0.6915        12         7   5.0     0     15  12.039910   
3   ...    7300  0.7815         7         7   5.0     0      6   6.799996   
4   ...   27300  0.6215        12        14   9.5     1      7  20.100058   
5   ...   19495  0.6915        14         7   7.5     1     33   9.859054   
6   ...   21152  0.6915        14         7   5.0     0     11   9.152048   
7   ...   18900  0.6915         3         3   5.0     0     35  10.900038   
8   ...   20405  0.7515         7         7   3.0     0     24  17.305000   
9   ...   20425  0.6915         7         7   5.0     0     21  12.925000   
10  ...   32300  0.5815        12         3   5.0     0     15  24.299953   
11  ...   28700  0.6215        14         7   5.0     0     14  19.700071   

       lwage  expersq  
0   1.210154      196  
1   0.328512       25  
2   1.514138      225  
3   0.092123       36  
4   1.524272       49  
5   1.556480     1089  
6   2.120260      121  
7   2.059634     1225  
8   0.754336      576  
9   1.544899      441  
10  1.401922      225  
11  1.524272      196  

[12 rows x 22 columns]
inlf
1    428
0    325
Name: count, dtype: int64
In [5]:
#Logistic Regression
Y = mroz['inlf'] #this is the binary response variable
X = mroz[['kidslt6', 'age', 'educ', 
        'huswage', 'exper', 'expersq']] #our covariates kidslt6, age, educ, huswage, exper, expersq
X = sm.add_constant(X)
logimodel = sm.GLM(Y, X, family=sm.families.Binomial()).fit()
print(logimodel.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                   inlf   No. Observations:                  753
Model:                            GLM   Df Residuals:                      746
Model Family:                Binomial   Df Model:                            6
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -403.13
Date:                Wed, 11 Oct 2023   Deviance:                       806.25
Time:                        13:51:42   Pearson chi2:                     728.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.2568
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.8430      0.757      1.114      0.265      -0.640       2.326
kidslt6       -1.4516      0.201     -7.232      0.000      -1.845      -1.058
age           -0.0945      0.014     -6.948      0.000      -0.121      -0.068
educ           0.2071      0.042      4.886      0.000       0.124       0.290
huswage       -0.0440      0.021     -2.066      0.039      -0.086      -0.002
exper          0.2079      0.032      6.540      0.000       0.146       0.270
expersq       -0.0032      0.001     -3.137      0.002      -0.005      -0.001
==============================================================================

What is the interpretation of the coefficient -1.4516 of the 'kidslt6' variable? It means that having a small kid reduces the log-odds of working by -1.4516 (for fixed values of all the other variables in the model). Equivalently, the odds of working are reduced by a factor of $\exp(-1.4516) \approx 0.234$. Thus the odds of working reduce by about 25% with each small kid (for fixed values of all other variables in the model).

We shall now fit the same logistic regression model using the scikit-learn Python library. This is because the other methods that we shall study today (k-NN, decision trees and random forests) cannot be fit via statsmodels.api. On the other hand, note that scikit-learn does not provide functionality for every GLM (e.g., negative binomial regression) unlike statsmodels.api.

In [6]:
#Loading the scikit-learn library for logistic regression:
from sklearn.linear_model import LogisticRegression
In [7]:
#Let us re-fit this logistic regression model using scikit-learn
Y = mroz['inlf']  # This is a binary variable
X = mroz[['kidslt6', 'age', 'educ', 'huswage', 'exper', 'expersq']]
logimodel_sk = LogisticRegression(max_iter = 1000)
logimodel_sk.fit(X, Y)
print("Intercept:", logimodel_sk.intercept_)
print("Coefficients:", logimodel_sk.coef_)
Intercept: [0.78500383]
Coefficients: [[-1.39555397 -0.09272429  0.20477548 -0.04376797  0.2071669  -0.0031617 ]]

The 'Intercept' above refers to the estimate for $\beta_0$ and 'Coefficients' refer to the estimates for $\beta_1, \dots, \beta_m$. Observe that these estimates are slightly different from those given by statsmodels.api. This is because scikit-learn does not compute the MLE by default. Instead, the default setting adds an $L_2$ penalty for regularization to $(-2) \times \text{log-likelihood}$ before minimizing the resulting criterion to estimate $\beta_0, \dots, \beta_m$. Such regularization is helpful (both for numerical computation and statistical purposes) especially in the case where the number of covariates $m$ is large. We shall see the usefulness of regularization in a future lecture.

If we do not want regularization and want to simply compute the MLE, we can specify this via the "penalty = None" option as shown below.

In [8]:
#With no regularization
logimodel_sk = LogisticRegression(penalty=None, max_iter = 2000) #"penalty = None" makes it compute the MLE just like statsmodels.api
logimodel_sk.fit(X, Y)
print("Intercept:", logimodel_sk.intercept_)
print("Coefficients:", logimodel_sk.coef_)
#These estimates are essentially the same as the ones given by statsmodels.api
Intercept: [0.84388436]
Coefficients: [[-1.45179926 -0.09456605  0.20715796 -0.04402401  0.20796276 -0.00317439]]

Accuracy Assessment for Logistic Regression¶

It is important to assess the prediction accuracy of regression models. One simple way of doing this is to split the existing dataset into two parts: training and test datasets. Then one fits the model on the training data and uses the fitted model to obtain predictions on the test dataset. The predictions can then be compared to actual response values from the test data and the discrepancy gives some idea of the prediction accuracy of the model. Below we evaluate the prediction accuracy of the fitted logistic regression model in this way by (randomly) splitting the data into a training dataset of size 503 and a test dataset of size 250. In practice, one would need to change the sizes of training and test datasets (and also do multiple random splits) to get a robust assessment of accuracy.

The scikit-learn library has in-built functions for randomly splitting the data into training and test datasets.

In [9]:
#Split the original data into training and test datasets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=503, test_size=250, random_state=1)
In [10]:
#Let us fit logistic regression on the training dataset:
logimodel_train = LogisticRegression(penalty = None, max_iter = 1000)
logimodel_train.fit(X_train, Y_train)
print("Intercept:", logimodel_train.intercept_)
print("Coefficients:", logimodel_train.coef_)
Intercept: [0.8104027]
Coefficients: [[-1.42578681 -0.08662768  0.19075837 -0.05340385  0.22425066 -0.00391879]]

We now use this fitted logistic regression model to obtain predictions on the test dataset. Note that for a set of new covariates $x_1, \dots, x_m$, the fitted logistic regression model gives the prediction: \begin{align*} \frac{\exp(\beta_0 + \beta_1 x_1 + \dots + \beta_m x_m)}{1 + \exp(\beta_0 + \beta_1 x_1 + \dots + \beta_m x_m)} \end{align*} for the probability that $Y = 1$. This is a number that can be anywhere between 0 and 1. If we want a binary prediction for the response, one would need to place a threshold on the predicted probability. The most common threshold is 0.5 (i.e., if the fitted probability is greater than or equal to zero, the predicted class label is 1; otherwise, it is zero). One can obtain these binary predictions using the '.predict' attribute. Predicted probabilities can be obtained by the '.predict_proba' attribute.

In [11]:
#Prediction for the test data
Y_pred = logimodel_train.predict(X_test)
print(Y_pred)
accuracy_logimodel = np.mean(Y_test == Y_pred)
print(f"Accuracy of Logistic Regression on test set: {accuracy_logimodel}")
[1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 1 1
 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 0 0 0
 1 1 0 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 0 1 0 0 1 1 0 1 1 1
 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 0 0 1 1 0 1 1
 1 1 1 0 1 1 0 1 0 0 1 1 0 0 1 1 0 1 1 0 0 0 0 1 1 1 0 1 0 1 0 0 0 0 1 1 1
 1 0 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 0 1
 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0]
Accuracy of Logistic Regression on test set: 0.728

Accuracy here is the number of correctly classified observations in the test dataset. One can also create the confusion matrix (or contingency table) comparing the predictions with actual response values on the test dataset.

In [12]:
contingency_table = pd.crosstab(Y_test, Y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
print(contingency_table)
Predicted   0    1  All
Actual                 
0          64   49  113
1          19  118  137
All        83  167  250

This table gives a further breakdown of the discrepancies between predictions and actual test response values. For example, 28 observations for which $Y = 1$ in the test dataset were predicted as 0 and 83 observations for which $Y = 0$ in the test dataset were predicted as 1. Also, logistic regression seems to be predicting many more ones (192) compared to actual ones (137) in the response.

Nearest Neighbour Classification¶

Given a positive integer $k$, the $k$-nearest neighbour classifier classifies each point based on the label of the $k$ nearest points in the training set. For example, suppose we want to predict the response for a test observation with covariate values $x_1, \dots, x_m$. The $k$-nearest neighbour method identifies $k$ observations in the training set whose covariate values are closes to $x_1, \dots, x_m$. It then looks at the observed response values for these $k$ observations in the training dataset. Suppose $k = 8$ and the response values of the $k$ closest training points are $1, 0, 1, 1, 1, 0, 0, 1$. Then the predicted probability for the response of the test equals $5/8 = 0.625$. If we want a binary prediction, it will be 1 as the predicted probability is more than 0.5.

scikit-learn has an in-built function for $k$-NN classification which works as follows.

In [13]:
from sklearn.neighbors import KNeighborsClassifier

#Using k-NN classification:
k_value = 8
knn = KNeighborsClassifier(n_neighbors = k_value)
knn.fit(X_train, Y_train)
Y_pred_knn = knn.predict(X_test)
accuracy_knn = np.mean(Y_test == Y_pred_knn)
print(f"Accuracy of k-NN Classification on test set: {accuracy_knn}")
#Contingency table (or confusion matrix)
print(pd.crosstab(Y_test, Y_pred_knn, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of k-NN Classification on test set: 0.744
Predicted    0    1  All
Actual                  
0           78   35  113
1           29  108  137
All        107  143  250

Generally, $k$-nearest neighbour classification is a simple method that often gives decent accuracy (this accuracy can be used as a benchmark for more elaborate classification methods). The value of $k$ can be tuned by standard techniques by cross-validation. However, one needs to be aware of the following issues while using $k$-NN:

  1. One needs to select a suitable notion of distance to measure the distance between $(x_1, \dots, x_m)$ and the covariate observations in the training set. The default choice is the Euclidean distance (square-root of the sum of the squares of the discrepancies in each covariate). However different covariates have different units which makes this distance not very principled. For example, if we are measure education in number of years and experience in number of months, then the Euclidean distance will probably be dominated by experience (as it will take larger values because of the specific choice of units). It is therefore generally recommended to scale the features before using k-NN. One can scale each covariate by either (a) subtracting the minimum value and dividing by the range, or (b) subtracting the mean and dividing by the standard deviation. This scaling makes all variables unitless (so that squaring and adding across variables makes more sense).
  2. When the number of covariates is large, the volume of the covariate space becomes very large and none of the 'nearest' neighbors might be actually near to the test observation. In this case, the method cannot provide accurate predictions.
  3. All the covariates are treated equally in the calculation of the distance used to determine nearest neighbors. This means that irrelevant features are given as much weight as important features.
  4. For large datasets, the computational cost can be high because the method requires computing distances to every training example for each prediction. Also, since k-NN requires storing the entire training dataset, memory consumption can be an issue for large datasets. In contrast, for predictions via logistic regression, one only needs to store the fitted regression coefficients $\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_m$.
  5. k-NN does not allow answering important questions like: what features are important for explaining the response? If we intervene and change one covariate by 5 units, what changes would we expect in the response?

Unlike logistic regression where there is a clearly defined likelihood involving parameters $\beta_0, \beta_1, \dots, \beta_m$, there are no parameters and likelihood in k-NN. This is one reason why k-NN can be treated as a nonparametric method.

Decision Trees¶

A decision tree is a method for classification (and regression) that uses a tree-like structure to decide what value to predict for a point. The decision tree is constructed using the training data, and the constructed tree is used to come up with predictions for test observations. The tree starts with the root node at the top which contains all the observations. Each node is then recursively split into smaller nodes until a stopping criterion is reached. The goal is to obtain nodes which are as pure as possible. Here pure means that the response values of observations in the node are homogeneous in the sense that all responses are equal to 0 or all responses are equal to 1. Here is an overview of the tree construction process:

  1. Selection of the splitting covariate and split point: For each covariate, potential split points are considered. The quality of a split is evaluated using a criterion. For classification, the common criterion is the Gini impurity. For a node containing observations having $n_0$ zero responses and $n_1$ one responses, the Gini impurity is defined as:
\begin{align*} \text{Gini} = 1 - \left(\frac{n_0}{n_0 + n_1} \right)^2 - \left(\frac{n_1}{n_0 + n_1} \right)^2 \end{align*}

If the node is maximally pure i.e., either $n_0 = 0$ or $n_1 = 0$, then Gini impurity takes its lowest possible value of 0. On the other hand, if the node is maximally impure i.e., $n_0 = n_1$, then Gini impurity takes its highest possible value of 0.5. The covariate and split point resulting in the lowest Gini impurity are selected for splitting the node. 2. Creating a node: If the node is pure or if other stopping conditions are met (e.g., maximum depth), mark the node as a leaf and assign it the most common response. Otherwise use the best covariate and split point (as in the previous step) to further split the node into two subsets. 3. Stopping Conditions: The default in scikit-learn is to grow the tree until all leaf nodes are pure. However deep decision trees are not interpretable and prone to overfitting. Several conditions can be used to stop the tree from growing further. One simple way is to specify the maximum depth of the tree.

After the tree is constructed on the training data, to predict the response value for a new observation, we start at the root node, and continue down until we reach a leaf node. The prediction is then given by the most common response in the leaf node.

Here's how it works in scikit-learn:

In [14]:
from sklearn.tree import DecisionTreeClassifier

# Initialize and train the Decision Tree classifier
dt = DecisionTreeClassifier(random_state=1)
dt.fit(X_train, Y_train)

# Predict on the test set
Y_pred_dt = dt.predict(X_test)

# Calculate the accuracy
accuracy_dt = np.mean(Y_test == Y_pred_dt)
print(f"Accuracy of Decision Tree Classification on test set: {accuracy_dt}")

# Create the confusion matrix (contingency table)
print(pd.crosstab(Y_test, Y_pred_dt, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of Decision Tree Classification on test set: 0.668
Predicted    0    1  All
Actual                  
0           67   46  113
1           37  100  137
All        104  146  250

The accuracy of this decision tree is comparable to but smaller than that of logistic regression and k-NN. We can visualize the fitted tree as follows.

In [15]:
# Plot the tree
from sklearn.tree import plot_tree

plt.figure(figsize=(20,10))
plot_tree(dt, feature_names=X.columns,class_names = ('0','1'),  filled=True, rounded=True)
plt.show()

Obviously this tree is way too big to even take a meaningful look at it. Also, it is also likely overfitting leading to poor accuracy. We can control the complexity of the tree by controlling its maximum depth.

In [18]:
#This tree is way too big to allow any meaningful interpretation
#Here are a few ways to create smaller trees:
dt_small = DecisionTreeClassifier(max_depth = 4, random_state = 1)
dt_small.fit(X_train, Y_train)

Y_pred_dt_small = dt_small.predict(X_test)
# Calculate the accuracy
accuracy_dt_small = np.mean(Y_test == Y_pred_dt_small)
print(f"Accuracy of Decision Tree Classification on test set: {accuracy_dt_small}")

# Plot the tree
from sklearn.tree import plot_tree

plt.figure(figsize=(20,10))
plot_tree(dt_small, feature_names=X.columns, class_names = ('0','1'),  filled=True, rounded=True)
plt.show()
Accuracy of Decision Tree Classification on test set: 0.72

The accuracy is now much improved to 0.72. One can also look at the tree (fitted on training data) to see which features are appearing in the splits. The root node is first split according to "exper <= 5.5". The left subtree corresponds to "exper <= 5.5" and the right subtree corresponds to "exper > 5.5". The color coding is orange for nodes with more zero responses, and blue for nodes with more one responses. Note that the left subtree is mostly orange while the right subtree is mostly blue. Also purer nodes (i.e., with lower Gini values) get darker colors and less pure nodes get lighter colors.

The Decision Tree method for classification does not specify a model in terms of any parameters (similar to the $k$-NN method) so it is also treated as nonparametric method.

Random Forest¶

The random forest algorithm constructs a whole bunch (usually over 100) of similar but distinct decision trees, and obtains predictions by averaging the predictions of each individual tree. The individual decision trees are constructed as follows:

  1. Instead of using the training data directly, draw a separate bootstrap sample (with replacement) and fit a decision tree on it. As the different bootstrap samples will be similar but distinct, this will create similar but distinct decision trees.
  2. At every split, instead of cycling over all available covariates, one randomly selects a subset of covariates and only considers splits in these covariates. This randomness also plays a role in creating similar but distinct decision trees.

Because of bootstrapping in the first step above and combining (aggregating) the results of individual trees at the end, the random forest algorithm is an example of a method involving Bootstrap AGGregation, or bagging.

Typically random forests have good accuracy on a wide variety of datasets.

Similar to $k$-NN and Decision Trees, Random Forests are also considered a nonparametric technique unlike logistic regression.

In [19]:
#Random Forest:
from sklearn.ensemble import RandomForestClassifier

# Initialize and train the Random Forest classifier
rf = RandomForestClassifier(random_state=1, max_features = None)
rf.fit(X_train, Y_train)

# Predict on the test set
Y_pred_rf = rf.predict(X_test)

# Calculate the accuracy
accuracy_rf = np.mean(Y_test == Y_pred_rf)
print(f"Accuracy of Random Forest Classification on test set: {accuracy_rf}")

# Create the confusion matrix (contingency table)
print(pd.crosstab(Y_test, Y_pred_rf, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of Random Forest Classification on test set: 0.728
Predicted   0    1  All
Actual                 
0          71   42  113
1          26  111  137
All        97  153  250

In the above code, the 'max_features = None' option in the function RandomForestClassifer is telling it not subselect covariates in node splits (in other words, the second step above is not being implemented; only bootstrap samples are used to create distinct but similar decision trees). If this is changed to 'max_features = 2', then 2 covariates are random selected to create each split.

Evaluation of these Classification Methods on Some Simulated Toy Datasets¶

More insight into the workings of these four classification methods can be obtained by looking at simple simulated toy datasets. In these datasets, we can determine the relation between the response and covariates allowing us to better evaluate the performance of each method.

Our toy datasets will have $m = 2$ covariates $x_1$ and $x_2$. We shall consider 500 training datapoints and 250 test datapoints. In the first example, we will have a deterministic relationship between $Y$ and $(X_1, X_2)$ given by the following. $Y$ equals 1 if $X_1$ and $X_2$ have the same sign, and 0 if $X_1$ and $X_2$ have opposite signs. Mathematically, this can be represented as \begin{align*} Y = I\{X_1 X_2 > 0\}. \end{align*} Such a dataset can be simulated in the following way.

In [24]:
#EXAMPLE ONE:
n = 750
n_train = 500
n_test = 250

x1 = np.random.uniform(-1, 1, n)
x2 = np.random.uniform(-1, 1, n)
X = np.vstack([x1, x2]).transpose()

Y = (x1 * x2 > 0).astype(np.int64)

#Split the dataset into training and test datasets of sizes 500 and 250 respectively
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=500, test_size=250, random_state=1)

x1_train = X_train[:,0]
x2_train = X_train[:,1]
x1_test = X_test[:,0]
x2_test = X_test[:,1]

Because there are only two covariates, the covariate data can be looked at in a scatter plot. The binary response can then be indicated on this plot with a color scheme. This is done in the following code.

In [25]:
#In this case where there are only two features, the data can be plotted 
#color of the points denotes Y
def draw_results(x1, x2, color, plot_title=''):
    plt.figure()
    plt.scatter(x1, x2, c=color, cmap='viridis', alpha=0.7);
    plt.colorbar()
    plt.title(plot_title)
    plt.axis('equal')
    plt.xlabel('$x_1$')
    plt.ylabel('$x_2$')
    plt.tight_layout()

draw_results(x1_train, x2_train, color=Y_train, plot_title='Training data')
draw_results(x1_test, x2_test, color=Y_test, plot_title='Test data (ground truth)')

We shall apply each of our four methods on this dataset and compare their prediction accuracy. Let us start with logistic regression.

In [26]:
#Method One: Logistic regression:
logimodel_train = LogisticRegression(penalty = None, max_iter = 1000)
logimodel_train.fit(X_train, Y_train)
print("Intercept:", logimodel_train.intercept_)
print("Coefficients:", logimodel_train.coef_)
#Prediction for the test data
Y_pred = logimodel_train.predict(X_test)
print(Y_pred)
accuracy_logimodel = np.mean(Y_test == Y_pred)
print(f"Accuracy of Logistic Regression on test set: {accuracy_logimodel}")
Intercept: [0.09310359]
Coefficients: [[-0.21360691  0.03125306]]
[1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0
 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 0 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 0 1 0 0 0 1 0 0 1 1 1 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0
 0 1 1 1 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 0 0 1 1 1
 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1
 1 0 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 1
 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 1 1 1 0 0 1 1 1 1 1 1]
Accuracy of Logistic Regression on test set: 0.54

Logistic regression led to an accuracy of only 0.54 (this number might change for you because of randomness in the observations). Observe that if we randomly guess the response for each observation as 0 or 1, we would expect an accuracy close to 0.5. In this sense, accuracy of 0.54 is quite poor (it basically means that logistic regression in this case is no better than random guessing). We can visualize the predicted probabilities as well as predicted labels in the following way.

In [27]:
# Visualize the results
pred_probs_logistic = logimodel_train.predict_proba(X_test)[:,1]
draw_results(
    x1_test, x2_test, color=pred_probs_logistic, 
    plot_title="Predicted probability of y=1 (logistic)"
)

draw_results(
    x1_test, x2_test, color=Y_pred, 
    plot_title="Logistic model prediction"
)

Logistic regression works poorly for this dataset because the function: \begin{align*} (x_1, x_2) \mapsto \frac{\exp \left(\beta_0 + \beta_1 x_1 + \beta_2 x_2 \right)}{1 + \exp \left(\beta_0 + \beta_1 x_1 + \beta_2 x_2 \right)} \end{align*) is a poor approximation of the function \begin{align*} (x_1, x_2) \mapsto I\{x_1 x_2 > 0\} \end{align*} for every choice of $\beta_0, \beta_1, \beta_2$. A simple way of fixing this problem is to add a third feature $x_1 x_2$. This is an example of feature engineering. Logistic regression with this new feature will have substantially larger accuracy as can be checked using the code below.

In [28]:
#Logistic Regression will work here if we do feature engineering (include x1 * x2 as a third feature):
# Create a new feature: x1 * x2
#the following function returns an array like X but with a new feature that is x1 * x2
def add_mult_feature(X):
    """Returns an array like X, but with a new feature that's X1 * X2"""
    new_feature = X[:, 0] * X[:, 1]
    return np.hstack([X, new_feature[:, None]])

X_train_feat = add_mult_feature(X_train)
X_test_feat = add_mult_feature(X_test)

logimodel_train_feat = LogisticRegression(penalty = None, max_iter = 2000)
logimodel_train_feat.fit(X_train_feat, Y_train)
print("Intercept:", logimodel_train_feat.intercept_)
print("Coefficients:", logimodel_train_feat.coef_)
#Prediction for the test data
Y_pred_feat = logimodel_train_feat.predict(X_test_feat)
print(Y_pred_feat)
accuracy_logimodel_feat = np.mean(Y_test == Y_pred_feat)
print(f"Accuracy of Logistic Regression (with new feature x1*x2) on test set: {accuracy_logimodel_feat}")

pred_probs_logistic_feat = logimodel_train_feat.predict_proba(X_test_feat)[:,1]
draw_results(
    x1_test, x2_test, color=pred_probs_logistic_feat, 
    plot_title="Predicted probability of y=1 (logistic)"
)

draw_results(
    x1_test, x2_test, color=Y_pred_feat, 
    plot_title="Logistic model prediction"
)
Intercept: [3.47948303]
Coefficients: [[  54.09568077   14.39074806 5570.18753813]]
[1 1 0 0 1 0 1 1 0 1 1 0 1 0 0 0 0 1 0 0 1 0 1 1 1 0 1 1 0 1 1 0 1 1 1 1 0
 0 1 0 1 0 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0
 0 0 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 0 0 1 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0
 1 1 1 1 1 0 1 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 1
 1 1 1 0 1 1 0 0 0 0 1 1 1 1 0 1 1 0 0 1 0 0 0 1 0 0 1 1 1 1 1 1 0 1 0 1 1
 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 1 0 0 1 0 1 0 0
 0 0 0 1 1 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 1 0 0 0 1 0]
Accuracy of Logistic Regression (with new feature x1*x2) on test set: 0.996

Observe that the fitted coefficient for the new feature $x_1 x_2$ is very high which leads to essentially binary predicted probabilities: 1 when $x_1 x_2 > 0$ and 0 when $x_1 x_2 < 0$.

One insight from this is that proper feature selection is crucial for logistic regression, and more generally any linear model. Unfortunately, feature selection is not easy to do in most real-world regression or classification tasks.

Next we apply $k$-NN to this dataset.

In [29]:
#Method Two: k-NN
#Using k-NN classification:
k_value = 8
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = k_value)
knn.fit(X_train, Y_train)
Y_pred_knn = knn.predict(X_test)
accuracy_knn = np.mean(Y_test == Y_pred_knn)
print(f"Accuracy of k-NN Classification on test set: {accuracy_knn}")
#Contingency table (or confusion matrix)
print(pd.crosstab(Y_test, Y_pred_knn, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of k-NN Classification on test set: 0.972
Predicted    0    1  All
Actual                  
0          123    5  128
1            2  120  122
All        125  125  250

$k$-NN has very good accuracy and we did not have to any feature selection either. We can actually look at the predicted probabilities and labels to get more insights.

In [30]:
#Visualizing the results:
from sklearn.neighbors import KNeighborsClassifier

probs_knn = knn.predict_proba(X_test)[:, 1]
y_hat_knn = (probs_knn > 0.5).astype(np.int64)


draw_results(
    x1_test, x2_test, color=probs_knn, 
    plot_title="Predicted probability of y=1 (k-NN)"
)

draw_results(
    x1_test, x2_test, color=y_hat_knn, 
    plot_title="k-NN Model prediction"
)

From the predicted probability plot, it is clear that for points near the boundary, the predicted probabilities are in the middle range i.e., closer to 0.5 compared to points away from the boundary. So the method has some confusion about predicting these observations but the overall accuracy is quite good (though not as high as that of logistic regression with the feature $x_1 x_2$).

Next we apply the Decision Tree algorithm:

In [31]:
#Model Three: Decision Tree
dt = DecisionTreeClassifier(random_state=1)
dt.fit(X_train, Y_train)

# Predict on the test set
Y_pred_dt = dt.predict(X_test)

# Calculate the accuracy
accuracy_dt = np.mean(Y_test == Y_pred_dt)
print(f"Accuracy of Decision Tree Classification on test set: {accuracy_dt}")

# Create the confusion matrix (contingency table)
print(pd.crosstab(Y_test, Y_pred_dt, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of Decision Tree Classification on test set: 0.992
Predicted    0    1  All
Actual                  
0          126    2  128
1            0  122  122
All        126  124  250

The accuracy is very high: only 2 test observations out of 250 are misclasified. We can get more insight by looking at the actual fitted tree.

In [32]:
# Plot the tree
from sklearn.tree import plot_tree
plt.figure(figsize=(20,10))
plot_tree(dt, feature_names=['x1', 'x2'],class_names = ('0','1'),  filled=True, rounded=True)
plt.show()

The fitted tree is not too deep (even though we did not impose any maximum depth limit). Note however that there are four clear pure nodes in the actual data which we can obtain for example by splitting according to $x_1 <= 0$ and then according to $x_2 <= 0$ in each node. However the decision tree method gave us 8 pure nodes. This increase of complexity is happening because of the greedy method of growing the tree.

Finally let us apply Random Forest.

In [33]:
#Model Four: Random Forest
rf = RandomForestClassifier(random_state=1, max_features = None)
rf.fit(X_train, Y_train)

# Predict on the test set
Y_pred_rf = rf.predict(X_test)

# Calculate the accuracy
accuracy_rf = np.mean(Y_test == Y_pred_rf)
print(f"Accuracy of Random Forest Classification on test set: {accuracy_rf}")

# Create the confusion matrix (contingency table)
print(pd.crosstab(Y_test, Y_pred_rf, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of Random Forest Classification on test set: 1.0
Predicted    0    1  All
Actual                  
0          128    0  128
1            0  122  122
All        128  122  250

This has even better accuracy compared to the decision tree. This is actually typical.

In the next toy dataset example, we take the previous dataset and intentionally corrupt 10% of the training observations by switching their label. The test dataset is left unaltered.

In [35]:
#EXAMPLE TWO: Let us add some training noise to the previous dataset:
Y_train_noisy = Y_train.copy()

pts_to_flip = np.random.random(n_train) < 0.1 #we are selecting 10% of the datapoints
Y_train_noisy[pts_to_flip] = 1 - Y_train_noisy[pts_to_flip] #we are corrupting the responses of these datapoints

draw_results(x1_train, x2_train, color=Y_train_noisy, plot_title='Training data with noise')
In [36]:
#Performance of Logistic Regression on this corrupted dataset:
logimodel_train = LogisticRegression(penalty = None, max_iter = 1000)
logimodel_train.fit(X_train, Y_train_noisy)
print("Intercept:", logimodel_train.intercept_)
print("Coefficients:", logimodel_train.coef_)
#Prediction for the test data
Y_pred = logimodel_train.predict(X_test)
print(Y_pred)
accuracy_logimodel = np.mean(Y_test == Y_pred)
print(f"Accuracy of Logistic Regression on test set: {accuracy_logimodel}")
Intercept: [0.19758252]
Coefficients: [[-0.03969418 -0.0795844 ]]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
Accuracy of Logistic Regression on test set: 0.488

Logistic Regression again has poor accuracy (which is unsurprising given that this problem with corrupted data is even harder than the previous problem). Next we try logistic regression with the added feature $x_1 * x_2$.

In [37]:
#Logistic Regression with the x1*x2 feature
logimodel_train_feat = LogisticRegression(penalty = None, max_iter = 1000)
logimodel_train_feat.fit(X_train_feat, Y_train_noisy)
print("Intercept:", logimodel_train_feat.intercept_)
print("Coefficients:", logimodel_train_feat.coef_)
#Prediction for the test data
Y_pred_feat = logimodel_train_feat.predict(X_test_feat)
print(Y_pred_feat)
accuracy_logimodel_feat = np.mean(Y_test == Y_pred_feat)
print(f"Accuracy of Logistic Regression (with new feature x1*x2) on test set: {accuracy_logimodel_feat}")
Intercept: [0.25340968]
Coefficients: [[ 0.19991346 -0.02759687  7.38911646]]
[1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 0 0 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 0 1 0 1 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 0
 0 0 0 0 0 1 0 1 0 1 0 1 1 1 1 0 0 0 1 1 1 1 1 0 1 1 0 1 0 1 0 0 0 0 0 0 1
 1 1 1 1 1 0 1 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 1 1
 1 1 1 0 1 1 0 0 0 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1
 0 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 1 0 0 1 0 1 0 0
 0 0 0 1 1 1 0 0 1 0 1 0 1 0 1 0 0 1 0 0 1 1 1 0 1 0 1 0]
Accuracy of Logistic Regression (with new feature x1*x2) on test set: 0.92

The fitted coefficient for the new feature is now not as high as before. The accuracy is also somewhat lower than before (previously it was very close to 100% and now it came down to 92%).

In [38]:
#Performance of k-NN on this corrupted data:
k_value = 8
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = k_value)
knn.fit(X_train, Y_train_noisy)
Y_pred_knn = knn.predict(X_test)
accuracy_knn = np.mean(Y_test == Y_pred_knn)
print(f"Accuracy of k-NN Classification on test set: {accuracy_knn}")
#Contingency table (or confusion matrix)
print(pd.crosstab(Y_test, Y_pred_knn, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of k-NN Classification on test set: 0.96
Predicted    0    1  All
Actual                  
0          119    9  128
1            1  121  122
All        120  130  250

For $k$-NN, the accuracy is still pretty good (although it deteriorated a little bit).

In [39]:
#Performance of Decision Tree on corrupted data:
dt = DecisionTreeClassifier(random_state=1)
dt.fit(X_train, Y_train_noisy)

# Predict on the test set
Y_pred_dt = dt.predict(X_test)

# Calculate the accuracy
accuracy_dt = np.mean(Y_test == Y_pred_dt)
print(f"Accuracy of Decision Tree Classification on test set: {accuracy_dt}")

# Create the confusion matrix (contingency table)
print(pd.crosstab(Y_test, Y_pred_dt, rownames=['Actual'], colnames=['Predicted'], margins=True))

# Plot the tree
from sklearn.tree import plot_tree
plt.figure(figsize=(20,10))
plot_tree(dt, feature_names=['x1', 'x2'],class_names = ('0','1'),  filled=True, rounded=True)
plt.show()
Accuracy of Decision Tree Classification on test set: 0.908
Predicted    0    1  All
Actual                  
0          113   15  128
1            8  114  122
All        121  129  250

The accuracy of decision tree is decent but much less than before. More seriously, the fitted tree itself is now way more complicated. In this sense, a few corrupted points completely changed the nature of the fitted tree. This points of the instability of individual decision trees. This instability is often cited as the motivation for using random forests.

In [41]:
#Performance of Random Forest:
rf = RandomForestClassifier(random_state=1, max_features = None)
rf.fit(X_train, Y_train_noisy)

# Predict on the test set
Y_pred_rf = rf.predict(X_test)

# Calculate the accuracy
accuracy_rf = np.mean(Y_test == Y_pred_rf)
print(f"Accuracy of Random Forest Classification on test set: {accuracy_rf}")

# Create the confusion matrix (contingency table)
print(pd.crosstab(Y_test, Y_pred_rf, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of Random Forest Classification on test set: 0.952
Predicted    0    1  All
Actual                  
0          118   10  128
1            2  120  122
All        120  130  250

Random forests still maintains a high accuracy despite a slight drop compared to the no-corruption case.

In the final example, we change the dependence of $Y$ on $X_1, X_2$ to: \begin{align*} Y = I\{(X_1 + X_2) (X_1 - X_2) > 0\} \end{align*} Running the code below, one can check the following:

  1. Logistic regression with no engineered features performs very poorly
  2. k-NN has good accuracy
  3. Decision tree has decent accuracy but the fitted tree is quite deep and not interpretable
  4. Random forest has better accuracy compared to decision tree.

You are encouraged to try out these methods on more toy (and real) datasets.

In [42]:
#EXAMPLE THREE:
n_train = 500
n_test = 250
n = n_train + n_test

x1 = np.random.uniform(-1, 1, n)
x2 = np.random.uniform(-1, 1, n)
X = np.vstack([x1, x2]).transpose()

Y = ((x1+x2) * (x1 - x2) > 0).astype(np.int64)

#Split the dataset into training and test datasets of sizes 500 and 250 respectively
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=500, test_size=250, random_state=1)

x1_train = X_train[:,0]
x2_train = X_train[:,1]
x1_test = X_test[:,0]
x2_test = X_test[:,1]

draw_results(x1_train, x2_train, color=Y_train, plot_title='Training data')
draw_results(x1_test, x2_test, color=Y_test, plot_title='Test data (ground truth)')
In [43]:
#Logistic Regression
logimodel_train = LogisticRegression(penalty = None, max_iter = 1000)
logimodel_train.fit(X_train, Y_train)
print("Intercept:", logimodel_train.intercept_)
print("Coefficients:", logimodel_train.coef_)
#Prediction for the test data
Y_pred = logimodel_train.predict(X_test)
print(Y_pred)
accuracy_logimodel = np.mean(Y_test == Y_pred)
print(f"Accuracy of Logistic Regression on test set: {accuracy_logimodel}")
Intercept: [0.01178031]
Coefficients: [[-0.12368168 -0.07390931]]
[0 1 1 1 1 1 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 0 1 1 0 1 1 0 1 0 1 1 0 1 1 0 1
 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 0 1 0 0 0
 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 1 0 1 1 1 1 0 1 0 1 0 1 1 0 1
 0 0 1 1 0 0 1 0 0 0 1 0 0 1 1 0 1 0 1 1 0 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 0
 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 0 1
 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 0 1 1 1 0 1 1 1 1 0 1 0 0 0 1 0 1 0 0 1 1 0
 0 1 0 1 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 0 0 1 1 1 1 1 1 1]
Accuracy of Logistic Regression on test set: 0.504
In [44]:
#Performance of k-NN:
k_value = 8
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = k_value)
knn.fit(X_train, Y_train)
Y_pred_knn = knn.predict(X_test)
accuracy_knn = np.mean(Y_test == Y_pred_knn)
print(f"Accuracy of k-NN Classification on test set: {accuracy_knn}")
#Contingency table (or confusion matrix)
print(pd.crosstab(Y_test, Y_pred_knn, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of k-NN Classification on test set: 0.932
Predicted    0    1  All
Actual                  
0          104    4  108
1           13  129  142
All        117  133  250
In [45]:
#Performance of Decision Tree:
dt = DecisionTreeClassifier(random_state=1)
dt.fit(X_train, Y_train)

# Predict on the test set
Y_pred_dt = dt.predict(X_test)

# Calculate the accuracy
accuracy_dt = np.mean(Y_test == Y_pred_dt)
print(f"Accuracy of Decision Tree Classification on test set: {accuracy_dt}")

# Create the confusion matrix (contingency table)
print(pd.crosstab(Y_test, Y_pred_dt, rownames=['Actual'], colnames=['Predicted'], margins=True))

# Plot the tree
from sklearn.tree import plot_tree
plt.figure(figsize=(20,10))
plot_tree(dt, feature_names=['x1', 'x2'],class_names = ('0','1'),  filled=True, rounded=True)
plt.show()
Accuracy of Decision Tree Classification on test set: 0.912
Predicted    0    1  All
Actual                  
0           97   11  108
1           11  131  142
All        108  142  250
In [46]:
#Performance of Random Forest:
rf = RandomForestClassifier(random_state=1, max_features = None)
rf.fit(X_train, Y_train)

# Predict on the test set
Y_pred_rf = rf.predict(X_test)

# Calculate the accuracy
accuracy_rf = np.mean(Y_test == Y_pred_rf)
print(f"Accuracy of Random Forest Classification on test set: {accuracy_rf}")

# Create the confusion matrix (contingency table)
print(pd.crosstab(Y_test, Y_pred_rf, rownames=['Actual'], colnames=['Predicted'], margins=True))
Accuracy of Random Forest Classification on test set: 0.968
Predicted    0    1  All
Actual                  
0          105    3  108
1            5  137  142
All        110  140  250