This lecture will focus on neural networks. Neural networks can be seen as an extension of logistic regression. Let us start then by fitting logistic regression to the Mroz dataset from the last few lectures.
Let us first import the necessary Python Libraries and the mroz dataset.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Loading the mroz dataset:
mroz = pd.read_csv("MROZ.csv")
Y = mroz['inlf'] # This is a binary variable
X = mroz[['kidslt6', 'age', 'educ', 'huswage', 'exper', 'expersq']]
We split the full dataset into training and test datasets, fit the logistic regression model on the training dataset and examine its prediction accuracy on the test dataset. We shall do all this using functions from the scikit-learn library.
#We split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=503, test_size=250, random_state=1)
#Logistic Regression and its accuracy:
from sklearn.linear_model import LogisticRegression
logimodel_train = LogisticRegression(penalty = None, max_iter = 1000)
logimodel_train.fit(X_train, Y_train)
print("Intercept:", logimodel_train.intercept_)
print("Coefficients:", logimodel_train.coef_)
Y_pred = logimodel_train.predict(X_test)
accuracy_logimodel = np.mean(Y_test == Y_pred)
print(f"Accuracy of Logistic Regression on test set: {accuracy_logimodel}")
contingency_table = pd.crosstab(Y_test, Y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
print(contingency_table)
Intercept: [0.80948004] Coefficients: [[-1.42568874 -0.08661254 0.19075556 -0.05338824 0.22426735 -0.00391921]] Accuracy of Logistic Regression on test set: 0.728 Predicted 0 1 All Actual 0 64 49 113 1 19 118 137 All 83 167 250
The fitted coefficients by the logistic regression model are given by the 'Intercept' and 'Coefficients' above. The test accuracy is given by 72.8% (the discrepancy between actual test values and predicted values can be further broken down into the confusion matrix or contingency table as above).
Let us briefly recall how the parameter estimates are obtained in logistic regression. The idea is to maximize the log-likelihood: \begin{align*} \sum_{i=1}^n \left\{ y_i \log \frac{e^{\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im}}}{1 + e^{\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im}}} + (1-y_i) \log \left[1 - \frac{e^{\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im}}}{1 + e^{\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im}}} \right] \right\} \end{align*} over the parameters $\beta_0, \dots, \beta_m$. Equivalently, the goal is to minimize (note the minus sign below in $L$): \begin{align*} L := - \sum_{i=1}^n \left(y_i \log p_i + (1-y_i) \log (1 - p_i) \right) ~~~ \text{ where } p_i = \sigma \left(\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im} \right) \end{align*} where $\sigma(\cdot)$ is the sigmoid function defined by \begin{align*} \sigma(z) = \frac{e^z}{1 + e^z} = \frac{1}{1 + e^{-z}}. \end{align*} The algorithm used to minimize $L$ is called Newton's method which is an iterative scheme given by: \begin{align*} \beta^{(m+1)} = \beta^{(m)} - \left(H L(\beta^{(m)}) \right)^{-1} \nabla L(\beta^{(m)}) \end{align*} where $\nabla L$ and $H L$ are the gradient and Hessian of $L$ respectively. The gradient and Hessian can be calculated in closed form by just differentiating the expression of $L$. Newton's algorithm converges quite quickly for logistic regression.
The quantity \begin{align*} \sum_{i=1}^n \left(-y_i \log p_i - (1-y_i) \log (1 - p_i) \right) = \sum_{i=1}^n \left( y_i \log \frac{1}{p_i} + (1-y_i) \log \frac{1}{1-p_i} \right) \end{align*} is often called the Binary Cross Entropy Loss between $(y_1, \dots, y_n)$ and $(p_1, \dots, p_n)$ (see e.g., https://en.wikipedia.org/wiki/Cross-entropy). It is always nonnegative (because $p_i$ and $1-p_i$ are between 0 and 1 so that $\log (1/p_i)$ and $\log(1/(1-p_i))$ are nonnegative), and it is minimized when $p_i = y_i$ for each $i$.
In the logistic regression model, the response variable $y$ is predicted by $\sigma(\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im})$. The parameters $\beta_0, \dots, \beta_m$ are estimated in such a way that the Binary Cross Entropy Loss between the model predictions and the observed responses is minimized.
The sigmoid function $\sigma(z)$ is nonlinear, strictly increasing taking values between 0 and 1, approaches 0 as $z \rightarrow -\infty$ and approaches 1 as $z \rightarrow \infty$. We can plot this function in Python as follows:
#Plotting the sigmoid function:
z = np.linspace(-10, 10, 100)
y = np.exp(z)/(1 + np.exp(z))
plt.figure(figsize = (8, 6))
plt.plot(z, y, label = "Sigmoid Function")
plt.title("Sigmoid Function")
plt.xlabel("z")
plt.ylabel("Sigmoid(z)")
plt.grid(True)
plt.legend()
plt.show()
The model-prediction determining function in logistic regression is:
\begin{align*}
(x_1, \dots, x_m) \mapsto \sigma(\beta_0 + \beta_1 x_{1} + \dots + \beta_m x_{m})
\end{align*}
This function, which applies the simple nonlinear map $\sigma(z) = \frac{e^z}{1+e^z} = \frac{1}{1 + e^{-z}}$ to the linear function $\beta_0 + \beta_1 x_1 + \dots + \beta_m x_m$, is a special case of an (artificial) neural network. More generally, a neural network is a function obtained by repeated composition of linear functions and simple nonlinearities such as the sigmoid. Neural networks are typically represented by pictures involving "nodes" and directed edges. The function $\sigma(\beta_0 + \beta_1 x_1 + \dots + \beta_m x_m)$ can be represented by the following picture (we took $m = 4$ for the picture).
Here the node '$+$' computes the quantity $\beta_0 + \beta_1 x_1 + \dots + \beta_m x_m$ and then the next node '$\sigma$' applies the sigmoid function to the output of the first node thereby computing the final output $\sigma(\beta_0 + \beta_1 x_1 + \dots + \beta_m x_m)$.
Because logistic regression is a special case of neural networks, we can fit logistic regression in standard Python libraries implementing neural networks. The following code illustrates this using the library PyTorch.
import torch
import torch.nn as nn
import torch.optim as optim
The first step to using PyTorch is to convert the data into PyTorch tensors. This step is needed because PyTorch neural network models expect input data in the form of PyTorch tensors to utilize the library's capabilities for automatic differentiation, GPU acceleration etc.
#Convert data to PyTorch Tensors
X_train_tensor = torch.tensor(X_train.values, dtype=torch.float32)
print(X_train_tensor.shape)
Y_train_tensor = torch.tensor(Y_train.values, dtype=torch.float32).unsqueeze(1) # Make Y a 2D tensor
print(Y_train_tensor.shape)
torch.Size([503, 6]) torch.Size([503, 1])
Note that the 'unsqueeze(1)' makes Y a 2D tensor (or matrix) just like X. It is necessary that both X and Y are 2D tensors.
Next let us specify the input and output sizes of the neural network. The input size is equal to 6 (the number of explanatory variables in this regression) and the output size is equal to 1 (because the response variable is a scalar).
input_size = X_train_tensor.shape[1]
output_size = Y_train_tensor.shape[1]
display(input_size, output_size)
6
1
Now we are ready to specify the neural network. From the picture above, it is clear that it has two layers. The first is the linear layer which corresponds to the '+' node (and computes $\beta_0 + \beta_1 x_1 + \dots + \beta_m x_m$). The second is the nonlinear layer with applies the sigmoid function. We can specify both layers in sequence using the 'Sequential' function:
# Define the Neural Network corresponding to Logistic Regression
model_1 = nn.Sequential(
nn.Linear(input_size, output_size),
nn.Sigmoid()
)
#We are calling this model_1 because there will be other models coming up.
'nn.Sequential' specifies the neural network in sequence. The first layer is the linear layer which maps the input $(x_1, \dots, x_m)$ to a single scalar $\beta_0 + \beta_1 x_1 + \dots + \beta_m x_m$. The coefficients $\beta_1, \dots, \beta_m$ are known as the 'weights' associated with the first layer, and the intercept $\beta_0$ is known as the 'bias' of the first layer. Collectively, the weights and bias are known as the parameters of the first layer. The second layer simply applies the sigmoid function, and it has no parameters.
display(model_1[0], model_1[1])
Linear(in_features=6, out_features=1, bias=True)
Sigmoid()
linear, sigm = model_1.children()
print(list(linear.parameters()))
print(list(sigm.parameters())) #this has no parameters so empty list ([]) is printed
[Parameter containing: tensor([[ 0.2821, -0.3516, 0.0119, 0.1230, 0.3326, 0.3583]], requires_grad=True), Parameter containing: tensor([0.2763], requires_grad=True)] []
The displayed weights and bias above for the first layer are random values. Every time a neural network is created, its parameters are initialized by random numbers unless otherwise specified (you can go back and re-run the code and check that the parameters change).
#Another way of printing the weights and bias of the first layer:
print(model_1[0].weight.data)
print(model_1[0].bias.data)
tensor([[ 0.2821, -0.3516, 0.0119, 0.1230, 0.3326, 0.3583]]) tensor([0.2763])
#This neural network is just a function mapping (x1,...,x6) to sigma(b0 + b1 x1 + ... + b6 x6)
#We can provide specific input values of x1,...,x6 and obtain the output of the neural network:
#For example, for x1 = 0.5, x2 = 1, x3 = 0.44, x4 = -4, x5 = -1.5, x6 = 3:
xvals = torch.tensor([0.5, 1, 0.44, -4, -1.5, 3], dtype = torch.float32)
youtput = model_1(xvals)
print(youtput)
#we can verify the correctness of this value by directly computing sigma(b0 + b1 x1 + ... + b6 x6):
print(1/(1 + np.exp(-model_1[0].bias.data - torch.dot(xvals, model_1[0].weight.data.view(-1)))))
tensor([0.5387], grad_fn=<SigmoidBackward0>) tensor([0.5387])
Now we shall train this neural network on the observed data. Training just means that the parameters will be estimated from the data. For this, we first need to specify the loss function (which is Binary Cross Entropy Loss function), and the optimization algorithm that we want to be employed. The algorithms employed in PyTorch are variants of gradient descent which is an iterative algorithm given by: \begin{align*} \beta^{(m+1)} = \beta^{(m)} - \alpha \nabla L(\beta^{(m)}) \end{align*} This should be compared with the iterative scheme for Newton's method: \begin{align*} \beta^{(m+1)} = \beta^{(m)} - \left(H L(\beta^{(m)}) \right)^{-1} \nabla L(\beta^{(m)}) \end{align*} Based on the above two equations, it is clear that gradient descent is obtained by replacing the Hessian in Newton's method with the identity matrix multiplied by $1/\alpha$. The quantity $\alpha$ is called the learning rate and it determines the size of the step taken in the direction opposite to the gradient $\nabla L(\beta^{(m)})$. A popular variant of Gradient Descent used in PyTorch is Adam (Adaptive Moment Estimation). The following code specifies the binary cross entropy loss to be minimized and the Adam algorithm to be used for optimization.
# Cross entropy loss function and Adam algorithm
criterion = nn.BCELoss() # Binary Cross Entropy Loss
optimizer = optim.Adam(model_1.parameters(), lr=0.001)
# Train the model
num_epochs = 5000 #this is the number of iterations for the optimization algorithm
for epoch in range(num_epochs):
optimizer.zero_grad() #reset the gradients
predictions = model_1(X_train_tensor)
loss = criterion(predictions,Y_train_tensor)
loss.backward() #in this step, the gradient is calculated
optimizer.step() #here the next iterate of the parameters is calculated according to the gradients
nn_weights = model_1[0].weight.data
nn_biases = model_1[0].bias.data
lr_coefficients = logimodel_train.coef_
lr_intercept = logimodel_train.intercept_
# Printing side by side
print(f"Neural Network Weights:\n{nn_weights}\nLogistic Regression Coefficients:\n{lr_coefficients}")
print(f"Neural Network Biases:\n{nn_biases}\nLogistic Regression Intercept:\n{lr_intercept}")
Neural Network Weights: tensor([[-1.4191, -0.0858, 0.1927, -0.0534, 0.2245, -0.0039]]) Logistic Regression Coefficients: [[-1.42568874 -0.08661254 0.19075556 -0.05338824 0.22426735 -0.00391921]] Neural Network Biases: tensor([0.7464]) Logistic Regression Intercept: [0.80948004]
There is some discrepancy between the weights (coefficients) and bias (intercept) returned by PyTorch and logistic regression. This is due to the use of different algorithms: Adam in PyTorch and a Newton-like algorithm in Scikit-learn. Actually, the algorithm used in scikit-learn is called L-BFGS which is a variant of Newton's algorithm (instead of the exact Hessian used in Newton's method, L-BFGS uses an approximation to the inverse Hessian matrix).
predictions = model_1(X_train_tensor)
loss_adam = criterion(predictions,Y_train_tensor)
print(loss_adam)
model_1[0].weight.data = torch.tensor(lr_coefficients, dtype=torch.float32)
model_1[0].bias.data = torch.tensor(lr_intercept, dtype=torch.float32)
# Compute predictions and loss using the modified model
predictions = model_1(X_train_tensor)
loss_scikit_learn = criterion(predictions, Y_train_tensor)
print(loss_scikit_learn)
tensor(0.5381, grad_fn=<BinaryCrossEntropyBackward0>) tensor(0.5381, grad_fn=<BinaryCrossEntropyBackward0>)
In the above code, we are comparing the loss function evaluated at the model parameters that are obtained by adam to the loss function evaluated at the model parameters given by the logistic regression intercept and coefficients. The two loss functions are quite close (with the logistic regression weights/bias giving slightly smaller loss). It indicates that adam is working decently even though there is some discrepancy between its fitted parameters and those given by scikit-learn. Also the fact that different looking parameter estimates have similar loss functions means that the loss function is somewhat flat around the actual minimizer.
#Let us reset the parameters of model_1 to those output by adam:
model_1[0].weight.data = nn_weights
model_1[0].bias.data = nn_biases
The L-BFGS algorithm is also available as an optimization method in PyTorch. The following code uses this algorithm.
#With the LBFGS Optimization Scheme:
X_train_tensor = torch.tensor(X_train.values, dtype=torch.float32)
Y_train_tensor = torch.tensor(Y_train.values, dtype=torch.float32).unsqueeze(1)
# Define the Neural Network Architecture
input_size = X_train.shape[1]
output_size = 1 # Binary classification
model_2 = nn.Sequential(
nn.Linear(input_size, output_size),
nn.Sigmoid()
)
# Define the loss function and optimizer
criterion = nn.BCELoss() # Binary Cross Entropy Loss
optimizer = optim.LBFGS(model_2.parameters(), lr=0.001)
# Train the model
num_epochs = 4000
for epoch in range(num_epochs):
def closure():
optimizer.zero_grad()
predictions = model_2(X_train_tensor)
loss = criterion(predictions, Y_train_tensor)
loss.backward()
return loss
optimizer.step(closure)
nn_weights = model_2[0].weight.data
nn_biases = model_2[0].bias.data
lr_coefficients = logimodel_train.coef_
lr_intercept = logimodel_train.intercept_
# Printing side by side
print(f"Neural Network Weights:\n{nn_weights}\nLogistic Regression Coefficients:\n{lr_coefficients}")
print(f"Neural Network Biases:\n{nn_biases}\nLogistic Regression Intercept:\n{lr_intercept}")
Neural Network Weights: tensor([[-1.4252, -0.0866, 0.1907, -0.0534, 0.2243, -0.0039]]) Logistic Regression Coefficients: [[-1.42568874 -0.08661254 0.19075556 -0.05338824 0.22426735 -0.00391921]] Neural Network Biases: tensor([0.8100]) Logistic Regression Intercept: [0.80948004]
This algorithm gives results that are very close to those given by scikit-learn.
Below we look at the test accuracy of the neural network models.
#Predictions on Test Data:
X_test_tensor = torch.tensor(X_test.values, dtype=torch.float32)
Y_test_tensor = torch.tensor(Y_test.values, dtype=torch.float32).squeeze()
with torch.no_grad(): # Ensure you're in "evaluation" mode which won't compute gradients
raw_predictions_1 = model_1(X_test_tensor)
Y_pred_1 = (raw_predictions_1 > 0.5).float().squeeze() # Thresholding at 0.5 for binary classification
accuracy_1 = (Y_pred_1 == Y_test_tensor).float().mean().item() # Compute accuracy
print(f"Accuracy of Adam Optimized NN on test data: {accuracy_1 * 100:.2f}%")
with torch.no_grad(): # Ensure you're in "evaluation" mode which won't compute gradients
raw_predictions_2 = model_2(X_test_tensor)
Y_pred_2 = (raw_predictions_2 > 0.5).float().squeeze() # Thresholding at 0.5 for binary classification
accuracy_2 = (Y_pred_2 == Y_test_tensor).float().mean().item() # Compute accuracy
print(f"Accuracy of LBFGS Optimized NN on test data: {accuracy_2 * 100:.2f}%")
Accuracy of Adam Optimized NN on test data: 72.40% Accuracy of LBFGS Optimized NN on test data: 72.80%
In the last lecture, we also considered some toy datasets the simplest of which is given below.
n = 750
n_train = 500
n_test = 250
np.random.seed(3) #setting seed to ensure reproducibility
x1 = np.random.uniform(-1, 1, n)
x2 = np.random.uniform(-1, 1, n)
X = np.vstack([x1, x2]).transpose()
Y = (x1 * x2 > 0).astype(np.int64)
#Split the dataset into training and test datasets of sizes 500 and 250 respectively
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=500, test_size=250, random_state=1)
x1_train = X_train[:,0]
x2_train = X_train[:,1]
x1_test = X_test[:,0]
x2_test = X_test[:,1]
#Plotting this dataset
def draw_results(x1, x2, color, plot_title=''):
plt.figure()
plt.scatter(x1, x2, c=color, cmap='viridis', alpha=0.7, s = 10);
plt.colorbar()
plt.title(plot_title)
plt.axis('equal')
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.tight_layout()
draw_results(x1_train, x2_train, color=Y_train, plot_title='Training data')
draw_results(x1_test, x2_test, color=Y_test, plot_title='Test data (ground truth)')
In the last lecture, we used logistic regression on this dataset and noted that it gave very poor accuracy.
#In the last lecture, we used logistic regression on this dataset:
from sklearn.linear_model import LogisticRegression
logimodel_train = LogisticRegression(penalty = None, max_iter = 1000)
logimodel_train.fit(X_train, Y_train)
print("Intercept:", logimodel_train.intercept_)
print("Coefficients:", logimodel_train.coef_)
#Prediction for the test data
Y_pred = logimodel_train.predict(X_test)
accuracy_logimodel = np.mean(Y_test == Y_pred)
print(f"Accuracy of Logistic Regression on test set: {accuracy_logimodel}")
contingency_table = pd.crosstab(Y_test, Y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
print(contingency_table)
Intercept: [-0.06385385] Coefficients: [[ 0.13228111 -0.12995264]] Accuracy of Logistic Regression on test set: 0.332 Predicted 0 1 All Actual 0 66 61 127 1 106 17 123 All 172 78 250
We can re-fit this logistic regression with PyTorch and verify that it gives same results:
#Let us re-do this logistic regression using PyTorch:
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
Y_train_tensor = torch.tensor(Y_train, dtype=torch.float32).unsqueeze(1)
# Define the Neural Network Architecture
input_size = X_train.shape[1]
output_size = 1 # Binary classification
model = nn.Sequential(
nn.Linear(input_size, output_size),
nn.Sigmoid() # Sigmoid activation for binary classification
)
# Define the loss function and optimizer
criterion = nn.BCELoss() # Binary Cross Entropy Loss
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Train the model
num_epochs = 1000 # You can adjust this
for epoch in range(num_epochs):
optimizer.zero_grad() #reset the gradients
predictions = model(X_train_tensor)
loss = criterion(predictions,Y_train_tensor)
loss.backward()
optimizer.step()
nn_weights = model[0].weight.data
nn_biases = model[0].bias.data
lr_coefficients = logimodel_train.coef_
lr_intercept = logimodel_train.intercept_
# Printing side by side
print(f"Neural Network Weights:\n{nn_weights}\nLogistic Regression Coefficients:\n{lr_coefficients}")
print(f"Neural Network Biases:\n{nn_biases}\nLogistic Regression Intercept:\n{lr_intercept}")
Neural Network Weights: tensor([[ 0.1532, -0.1311]]) Logistic Regression Coefficients: [[ 0.13228111 -0.12995264]] Neural Network Biases: tensor([-0.0591]) Logistic Regression Intercept: [-0.06385385]
#Accuracy of Neural Networks:
#Predictions on Test Data:
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
Y_test_tensor = torch.tensor(Y_test, dtype=torch.float32).squeeze()
with torch.no_grad():
raw_predictions = model(X_test_tensor)
Y_pred = (raw_predictions > 0.5).float().squeeze() # Thresholding at 0.5 for binary classification
accuracy = (Y_pred == Y_test_tensor).float().mean().item() # Compute accuracy
print(f"Accuracy of Adam Optimized NN on test data: {accuracy * 100:.2f}%")
Accuracy of Adam Optimized NN on test data: 34.00%
This neural network in PyTorch gives essentially the same estimates as scikit-learn logistic regression and also the same accuracy. In the last lecture, we also saw that logistic regression with feature engineering where we add the additional feature $x_1 x_2$ works very well as demonstrated below.
def add_mult_feature(X):
"""Returns an array like X, but with a new feature that's X1 * X2"""
new_feature = X[:, 0] * X[:, 1]
return np.hstack([X, new_feature[:, None]])
X_train_feat = add_mult_feature(X_train)
X_test_feat = add_mult_feature(X_test)
logimodel_train_feat = LogisticRegression(penalty = None, max_iter = 2000)
logimodel_train_feat.fit(X_train_feat, Y_train)
print("Intercept:", logimodel_train_feat.intercept_)
print("Coefficients:", logimodel_train_feat.coef_)
#Prediction for the test data
Y_pred_feat = logimodel_train_feat.predict(X_test_feat)
accuracy_logimodel_feat = np.mean(Y_test == Y_pred_feat)
print(f"Accuracy of Logistic Regression (with new feature x1*x2) on test set: {accuracy_logimodel_feat}")
contingency_table_feat = pd.crosstab(Y_test, Y_pred_feat, rownames=['Actual'], colnames=['Predicted'], margins=True)
print(contingency_table_feat)
Intercept: [-7.05612094] Coefficients: [[ 23.07451326 13.61764772 7783.51393629]] Accuracy of Logistic Regression (with new feature x1*x2) on test set: 1.0 Predicted 0 1 All Actual 0 127 0 127 1 0 123 123 All 127 123 250
For neural networks, there is a different fix for improving accuracy without adding the feature $x_1 x_2$. This method adds a hidden layer to the neural network.
The neural network obtained by adding a hidden layer to logistic regression is given by the following picture.
Here the hidden nodes are $h_1, \dots, h_p$ (in the picture, we took $p = 6$) which are linear combinations of $x_1, \dots, x_m$. $\sigma$ is a nonlinear function (like the sigmoid) that is first applied to $h_1, \dots, h_p$; we define $s_i = \sigma(h_i)$ (it is common to alternatively think of $s_1, \dots, s_p$ as the hidden nodes). The logistic regression model is then used with these modified features $s_1, \dots, s_p$.
The mathematical equation underlying this neural network is the following: \begin{align*} f(x_1, \dots, x_m) = \sigma(\beta_0 + \beta_1 s_1 + \dots + \beta_p s_p) ~~ \text{ where } s_i = \sigma(h_i) \text{ and } h_i = \sum_{j} w_{ij} x_j + b_i. \end{align*} More succinctly, using vector-matrix notation, this can be written as \begin{align*} f(x_1, \dots, x_m) = \sigma\left(\beta_0 + \beta^T \sigma(W x + b) \right) \end{align*}
The nonlinear function $\sigma$ is applied twice. First on $h_1, \dots, h_p$ to obtain $s_1, \dots, s_p$ and then on $\beta_0 + \beta_1 s_1 + \dots + \beta_p s_p$. Actually, there is no reason to apply the same nonlinear function $\sigma$ in both these applications. It is common to use the ReLU nonlinear function in the first step (i.e., in $s_i = \sigma(h_i)$) and the sigmoide function in the second step. The ReLU activation function is given by: \begin{align*} \sigma_{\text{ReLU}}(z) = z_+ := \max(z, 0) \end{align*} With these two different nonlinear functions, the formula for the overall function computed by this neural network becomes: \begin{align*} f(x_1, \dots, x_m) = \sigma\left(\beta_0 + \beta^T \sigma_{\text{ReLU}}(W x + b) \right) \end{align*} Here $W$ is a $p \times m$ matrix, $b$ is a $p \times 1$ vector, $\beta$ is a $p \times 1$ vector and $\beta_0$ is a scalar. Note also that $\sigma_{\text{ReLU}}$ is being applied to the vector $Wx + b$ (as opposed to a scalar $z$) which means that it is being applied separately to each component of the vector (and all the results are stored in a vector of the same size): $\sigma_{\text{ReLU}}(z_1, z_2, z_3) = (\sigma_{\text{ReLU}}(z_1), \sigma_{\text{ReLU}}(z_2), \sigma_{\text{ReLU}}(z_3))$.
This hidden layer neural network can be fit using PyTorch by a minor modification of the previous code: we just need to change the specification of the network by adding the new hidden layer. The code is given below.
n_hidden = 6 #number of hidden units (the quantity p in our notation above)
model = nn.Sequential(
nn.Linear(input_size, n_hidden), # First linear layer with n_hidden nodes (hidden layer)
nn.ReLU(), # Activation function after hidden layer
nn.Linear(n_hidden, output_size), # Output layer
nn.Sigmoid() # Sigmoid activation for binary classification
)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
num_epochs = 1000
for epoch in range(num_epochs):
optimizer.zero_grad() #reset the gradients
predictions = model(X_train_tensor)
loss = criterion(predictions,Y_train_tensor)
loss.backward()
optimizer.step()
The accuracy of the above neural network model on the test dataset is computed below.
with torch.no_grad(): # Ensures you're in "evaluation" mode which won't compute gradients
raw_predictions = model(X_test_tensor)
Y_pred = (raw_predictions > 0.5).float().squeeze() # Thresholding at 0.5 for binary classification
accuracy = (Y_pred == Y_test_tensor).float().mean().item()
print(f"Accuracy on test data: {accuracy * 100:.2f}%")
contingency_table = pd.crosstab(Y_test, Y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
print(contingency_table)
Accuracy on test data: 88.40% Predicted 0.0 1.0 All Actual 0 126 1 127 1 28 95 123 All 154 96 250
The accuracy for $p = 6$ is decent but it gets much better as $p$ is increased (go back to the code above and put $p = 60, 200, 500$). The success of this model means that the neural network with a single hidden node is able to extract the correct feature $x_1 x_2$ within its hidden layer. Some intuition for why this is working can be obtained as follows. Below we use the elementary formula: \begin{align*} x^2 = 8 \int_0^1 (x - 4t + 2)_+ dt - 4x - 4 ~~ \text{ for } -2 \leq x \leq 2. \end{align*} To verify the above formula, start with the right hand side and note that $(x - 4t + 2)_+ = 0$ unless $x-4t + 2 \geq 0$ which is equivalent to $t \leq (x+2)/4$ (because $-2 \leq x \leq x$ we have $0 \leq (x+2)/4 \leq 1$). Thus \begin{align*} 8 \int_0^1 (x - 4t + 2)_+ dt - 4x - 4 &= 8 \int_0^{(x+2)/4} (x - 4t + 2) dt - 4 x - 4 \end{align*} which can be easily checked to equal $x^2$. Because of this formula, we can write \begin{align*} x_1 x_2 &= \frac{1}{4} (x_1 + x_2)^2 + \frac{1}{4} (x_1 - x_2)^2 \\ &= 2 \int_0^1 (x_1 + x_2 - 4t + 2)_+ dt - 4(x_1 + x_2) - 4 + 2 \int_0^1 (x_1 - x_2 - 4t + 2)_+ dt - 4(x_1 - x_2) - 4 \\ &= 2 \int_0^1 (x_1 + x_2 - 4t + 2)_+ dt + 2 \int_0^1 (x_1 - x_2 - 4t + 2)_+ dt - 8 x_1 - 8 \\ &= 2 \int_0^1 \sigma_{\text{ReLU}}(x_1 + x_2 - 4t + 2) dt + 2 \int_0^1 \sigma_{\text{ReLU}}(x_1 - x_2 - 4t + 2) dt - 8 x_1 - 8. \end{align*} By approximating the integrals above by discrete sums over $t$, we can write \begin{align*} x_1 x_2 \approx \sum_{j=1}^N w_j \sigma_{\text{ReLU}}(x_1 + x_2 - t_j) + \sum_{j=1}^N w_j \sigma_{\text{ReLU}}(x_1 - x_2 - t_j) \end{align*} This means that $x_1 x_2$ can be closely approximated by the linear combination of a large number of $\sigma_{\text{ReLU}}$ applied to linear combinations of $x_1$ and $x_2$. In other words, it should be possible to find $W$ ($p \times 2$) and $b$ ($p \times 1$) so that $x_1 x_2 \approx \sigma_{\text{ReLU}}(Wx + b)$ when $p$ is large. This is the reason why the hidden neural network with $p$ large is working well.
One can also check (see code below) that this method also works when a small proportion of datapoints are corrupted.
#With noisy data:
np.random.seed(3)
Y_train_noisy = Y_train.copy()
pts_to_flip = np.random.random(n_train) < 0.1 #we are selecting 10% of the datapoints
Y_train_noisy[pts_to_flip] = 1 - Y_train_noisy[pts_to_flip] #we are corrupting the responses of these datapoints
draw_results(x1_train, x2_train, color=Y_train_noisy, plot_title='Training data with noise')
Y_train_noisy_tensor = torch.tensor(Y_train_noisy, dtype=torch.float32).unsqueeze(1) # Make Y a 2D tensor
n_hidden = 500
model = nn.Sequential(
nn.Linear(input_size, n_hidden), # First linear layer with n_hidden nodes (hidden layer)
nn.ReLU(), # Activation function after hidden layer
nn.Linear(n_hidden, output_size), # Output layer
nn.Sigmoid() # Sigmoid activation for binary classification
)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
num_epochs = 1000
for epoch in range(num_epochs):
optimizer.zero_grad() #reset the gradients
predictions = model(X_train_tensor)
loss = criterion(predictions,Y_train_noisy_tensor)
loss.backward()
optimizer.step()
with torch.no_grad(): # Ensures you're in "evaluation" mode which won't compute gradients
raw_predictions = model(X_test_tensor)
Y_pred = (raw_predictions > 0.5).float().squeeze() # Thresholding at 0.5 for binary classification
accuracy = (Y_pred == Y_test_tensor).float().mean().item()
print(f"Accuracy on test data: {accuracy * 100:.2f}%")
contingency_table = pd.crosstab(Y_test, Y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
print(contingency_table)
Accuracy on test data: 98.40% Predicted 0.0 1.0 All Actual 0 126 1 127 1 3 120 123 All 129 121 250
Neural network models are fitted using variants of gradient descent. Calculation of gradients are crucial for these algorithms to work. The standard gradient calculation algorithm used in neural networks is backpropagation. Backpropagation is an algorithm for efficiently computing derivatives by applying the chain rule. To see how it works, let's consider a very simple loss function involving three variables, $a$, $b$, and $c$:
$$ L(a, b, c) = (a + 3b)c^2 $$We can compute the partial derivatives with respect to $a$, $b$, and $c$. To make it a little clearer when and where we're using the chain rule, let $q = a+b$ and $r = c^2$.
$$ \begin{align*} \frac{\partial \ell}{\partial a} &= \frac{\partial \ell}{\partial q}\cdot\frac{\partial q}{\partial a} = c^2 \cdot 1 \\ \frac{\partial \ell}{\partial b} &= \frac{\partial \ell}{\partial q}\cdot\frac{\partial q}{\partial b} = c^2 \cdot 3 \\ \frac{\partial \ell}{\partial c} &= \frac{\partial \ell}{\partial r}\cdot\frac{\partial r}{\partial c} = (a+b) \cdot 2c \end{align*} $$Even in this simple example, we can see that there was some redundant work involved: in doing this computation, we needed to compute $\frac{\partial \ell}{\partial q} = c^2$ twice. In a more complicated expression, the redundant work would become much worse. Backpropagation gives us a way to compute these gradients more efficiently.
Given specific numerical values for $a$, $b$, and $c$, backpropagation is an efficient way to compute the loss and the gradient (i.e., all the partial derivatives), with no redundant computation. First consider the computation of $L(a, b, c)$ for given values of $a, b, c$. For concreteness, take $a = 2, b = 3, c = 4$. The calculations for $L(a, b, c)$ can be represented by the following graph (with answers given by the blue numbers above the arrows):
Next, let's notice that when we did the calculations in the previous section to find the gradient, most of our expressions started at the loss, then, using the chain rule, computed partial derivatives with respect to things like $q$ and $r$. Let's try to write these partial derivatives on the graph, and see if we can use them to keep working backwards.
First, we'll start with the easiest derivative, the derivative of the loss with respect to itself: $\frac{\partial L}{\partial L}$. This is just 1!
Next, we'll compute the derivative of the loss with respect to $q$ (top right branch of the graph). How did we get from $q$ to $L$? We multiplied by 16 (that is, for these specific numbers, $L = 16q$). So, the partial derivative of $L$ with respect to $q$ is just 16.
Continuing along the top part of the graph, now we can compute the derivative with respect to $a$. How did we get from $a$ to $q$? We added 9 (that is, for these specific numbers, $q = a + 9$). So, the partial derivative of $q$ with respect to $a$ is just 1. But we're trying to compute $\frac{\partial L}{\partial a}$, not $\frac{\partial q}{\partial a}$. So, we'll take advantage of the chain rule and multiply by the "derivative so far": that's just $\frac{\partial L}{\partial q}$ = 16. So, our answer is $\frac{\partial L}{\partial a} = 1 \cdot 16 = 16$.
Next, we'll look at the $b$ branch of the graph. From similar reasoning to above, the derivative at the output of the "multiply by three" block is just 16. How do we use that to compute the derivative with respect to $b$? To get from $b$ to that value, we multiplied by 3. So, the corresponding term in the chain rule is 3. We multiply that with what we have so far, 16, to get 48.
Finally, let's look at the $c$ branch at the bottom of the graph. We'll start by computing the derivative with respect to $r$. Similar to step 2 above, we multiplied $r$ by 11 to get $L$, so that means that the derivative is 11.
All we have left is to go through the "square" block. The derivative of its output with respect to its input is two times the input (in other words, $\frac{\partial r}{\partial c} = 2c$). Since the input was 4, that means our new term is 8, and our overall derivative on this branch is $11 \cdot 8 = 88$.
Now we're done! We've computed the derivatives, as shown in this completed graph with the backpropagation intermediate and final results in red below the arrows:
In general, all we need to successfully run backpropagation is the ability to differentiate every mathematical building block of our loss (don't forget, the loss depends on the prediction). For every building block, we need to know how to compute the forward pass (the mathematical operation, like addition, multiplication, squaring, etc.) and the backward pass (multiplying by the derivative).
Let's see what this looks like in code using pytorch:
#Example One:
a = torch.tensor(2., requires_grad = True)
b = torch.tensor(3., requires_grad = True)
c = torch.tensor(4., requires_grad = True)
q = a + 3 * b
r = c ** 2
print(q, r)
tensor(11., grad_fn=<AddBackward0>) tensor(16., grad_fn=<PowBackward0>)
l = q*r
print(l)
tensor(176., grad_fn=<MulBackward0>)
l.backward()
display(a.grad, b.grad, c.grad)
tensor(16.)
tensor(48.)
tensor(88.)