This lecture will focus on neural networks. Neural networks can be seen as an extension of logistic regression. Let us start then by fitting logistic regression to the Mroz dataset from the last few lectures.

Let us first import the necessary Python Libraries and the mroz dataset.

In [6]:

```
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
```

In [7]:

```
#Loading the mroz dataset:
mroz = pd.read_csv("MROZ.csv")
Y = mroz['inlf'] # This is a binary variable
X = mroz[['kidslt6', 'age', 'educ', 'huswage', 'exper', 'expersq']]
```

In [8]:

```
#We split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=503, test_size=250, random_state=1)
```

In [9]:

```
#Logistic Regression and its accuracy:
from sklearn.linear_model import LogisticRegression
logimodel_train = LogisticRegression(penalty = None, max_iter = 1000)
logimodel_train.fit(X_train, Y_train)
print("Intercept:", logimodel_train.intercept_)
print("Coefficients:", logimodel_train.coef_)
Y_pred = logimodel_train.predict(X_test)
accuracy_logimodel = np.mean(Y_test == Y_pred)
print(f"Accuracy of Logistic Regression on test set: {accuracy_logimodel}")
contingency_table = pd.crosstab(Y_test, Y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
print(contingency_table)
```

The fitted coefficients by the logistic regression model are given by the 'Intercept' and 'Coefficients' above. The test accuracy is given by 72.8% (the discrepancy between actual test values and predicted values can be further broken down into the confusion matrix or contingency table as above).

Let us briefly recall how the parameter estimates are obtained in logistic regression. The idea is to maximize the log-likelihood:
\begin{align*}
\sum_{i=1}^n \left\{ y_i \log \frac{e^{\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im}}}{1 + e^{\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im}}} + (1-y_i) \log \left[1 - \frac{e^{\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im}}}{1 + e^{\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im}}} \right] \right\}
\end{align*}
over the parameters $\beta_0, \dots, \beta_m$. Equivalently, the goal is to **minimize** (note the minus sign below in $L$):
\begin{align*}
L := - \sum_{i=1}^n \left(y_i \log p_i + (1-y_i) \log (1 - p_i) \right) ~~~ \text{ where } p_i = \sigma \left(\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im} \right)
\end{align*}
where $\sigma(\cdot)$ is the **sigmoid** function defined by
\begin{align*}
\sigma(z) = \frac{e^z}{1 + e^z} = \frac{1}{1 + e^{-z}}.
\end{align*}
The algorithm used to minimize $L$ is called Newton's method which is an iterative scheme given by:
\begin{align*}
\beta^{(m+1)} = \beta^{(m)} - \left(H L(\beta^{(m)}) \right)^{-1} \nabla L(\beta^{(m)})
\end{align*}
where $\nabla L$ and $H L$ are the gradient and Hessian of $L$ respectively. The gradient and Hessian can be calculated in closed form by just differentiating the expression of $L$. Newton's algorithm converges quite quickly for logistic regression.

The quantity \begin{align*} \sum_{i=1}^n \left(-y_i \log p_i - (1-y_i) \log (1 - p_i) \right) = \sum_{i=1}^n \left( y_i \log \frac{1}{p_i} + (1-y_i) \log \frac{1}{1-p_i} \right) \end{align*} is often called the Binary Cross Entropy Loss between $(y_1, \dots, y_n)$ and $(p_1, \dots, p_n)$ (see e.g., https://en.wikipedia.org/wiki/Cross-entropy). It is always nonnegative (because $p_i$ and $1-p_i$ are between 0 and 1 so that $\log (1/p_i)$ and $\log(1/(1-p_i))$ are nonnegative), and it is minimized when $p_i = y_i$ for each $i$.

In the logistic regression model, the response variable $y$ is predicted by $\sigma(\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im})$. The parameters $\beta_0, \dots, \beta_m$ are estimated in such a way that the Binary Cross Entropy Loss between the model predictions and the observed responses is minimized.

The sigmoid function $\sigma(z)$ is nonlinear, strictly increasing taking values between 0 and 1, approaches 0 as $z \rightarrow -\infty$ and approaches 1 as $z \rightarrow \infty$. We can plot this function in Python as follows:

In [10]:

```
#Plotting the sigmoid function:
z = np.linspace(-10, 10, 100)
y = np.exp(z)/(1 + np.exp(z))
plt.figure(figsize = (8, 6))
plt.plot(z, y, label = "Sigmoid Function")
plt.title("Sigmoid Function")
plt.xlabel("z")
plt.ylabel("Sigmoid(z)")
plt.grid(True)
plt.legend()
plt.show()
```

The model-prediction determining function in logistic regression is:

\begin{align*}
(x_1, \dots, x_m) \mapsto \sigma(\beta_0 + \beta_1 x_{1} + \dots + \beta_m x_{m})
\end{align*}
This function, which applies the simple nonlinear map $\sigma(z) = \frac{e^z}{1+e^z} = \frac{1}{1 + e^{-z}}$ to the linear function $\beta_0 + \beta_1 x_1 + \dots + \beta_m x_m$, is a special case of an (artificial) neural network. More generally, a neural network is a function obtained by repeated composition of linear functions and simple nonlinearities such as the sigmoid. Neural networks are typically represented by pictures involving "nodes" and directed edges. The function $\sigma(\beta_0 + \beta_1 x_1 + \dots + \beta_m x_m)$ can be represented by the following picture (we took $m = 4$ for the picture).

Here the node '$+$' computes the quantity $\beta_0 + \beta_1 x_1 + \dots + \beta_m x_m$ and then the next node '$\sigma$' applies the sigmoid function to the output of the first node thereby computing the final output $\sigma(\beta_0 + \beta_1 x_1 + \dots + \beta_m x_m)$.

Because logistic regression is a special case of neural networks, we can fit logistic regression in standard Python libraries implementing neural networks. The following code illustrates this using the library PyTorch.

In [11]:

```
import torch
import torch.nn as nn
import torch.optim as optim
```

In [12]:

```
#Convert data to PyTorch Tensors
X_train_tensor = torch.tensor(X_train.values, dtype=torch.float32)
print(X_train_tensor.shape)
Y_train_tensor = torch.tensor(Y_train.values, dtype=torch.float32).unsqueeze(1) # Make Y a 2D tensor
print(Y_train_tensor.shape)
```

torch.Size([503, 6]) torch.Size([503, 1])

Note that the 'unsqueeze(1)' makes Y a 2D tensor (or matrix) just like X. It is necessary that both X and Y are 2D tensors.

Next let us specify the input and output sizes of the neural network. The input size is equal to 6 (the number of explanatory variables in this regression) and the output size is equal to 1 (because the response variable is a scalar).

In [13]:

```
input_size = X_train_tensor.shape[1]
output_size = Y_train_tensor.shape[1]
display(input_size, output_size)
```

6

1

In [14]:

```
# Define the Neural Network corresponding to Logistic Regression
model_1 = nn.Sequential(
nn.Linear(input_size, output_size),
nn.Sigmoid()
)
#We are calling this model_1 because there will be other models coming up.
```

In [15]:

```
display(model_1[0], model_1[1])
```

Linear(in_features=6, out_features=1, bias=True)

Sigmoid()

In [16]:

```
linear, sigm = model_1.children()
print(list(linear.parameters()))
print(list(sigm.parameters())) #this has no parameters so empty list ([]) is printed
```

In [17]:

```
#Another way of printing the weights and bias of the first layer:
print(model_1[0].weight.data)
print(model_1[0].bias.data)
```

tensor([[ 0.2821, -0.3516, 0.0119, 0.1230, 0.3326, 0.3583]]) tensor([0.2763])

In [18]:

```
#This neural network is just a function mapping (x1,...,x6) to sigma(b0 + b1 x1 + ... + b6 x6)
#We can provide specific input values of x1,...,x6 and obtain the output of the neural network:
#For example, for x1 = 0.5, x2 = 1, x3 = 0.44, x4 = -4, x5 = -1.5, x6 = 3:
xvals = torch.tensor([0.5, 1, 0.44, -4, -1.5, 3], dtype = torch.float32)
youtput = model_1(xvals)
print(youtput)
#we can verify the correctness of this value by directly computing sigma(b0 + b1 x1 + ... + b6 x6):
print(1/(1 + np.exp(-model_1[0].bias.data - torch.dot(xvals, model_1[0].weight.data.view(-1)))))
```

tensor([0.5387], grad_fn=<SigmoidBackward0>) tensor([0.5387])

**learning rate** and it determines the size of the step taken in the direction opposite to the gradient $\nabla L(\beta^{(m)})$. A popular variant of Gradient Descent used in PyTorch is Adam (Adaptive Moment Estimation). The following code specifies the binary cross entropy loss to be minimized and the Adam algorithm to be used for optimization.

In [19]:

```
# Cross entropy loss function and Adam algorithm
criterion = nn.BCELoss() # Binary Cross Entropy Loss
optimizer = optim.Adam(model_1.parameters(), lr=0.001)
```

In [20]:

```
# Train the model
num_epochs = 5000 #this is the number of iterations for the optimization algorithm
for epoch in range(num_epochs):
optimizer.zero_grad() #reset the gradients
predictions = model_1(X_train_tensor)
loss = criterion(predictions,Y_train_tensor)
loss.backward() #in this step, the gradient is calculated
optimizer.step() #here the next iterate of the parameters is calculated according to the gradients
```

In [21]:

```
nn_weights = model_1[0].weight.data
nn_biases = model_1[0].bias.data
lr_coefficients = logimodel_train.coef_
lr_intercept = logimodel_train.intercept_
# Printing side by side
print(f"Neural Network Weights:\n{nn_weights}\nLogistic Regression Coefficients:\n{lr_coefficients}")
print(f"Neural Network Biases:\n{nn_biases}\nLogistic Regression Intercept:\n{lr_intercept}")
```

In [22]:

```
predictions = model_1(X_train_tensor)
loss_adam = criterion(predictions,Y_train_tensor)
print(loss_adam)
model_1[0].weight.data = torch.tensor(lr_coefficients, dtype=torch.float32)
model_1[0].bias.data = torch.tensor(lr_intercept, dtype=torch.float32)
# Compute predictions and loss using the modified model
predictions = model_1(X_train_tensor)
loss_scikit_learn = criterion(predictions, Y_train_tensor)
print(loss_scikit_learn)
```

In [23]:

```
#Let us reset the parameters of model_1 to those output by adam:
model_1[0].weight.data = nn_weights
model_1[0].bias.data = nn_biases
```

In [24]:

```
#With the LBFGS Optimization Scheme:
X_train_tensor = torch.tensor(X_train.values, dtype=torch.float32)
Y_train_tensor = torch.tensor(Y_train.values, dtype=torch.float32).unsqueeze(1)
# Define the Neural Network Architecture
input_size = X_train.shape[1]
output_size = 1 # Binary classification
model_2 = nn.Sequential(
nn.Linear(input_size, output_size),
nn.Sigmoid()
)
# Define the loss function and optimizer
criterion = nn.BCELoss() # Binary Cross Entropy Loss
optimizer = optim.LBFGS(model_2.parameters(), lr=0.001)
# Train the model
num_epochs = 4000
for epoch in range(num_epochs):
def closure():
optimizer.zero_grad()
predictions = model_2(X_train_tensor)
loss = criterion(predictions, Y_train_tensor)
loss.backward()
return loss
optimizer.step(closure)
```

In [25]:

```
nn_weights = model_2[0].weight.data
nn_biases = model_2[0].bias.data
lr_coefficients = logimodel_train.coef_
lr_intercept = logimodel_train.intercept_
# Printing side by side
print(f"Neural Network Weights:\n{nn_weights}\nLogistic Regression Coefficients:\n{lr_coefficients}")
print(f"Neural Network Biases:\n{nn_biases}\nLogistic Regression Intercept:\n{lr_intercept}")
```

This algorithm gives results that are very close to those given by scikit-learn.

Below we look at the test accuracy of the neural network models.

In [26]:

```
#Predictions on Test Data:
X_test_tensor = torch.tensor(X_test.values, dtype=torch.float32)
Y_test_tensor = torch.tensor(Y_test.values, dtype=torch.float32).squeeze()
with torch.no_grad(): # Ensure you're in "evaluation" mode which won't compute gradients
raw_predictions_1 = model_1(X_test_tensor)
Y_pred_1 = (raw_predictions_1 > 0.5).float().squeeze() # Thresholding at 0.5 for binary classification
accuracy_1 = (Y_pred_1 == Y_test_tensor).float().mean().item() # Compute accuracy
print(f"Accuracy of Adam Optimized NN on test data: {accuracy_1 * 100:.2f}%")
with torch.no_grad(): # Ensure you're in "evaluation" mode which won't compute gradients
raw_predictions_2 = model_2(X_test_tensor)
Y_pred_2 = (raw_predictions_2 > 0.5).float().squeeze() # Thresholding at 0.5 for binary classification
accuracy_2 = (Y_pred_2 == Y_test_tensor).float().mean().item() # Compute accuracy
print(f"Accuracy of LBFGS Optimized NN on test data: {accuracy_2 * 100:.2f}%")
```

In the last lecture, we also considered some toy datasets the simplest of which is given below.

In [27]:

```
n = 750
n_train = 500
n_test = 250
np.random.seed(3) #setting seed to ensure reproducibility
x1 = np.random.uniform(-1, 1, n)
x2 = np.random.uniform(-1, 1, n)
X = np.vstack([x1, x2]).transpose()
Y = (x1 * x2 > 0).astype(np.int64)
#Split the dataset into training and test datasets of sizes 500 and 250 respectively
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=500, test_size=250, random_state=1)
x1_train = X_train[:,0]
x2_train = X_train[:,1]
x1_test = X_test[:,0]
x2_test = X_test[:,1]
```

In [28]:

```
#Plotting this dataset
def draw_results(x1, x2, color, plot_title=''):
plt.figure()
plt.scatter(x1, x2, c=color, cmap='viridis', alpha=0.7, s = 10);
plt.colorbar()
plt.title(plot_title)
plt.axis('equal')
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.tight_layout()
draw_results(x1_train, x2_train, color=Y_train, plot_title='Training data')
draw_results(x1_test, x2_test, color=Y_test, plot_title='Test data (ground truth)')
```

In [29]:

```
#In the last lecture, we used logistic regression on this dataset:
from sklearn.linear_model import LogisticRegression
logimodel_train = LogisticRegression(penalty = None, max_iter = 1000)
logimodel_train.fit(X_train, Y_train)
print("Intercept:", logimodel_train.intercept_)
print("Coefficients:", logimodel_train.coef_)
#Prediction for the test data
Y_pred = logimodel_train.predict(X_test)
accuracy_logimodel = np.mean(Y_test == Y_pred)
print(f"Accuracy of Logistic Regression on test set: {accuracy_logimodel}")
contingency_table = pd.crosstab(Y_test, Y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
print(contingency_table)
```

We can re-fit this logistic regression with PyTorch and verify that it gives same results:

In [30]:

```
#Let us re-do this logistic regression using PyTorch:
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
Y_train_tensor = torch.tensor(Y_train, dtype=torch.float32).unsqueeze(1)
# Define the Neural Network Architecture
input_size = X_train.shape[1]
output_size = 1 # Binary classification
model = nn.Sequential(
nn.Linear(input_size, output_size),
nn.Sigmoid() # Sigmoid activation for binary classification
)
# Define the loss function and optimizer
criterion = nn.BCELoss() # Binary Cross Entropy Loss
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Train the model
num_epochs = 1000 # You can adjust this
for epoch in range(num_epochs):
optimizer.zero_grad() #reset the gradients
predictions = model(X_train_tensor)
loss = criterion(predictions,Y_train_tensor)
loss.backward()
optimizer.step()
```

In [31]:

```
nn_weights = model[0].weight.data
nn_biases = model[0].bias.data
lr_coefficients = logimodel_train.coef_
lr_intercept = logimodel_train.intercept_
# Printing side by side
print(f"Neural Network Weights:\n{nn_weights}\nLogistic Regression Coefficients:\n{lr_coefficients}")
print(f"Neural Network Biases:\n{nn_biases}\nLogistic Regression Intercept:\n{lr_intercept}")
```

In [32]:

```
#Accuracy of Neural Networks:
#Predictions on Test Data:
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
Y_test_tensor = torch.tensor(Y_test, dtype=torch.float32).squeeze()
with torch.no_grad():
raw_predictions = model(X_test_tensor)
Y_pred = (raw_predictions > 0.5).float().squeeze() # Thresholding at 0.5 for binary classification
accuracy = (Y_pred == Y_test_tensor).float().mean().item() # Compute accuracy
print(f"Accuracy of Adam Optimized NN on test data: {accuracy * 100:.2f}%")
```

Accuracy of Adam Optimized NN on test data: 34.00%

In [33]:

```
def add_mult_feature(X):
"""Returns an array like X, but with a new feature that's X1 * X2"""
new_feature = X[:, 0] * X[:, 1]
return np.hstack([X, new_feature[:, None]])
X_train_feat = add_mult_feature(X_train)
X_test_feat = add_mult_feature(X_test)
logimodel_train_feat = LogisticRegression(penalty = None, max_iter = 2000)
logimodel_train_feat.fit(X_train_feat, Y_train)
print("Intercept:", logimodel_train_feat.intercept_)
print("Coefficients:", logimodel_train_feat.coef_)
#Prediction for the test data
Y_pred_feat = logimodel_train_feat.predict(X_test_feat)
accuracy_logimodel_feat = np.mean(Y_test == Y_pred_feat)
print(f"Accuracy of Logistic Regression (with new feature x1*x2) on test set: {accuracy_logimodel_feat}")
contingency_table_feat = pd.crosstab(Y_test, Y_pred_feat, rownames=['Actual'], colnames=['Predicted'], margins=True)
print(contingency_table_feat)
```

The neural network obtained by adding a hidden layer to logistic regression is given by the following picture.

Here the hidden nodes are $h_1, \dots, h_p$ (in the picture, we took $p = 6$) which are linear combinations of $x_1, \dots, x_m$. $\sigma$ is a nonlinear function (like the sigmoid) that is first applied to $h_1, \dots, h_p$; we define $s_i = \sigma(h_i)$ (it is common to alternatively think of $s_1, \dots, s_p$ as the hidden nodes). The logistic regression model is then used with these modified features $s_1, \dots, s_p$.

The mathematical equation underlying this neural network is the following: \begin{align*} f(x_1, \dots, x_m) = \sigma(\beta_0 + \beta_1 s_1 + \dots + \beta_p s_p) ~~ \text{ where } s_i = \sigma(h_i) \text{ and } h_i = \sum_{j} w_{ij} x_j + b_i. \end{align*} More succinctly, using vector-matrix notation, this can be written as \begin{align*} f(x_1, \dots, x_m) = \sigma\left(\beta_0 + \beta^T \sigma(W x + b) \right) \end{align*}

The nonlinear function $\sigma$ is applied twice. First on $h_1, \dots, h_p$ to obtain $s_1, \dots, s_p$ and then on $\beta_0 + \beta_1 s_1 + \dots + \beta_p s_p$. Actually, there is no reason to apply the same nonlinear function $\sigma$ in both these applications. It is common to use the ReLU nonlinear function in the first step (i.e., in $s_i = \sigma(h_i)$) and the sigmoide function in the second step. The ReLU activation function is given by: \begin{align*} \sigma_{\text{ReLU}}(z) = z_+ := \max(z, 0) \end{align*} With these two different nonlinear functions, the formula for the overall function computed by this neural network becomes: \begin{align*} f(x_1, \dots, x_m) = \sigma\left(\beta_0 + \beta^T \sigma_{\text{ReLU}}(W x + b) \right) \end{align*} Here $W$ is a $p \times m$ matrix, $b$ is a $p \times 1$ vector, $\beta$ is a $p \times 1$ vector and $\beta_0$ is a scalar. Note also that $\sigma_{\text{ReLU}}$ is being applied to the vector $Wx + b$ (as opposed to a scalar $z$) which means that it is being applied separately to each component of the vector (and all the results are stored in a vector of the same size): $\sigma_{\text{ReLU}}(z_1, z_2, z_3) = (\sigma_{\text{ReLU}}(z_1), \sigma_{\text{ReLU}}(z_2), \sigma_{\text{ReLU}}(z_3))$.

This hidden layer neural network can be fit using PyTorch by a minor modification of the previous code: we just need to change the specification of the network by adding the new hidden layer. The code is given below.

In [34]:

```
n_hidden = 6 #number of hidden units (the quantity p in our notation above)
model = nn.Sequential(
nn.Linear(input_size, n_hidden), # First linear layer with n_hidden nodes (hidden layer)
nn.ReLU(), # Activation function after hidden layer
nn.Linear(n_hidden, output_size), # Output layer
nn.Sigmoid() # Sigmoid activation for binary classification
)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
num_epochs = 1000
for epoch in range(num_epochs):
optimizer.zero_grad() #reset the gradients
predictions = model(X_train_tensor)
loss = criterion(predictions,Y_train_tensor)
loss.backward()
optimizer.step()
```

The accuracy of the above neural network model on the test dataset is computed below.

In [35]:

```
with torch.no_grad(): # Ensures you're in "evaluation" mode which won't compute gradients
raw_predictions = model(X_test_tensor)
Y_pred = (raw_predictions > 0.5).float().squeeze() # Thresholding at 0.5 for binary classification
accuracy = (Y_pred == Y_test_tensor).float().mean().item()
print(f"Accuracy on test data: {accuracy * 100:.2f}%")
contingency_table = pd.crosstab(Y_test, Y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
print(contingency_table)
```

Accuracy on test data: 88.40% Predicted 0.0 1.0 All Actual 0 126 1 127 1 28 95 123 All 154 96 250

The accuracy for $p = 6$ is decent but it gets much better as $p$ is increased (go back to the code above and put $p = 60, 200, 500$). The success of this model means that the neural network with a single hidden node is able to extract the correct feature $x_1 x_2$ within its hidden layer. Some intuition for why this is working can be obtained as follows. Below we use the elementary formula: \begin{align*} x^2 = 8 \int_0^1 (x - 4t + 2)_+ dt - 4x - 4 ~~ \text{ for } -2 \leq x \leq 2. \end{align*} To verify the above formula, start with the right hand side and note that $(x - 4t + 2)_+ = 0$ unless $x-4t + 2 \geq 0$ which is equivalent to $t \leq (x+2)/4$ (because $-2 \leq x \leq x$ we have $0 \leq (x+2)/4 \leq 1$). Thus \begin{align*} 8 \int_0^1 (x - 4t + 2)_+ dt - 4x - 4 &= 8 \int_0^{(x+2)/4} (x - 4t + 2) dt - 4 x - 4 \end{align*} which can be easily checked to equal $x^2$. Because of this formula, we can write \begin{align*} x_1 x_2 &= \frac{1}{4} (x_1 + x_2)^2 + \frac{1}{4} (x_1 - x_2)^2 \\ &= 2 \int_0^1 (x_1 + x_2 - 4t + 2)_+ dt - 4(x_1 + x_2) - 4 + 2 \int_0^1 (x_1 - x_2 - 4t + 2)_+ dt - 4(x_1 - x_2) - 4 \\ &= 2 \int_0^1 (x_1 + x_2 - 4t + 2)_+ dt + 2 \int_0^1 (x_1 - x_2 - 4t + 2)_+ dt - 8 x_1 - 8 \\ &= 2 \int_0^1 \sigma_{\text{ReLU}}(x_1 + x_2 - 4t + 2) dt + 2 \int_0^1 \sigma_{\text{ReLU}}(x_1 - x_2 - 4t + 2) dt - 8 x_1 - 8. \end{align*} By approximating the integrals above by discrete sums over $t$, we can write \begin{align*} x_1 x_2 \approx \sum_{j=1}^N w_j \sigma_{\text{ReLU}}(x_1 + x_2 - t_j) + \sum_{j=1}^N w_j \sigma_{\text{ReLU}}(x_1 - x_2 - t_j) \end{align*} This means that $x_1 x_2$ can be closely approximated by the linear combination of a large number of $\sigma_{\text{ReLU}}$ applied to linear combinations of $x_1$ and $x_2$. In other words, it should be possible to find $W$ ($p \times 2$) and $b$ ($p \times 1$) so that $x_1 x_2 \approx \sigma_{\text{ReLU}}(Wx + b)$ when $p$ is large. This is the reason why the hidden neural network with $p$ large is working well.

One can also check (see code below) that this method also works when a small proportion of datapoints are corrupted.

In [36]:

```
#With noisy data:
np.random.seed(3)
Y_train_noisy = Y_train.copy()
pts_to_flip = np.random.random(n_train) < 0.1 #we are selecting 10% of the datapoints
Y_train_noisy[pts_to_flip] = 1 - Y_train_noisy[pts_to_flip] #we are corrupting the responses of these datapoints
draw_results(x1_train, x2_train, color=Y_train_noisy, plot_title='Training data with noise')
Y_train_noisy_tensor = torch.tensor(Y_train_noisy, dtype=torch.float32).unsqueeze(1) # Make Y a 2D tensor
```