Luis Alvergue

Controls engineer by training, transportation engineer by trade

Toy deep neural network implementation with PyTorch

10 minutes
August 14, 2021

Linear Algebra and Learning from Data by Gilbert Strang has the clearest explanation of deep neural networks I have seen, so far. The main reason why I say this is because it describes a neural network as a composition of Continuous Piecewise Linear (CPL) functions using linear algebra notation. In addition, the key rule to compute the parameters of these CPL functions is stochastic gradient descent using backpropagation to execute the chain rule that shows up when calculating the gradient. The problem is stated as follows.

Deep Learning

First, suppose you can measure the value of $p$ different variables, and that you store one variable in each component of a vector. Then we have a vector of data $x\in\mathbb{R}^p$. Second, suppose there is a rule $f:\mathbb{R}^p\rightarrow\mathbb{R}^m$, which at the moment is unknown, that maps these measured variables into another set of $m$ measured values represented by a vector of ‘labels’ $y\in\mathbb{R}^m$. A deep neural network is a composition of CPL functions $F:\mathbb{R}^p\rightarrow\mathbb{R}^m$ that will hopefully approximate $f$. In this way, if we get a new data measurement, we just run it through $F$ and hopefully we will get the ‘correct’ label associated with this new measurement.

The state of the art $F$ in deep learning (as of 2021) is a composition of CPLs that looks like this

$$F=F_N(W_NF_{N-1}(W_{N-1}\ldots F_2(W_2F_1(W_1x+b_1)+b_2))+b_{N-1})+b_N)$$

where each $F_i$ is a function that applies ReLu to each component of its domain, which is a vector, $W_iv_{i-1}+b_i$. Typically though, the outermost function, which corresponds to the output of the neural network, does not have a ReLu. Having a ReLU would constrain the output to only positive numbers, and although this may make sense for many problems, it typically isn’t the case. Also, each $F_i$ is a ‘layer’ in the neural network, and the number of layers is called the ‘depth’ of the network. The number of ‘neurons’ per layer is called the ‘width’ of the network and is given by the number of rows of $W_i$.

Note that $W_i$ is a matrix, sorry for the sloppy notation.

Deep Learning Example with 1 input and 1 neuron per layer

Let’s look at a very simple 1 input and 1 neuron per layer example. We are interested in learning the function

$$ f(x)= \begin{cases} 0 & x< 0 \\ 2x+1 & x\geq 0 \end{cases} $$

given data $\left\{y^{(i)},x^{(i)}\right\}$ and assuming a 3-layer deep neural network $F=F_3(F_2(F_1(wx+b)))=w_3((w_2(w_1x+b_1)+b_2))+b_3$. Note that $W_i$ and $b_i$ are scalars in this simple example. Using the following code

import torch
import random
from matplotlib import pyplot as plt
from IPython import display


def make_data(num_examples):
    # Make a piecewise linear data distribution
    lower, upper = -20, 20
    x = torch.arange(lower, upper,
    (upper - lower) / num_examples)
    y = torch.zeros(num_examples)
    i = 0
    for xi in x:
        if xi < 0:
            y[i] = 0.
        else:
            y[i] = 2. * xi + 1.
        i = i + 1
    return x, y


def data_iter(batch_size, features, labels):
    # Takes a batch size, a matrix of features,
    # and a vector of labels,
    # yielding minibatches of the size batch_size
    num_examples = len(features)
    indices = list(range(num_examples))
    random.shuffle(indices)
    for i in range(0, num_examples, batch_size):
        batch_indices = torch.tensor
        (indices[i:min(i + batch_size, num_examples)])
        yield features[batch_indices],
        labels[batch_indices]


def relu(X):
    a = torch.zeros_like(X)
    return torch.max(X, a)


def model(features, w1, b1, w2, b2, w3, b3):
    # One input, one output model
    return w3 * relu(w2 * relu(w1 * features + b1)
    + b2) + b3


def squared_loss(y_hat, y):
    return (y_hat - y)**2 / 2


def sgd(params, lr, batch_size):
    # Minibatch stochastic gradient descent
    with torch.no_grad():
        for param in params:
            param -= lr * param.grad / batch_size
            param.grad.zero_()


num_examples = 20
features, labels = make_data(num_examples)
# Initialize parameters, w and b
dimension = 1
w1 = torch.normal(0., .3, size=(dimension, 1),
requires_grad=True)
b1 = torch.ones(size=(dimension, 1), requires_grad=True)
w2 = torch.normal(0., .3, size=(dimension, 1),
requires_grad=True)
b2 = torch.ones(size=(dimension, 1), requires_grad=True)
w3 = torch.normal(0., .3, size=(dimension, 1),
requires_grad=True)
b3 = torch.ones(size=(dimension, 1), requires_grad=True)
batch_size = 2
lr = 0.001  # learning rate
num_epochs = 2000
net = model
loss = squared_loss

# Training Loop
print(f'Features: {features}')
for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        l = loss(net(X, w1, b1, w2, b2, w3, b3), y)
        l.sum().backward()
        sgd([w1, b1, w2, b2, w3, b3], lr, batch_size)
    with torch.no_grad():
        train_l = loss(net(features, w1, b1, w2, b2,
        w3, b3), labels)
        print(f'epoch {epoch + 1},
        loss {float(train_l.mean()):f}')
print(
    f"w: {w1.detach().numpy()[0].item(0)},
    {w2.detach().numpy()[0].item(0)},
    {w3.detach().numpy()[0].item(0)}
    b: {b1.detach().numpy()[0].item(0)},
    {b2.detach().numpy()[0].item(0)},
    {b3.detach().numpy()[0].item(0)}"
)
labelshat = net(features, w1, b1, w2, b2, w3, b3)

fig, ax = plt.subplots()
ax.scatter(features.detach().numpy(),
labels.detach().numpy(), label="f(x)")
ax.scatter(features.detach().numpy(),
           labelshat.detach().numpy(),
           label="F(x)",
           marker="*")
ax.grid()
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.legend()
plt.show()

produced the parameter vector

$$ \begin{bmatrix} w_1 \\ w_2 \\ w_3 \\b_1 \\b_2 \\b_3\end{bmatrix}=\begin{bmatrix} 1.5201 \\ 1.3763 \\ 0.9561 \\0.7552 \\4.8357\times10^{-7} \\0.0020\end{bmatrix} $$

and plot showing how the neural network, $F$, did a good job at learning $f$.

Approximating a piecewise linear function with 3 CPLs

Approximating a piecewise linear function with 3 CPLs

Notice that $w_1\cdot w_2\cdot w_3\approx2$ and $b_1+ b_2+ b_3\approx0.76$ which makes sense since $f(x)=2x+1$. This system seems to be very sensitive to the learning rate, initial guess of the parameter vector, range of values the features cover (in this case $-20\leq x<20$), and the number of epochs. Play around with it to become convinced of how critical these values are and how the training loop is very dependent on their values. These are just a few things you might notice:

  1. If you multiply the slope of the line by 10, you should divide the learning rate by 10 to help gradient descent converge.
  2. Another operation that helps gradient descent converge is to normalize $x$. A common technique is to replace $x$ with $\frac{x-\mu}{\sigma}$, where $\mu$ is the sample mean value of $x$ and $\sigma$ is the sample standard deviation. Don’t forget that when you want to use the trained neural network on a new feature, you should normalize your new feature using the statistics that you computed from the training data or use scaled versions of $w_1\leftarrow\frac{w_1}{\sigma}$ and $b_1\leftarrow b_1-\frac{w_1\mu}{\sigma}$.

Also, notice how this neural network architecture is only capable of approximating piecewise linear functions with one ‘kink’. No matter how deep the network gets, it is only able to ‘express’ functions with one kink.