This post is the second and final part of our series discussing three approaches to solving the two-dimensional XOR problem.

In part- I, we talked about:

- The Perceptron algorithm.
- The XOR problem.
- Two attempts to solve it.

**As a quick recap**, our first attempt of using a single-layer perceptron failed miserably due to an inherent issue in perceptrons—they can't model non-linearity.

**Our second approach**, despite being functional, was very specific to the XOR problem. It simply combined multiple single-layer perceptrons so that the resulting architecture ended up modeling an XOR function.

## Attempt #3: The Multi-layered Perceptron

The overall components of an MLP, like input and output nodes, the activation function, and weights and biases, are the same as those we just discussed in a perceptron.

## The biggest difference? An MLP can have hidden layers.

### Hidden layers

Hidden layers are those layers with nodes other than the input and output nodes.

## An MLP is generally restricted to having a single hidden layer.

** The hidden layer allows for non-linearity.** A node in the hidden layer isn’t too different from an output node; nodes in the previous layers connect to it with their own weights and biases, and an output is computed, generally with an activation function.

### Activation Function

Remember the linear activation function we used on the output node of our perceptron model? There are several more complex activation functions. You may have heard of the `sigmoid`

and the `tanh`

functions, which are some of the most popular non-linear activation functions.

## Activation functions should be differentiable, so that a network’s parameters can be updated using backpropagation.

### Training algorithm

Though the output generation process is a direct extension of that of the perceptron, updating weights isn’t as straightforward. This is where backpropagation comes into the picture.

** Backpropagation** is a way to update the weights and biases of a model starting from the output layer all the way to the beginning. The main principle behind it is that each parameter changes in proportion to how much it affects the network’s output. A weight that has barely any effect on the model's output will show a minimal change, while one having a large negative impact will change drastically to improve the model’s prediction power.

## Backpropagation is an algorithm used to update the weights and biases of a model based on their gradients with respect to the error function, starting from the output layer all the way to the first layer.

The method of updating weights directly follows from derivation and the chain rule.

There’s a lot to cover when talking about backpropagation. It warrants its own article. If you want to find out more, have a look at this excellent article by Simeon Kostadinov.

### The architecture

There are no fixed rules on the number of hidden layers or the number of nodes in each layer of a network. The best performing models are obtained through trial and error.

## A network's architecture refers to its general structure—the number of hidden layers, the number of nodes in each layer, and how these nodes are interconnected.

** Let’s go with a single hidden layer with two nodes in it.** We’ll be using the sigmoid function in each of our hidden layer nodes and, of course, our output node.

### Implementation

The libraries used here, like NumPy and pyplot, are the same as those used in the Perceptron class.

**The training algorithm**

The algorithm here is slightly different; we iterate through the training data a fixed number of times, or "`num_epochs`

" to be precise. In each iteration, we do a forward pass followed by a backward pass where we update the weights and biases as necessary. This is called backpropagation.

```
def train(self):
"""
Train an MLP. Runs through the data num_epochs number of times.
A forward pass is done first, followed by a backward pass (backpropagation)
where the networks parameter's are updated.
"""
for _ in range(self.num_epochs):
self.forward(self.train_data)
self.update_weights()
```

**The sigmoid activation function**

Here, we define a sigmoid function. As discussed, it is applied to the output of each hidden layer node and the output node. It is differentiable, so it allows us to comfortably perform backpropagation to improve our model.

Its derivate is also implemented through the `_delsigmoid`

function.

```
def _sigmoid(self, x):
"""
The sigmoid activation function.
"""
return 1 / (1 + np.exp(-x))
def _delsigmoid(self, x):
"""
The first derivative of the sigmoid function wrt x
"""
return x * (1 - x)
```

**The forward and backward pass**

In the forward pass, we apply the `wX + b`

relation multiple times and execute a sigmoid function after each call.

In the backward pass, implemented as the `update_weights`

function, we calculate the gradients of each of our 6 weights and 3 biases with respect to the error function and update them by the factor `learning rate * gradient.`

Finally, the classify function works as expected. Since a sigmoid function outputs values between 0 and 1, we simply interpret them as probabilities of belonging to a particular class. Hence, outputs greater than or equal to 0.5 are classified as belonging to `Class 1`

, while outputs that are less than 0.5 are said to belong to `Class 0`

.

```
def forward(self, batch):
"""
A single forward pass through the network.
Implementation of wX + b
"""
self.hidden_ = np.dot(batch, self.weights_01) + self.b01
self.hidden_out = self._sigmoid(self.hidden_)
self.output_ = np.dot(self.hidden_out, self.weights_12) + self.b12
self.output_final = self._sigmoid(self.output_)
return self.output_final
def update_weights(self):
# Calculate the squared error
loss = 0.5 * (self.target - self.output_final) ** 2
print(loss)
self.losses.append(np.sum(loss))
error_term = (self.target - self.output_final)
# the gradient for the hidden layer weights
grad01 = self.train_data.T @ (((error_term * self._delsigmoid(self.output_final)) * self.weights_12.T) * self._delsigmoid(self.hidden_out))
print("grad01: ", grad01)
print(grad01.shape)
# the gradient for the output layer weights
grad12 = self.hidden_out.T @ (error_term * self._delsigmoid(self.output_final))
print("grad12: ", grad12)
print(grad12.shape)
# updating the weights by the learning rate times their gradient
self.weights_01 += self.lr * grad01
self.weights_12 += self.lr * grad12
# update the biases the same way
self.b01 += np.sum(self.lr * ((error_term * self._delsigmoid(self.output_final)) * self.weights_12.T) * self._delsigmoid(self.hidden_out), axis=0)
self.b12 += np.sum(self.lr * error_term * self._delsigmoid(self.output_final), axis=0)
def classify(self, datapoint):
"""
Return the class to which a datapoint belongs based on
the perceptron's output for that point.
"""
datapoint = np.transpose(datapoint)
if self.forward(datapoint) >= 0.5:
return 1
return 0
```

**The MLP class**

Let’s bring everything together by creating an MLP class. All the functions we just discussed are placed in it. The `plot`

function is exactly the same as the one in the `Perceptron`

class.

```
class MLP:
"""
Create a multi-layer perceptron.
train_data: A 4x2 matrix with the input data.
target: A 4x1 matrix with expected outputs
lr: the learning rate. Defaults to 0.1
num_epochs: the number of times the training data goes through the model
while training
num_input: the number of nodes in the input layer of the MLP.
Should be equal to the second dimension of train_data.
num_hidden: the number of nodes in the hidden layer of the MLP.
num_output: the number of nodes in the output layer of the MLP.
Should be equal to the second dimension of target.
"""
def __init__(self, train_data, target, lr=0.1, num_epochs=100, num_input=2, num_hidden=2, num_output=1):
self.train_data = train_data
self.target = target
self.lr = lr
self.num_epochs = num_epochs
# initialize both sets of weights and biases randomly
# - weights_01: weights between input and hidden layer
# - weights_12: weights between hidden and output layer
self.weights_01 = np.random.uniform(size=(num_input, num_hidden))
self.weights_12 = np.random.uniform(size=(num_hidden, num_output))
# - b01: biases for the hidden layer
# - b12: bias for the output layer
self.b01 = np.random.uniform(size=(1,num_hidden))
self.b12 = np.random.uniform(size=(1,num_output))
self.losses = []
def update_weights(self):
# Calculate the squared error
loss = 0.5 * (self.target - self.output_final) ** 2
print(loss)
self.losses.append(np.sum(loss))
error_term = (self.target - self.output_final)
# the gradient for the hidden layer weights
grad01 = self.train_data.T @ (((error_term * self._delsigmoid(self.output_final)) * self.weights_12.T) * self._delsigmoid(self.hidden_out))
print("grad01: ", grad01)
print(grad01.shape)
# the gradient for the output layer weights
grad12 = self.hidden_out.T @ (error_term * self._delsigmoid(self.output_final))
print("grad12: ", grad12)
print(grad12.shape)
# updating the weights by the learning rate times their gradient
self.weights_01 += self.lr * grad01
self.weights_12 += self.lr * grad12
# update the biases the same way
self.b01 += np.sum(self.lr * ((error_term * self._delsigmoid(self.output_final)) * self.weights_12.T) * self._delsigmoid(self.hidden_out), axis=0)
self.b12 += np.sum(self.lr * error_term * self._delsigmoid(self.output_final), axis=0)
def _sigmoid(self, x):
"""
The sigmoid activation function.
"""
return 1 / (1 + np.exp(-x))
def _delsigmoid(self, x):
"""
The first derivative of the sigmoid function wrt x
"""
return x * (1 - x)
def forward(self, batch):
"""
A single forward pass through the network.
Implementation of wX + b
"""
self.hidden_ = np.dot(batch, self.weights_01) + self.b01
self.hidden_out = self._sigmoid(self.hidden_)
self.output_ = np.dot(self.hidden_out, self.weights_12) + self.b12
self.output_final = self._sigmoid(self.output_)
return self.output_final
def classify(self, datapoint):
"""
Return the class to which a datapoint belongs based on
the perceptron's output for that point.
"""
datapoint = np.transpose(datapoint)
if self.forward(datapoint) >= 0.5:
return 1
return 0
def plot(self, h=0.01):
"""
Generate plot of input data and decision boundary.
"""
# setting plot properties like size, theme and axis limits
sns.set_style('darkgrid')
plt.figure(figsize=(20, 20))
plt.axis('scaled')
plt.xlim(-0.1, 1.1)
plt.ylim(-0.1, 1.1)
colors = {
0: "ro",
1: "go"
}
# plotting the four datapoints
for i in range(len(self.train_data)):
plt.plot([self.train_data[i][0]],
[self.train_data[i][1]],
colors[self.target[i][0]],
markersize=20)
x_range = np.arange(-0.1, 1.1, h)
y_range = np.arange(-0.1, 1.1, h)
# creating a mesh to plot decision boundary
xx, yy = np.meshgrid(x_range, y_range, indexing='ij')
Z = np.array([[self.classify([x, y]) for x in x_range] for y in y_range])
# using the contourf function to create the plot
plt.contourf(xx, yy, Z, colors=['red', 'green', 'green', 'blue'], alpha=0.4)
def train(self):
"""
Train an MLP. Runs through the data num_epochs number of times.
A forward pass is done first, followed by a backward pass (backpropagation)
where the networks parameter's are updated.
"""
for _ in range(self.num_epochs):
self.forward(self.train_data)
self.update_weights()
```

### Results

Let’s train our MLP with a learning rate of `0.2`

over `5000`

epochs.

```
mlp = MLP(train_data, target_xor, 0.2, 5000)
mlp.train()
```

If we plot the values of our loss function, we get the following plot after about 5000 iterations, showing that our model has indeed converged.

A clear non-linear decision boundary is created here with our generalized neural network, or MLP.

### Note #1: Adding more layers or nodes

Adding more layers or nodes gives increasingly complex decision boundaries. However, this could also lead to **overfitting**—where a model achieves very high accuracy on the training data but fails to generalize.

A useful resource is the Tensorflow Neural Net playground, where you can try out different network architectures and view the results.

### Note #2: Choosing a loss function

The loss function we used in our MLP model is the Mean Squared loss function. Although this is a very popular loss function, it makes some assumptions about the data (such as it being gaussian) and isn’t always convex when it comes to a classification problem. It was used here to make it easier to understand how a perceptron works, but there are better alternatives for classification tasks, like **binary cross-entropy loss.**

## Conclusion

In this series, we discussed three different attempts at solving the XOR problem, ultimately arriving at a generic solution with the MLP, which could be used to solve other, more complex non-linear problems.

Neural nets used in production or research are never this simple but they almost always build on the basics outlined here. Hopefully, this two-post series gave you some idea on how to build and train perceptrons and vanilla networks.

Thanks for reading!

I'm a CS Undergrad at BITS Pilani, currently in my 4th year. I write infrequently on several topics, from technical concepts to the latest customizations on my laptop, on my blog. Some of my more interesting posts are also published on Medium.