How Neural Networks Solve the XOR Problem - Part II

And why hidden layers are so important!!

How Neural Networks Solve the XOR Problem - Part II

This post is the second and final part of our series discussing three approaches to solving the two-dimensional XOR problem.

In part- I, we talked about:

  • The Perceptron algorithm.
  • The XOR problem.
  • Two attempts to solve it.

As a quick recap, our first attempt of using a single-layer perceptron failed miserably due to an inherent issue in perceptrons—they can't model non-linearity.

Our second approach, despite being functional, was very specific to the XOR problem. It simply combined multiple single-layer perceptrons so that the resulting architecture ended up modeling an XOR function.


Attempt #3: The Multi-layered Perceptron

The overall components of an MLP, like input and output nodes, the activation function, and weights and biases, are the same as those we just discussed in a perceptron.

The biggest difference? An MLP can have hidden layers.

Hidden layers

Hidden layers are those layers with nodes other than the input and output nodes.

An MLP is generally restricted to having a single hidden layer.

The hidden layer allows for non-linearity. A node in the hidden layer isn’t too different from an output node; nodes in the previous layers connect to it with their own weights and biases, and an output is computed, generally with an activation function.

Image for post
The general structure of a multi-layered perceptron

Activation Function

Remember the linear activation function we used on the output node of our perceptron model? There are several more complex activation functions. You may have heard of the sigmoid and the tanh functions, which are some of the most popular non-linear activation functions.

Activation functions should be differentiable, so that a network’s parameters can be updated using backpropagation.

Training algorithm

Though the output generation process is a direct extension of that of the perceptron, updating weights isn’t as straightforward. This is where backpropagation comes into the picture.

Backpropagation is a way to update the weights and biases of a model starting from the output layer all the way to the beginning. The main principle behind it is that each parameter changes in proportion to how much it affects the network’s output. A weight that has barely any effect on the model's output will show a minimal change, while one having a large negative impact will change drastically to improve the model’s prediction power.

Backpropagation is an algorithm used to update the weights and biases of a model based on their gradients with respect to the error function, starting from the output layer all the way to the first layer.

The method of updating weights directly follows from derivation and the chain rule.

There’s a lot to cover when talking about backpropagation. It warrants its own article. If you want to find out more, have a look at this excellent article by Simeon Kostadinov.

Understanding Backpropagation Algorithm
Backpropagation algorithm is probably the most fundamental building block in a neural network. It was first introduced in 1960s and almost 30 years later (1989) popularized by Rumelhart, Hinton and…

The architecture

There are no fixed rules on the number of hidden layers or the number of nodes in each layer of a network. The best performing models are obtained through trial and error.

A network's architecture refers to its general structure—the number of hidden layers, the number of nodes in each layer, and how these nodes are interconnected.

Let’s go with a single hidden layer with two nodes in it. We’ll be using the sigmoid function in each of our hidden layer nodes and, of course, our output node.

Image for post
The final architecture of our MLP

Implementation

The libraries used here, like NumPy and pyplot, are the same as those used in the Perceptron class.

The training algorithm

The algorithm here is slightly different; we iterate through the training data a fixed number of times, or "num_epochs" to be precise. In each iteration, we do a forward pass followed by a backward pass where we update the weights and biases as necessary. This is called backpropagation.

def train(self):
    """
    Train an MLP. Runs through the data num_epochs number of times.
    A forward pass is done first, followed by a backward pass (backpropagation)
    where the networks parameter's are updated.
    """
    for _ in range(self.num_epochs):
        self.forward(self.train_data)
        self.update_weights()

The sigmoid activation function

Here, we define a sigmoid function. As discussed, it is applied to the output of each hidden layer node and the output node. It is differentiable, so it allows us to comfortably perform backpropagation to improve our model.

Its derivate is also implemented through the _delsigmoid function.

def _sigmoid(self, x):
    """
    The sigmoid activation function.
    """
    return 1 / (1 + np.exp(-x))

def _delsigmoid(self, x):
    """
    The first derivative of the sigmoid function wrt x
    """
    return x * (1 - x)

The forward and backward pass

In the forward pass, we apply the wX + b relation multiple times and execute a sigmoid function after each call.

In the backward pass, implemented as the update_weights function, we calculate the gradients of each of our 6 weights and 3 biases with respect to the error function and update them by the factor learning rate * gradient.

Finally, the classify function works as expected. Since a sigmoid function outputs values between 0 and 1, we simply interpret them as probabilities of belonging to a particular class. Hence, outputs greater than or equal to 0.5 are classified as belonging to Class 1, while outputs that are less than 0.5 are said to belong to Class 0 .

def forward(self, batch):
    """
    A single forward pass through the network.
    Implementation of wX + b
    """

    self.hidden_ = np.dot(batch, self.weights_01) + self.b01
    self.hidden_out = self._sigmoid(self.hidden_)

    self.output_ = np.dot(self.hidden_out, self.weights_12) + self.b12
    self.output_final = self._sigmoid(self.output_)

    return self.output_final

def update_weights(self):

    # Calculate the squared error
    loss = 0.5 * (self.target - self.output_final) ** 2
    print(loss)
    self.losses.append(np.sum(loss))

    error_term = (self.target - self.output_final)

    # the gradient for the hidden layer weights
    grad01 = self.train_data.T @ (((error_term * self._delsigmoid(self.output_final)) * self.weights_12.T) * self._delsigmoid(self.hidden_out))
    print("grad01: ", grad01)
    print(grad01.shape)

    # the gradient for the output layer weights
    grad12 = self.hidden_out.T @ (error_term * self._delsigmoid(self.output_final))

    print("grad12: ", grad12)
    print(grad12.shape)

    # updating the weights by the learning rate times their gradient
    self.weights_01 += self.lr * grad01
    self.weights_12 += self.lr * grad12

    # update the biases the same way
    self.b01 += np.sum(self.lr * ((error_term * self._delsigmoid(self.output_final)) * self.weights_12.T) * self._delsigmoid(self.hidden_out), axis=0)
    self.b12 += np.sum(self.lr * error_term * self._delsigmoid(self.output_final), axis=0)
    
def classify(self, datapoint):
    """
    Return the class to which a datapoint belongs based on
    the perceptron's output for that point.
    """
    datapoint = np.transpose(datapoint)
    if self.forward(datapoint) >= 0.5:
        return 1

    return 0

The MLP class

Let’s bring everything together by creating an MLP class. All the functions we just discussed are placed in it. The plot function is exactly the same as the one in the Perceptron class.

class MLP:
    """
    Create a multi-layer perceptron.
    train_data: A 4x2 matrix with the input data.
    target: A 4x1 matrix with expected outputs
    lr: the learning rate. Defaults to 0.1
    num_epochs: the number of times the training data goes through the model
        while training
    num_input: the number of nodes in the input layer of the MLP.
        Should be equal to the second dimension of train_data.
    
    num_hidden: the number of nodes in the hidden layer of the MLP.
    num_output: the number of nodes in the output layer of the MLP.
        Should be equal to the second dimension of target.
    """
    def __init__(self, train_data, target, lr=0.1, num_epochs=100, num_input=2, num_hidden=2, num_output=1):
        self.train_data = train_data
        self.target = target
        self.lr = lr
        self.num_epochs = num_epochs

        # initialize both sets of weights and biases randomly
            # - weights_01: weights between input and hidden layer
            # - weights_12: weights between hidden and output layer
        self.weights_01 = np.random.uniform(size=(num_input, num_hidden))
        self.weights_12 = np.random.uniform(size=(num_hidden, num_output))

        # - b01: biases for the  hidden layer
        # - b12: bias for the output layer
        self.b01 = np.random.uniform(size=(1,num_hidden))
        self.b12 = np.random.uniform(size=(1,num_output))

        self.losses = []

    def update_weights(self):
        
        # Calculate the squared error
        loss = 0.5 * (self.target - self.output_final) ** 2
        print(loss)
        self.losses.append(np.sum(loss))

        error_term = (self.target - self.output_final)

        # the gradient for the hidden layer weights
        grad01 = self.train_data.T @ (((error_term * self._delsigmoid(self.output_final)) * self.weights_12.T) * self._delsigmoid(self.hidden_out))
        print("grad01: ", grad01)
        print(grad01.shape)

        # the gradient for the output layer weights
        grad12 = self.hidden_out.T @ (error_term * self._delsigmoid(self.output_final))

        print("grad12: ", grad12)
        print(grad12.shape)

        # updating the weights by the learning rate times their gradient
        self.weights_01 += self.lr * grad01
        self.weights_12 += self.lr * grad12

        # update the biases the same way
        self.b01 += np.sum(self.lr * ((error_term * self._delsigmoid(self.output_final)) * self.weights_12.T) * self._delsigmoid(self.hidden_out), axis=0)
        self.b12 += np.sum(self.lr * error_term * self._delsigmoid(self.output_final), axis=0)

    def _sigmoid(self, x):
        """
        The sigmoid activation function.
        """
        return 1 / (1 + np.exp(-x))

    def _delsigmoid(self, x):
        """
        The first derivative of the sigmoid function wrt x
        """
        return x * (1 - x)

    def forward(self, batch):
        """
        A single forward pass through the network.
        Implementation of wX + b
        """

        self.hidden_ = np.dot(batch, self.weights_01) + self.b01
        self.hidden_out = self._sigmoid(self.hidden_)

        self.output_ = np.dot(self.hidden_out, self.weights_12) + self.b12
        self.output_final = self._sigmoid(self.output_)

        return self.output_final

    def classify(self, datapoint):
        """
        Return the class to which a datapoint belongs based on
        the perceptron's output for that point.
        """
        datapoint = np.transpose(datapoint)
        if self.forward(datapoint) >= 0.5:
            return 1

        return 0

    def plot(self, h=0.01):
        """
        Generate plot of input data and decision boundary.
        """
        # setting plot properties like size, theme and axis limits
        sns.set_style('darkgrid')
        plt.figure(figsize=(20, 20))

        plt.axis('scaled')
        plt.xlim(-0.1, 1.1)
        plt.ylim(-0.1, 1.1)

        colors = {
            0: "ro",
            1: "go"
        }

        # plotting the four datapoints
        for i in range(len(self.train_data)):
            plt.plot([self.train_data[i][0]],
                     [self.train_data[i][1]],
                     colors[self.target[i][0]],
                     markersize=20)

        x_range = np.arange(-0.1, 1.1, h)
        y_range = np.arange(-0.1, 1.1, h)

        # creating a mesh to plot decision boundary
        xx, yy = np.meshgrid(x_range, y_range, indexing='ij')
        Z = np.array([[self.classify([x, y]) for x in x_range] for y in y_range])

        # using the contourf function to create the plot
        plt.contourf(xx, yy, Z, colors=['red', 'green', 'green', 'blue'], alpha=0.4)

    def train(self):
        """
        Train an MLP. Runs through the data num_epochs number of times.
        A forward pass is done first, followed by a backward pass (backpropagation)
        where the networks parameter's are updated.
        """
        for _ in range(self.num_epochs):
            self.forward(self.train_data)
            self.update_weights()

Results

Let’s train our MLP with a learning rate of 0.2 over 5000 epochs.

mlp = MLP(train_data, target_xor, 0.2, 5000)
mlp.train()

If we plot the values of our loss function, we get the following plot after about 5000 iterations, showing that our model has indeed converged.

Image for post
The Loss Plot over 5000 epochs of our MLP

A clear non-linear decision boundary is created here with our generalized neural network, or MLP.

Image for post
The Decision Boundary plot, showing the decision boundary and the classes

Note #1: Adding more layers or nodes

Adding more layers or nodes gives increasingly complex decision boundaries. However, this could also lead to overfitting—where a model achieves very high accuracy on the training data but fails to generalize.

A useful resource is the Tensorflow Neural Net playground, where you can try out different network architectures and view the results.

Tensorflow — Neural Network Playground
Tinker with a real neural network right here in your browser.

Note #2: Choosing a loss function

The loss function we used in our MLP model is the Mean Squared loss function. Although this is a very popular loss function, it makes some assumptions about the data (such as it being gaussian) and isn’t always convex when it comes to a classification problem. It was used here to make it easier to understand how a perceptron works, but there are better alternatives for classification tasks, like binary cross-entropy loss.

How to Choose Loss Functions When Training Deep Learning Neural Networks
Deep learning neural networks are trained using the stochastic gradient descent optimization algorithm. As part of the optimization algorithm, the error for the current state of the model must be estimated repeatedly. This requires the choice of an error function, conventionally called a loss functi…

Conclusion

In this series, we discussed three different attempts at solving the XOR problem, ultimately arriving at a generic solution with the MLP, which could be used to solve other, more complex non-linear problems.

Neural nets used in production or research are never this simple but they almost always build on the basics outlined here. Hopefully, this two-post series gave you some idea on how to build and train perceptrons and vanilla networks.

Thanks for reading!


I'm a CS Undergrad at BITS Pilani, currently in my 4th year. I write infrequently on several topics, from technical concepts to the latest customizations on my laptop, on my blog. Some of my more interesting posts are also published on Medium.

Aniruddha Karajgi – Medium
Read writing from Aniruddha Karajgi on Medium. GSoC’20 Mentor at CHAOSS | Former Intern at Samsung Research | CS Undergrad at BITS Pilani | polaris000.github.io.