The perceptron is a classification algorithm. Specifically, it works as a linear binary classifier. It was invented in the late 1950s by Frank Rosenblatt.

The perceptron basically works as a threshold function—non-negative outputs are put into one class while negative ones are put into the other class.

Though there’s a lot to talk about when it comes to neural networks and their variants, we’ll be discussing a specific problem that highlights the major differences between a single layer perceptron and one that has a few more layers.

We'll discuss this across two posts. In this post, we'll talk about the Perceptron Algorithm and two attempts at solving the XOR problem.

In the second and final post, we'll make our third (and final) attempt using a generalized solution.


Structure and Properties

A perceptron has the following components:

  • Input nodes
  • Output node
  • An activation function
  • Weights and biases
  • Error function
Image for post
A representation of a single-layer perceptron with 2 input nodes

Input Nodes

These nodes contain the input to the network. In any iteration—whether testing or training—the input from our data is passed to these nodes.

Weights and Biases

These parameters are what we update when we talk about “training” a model. They are initialized to some random value or set to 0 and updated as the training progresses. The bias is analogous to a weight independent of any input node. Basically, it makes the model more flexible, since you can “move” the activation function around.

Evaluation

The output calculation is straightforward.

  • Compute the dot product of the input and weight vector.
  • Add the bias.
  • Apply the activation function.

This can be expressed like so:

Image for post

This is often simplified and written as a dot product of the weight and input vectors plus the bias.

Image for post

Activation Function

This function allows us to fit the output in a way that makes more sense. For example, in a simple classifier, an output of say -2.5 or 8 doesn’t make much sense when it comes  to classification. If we use a sigmoidal activation function, we can fit that within a range of 0 to 1, which can be interpreted directly as a probability of a data point belonging to a particular class.

Though there are many activation functions, we will use a simple linear activation function for our perceptron. The linear activation function has no effect on its input and outputs it as is.

Classification

How does a perceptron assign a class to a data point?

We know that a data point’s evaluation is expressed by the relation wX + b . We define a threshold (θ) which classifies our data. Generally, this threshold is set to 0 for a perceptron.

So, points for which wX + b is greater than or equal to 0 will belong to one class while the rest (wX + b is negative) are classified as belonging to the other class. We can express this as:

Image for post


Training algorithm

To train our perceptron, we must ensure that we correctly classify all of our training data. Note that this is different from how you would train a neural network, where you wouldn’t try and correctly classify your entire training data. That would lead to something called overfitting in most cases.

We start the training algorithm by calculating the gradient, or Δw. It's the product of:

  • The value of the input node corresponding to that weight.
  • The difference between the actual value and the computed value.
Image for post

We get our new weights by simply incrementing our original weights with the computed gradients which are then multiplied by the learning rate.

Image for post

A simple intuition for how this works is: if our perceptron correctly classifies an input data point, actual_value — computed_value would be 0  and there wouldn’t be any change in our weights since the gradient is now 0.


The 2D XOR problem

In the XOR problem, we are trying to train a model to mimic a 2D XOR function.

The XOR function

The function is defined like so:

Image for post
The XOR Truth table


If we plot it, we get the following chart. This is what we’re trying to classify. The ⊕ (“o-plus”) symbol you see in the legend is conventionally used to represent the XOR Boolean operator.

Image for post
The XOR output plot

Our algorithm—regardless of how it works—must correctly output the XOR value for each of the 4 points. We’ll be modelling this as a classification problem, so Class 1 would represent an XOR value of 1, while Class 0 would represent a value of 0.


Attempt #1: The Single Layer Perceptron

Let's model the problem using a single layer perceptron.

Input data

The table we saw for the XOR function will be used as the data to train our model.

Data         Target
[0, 0]         0
[0, 1]         1
[1, 0]         1
[1, 1]         0

Implementation

Imports

Apart from the usual visualization ( matplotliband seaborn) and numerical libraries (numpy), we’ll use cycle from itertools . This is done since our algorithm cycles through our data indefinitely until it manages to correctly classify the entire training data without any mistakes in the middle.

from itertools import cycle
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

The data

We next create our training data. This data is the same for each kind of logic gate since they all take in two Boolean variables as input.

train_data = np.array(
    [
        [0, 0],
        [0, 1],
        [1, 0],
        [1, 1]])

target_xor = np.array(
    [
        [0],
        [1],
        [1],
        [0]])

target_nand = np.array(
    [
        [1],
        [1],
        [1],
        [0]])

target_or = np.array(
    [
        [0],
        [1],
        [1],
        [1]])

target_and = np.array(
    [
        [0],
        [0],
        [0],
        [1]])

The training function

Here, we cycle through the data indefinitely, keeping track of how many consecutive data points we correctly classified. If we manage to classify everything in one stretch, we terminate our algorithm.

If not, we reset our counter, update our weights, and continue the algorithm.

def train(self):
    """
    Train a single layer perceptron.
    """
    # the number of consecutive correct classifications
    correct_counter = 0

    for train, target in cycle(zip(self.train_data, self.target)):
        # end if all points are correctly classified
        if correct_counter == len(self.train_data):
            break

        output = self.classify(train)
        self.node_val = train

        if output == target:
            correct_counter += 1
        else:
            # if incorrectly classified, update weights and reset correct_counter
            self.update_weights(target, output)
            correct_counter = 0

The rest of the utility functions, including those that update our model's weights and calculate gradients, can be implemented like this.

def _gradient(self, node, exp, output):
    """
    Return the gradient for a weight.
    This is the value of delta-w.
    """
    return node * (exp - output)

def update_weights(self, exp, output):
    """
    Update weights and bias based on their respective gradients
    """
    for i in range(self.input_nodes):
        self.w[i] += self.lr * self._gradient(self.node_val[i], exp, output)

    # the value of the bias node can be considered as being 1 and the weight between this node
    # and the output node being self.b
    self.b += self.lr * self._gradient(1, exp, output)

def forward(self, datapoint):
    """
    One forward pass through the perceptron.
    Implementation of "wX + b".
    """
    return self.b + np.dot(self.w, datapoint)

def classify(self, datapoint):
    """
    Return the class to which a datapoint belongs based on
    the perceptron's output for that point.
    """
    if self.forward(datapoint) >= 0:
        return 1

To visualize how our model performs, we create a mesh of data points, or a grid, and evaluate our model at each point in that grid. Finally, we colour each point based on how our model classifies it. Thus, the Class 0 region would be filled with the colour assigned to points belonging to that class.

def plot(self, h=0.01):
    """
    Generate plot of input data and decision boundary.
    """
    # setting plot properties like size, theme and axis limits
    sns.set_style('darkgrid')
    plt.figure(figsize=(20, 20))

    plt.axis('scaled')
    plt.xlim(-0.1, 1.1)
    plt.ylim(-0.1, 1.1)

    colors = {
        0: "ro",
        1: "go"
    }

    # plotting the four datapoints
    for i in range(len(self.train_data)):
        plt.plot([self.train_data[i][0]],
                 [self.train_data[i][1]],
                 colors[self.target[i][0]],
                 markersize=20)

    x_range = np.arange(-0.1, 1.1, h)
    y_range = np.arange(-0.1, 1.1, h)

    # creating a mesh to plot decision boundary
    xx, yy = np.meshgrid(x_range, y_range, indexing='ij')
    Z = np.array([[self.classify([x, y]) for x in x_range] for y in y_range])

    # using the contourf function to create the plot

The Perceptron class

To bring everything together, we create a simple Perceptron class with the functions we just discussed. We have some instance variables like the training data, the target, the number of input nodes, and the learning rate.


class Perceptron:
    """
    Create a perceptron.
    train_data: A 4x2 matrix with the input data.
    target: A 4x1 matrix with the perceptron's expected outputs
    lr: the learning rate. Defaults to 0.01
    input_nodes: the number of nodes in the input layer of the perceptron.
        Should be equal to the second dimension of train_data.
    """

    def __init__(self, train_data, target, lr=0.01, input_nodes=2):
        self.train_data = train_data
        self.target = target
        self.lr = lr
        self.input_nodes = input_nodes

        # randomly initialize the weights and set the bias to -1.
        self.w = np.random.uniform(size=self.input_nodes)
        self.b = -1

        # node_val hold the values of each node at a given point of time.
        self.node_val = np.zeros(self.input_nodes)

    def _gradient(self, node, exp, output):
        """
        Return the gradient for a weight.
        This is the value of delta-w.
        """
        return node * (exp - output)

    def update_weights(self, exp, output):
        """
        Update weights and bias based on their respective gradients
        """
        for i in range(self.input_nodes):
            self.w[i] += self.lr * self._gradient(self.node_val[i], exp, output)

        # the value of the bias node can be considered as being 1 and the weight between this node
        # and the output node being self.b
        self.b += self.lr * self._gradient(1, exp, output)

    def forward(self, datapoint):
        """
        One forward pass through the perceptron.
        Implementation of "wX + b".
        """
        return self.b + np.dot(self.w, datapoint)

    def classify(self, datapoint):
        """
        Return the class to which a datapoint belongs based on
        the perceptron's output for that point.
        """
        if self.forward(datapoint) >= 0:
            return 1

        return 0
    def plot(self, h=0.01):
        """
        Generate plot of input data and decision boundary.
        """
        # setting plot properties like size, theme and axis limits
        sns.set_style('darkgrid')
        plt.figure(figsize=(20, 20))

        plt.axis('scaled')
        plt.xlim(-0.1, 1.1)
        plt.ylim(-0.1, 1.1)

        colors = {
            0: "ro",
            1: "go"
        }

        # plotting the four datapoints
        for i in range(len(self.train_data)):
            plt.plot([self.train_data[i][0]],
                     [self.train_data[i][1]],
                     colors[self.target[i][0]],
                     markersize=20)

        x_range = np.arange(-0.1, 1.1, h)
        y_range = np.arange(-0.1, 1.1, h)

        # creating a mesh to plot decision boundary
        xx, yy = np.meshgrid(x_range, y_range, indexing='ij')
        Z = np.array([[self.classify([x, y]) for x in x_range] for y in y_range])

        # using the contourf function to create the plot
        plt.contourf(xx, yy, Z, colors=['red', 'green', 'green', 'blue'], alpha=0.4)


    def train(self):
        """
        Train a single layer perceptron.
        """
        # the number of consecutive correct classifications
        correct_counter = 0

        for train, target in cycle(zip(self.train_data, self.target)):
            # end if all points are correctly classified
            if correct_counter == len(self.train_data):
                break

            output = self.classify(train)
            self.node_val = train

            if output == target:
                correct_counter += 1
            else:
                # if incorrectly classified, update weights and reset correct_counter
                self.update_weights(target, output)
                correct_counter = 0
        

Results

Let’s create a perceptron object and train it on the XOR data.

p_xor = Perceptron(train_data, target_xor)
p_xor.train()

You’ll notice that the training loop never terminates since a perceptron can only converge on linearly separable data. To put it simply, linearly separable data means that you can separate data with a point in 1D, a line in 2D, a plane in 3D, and so on.

A perceptron can only converge on linearly separable data. Therefore, it isn’t capable of imitating the XOR function.

Remember that a perceptron must correctly classify the entire training data in one go. If we keep track of how many points it correctly classified in succession, we get something like this.

Image for post
The value of correct_counter over 100 cycles of training

The algorithm only terminates when correct_counter hits 4—which is the size of the training set—so this will go on indefinitely.

The Need for Non-Linearity

It is clear that a single perceptron will not serve our purpose; the classes aren’t linearly separable. This boils down to the fact that a single linear decision boundary isn’t going to work.

Non-linearity allows for more complex decision boundaries. One potential decision boundary for our XOR data could look like this:

Image for post
A potential non-linear decision boundary for our XOR model

Attempt #2: The 2D XOR problem

We know that imitating the XOR function would require a non-linear decision boundary.

However, why do we have to stick with a single decision boundary?

The Intuition

Let’s first break down the XOR function into its AND and OR counterparts.

The XOR function on two Boolean variables A and B is defined as:

Image for post

Let’s add A.~A and B.~B to the equation. Since they both equate to 0, the equation remains valid.

Image for post

Let’s rearrange the terms so that we can pull out A from the first part and B from the second.

Image for post

Simplifying it further, we get:

Image for post

Using DeMorgan’s laws for Boolean algebra ~A + ~B = ~(AB) , we can replace the second term in the above equation as shown:

Image for post

Let’s replace A and B with x_1 and x_2 respectively since that’s the convention we’re using in our data.

Image for post

The XOR function can be condensed into two parts: a NAND and an OR. If we can calculate these separately, we can just combine the results, using an AND gate.

Let’s call the OR section of the formula part I, and the NAND section as part II.

Modelling the OR part

We’ll use the same Perceptron class as before, only that we’ll train it on OR training data.

Image for post
The OR truth table
p_or = Perceptron(train_data, target_or)
p_or.train()

This converges, since the data for the OR function is linearly separable. If we plot the number of accurately classified consecutive data points as we did in our first attempt, we get this plot. It’s clear that around iteration 50, it hits the value 4, meaning that it classified the entire dataset correctly.

correct_counter measures the number of consecutive data points correctly classified by our Perceptron.

Image for post
The correct_counter plot for our OR perceptron

The decision boundary plot looks like this:

Image for post
The Output plot of our OR perceptron

Modelling the NAND part

Moving on to the second part, we need to model a NAND gate. Just like the OR part, we’ll use the same code but train the model on the NAND data. So our input data would be:

Image for post
The NAND Truth table

After training, the following plots show that our model converged on the NAND data and mimics the NAND gate perfectly.

Image for post
The correct_count plot of our AND perceptron
Image for post
The output plot of our AND perceptron.

What we now have is a model that mimics the XOR function.

If we were to implement our XOR model, it would look something like this:

def XOR(x1, x2):
    """
    Return the boolean XOR of x1 and x2
    """

    x = [x1, x2]
    p_or = Perceptron(train_data, target_or)
    p_nand = Perceptron(train_data, target_nand)
    p_and = Perceptron(train_data, target_and)

    p_or.train()
    p_nand.train()
    p_and.train()

    return p_and.classify([p_or.classify(x),
                          p_nand.classify(x)])

If we plot the decision boundaries from our model—which is basically an AND of our OR and NAND models—we get something like this:

Image for post
The Output plot of our 2nd Attempt, showing a correct classification on our XOR data

Out of all the logic gates taking two inputs, the XOR and XNOR gates are the only ones that are not linearly-separable.

Though our model works, it doesn’t seem like a viable solution to most non-linear classification or regression tasks. It’s really specific to this case. Most problems cannot be split into simple intermediate ones that can be individually solved and then combined. For something like this...

Image for post
A binary classification problem in two dimensions

…a potential decision boundary could be something like this.

Image for post
A potential decision boundary that fits our example

Conclusion

In this post, we understood how a perceptron works, what the XOR problem is, and then made two attempts to solve it:

  • The Perceptron algorithm—which failed.
  • Combining several single-layer perceptrons to model non-linearity.

We need to look for a more general model, which would allow for non-linear decision boundaries, like a curve, as in the case above.

In the next post, we'll have a look at the Multi-layer Perceptron (MLP). As a more generic solution, it is more powerful than our previous approaches as it can model a variety of non-linear problems.

Link to part - II.


I'm a CS Undergrad at BITS Pilani, currently in my 4th year. I write infrequently on several topics, from technical concepts to the latest customizations on my laptop, on my blog. Some of my more interesting posts are also published on Medium.

Aniruddha Karajgi – Medium
Read writing from Aniruddha Karajgi on Medium. GSoC’20 Mentor at CHAOSS | Former Intern at Samsung Research | CS Undergrad at BITS Pilani | polaris000.github.io.