# How Neural Networks Solve the XOR Problem- Part I

And why hidden layers are so important!!

** The perceptron** is a classification algorithm. Specifically, it works as a linear binary classifier. It was invented in the late 1950s by Frank Rosenblatt.

The perceptron basically works as a threshold function—non-negative outputs are put into one class while negative ones are put into the other class.

Though there’s a lot to talk about when it comes to neural networks and their variants, we’ll be discussing a specific problem that highlights the major differences between a single layer perceptron and one that has a few more layers.

**We'll discuss this across two posts.** In this post, we'll talk about the Perceptron Algorithm and two attempts at solving the XOR problem.

In the second and final post, we'll make our third (and final) attempt using a generalized solution.

## Structure and Properties

A perceptron has the following components:

- Input nodes
- Output node
- An activation function
- Weights and biases
- Error function

### Input Nodes

These nodes contain the input to the network. In any iteration—whether testing or training—the input from our data is passed to these nodes.

### Weights and Biases

These parameters are what we update when we talk about “training” a model. They are initialized to some random value or set to 0 and updated as the training progresses. The bias is analogous to a weight independent of any input node. Basically, it makes the model more flexible, since you can “move” the activation function around.

### Evaluation

The output calculation is straightforward.

- Compute the dot product of the input and weight vector.
- Add the bias.
- Apply the activation function.

This can be expressed like so:

This is often simplified and written as a dot product of the weight and input vectors plus the bias.

### Activation Function

This function allows us to fit the output in a way that makes more sense. For example, in a simple classifier, an output of say `-2.5`

or `8`

doesn’t make much sense when it comes to classification. If we use a sigmoidal activation function, we can fit that within a range of 0 to 1, which can be interpreted directly as a probability of a data point belonging to a particular class.

Though there are many activation functions, we will use a simple linear activation function for our perceptron. The linear activation function has no effect on its input and outputs it as is.

### Classification

How does a perceptron assign a class to a data point?

We know that a data point’s evaluation is expressed by the relation `wX + b`

. We define a threshold (** θ**) which classifies our data. Generally, this threshold is set to 0 for a perceptron.

So, points for which `wX + b`

is greater than or equal to 0 will belong to one class while the rest (`wX + b`

is negative) are classified as belonging to the other class. We can express this as:

Training algorithm

To train our perceptron, we must ensure that we correctly classify all of our training data. Note that this is different from how you would train a neural network, where you wouldn’t try and correctly classify your entire training data. That would lead to something called overfitting in most cases.

We start the training algorithm by calculating the ** gradient**, or Δw. It's the product of:

- The value of the input node corresponding to that weight.
- The difference between the actual value and the computed value.

We get our new weights by simply incrementing our original weights with the computed gradients which are then multiplied by the learning rate.

A simple intuition for how this works is: if our perceptron correctly classifies an input data point, `actual_value — computed_value`

would be `0`

and there wouldn’t be any change in our weights since the gradient is now `0`

.

## The 2D XOR problem

In the XOR problem, we are trying to train a model to mimic a 2D XOR function.

### The XOR function

The function is defined like so:

If we plot it, we get the following chart. This is what we’re trying to classify. The ⊕ (“o-plus”) symbol you see in the legend is conventionally used to represent the XOR Boolean operator.

Our algorithm—regardless of how it works—must correctly output the XOR value for each of the 4 points. We’ll be modelling this as a classification problem, so `Class 1`

would represent an XOR value of 1, while `Class 0`

would represent a value of 0.

## Attempt #1: The Single Layer Perceptron

Let's model the problem using a single layer perceptron.

### Input data

The table we saw for the XOR function will be used as the data to train our model.

```
Data Target
[0, 0] 0
[0, 1] 1
[1, 0] 1
[1, 1] 0
```

### Implementation

**Imports**

Apart from the usual visualization ( `matplotlib`

and `seaborn`

) and numerical libraries (`numpy`

), we’ll use `cycle`

from `itertools`

. This is done since our algorithm cycles through our data indefinitely until it manages to correctly classify the entire training data without any mistakes in the middle.

```
from itertools import cycle
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
```

**The data**

We next create our training data. This data is the same for each kind of logic gate since they all take in two Boolean variables as input.

```
train_data = np.array(
[
[0, 0],
[0, 1],
[1, 0],
[1, 1]])
target_xor = np.array(
[
[0],
[1],
[1],
[0]])
target_nand = np.array(
[
[1],
[1],
[1],
[0]])
target_or = np.array(
[
[0],
[1],
[1],
[1]])
target_and = np.array(
[
[0],
[0],
[0],
[1]])
```

**The training function**

Here, we cycle through the data indefinitely, keeping track of how many consecutive data points we correctly classified. If we manage to classify everything in one stretch, we terminate our algorithm.

If not, we reset our counter, update our weights, and continue the algorithm.

```
def train(self):
"""
Train a single layer perceptron.
"""
# the number of consecutive correct classifications
correct_counter = 0
for train, target in cycle(zip(self.train_data, self.target)):
# end if all points are correctly classified
if correct_counter == len(self.train_data):
break
output = self.classify(train)
self.node_val = train
if output == target:
correct_counter += 1
else:
# if incorrectly classified, update weights and reset correct_counter
self.update_weights(target, output)
correct_counter = 0
```

The rest of the utility functions, including those that update our model's weights and calculate gradients, can be implemented like this.

```
def _gradient(self, node, exp, output):
"""
Return the gradient for a weight.
This is the value of delta-w.
"""
return node * (exp - output)
def update_weights(self, exp, output):
"""
Update weights and bias based on their respective gradients
"""
for i in range(self.input_nodes):
self.w[i] += self.lr * self._gradient(self.node_val[i], exp, output)
# the value of the bias node can be considered as being 1 and the weight between this node
# and the output node being self.b
self.b += self.lr * self._gradient(1, exp, output)
def forward(self, datapoint):
"""
One forward pass through the perceptron.
Implementation of "wX + b".
"""
return self.b + np.dot(self.w, datapoint)
def classify(self, datapoint):
"""
Return the class to which a datapoint belongs based on
the perceptron's output for that point.
"""
if self.forward(datapoint) >= 0:
return 1
```

To visualize how our model performs, we create a mesh of data points, or a grid, and evaluate our model at each point in that grid. Finally, we colour each point based on how our model classifies it. Thus, the `Class 0`

region would be filled with the colour assigned to points belonging to that class.

```
def plot(self, h=0.01):
"""
Generate plot of input data and decision boundary.
"""
# setting plot properties like size, theme and axis limits
sns.set_style('darkgrid')
plt.figure(figsize=(20, 20))
plt.axis('scaled')
plt.xlim(-0.1, 1.1)
plt.ylim(-0.1, 1.1)
colors = {
0: "ro",
1: "go"
}
# plotting the four datapoints
for i in range(len(self.train_data)):
plt.plot([self.train_data[i][0]],
[self.train_data[i][1]],
colors[self.target[i][0]],
markersize=20)
x_range = np.arange(-0.1, 1.1, h)
y_range = np.arange(-0.1, 1.1, h)
# creating a mesh to plot decision boundary
xx, yy = np.meshgrid(x_range, y_range, indexing='ij')
Z = np.array([[self.classify([x, y]) for x in x_range] for y in y_range])
# using the contourf function to create the plot
```

**The Perceptron class**

To bring everything together, we create a simple `Perceptron`

class with the functions we just discussed. We have some instance variables like the training data, the target, the number of input nodes, and the learning rate.

```
class Perceptron:
"""
Create a perceptron.
train_data: A 4x2 matrix with the input data.
target: A 4x1 matrix with the perceptron's expected outputs
lr: the learning rate. Defaults to 0.01
input_nodes: the number of nodes in the input layer of the perceptron.
Should be equal to the second dimension of train_data.
"""
def __init__(self, train_data, target, lr=0.01, input_nodes=2):
self.train_data = train_data
self.target = target
self.lr = lr
self.input_nodes = input_nodes
# randomly initialize the weights and set the bias to -1.
self.w = np.random.uniform(size=self.input_nodes)
self.b = -1
# node_val hold the values of each node at a given point of time.
self.node_val = np.zeros(self.input_nodes)
def _gradient(self, node, exp, output):
"""
Return the gradient for a weight.
This is the value of delta-w.
"""
return node * (exp - output)
def update_weights(self, exp, output):
"""
Update weights and bias based on their respective gradients
"""
for i in range(self.input_nodes):
self.w[i] += self.lr * self._gradient(self.node_val[i], exp, output)
# the value of the bias node can be considered as being 1 and the weight between this node
# and the output node being self.b
self.b += self.lr * self._gradient(1, exp, output)
def forward(self, datapoint):
"""
One forward pass through the perceptron.
Implementation of "wX + b".
"""
return self.b + np.dot(self.w, datapoint)
def classify(self, datapoint):
"""
Return the class to which a datapoint belongs based on
the perceptron's output for that point.
"""
if self.forward(datapoint) >= 0:
return 1
return 0
def plot(self, h=0.01):
"""
Generate plot of input data and decision boundary.
"""
# setting plot properties like size, theme and axis limits
sns.set_style('darkgrid')
plt.figure(figsize=(20, 20))
plt.axis('scaled')
plt.xlim(-0.1, 1.1)
plt.ylim(-0.1, 1.1)
colors = {
0: "ro",
1: "go"
}
# plotting the four datapoints
for i in range(len(self.train_data)):
plt.plot([self.train_data[i][0]],
[self.train_data[i][1]],
colors[self.target[i][0]],
markersize=20)
x_range = np.arange(-0.1, 1.1, h)
y_range = np.arange(-0.1, 1.1, h)
# creating a mesh to plot decision boundary
xx, yy = np.meshgrid(x_range, y_range, indexing='ij')
Z = np.array([[self.classify([x, y]) for x in x_range] for y in y_range])
# using the contourf function to create the plot
plt.contourf(xx, yy, Z, colors=['red', 'green', 'green', 'blue'], alpha=0.4)
def train(self):
"""
Train a single layer perceptron.
"""
# the number of consecutive correct classifications
correct_counter = 0
for train, target in cycle(zip(self.train_data, self.target)):
# end if all points are correctly classified
if correct_counter == len(self.train_data):
break
output = self.classify(train)
self.node_val = train
if output == target:
correct_counter += 1
else:
# if incorrectly classified, update weights and reset correct_counter
self.update_weights(target, output)
correct_counter = 0
```

### Results

Let’s create a perceptron object and train it on the XOR data.

```
p_xor = Perceptron(train_data, target_xor)
p_xor.train()
```

You’ll notice that the training loop never terminates since a perceptron can only converge on linearly separable data. To put it simply, linearly separable data means that you can separate data with a point in 1D, a line in 2D, a plane in 3D, and so on.

## A perceptron can only converge on linearly separable data. Therefore, it isn’t capable of imitating the XOR function.

Remember that a perceptron must correctly classify the entire training data in one go. If we keep track of how many points it correctly classified in succession, we get something like this.

The algorithm only terminates when `correct_counter`

hits 4—which is the size of the training set—so this will go on indefinitely.

## The Need for Non-Linearity

It is clear that a single perceptron will not serve our purpose; the classes aren’t linearly separable. This boils down to the fact that a single linear decision boundary isn’t going to work.

Non-linearity allows for more complex decision boundaries. One potential decision boundary for our XOR data could look like this:

## Attempt #2: The 2D XOR problem

We know that imitating the XOR function would require a non-linear decision boundary.

## However, why do we have to stick with a single decision boundary?

### The Intuition

Let’s first break down the XOR function into its AND and OR counterparts.

The XOR function on two Boolean variables A and B is defined as:

Let’s add `A.~A`

and `B.~B`

to the equation. Since they both equate to 0, the equation remains valid.

Let’s rearrange the terms so that we can pull out `A`

from the first part and `B`

from the second.

Simplifying it further, we get:

Using DeMorgan’s laws for Boolean algebra `~A + ~B = ~(AB)`

, we can replace the second term in the above equation as shown:

Let’s replace A and B with x_1 and x_2 respectively since that’s the convention we’re using in our data.

The XOR function can be condensed into two parts: ** a NAND and an OR**. If we can calculate these separately, we can just combine the results, using

**an AND gate.**Let’s call the OR section of the formula part I, and the NAND section as part II.

### Modelling the OR part

We’ll use the same Perceptron class as before, only that we’ll train it on OR training data.

```
p_or = Perceptron(train_data, target_or)
p_or.train()
```

This converges, since the data for the OR function is linearly separable. If we plot the number of accurately classified consecutive data points as we did in our first attempt, we get this plot. It’s clear that around iteration 50, it hits the value 4, meaning that it classified the entire dataset correctly.

## correct_counter measures the number of consecutive data points correctly classified by our Perceptron.

The decision boundary plot looks like this:

### Modelling the NAND part

Moving on to the second part, we need to model a NAND gate. Just like the OR part, we’ll use the same code but train the model on the NAND data. So our input data would be:

After training, the following plots show that our model converged on the NAND data and mimics the NAND gate perfectly.

What we now have is a model that mimics the XOR function.

If we were to implement our XOR model, it would look something like this:

```
def XOR(x1, x2):
"""
Return the boolean XOR of x1 and x2
"""
x = [x1, x2]
p_or = Perceptron(train_data, target_or)
p_nand = Perceptron(train_data, target_nand)
p_and = Perceptron(train_data, target_and)
p_or.train()
p_nand.train()
p_and.train()
return p_and.classify([p_or.classify(x),
p_nand.classify(x)])
```

If we plot the decision boundaries from our model—which is basically an AND of our OR and NAND models—we get something like this:

## Out of all the logic gates taking two inputs, the XOR and XNOR gates are the only ones that are not linearly-separable.

Though our model works, it doesn’t seem like a viable solution to most non-linear classification or regression tasks. It’s really specific to this case. Most problems cannot be split into simple intermediate ones that can be individually solved and then combined. For something like this...

…a potential decision boundary could be something like this.

## Conclusion

In this post, we understood how a perceptron works, what the XOR problem is, and then made two attempts to solve it:

- The Perceptron algorithm—which failed.
- Combining several single-layer perceptrons to model non-linearity.

We need to look for a more general model, which would allow for non-linear decision boundaries, like a curve, as in the case above.

**In the next post**, we'll have a look at the Multi-layer Perceptron (MLP). As a more generic solution, it is more powerful than our previous approaches as it can model a variety of non-linear problems.

I'm a CS Undergrad at BITS Pilani, currently in my 4th year. I write infrequently on several topics, from technical concepts to the latest customizations on my laptop, on my blog. Some of my more interesting posts are also published on Medium.