A lot of us want to get started with Machine Learning. Either because it's a hot topic of the day or because it just looks like something fun. However, not many can boast familiarity with linear algebra, multivariable calculus, statistics and advanced programming concepts. If you fit into this category where you're interested but lack any advanced computer science or mathematical knowledge, you're in luck! In fact, with little to no expert knowledge, it's possible to build a simple machine learning classifier. The only prerequisite is some basic python knowledge which you can get from a variety of sources. Personally, I recommend MIT OCW's 6.0001 (link) since it's a very structured way to learn python.

To follow along with this article, refer the code used here.

What is Machine learning?

So, to begin with, I would like to state the goal of machine learning: it's essentially a way to build a model that simulates an aspect of "figuring things out". For example, it's very easy for humans to distinguish cats and dogs, but it becomes harder to get a computer to do the same thing. If you really think about it though, training computers and training humans to identify these things can be similar.

In fact, this is probably how we were taught as children. We would first guess the correct answer based on what was presented to us, and then an adult would tell us what it actually was. Our next guess would then be better. Then when we went out into the world and saw a new cat or a dog, we made our guesses with close to 100% accuracy. So the set of all cats and dogs in the world is now familiar to us just because we were trained with the small sample of cats and dogs we saw in our childhood.

Machine learning is the same thing. We train the model with a small sample of things and hopefully it is ready to go out and identify one that hasn't been presented to it before. The sample that we use is typically from a dataset where some researchers collect and store data for us to use. Machine Learning follows a different paradigm from standard programming practices.

Usually, while programming, you would probably try to enforce rules and data onto the system and get answers. Machine Learning goes about this backwards. You enforce data and answers and hope to get a set of rules to come out of the machine.

In this project we'll be working on the iris dataset. We'll use some data to predict what sort of iris plant we're looking at based on a few features such as petal length, sepal length, etc.

How Naïve Bayes Works

Statistics and machine learning go hand-in-hand. It's almost impossible to learn machine learning without knowing some statistics. So, let's dive in.

Typically in statistics, we have a preconceived notion that we want to take a closer look at to see if it's true. This is known as a hypothesis. To check whether your hypothesis is true, you do the following -

  1. You observe a small set of values (small relative to the set of all values that exist in the universe), known as a sample.
  2. Then you look at how likely your hypothesis is to be true, given that the sample is how it is.

I hope that you are already familiar with Bayes Theorem (if you're unfamiliar with it, you probably want to check out this video). Typically, we use y  to represent the hypothesis and X to represent a sample that tests for that hypothesis. Let's take an example.

Say you think you have contracted the coronavirus and you want to test for it and say there's a new testing system which has a 99% accuracy – which means that whatever decision it makes, there's a 99% chance that it's right. Now, let's say that you've been tested positive by this system. Now you want to find out the probability that you actually have coronavirus. In this case, y represents the event that you do have coronavirus and X represents the event that the test comes out as positive. So, as per Bayes' Theorem, the probability that you actually do have coronavirus is:

$$ P(y | X) = \frac{P(X | y) P(y)}{P(X)} $$

What relevance do machine learning engineers have here? Well, their job is to actually build this test which can classify people into people who have coronavirus and don't have coronavirus.

Naturally we're not going to try and devise a medical setup (because, well, we're not doctors). We would – instead – do it in a probabilistic fashion. Naive Bayes is a probabilistic classifier that allows us to do this.

Based on a few sample values, the Naive Bayes classifier understands what the rules are and can classify a new test case when presented with one. Now, each value that enters a sample initially comes from a probability distribution that the population follows. So our job now is to try and find out what sort of distribution we have and then see how likely it is that a sample would present itself the way it does (take a minute to think about what this means).

We can refine the problem statement that we had earlier to better suite a statistician. Earlier we were asking "how do we identify samples?". Now we can ask the question "what is the underlying distribution from which these samples are coming from?". Obviously, we can't look at the value every single member in the population to identify the underlying distribution, because it's impractical to do so. Probability distributions are typically characterised by simple functions. If you ascertain the values of the parameters in the function, you basically know the probability distribution for all input values. For example, an exponential distribution looks like this:
$P(x) = \lambda e ^{- \lambda x}$   $\forall x \geqslant 0$

Here, $\lambda$ is a parameter for this particular probability distribution. If you find $\lambda$, your job is done since you've got the distribution.

Typically we use $\theta$ to describe a generic parameter for a probability distribution. Naive Bayes is very similar to a statistical concept known as Maximum Likelihood Estimation, which estimates the value of these parameters.

Finding the parameters for a distribution can be tricky and the best we can do is model the distribution using a representative sample. Let's say we have a sample of n members, $X_{1} .... X_{n}$ (in this case, our dataset). There are a few ways to obtain an estimate for $\theta$. One way to do it is to select the parameter that makes the sample most likely to have occurred. In essence, it means that you try out all possible values for $\theta$ and pick the one that would make the probability of getting the sample you obtained as large as possible. This is known as maximum likelihood estimation. We'll represent the estimated value of the parameter by $\hat{\theta}$ . Assuming the sample is independent and identically distributed (meaning that each member is from the same underlying distribution and the fact that one member is selected does not affect the selection of another member), we get

$$\hat{\theta} = argmax [\prod_{i}P_{\theta} (X_{i} = x_{i})]$$

The formula above looks complicated, but in reality it's very simple. "argmax of a function" essentially translates to "analyse all values and pick the one that makes this function take it's maximum value". In this context, argmax tries to find the value of $\theta$ that results in a maximum for the probability of the sample.

Our problem is a bit different but uses the same paradigm. We are trying to build a classifier that makes a prediction. Instead of choosing a parameter, we have to choose a classification result. Our sample members are conditionally independent on y

$$ \hat{y} = argmax [P(y) \prod_{i} P(x_{i} | y)] $$

$\hat{y}$ is the prediction made by the classifier by analysing all the values of y. As a theoretical example, let us see how this paradigm would classify images to identify dogs and cats. If an image X is fed to it, it would look at the values of $P(X|cat)$ (the probability of X showing up, given that it's a picture of cat) and $P(X|dog)$ (the probability of X showing up, given that it's a picture of dog) and decide whether the image is showing a cat or a dog by picking the larger of these values.

Images are more complex objects to deal with. The value of each pixel contributes to the final decision made. Let's take a simpler example for now. The Iris plant has several different varieties. However, with the lengths and widths of their sepals and petals, it's easy to distinguish one variety from another.

Exploring the Dataset

Now for this portion, it will help if you know how to use the pandas and matplotlib libraries, but if you don't, it's fine because I'll walk you through the process of using them for the task at hand. We'll also take a look at a tool called Altair which makes visualisations a bit cooler.

You can download the dataset from Kaggle here. Now, first order of business is loading this dataset into your python program. I'm assuming you have pandas installed. If not, just:

pip3 install pandas

In terminal/command prompt and you're good to go! Now, we can begin the code.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import altair as alt

This gets the prerequisite libraries imported into the program. Now, you can load the dataset's csv file into a pandas DataFrame (which is the basic object for all pandas related data analysis) and see what the dataset looks like

df = pd.read_csv('Iris.csv')

print(df.head()) # Prints the first few lines of the dataset.

Great! So we can see that there are 4 major parameters (Sepal Length, Sepal Width, Petal Length and Petal Width) which we can use for classification. Now, let's try and see how these values trend. First, let's look at the default pandas plot function which plots all the data using the specified index. We'll specify the index to by the Id of the Iris (since we have nothing better to use for now).

df = df.set_index("Id")

df.plot()
plt.show()
The default pandas plot

Since the species are grouped by ID (the first lot contains Setosa, second lot being Versicolor and third lot being Virginica), you may notice the jumps in the data values here and there and find signs of change. Around Id=50, you can notice a spike in all values, which is probably an indication of the change from Setosa to Versicolor. However, this representation is kind of ugly and rather crude overall. Let's see if we can use Altair to get a better representation of the data.

Exploring with Altair

Altair is a fun library to mess around with. It allows you to create fun, interractive charts. If you're not using a Jupyter notebook, you'll need to allow Altair to open up a browser tab to display the chart.

Let's create a 2D plot with Sepal Length and Sepal Width.

alt.renderers.enable('altair_viewer') # Allows browser tab open.

chart = alt.Chart(df).mark_circle(size=60).encode(
    x='SepalLengthCm',
    y='SepalWidthCm',
    color='Species',
    tooltip=['Species', 'SepalLengthCm', 'SepalWidthCm']
).interactive()

chart.show()

This gives you a nice interactive plot.

You can already see some patterns emerging. Iris-setosa is almost immediately distinguishable from the other 2. Virginica seems to trend with slightly higher Sepal Width and Height.

It's your turn to fool around with the dataset as much as you like. You can draw as many graphs and figures as you need to see what the data looks like. You can be as creative as you want while trying to poke around the dataset.

Giving Mathematical Meaning to the Data

This is one of the things that people don't often tell you to do, but can be very useful to get the full picture when doing something related to Machine Learning or Data Science. It's generally a good idea to know what you're dealing with and what exactly it is you're doing when writing the few lines of code it takes to generate output.

While discussing Bayes' Theorem, I mentioned that X is often used to represent the input and y represents the output. Here $X_{i}$ represents the member i of the sample and $y_{i}$ represents the prediction value of that corresponding sample. Note that each member of the sample is a vector containing 4 values (Sepal Width and Height and Petal Width and Height). y is the numerical output for which we need to assign meaning. For example, in a medical test, you could say y = 0 represents the event that the test is negative and y = 1 represents the event that the test came out positive. Similarly, in our Iris dataset, let us take y = 0 to be the case that the Iris is Setosa, y = 1 correspond to Versicolor and y = 2 correspond to Virginica.

Let's say that $\hat{y_{i}}$ is the predicted output for a test sample $X_{i}$. We hope that for as many Iris species as possible $\hat{y_{i}}$ and $y_{i}$ will be the same.

Developing the Model

For this bit, we'll be using scikit-learn, so you should probably install it with pip or conda. Implementing a Naive Bayes classifier typically requires 3 steps: 1) Choosing the distribution, 2) Fitting the training sample to the distribution 3) Seeing how well the model does with training data.

Choosing the distribution is the only hard part about implementing Naive Bayes (assuming you've understood the theory behind it). Typically there are 3 types of distributions we fit to: Bernoulli, Multionomial and Gaussian. Bernoulli Distributions are employed when a single trial is conducted to see whether something comes up as either a success (y=1) or a failure (y=0). Clearly, the Bernoulli isn't applicable here since we're dealing with a variable having multiple values (0, 1, 2). Here, the Gaussian ends up being a good distribution to fit the data to, so we'll use that. Let's start by getting sklearn in and loading the data. The same dataset is available in the sklearn libraries and it's more convenient to use.

from sklearn import datasets

iris = datasets.load_iris()  # Loads the data from sklearn's datasets.

Now, let's make the model variable and get it acclimatised to the Gaussian environment.

from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

Now that we've selected our template for the model, we have to fit our training sample to it. Before that, we need to split our dataset into training and testing so that we can later see how good our model actually is.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split (iris.data, iris.target, test_size = 0.2)

Let's train the model with the training sample:

model.fit(X_train, y_train)     # Trains the model

Checkpoint : It's a good idea to stop here and see if we really know what we're doing. The training part of the Naive Bayes approach takes 4 Gaussians in total. 1 corresponding to each class and 1 corresponding to the Iris set as a whole. The reason it does this is so that classification becomes easy. The MAP approach relies on the fact that you can pick the value that would've most likely lead to the test value being picked out the way it was. The best way to check which one of those classes best does that is to see each one and pick the one that best fits. For doing that, we need the mean and variance of the members of each class in our training sample.

In statistics, we never just stop with making the model. We often want to see how good the model holds up when it sees data it hasn't seen before. Typically in Inferential Statistics, a Goodness of Fit test is performed at this point. For now, we'll just stick to tracking how accurate the model was, by seeing how different the predicted value $\hat{y}$ and the actual value y are. To measure the difference between these values, we need a proper metric. Luckily for us, sklearn has it figured out.

from sklearn import metrics

# Applies the calculated probabilities to find
predicts = model.predict(X_test)

# Grades the model based on performance.
print(metrics.accuracy_score(predicts, y_test)*100)   

The "metrics" feature in sklearn allows us to really see how well our model does against new data. Turns out to do pretty well.

A 93.33% accuracy isn't too bad considering that our model was so simple!

The Caveats

Naive Bayes, while being simple and easy to understand, isn't the best classifier. If the non-linearity of the data starts to grow haywire, the points wouldn't fit a nice simple Gaussian or Binomial curve. This is where the "Naive" part of Naive Bayes has a major part to play. It makes far too many generalisations and wouldn't really do well under modern expectations. However, it's as simple as classifiers get and it's insanely efficient.

Additional Reading

You made it this far, you might as well go one step further! There are a few great resources you can use to find out a lot about the Bayes Estimators, Data Analysis and Probability.

Maximum a Posteriori Estimators as a Limit of Bayes Estimators (a great research paper): https://arxiv.org/pdf/1611.05917.pdf

Generic Statistical Classification : https://en.wikipedia.org/wiki/Statistical_classification

Data Analysis with Pandas and Sci-kit Learn by Sentdex : https://www.youtube.com/playlist?list=PLQVvvaa0QuDfSfqQuee6K8opKtZsh7sA9

Statistics for Application (MIT 18.650) [if you found the maximum likelihood stuff interesting, you'll get a real kick out of this course] : https://ocw.mit.edu/courses/mathematics/18-650-statistics-for-applications-fall-2016/index.htm

Cornell's explanation of maximum likelihood for data classification : https://www.youtube.com/watch?v=RIawrYLVdIw&t=0s

Scikit-learn documentation : https://scikit-learn.org/stable/