Practical Machine Learning in Cybersecurity

After doing a couple of ML courses, how do you actually use what you have learned? Read on to find out

Practical Machine Learning in Cybersecurity

Machine Learning is a field with multiple applications in a wide variety of domains. In this article, we'll explore some key aspects of applying machine learning knowledge to a topic.

The first challenge involved a large dataset of many files where the task was reasonably straightforward. The files were named with their SHA256 hashes. You had to sort the files into malware and benign, in a binary classification task. While the dataset can't be shared, I'll describe it here.

The first thing you do when you receive a dataset is to explore it. This is part of the process called "exploratory data analysis", giving you many key insights for training your model. First, we look at the size of the data. This particular dataset ran into over 50 GB in all, which gives us some clues. It tells us a lot of redundant information is contained in the files and that we will have to perform feature extraction. This means exploration of the data is a necessity. Going forward, we find that there are two folders in the data. One of them is static and the other is benign. The sub-directories address the classes of the files.

/malware-classification$ tree
.
├── Static Analysis
│   ├── Malware
│   │   ├── hash1
│   │   │   ├── String.txt
│   │   │   └── structure_info.txt
│   │   └── hash2
│   │       ├── String.txt
│   │       └── structure_info.txt
│   └── Benign
│       ├── hash1
│       |   ├── String.txt
│       |   └── structure_info.txt
│       └── hash2
│           ├── String.txt
│           └── structure_info.txt
└── Dynamic Analysis
    ├── Malware
    │   ├── hash1.json
    │   └── hash2.json
    └── Benign
        ├── hash1.json
        └── hash2.json
Directory structure of the Challenge 1 dataset

The static analysis data consisted of more directories, each named by the SHA256 hash mentioned above. It isn't essential to know a lot about the hash, but some crucial information can be obtained from a Google search. From the search, we find it is a cryptographic encoder and that it is unique. This second bit of information is important because feeding this to a model is useless—it only serves as an identifier for a data element. Similarly, this is the time to pick out which features you intend to feed your model. Use your plotting skills to make a few charts first and understand the flow and patterns without a model.

Use matplotlib, seaborn or altair to explore your data

Every folder in this data consisted of two files named the same way everywhere - String.txt and structure_info.txt. The former file was very aptly named as it literally had just one long string. After being trained in courses on the Internet where the data was provided in a csv format, this might look intimidating. Fear not, reader! Just keep scrolling, and your naturally intelligent brains will start spotting the data patterns. Some words in these strings were repeated multiple times and this probably should lead to something interesting. So we will take some of the most commonly repeated words out of these files.

The latter file also had some unintelligible text and some labeled data with possibly useful features. The structure_info.txt file had a super crucial piece of information - the file entropy. While many of you may have approached entropy from a thermodynamics context, the fundamental idea is that of randomness. Entropy can thus be used to analyze information - the groundwork laid by Claude Shannon in the 1940s. Here we make use of our second critical insight - malware files try to hide in what makes them malicious by forcing it into smaller chunks of data. As a lot of information is encoded in a small amount of data, the entropy is higher, proving useful for our classification problem. This allows us to engineer some features to use in our dataset.

def get_entropy(path):
	try:
		file1 = open(path, encoding='latin-1')
		fcontent = file1.readlines()
	except UnicodeDecodeError as e:
		return None

	entropies = []
	for line in fcontent:
		line = line.split()
		if "Entropy:" in line:
			entropies.append(float(line[1]))

	if len(entropies) == 0:
		return -1
	else:		
		entropy = sum(entropies)/len(entropies)
		return entropy
Engineering cumulative entropy for a file

The dynamic data was presented in a much friendlier format, the json. Features about the file are neatly arranged and mentioned in this clean format. However, the data, to put it bluntly, was flawed. The file contained one key called  "virustotals", which spoke of the results by antivirus scans through different software. This feature very clearly discriminated against the malware and no further data was needed. Its existence gave near-perfect results and thus, the problem could have been solved without even using machine learning.

def classify_dynamic(path):
	with open(path) as f:
        data = json.load(f)

        if 'virustotal' in data.keys():
            return 1
        else
            return 0
This is a joke. More features were considered for better accuracy, but this alone gave 0.986.

Another crucial fact to determine here is the balance of your dataset. You must know the distribution of classes in your training set to give you an idea of what metric to use to evaluate your model. If your data is imbalanced, your model will definitely take the lazy way and predict the most common class (if you optimize for plain accuracy). The dataset can be balanced by weighting underrepresented classes or by carefully duplicating them. By now, you should have an idea about your loss function or optimizing metric.

After checking the data, the next step is to go ahead and ensure your features are all fine and can be understood by a model. You will now extract features from your dataset if it already isn't extracted. Once you've got the data elements out, all features must be made uniform for the model. To do this, you will have to encode all features of your model into a numeric format. This is called preprocessing of data. While some models like decision trees may not require your data to be numeric, encoding allows you to test out other models to see which one performs better. While encoding, you'll want to take care of the method in which it's done. It can be done categorically randomly with a label encoder. A label encoder simply assigns a numerical value which corresponds to an alphanumeric value. An improved version can be made by ranking the items and hence preserving some sort of hierarchy in the data. It can also be made into a one-hot encoding, where the data is represented as a vector with zeros everywhere except the quantity it is representing.

le = LabelEncoder()
X = data.drop(['target'], axis=1)
cols = X.columns

for col in cols:
    if col not in words:
    	le.fit(X[col].astype(str))
    	df[col] = le.transform(df[col].astype(str))

During preprocessing, you might also want to consider scaling your data to help your model train more easy. Sometimes, extremely large values in a column can bring in weird biases which you might want to cut down on. You can do a Gaussian scaling by subtracting mean and dividing by standard deviation to expose all outliers and reduce most of your data to a small interval, centered in the origin. You can also use MinMaxScaler where you subtract the minimum value from every data point and divide it by the range of your dataset. Your column now gets squished to the interval [0,1].

Finally, you are ready to train! Feed the data to your model. For relatively more uncomplicated tree models, let it run; they won't hassle you too much unless something is wrong with your data. Deep learning models are far more powerful, but they do need some coaxing to achieve optimal performance. In both cases, do not manually tune hyperparameters. Use a random search or grid search for finding decent hyperparameters and move on. With a deep learning model, you will also want to explore early stopping. Here, we've used the XGBoost classifier, an extremely popular boosting algorithm. We define a range where we want to search for hyperparameters and the CV function will use cross-validation to determine which are the best settings for your model. Finally, you must save your model for later use, so dump it into a file.

xgb_model = xgb.XGBClassifier()

parameters = {'nthread':[4], 
              'objective':['binary:logistic'],
              'learning_rate': [0.05], 
              'max_depth': [6],
              'silent': [1],
              'subsample': [0.8],
              'colsample_bytree': [0.7],
              'n_estimators': [1000], #number of trees
              'seed': [42],
              'gamma': [1]}

clf = GridSearchCV(xgb_model, parameters, n_jobs=5, 
                   scoring='roc_auc',
                   verbose=2, refit=True)

clf.fit(X_train, y_train)
pickle.dump(clf, open("static_model.pkl", 'wb'))

Now you have a well-trained model. Test it! Results look good? Move on. Results are bad? You might need to recheck your features and walk back to your conclusions from EDA. If they are fine, then do check your hyperparameters and try training again. Watch out for misleadingly good models in imbalanced datasets with a good test set!

Finally, consider deploying your model to the world, an often overlooked part of machine learning. You want your end-user to use your model's power without having to type out code. Building an app or a GUI might be the next step if you wish to continue with your model.

To summarize, I'll run through the second challenge—detecting DDoS attacks. A DDoS attack, or a Distributed Denial of Service, is done by spamming a server with requests from multiple sources till it becomes unable to handle the load of requests. Usually a hacker gains access to multiple systems and directs them all to relentlessly send requests to a target server. Since networking data is sent in packets, analysis of packets might reveal some information about whether there was an attack on the server or just usual traffic.

For this challenge, network data was captured from Wireshark and presented in the form of pcap files. Again, the files were labeled malicious and benign. While taking an overall look at the data, it appears as if the malicious files occupy the same amount of storage as the benign files (~20 GB each). Do not be tricked! pcap files are binary files, meaning they have all sorts of compression hiding the data's actual balance. Every pcap file contains around a hundred thousand packets and there was a wild variation in this number. While there was more benign data as compared to its malicious counterpart, the difference was not significant enough to warrant any balancing to accommodate it. Further, you now have to translate them into a matrix for a machine learning model.

Again, to begin EDA, you must take out some features first. The first task here is to dig around and find a library for reading the binary file. The dataset's size calls for some intelligent guessing, which gave us some idea as to which features to use. Some background research for how features affect the dependent variable will never hurt a model. The packets were then grouped by IP addresses, which works fine in this case. Now we have a relatively small dataset that can be handled rather quickly. Some more preprocessing to convert to numeric data can be done and the data is ready to be trained. The data was also scaled with MinMaxScaler.

Moving on, the data was evaluated with random forests and scalable vector machines. After determining that the random forests do better, a hyperparameter search was done. As a result, the whole pipeline was assembled. From start to finish, in one sentence, we can summarize it as - read the file; extract features; preprocess features; classify. The entire process took under a second even for largish pcap files of close to 100 MB. The model is now ready to be deployed!

We read the pcap file, take out the packets, and build the table. Preprocessing of the features is done with the same settings as training (SUPER CAUTION) and then we predict. You just have to format your prediction now and your trusty model will take care of the rest.

model = pickle.load(open(path_to_classifier, "rb"))
mms = pickle.load(open(path_to_scaler, "rb"))

filename = sys.argv[1]
f = open(filename, 'rb')
pcap = dpkt.pcap.Reader(f)

df = pd.DataFrame(pcap_to_dict(pcap))
final_df = proc_pcap_df(df)

X = final_df.drop(['src'], axis=1)
X = mms.transform(X)
ips = final_df["src"]
preds = model.predict(X)
The crux of the final file

The code used for the challenge is available in this link.