In this post we’ll talk about two topics I love and that have been central elements of my (private) research for the last ~7 years: machine learning and malware detection.
Having a rather empirical and definitely non-academic education, I know the struggle of a passionate developer who wants to approach machine learning and is trying to make sense of formal definitions, linear algebra and whatnot. Therefore, I’ll try to keep this as practical as possible in order to allow even the less formally-educated reader to understand and possibly start having fun with neural networks.
Moreover, most of the resources out there focus on very known problems such as handwritten digit recognition on the MNIST dataset (the “hello world” of machine learning), while leaving to the reader’s imagination how more complex features engineering systems are supposed to work and generally what to do with inputs that are not images.
TL;DR: I’m bad at math, MNIST is boring and detecting malware is more fun :D
I’ll also use this as an example use-case for some new features of ergo, a project me and chiconara started some time ago to automate machine learning models creation, data encoding, training on GPU, benchmarking and deployment at scale.
The source code related to this post is available here.
Important note: this project alone does NOT constitute a valid replacement for your commercial antivirus.
Traditional malware detection engines rely on the use of signatures - unique values that have been manually selected by a malware researcher to identify the presence of malicious code while making sure there are no collisions in the non-malicious samples group (that’d be called a “false positive”).
The problems with this approach are several, among others it’s usually easy to bypass (depending on the type of signature, the change of a single bit or just a few bytes in the malicious code could make the malware undetectable) and it doesn’t scale very well when the number of researchers is orders of magnitude smaller than the number of unique malware families they need to manually reverse engineer, identify and write signatures for.
Our goal is teaching a computer, more specifically an artificial neural network, to detect Windows malware without relying on any explicit signatures database that we’d need to create, but by simply ingesting the dataset of malicious files we want to be able to detect and learning from it to distinguish between malicious code or not, both inside the dataset itself but, most importantly, while processing new, unseen samples. Our only knowledge is which of those files are malicious and which are not, but not what specifically makes them so, we’ll let the ANN do the rest.
In order to do this, I’ve collected approximately 200,000 Windows PE samples, divided evenly in malicious (with 10+ detections on VirusTotal) and clean (known and with 0 detections on VirusTotal). Since training and testing the model on the very same dataset wouldn’t make much sense (as it could perform extremely well on the training set, but not being able to generalize at all on new samples), this dataset will be automatically divided by ergo into 3 sub sets:
- A training set, with 70% of the samples, used for training.
- A validation set, with 15% of the samples, used to benchmark the model at each training epoch.
- A test set, with 15% of the samples, used to benchmark the model after training.
Needless to say, the amount of (correctly labeled) samples in your dataset is key for the model accuracy, its ability to correcly separate the two classes and generalize to unseen samples - the more you’ll use in your training process, the better. Besides, ideally the dataset should be periodically updated with newer samples and the model retrained in order to keep its accuracy high over time even when new unique samples appear in the wild (namely: wget + crontab + ergo).
Due to the size of the specific dataset I’ve used for this post, I can’t share it without killing my bandwidth:
However, I uploaded the dataset.csv file on Google Drive, it’s ~340MB extracted and you can use it to reproduce the results of this post.
The Windows PE format is abundantly documented and many good resources to understand the internals, such as Ange Albertini‘s “Exploring the Portable Executable format“ 44CON 2013 presentation (from where I took the following picture) are available online for free, therefore I won’t spend too much time going into details.
The key facts we must keep in mind are:
- A PE has several headers describing its properties and various addressing details, such as the base address the PE is going to be loaded in memory and where the entry point is.
- A PE has several sections, each one containing data (constants, global variables, etc), code (in which case the section is marked as executable) or sometimes both.
- A PE contains a declaration of what API are imported and from what system libraries.
Credits to Ange Albertini
For instance, this is how the Firefox PE sections look like:
Credits to the “Machines Can Think” blog
While in some cases, if the PE has been processed with a packer such as UPX, its sections might look a bit different, as the main code and data sections are compressed and a code stub to decompress at runtime it’s added:
Credits to the “Machines Can Think” blog
What we’re going to do now is looking at how we can encode these values that are very heterogeneous in nature (they’re numbers of all types of intervals and strings of variable length) into a vector of scalar numbers, each normalized in the interval [0.0,1.0], and of constant length. This is the type of input that our machine learning model is able to understand.
The process of determining which features of the PE to consider is possibly the most important part of designing any machine learning system and it’s called features engineering, while the act of reading these values and encoding them is called features extraction.
After creating the project with:
ergo create ergo-pe-av
I started implementing the features extraction algorithm, inside the encode.py file, as a very simple (150 lines including comments and multi line strings) starting point that yet provides us enough information to reach interesting accuracy levels and that could easily be extended in the future with additional features.
cd ergo-pe-av vim encode.py
The first 11 scalars of our vector encode a set of boolean properties that LIEF, the amazing library from QuarksLab I’m using, parses from the PE - each property is encoded to a
1.0 if true, or to a
0.0 if false:
||True if the PE has a Load Configuration|
||True if the PE has a Debug section.|
||True if the PE is using exceptions.|
||True if the PE has any exported symbol.|
||True if the PE is importing any symbol.|
||True if the PE has the NX bit set.|
||True if the PE has relocation entries.|
||True if the PE has any resource.|
||True if a rich header is present.|
||True if the PE is digitally signed.|
||True if the PE is using TLS|
Then 64 elements follow, representing the first 64 bytes of the PE entry point function, each normalized to
[0.0,1.0] by dividing each of them by
255 - this will help the model detecting those executables that have very distinctive entrypoints that only vary slightly among different samples of the same family (you can think about this as a very basic signature):
ep_bytes =  * 64
Then an histogram of the repetitions of each byte of the ASCII table (therefore size 256) in the binary file follows - this data point will encode basic statistical information about the raw contents of the file:
# the 'raw' argument holds the entire contents of the file
The next thing I decided to encode in the features vector is the import table, as the API being used by the PE is quite a relevant information :D In order to do this I manually selected the 150 most common libraries in my dataset and for each API being used by the PE I increment by one the column of the relative library, creating another histogram of 150 values then normalized by the total amount of API being imported:
# the 'pe' argument holds the PE object parsed by LIEF
We proceed to encode the ratio of the PE size on disk vs the size it’ll have in memory (its virtual size):
min(sz, pe.virtual_size) / max(sz, pe.virtual_size)
Next, we want to encode some information about the PE sections, such the amount of them containing code vs the ones containing data, the sections marked as executable, the average Shannon entropy) of each one and the average ratio of their size vs their virtual size - these datapoints will tell the model if and how the PE is packed/compressed/obfuscated:
Last, we glue all the pieces into one single vector of size
v = np.concatenate([ \
The only thing left to do, is telling our model how to encode the input samples by customizing the
prepare_input function in the
prepare.py file previously created by ergo - the following implementation supports the encoding of a file given its path, given its contents (sent as a file upload to the ergo API), or just the evaluation on a raw vector of scalar features:
# used by `ergo encode <path> <folder>` to encode a PE in a vector of scalar features
Now we have everything we need to transform something like this, to something like this:
Assuming you have a folder containing malicious samples in the
pe-malicious subfolder and clean ones in
pe-legit (feel free to give them any name, but the folder names will become the labels associated to each of the samples), you can start the encoding process to a
dataset.csv file that our model can use for training with:
ergo encode /path/to/ergo-pe-av /path/to/dataset --output /path/to/dataset.csv
Take a coffee and relax, depending on the size of your dataset and how fast the disk where it’s stored is, this process might take quite some time :)
While ergo is encoding our dataset, let’s take a break to discuss an interesting property of these vectors and how to use it.
It’ll be clear to the reader by now that structurally and/or behaviourally similar executables will have similar vectors, where the distance/difference from one vector and another can be measured, for instance, by using the Cosine similarity, defined as:
This metric can be used, among other things, to extract from the dataset (that, let me remind, is a huge set of files you don’t really know much about other if they’re malicious or not) all the samples of a given family given a known “pivot” sample. Say, for instance, that you have a Mirai sample for MIPS, and you want to extract every Mirai variant for any architecture from a dataset of thousands of different unlabeled samples.
The algorithm, that I implemented inside the sum database as the
findSimilar “oracle” (a fancy name for stored procedure), is quite simple:
// Given the vector with id="id", return a list of
Yet quite effective:
Meanwhile, our encoder should have finished doing its job and the resulting
dataset.csv file containing all the labeled vectors extracted from each of the samples should be ready to be used for training our model … but what “training our model” actually means? And what’s this “model” in the first place?
The model we’re using is a computational structure called Artificial neural network that we’re training using the Adam optimization algorithm . Online you’ll find very detailed and formal definitions of both, but the bottomline is:
An ANN is a “box” containing hundreds of numerical parameters (the “weights” of the “neurons”, organized in layers) that are multiplied with the inputs (our vectors) and combined to produce an output prediction. The training process consists in feeding the system with the dataset, checking the predictions against the known labels, changing those parameters by a small amount, observing if and how those changes affected the model accuracy and repeating this process for a given number of times (epochs) until the overall performance has reached what we defined as the required minimum.
Credits to nature.com
The main assumption is that there is a numerical correlation among the datapoints in our dataset that we don’t know about but that if known would allow us to divide that dataset into the output classes. What we do is asking this blackbox to ingest the dataset and approximate such function by iteratively tweaking its internal parameters.
model.py file you’ll find the definition of our ANN, a fully connected network with two hidden layers of 70 neurons each, ReLU as the activation function and a dropout of 30% during training:
n_inputs = 486
We can now start the training process with:
ergo train /path/to/ergo-pe-av --dataset /path/to/dataset.csv
Depending on the total amount of vectors in the CSV file, this process might take from a few minutes, to hours, to days. In case you have GPUs on your machine, ergo will automatically use them instead of the CPU cores in order to significantly speed the training up (check this article if you’re curious why).
Once done, you can inspect the model performance statistics with:
ergo view /path/to/ergo-pe-av
This will show the training history, where we can verify that the model accuracy indeed increased over time (in our case, it got to a 97% accuracy around epoch 30), and the ROC curve, which tells us how effectively the model can distinguish between malicious or not (an AUC, or area under the curve, of 0.994, means that the model is pretty good):
Moreover, a confusion matrix for each of the training, validation and test sets will also be shown. The diagonal values from the top left (dark red) represent the number of correct predictions, while the other values (pink) are the wrong ones (our model has a 1.4% false positives rate on a test set of ~30000 samples):
97% accuracy on such a big dataset is a very interesting result considering how simple our features extraction algorithm is. Many of the misdetections are caused by packers such as UPX (or even just self extracting zip/msi archives) that affect some of the datapoints we’re encoding - adding an unpacking strategy (such as emulating the unpacking stub until the real PE is in memory) and more features (bigger entrypoint vector, dynamic analysis to trace the API being called, imagination is the limit!) is the key to get it to 99% :)
We can now remove the temporary files:
ergo clean /path/to/ergo-pe-av
Load the model and use it as an API:
ergo serve /path/to/ergo-pe-av --classes "clean, malicious"
And request its classification from a client:
curl -F "[email protected]/path/to/file.exe" "http://localhost:8080/"
You’ll get a response like the following (here the file being scanned):
The model detecting a sample as malicious with over 99% confidence.
Now you can use the model to scan whatever you want, enjoy! :)