In the following post I would like to talk about the potential use of machine learning algorithms for malware detection, specially in environments like the Android operating system where obfuscation is not that common, apps use to be unprotected being just zipped files and file metadata may be relevant with properties like execution permissions.
I would also like to teach about how to get started with machine learning under DecisionTrees using the world-known scikit learn library. Machine learning is a really big field as well as popular these days. I’ve chosen a DecisionTree model due to its easy of use and understanding along with its acceptable speed of operation (specially compared with neural networks).
Basically a Decision tree is a tree structured where on each node you ask a question, depending on the answer you chose one direction or another to land in a new node repeating the process until you reach a node with no posterior nodes (leaf). When you land into a destionation node (leaf) you have your result, the result of the classification.
We generate Decision Trees starting from a dataset. A dataset is like a big table with columns and rows, being each column a property of every instance (row) then in the case of supervised learning problems like the ones we solve with Decision Trees we have a final column in that dataset containing a classification (type a, type b…).
And we generate that tree with an algorithm like the CART we use in sklearn. That algorithm goes by each of the columns and performs an operation to get an information index of that column, it does the same with every column and finally it knows which column (property) is the better for classification, it creates a node from that step so if the selected column is “height” the algorithm will evaluate the examples and know that when height is more than 1.2 it corresponds to a value when it is less than that it is another one. Then the algorithm repeats the process but eliminating the column it just created in the next step. If we jump into a classification when we selected “height” but all the samples correspond to a class, we can stop there.
We can also think of that algorithm in this another way. If we have a lot of features being one of them the weather, that can be “sun”, “wind” or “rain” and our classification feature is “yes” or “no” and that feature (weather) gets selected, we will create a node and if we have samples of every kind (sun, wind, rain) we will create 3 new nodes down from weather. On each node we will repeat the process but eliminating the samples (rows of the table) containing the respective feature (sun, wind or rain) and we will re calculate the algorithm again. If we only have samples with “yes” classification in, for example “wind” then we are done here “wind” will be a final node (leaf) being the classification.
Getting started with sklearn
The first thing that we will need to follow this tutorial is the scikit library and the whole, let’s call it “python-datascience” toolset. We will be using: pandas, numpy and sklearn mainly. Also graphviz.
You will see those libraries being used a lot of times in many different PoCs and I think that some researchers don’t take a lot of time to explain what these tools are and its particular role in the project.
pandas is a library used for working with data structures with python. We use pandas in projects like this when working with machine learning algorithms because these algorithms need to work with data structures, such as arrays and as we work with big amounts of data we need those structures and their operations to be optimized. In a brief panda performs an efficient treatment of that data, we will see what it does in particular in our code.
numpy is a python librery aimed for scientific computing. Again in machine learning we will treat our information in many ways, and we will have to perform operations, some of them being floating point operations, to a lot of data. numpy allows us to perform those operations in an optimal way.
sklearn or sci-kit learn is the library that has the implementation of all the “machine learning” algorithms we will be using. graphviz is used for generating and storing images.
So first of all, we’ll have to install those libraries. My starting point here will be a clean (x)Ubuntu 18.04 machine. Whenever I can I always tend to use python-pip to install everything. So we can do:
$ pip install sklearn $ pip install pandas $ pip install graphviz $ pip install numpy $ pip install IPython $ pip install pydotplus $ apt-get install python-graphviz
pydotplus and ipython may not be necessary as they are related to the python jupyter notebook
So now that we have our tools ready. We need to choose a dataset to start working on. I don’t want to jump directly to an AndroidMalware dataset as I prefeer you to understand what we are doing instead of copying and pasting the source code. Getting to work with a small dataset with only few attributes is the better way to start and to know about how the algorithm works.
There is a world known dataset, used in every TreeClassifier example named “the IRIS dataset”. It is about flower classification. It contains information about the sepal and petal width and height of different flowers and its classification into 3 types.
We can download it here but as it is used in almost every example, it already comes with sklearn by default.
So when selecting your dataset you should be browsing the site from you got it to know about its columns and its values, also it is important to know how many values it contains and the format. Common datasets can be found in CSV or JSON.
We can easily use pandas to gather some basic information about the dataset.
The IRIS dataset
import sklearn.datasets as datasets import pandas as pd iris=datasets.load_iris() df=pd.DataFrame(iris.data, columns=iris.feature_names) print df
So as we see here, we start by loading the iris dataset and then we load that datasett into a pandas matrix.
As we see, that will print a large mount of data, not the whole dataset, but a large enough amount of data to be annoying. We can use “head()” to print only the first n values though.
import sklearn.datasets as datasets import pandas as pd iris=datasets.load_iris() df=pd.DataFrame(iris.data, columns=iris.feature_names) print df.head(5)
But those 5 first rows may not be a good example of what we have in our dataset, so a good way to get the picture of the whole thing is to select a random n samples of the dataset.
import sklearn.datasets as datasets import pandas as pd iris=datasets.load_iris() df=pd.DataFrame(iris.data, columns=iris.feature_names) df2 = df.sample(2)
So with that, we can quickly now what columns do we have and their class and range of values. The most impostant thing to identify here is the length of the dataset and also the type of the values (categorical vs numeric) and their range.
The other must-know here is the describe() function that is used to get a basic statistical description/analysis of the dataset. The result is self explainatory:
import sklearn.datasets as datasets import pandas as pd iris=datasets.load_iris() df=pd.DataFrame(iris.data, columns=iris.feature_names) print df.describe()
So as you see from this we can now that we have 150 samples, but we are also able to know that we have 150 samples of each individual, so we can say that the dataset is balanced. In some cases we may have incomplete data. The mean, standard deviation and the min val are also useful for figuring out what data do we have. For example knowing the mean and std plus the percentiles we can try to approximate our data model to a normal distribution and perform “effective” predictions using less data.
Also take a special look at the following line:
As you know, we are facing a classification problem here, so our instances will be classified into three possible classes. In our dataset, there is a special column called feature_names that can be named “class“, “type” or something like that depending on the dataset. That column represents the class of our samples, so the algorithm can learn how to classify those samples based on the actual data. This is called supervised learning due to that.
So we create our model separating the data used for generating (training) the model and the column that identifies the class of each row.
Let’s now watch the whole code before moving into something more complex.
Generating a prediction model
import sklearn.datasets as datasets import pandas as pd from sklearn.tree import DecisionTreeClassifier from sklearn.externals.six import StringIO from IPython.display import Image from sklearn.tree import export_graphviz import pydotplus iris=datasets.load_iris() df=pd.DataFrame(iris.data, columns=iris.feature_names) y=iris.target print y # print the classes dtree=DecisionTreeClassifier() dtree.fit(df,y) dot_data = StringIO() export_graphviz(dtree, out_file=dot_data, filled=True, rounded=True, special_characters=True) graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) Image(graph.create_png()) graph.write_png("iris.png")
So we load the dataset and then we separate the data used for training and the data used for classification. We print the data used for classification, just to be sure we are understanding what this is doing. We can see that we have 3 classes, each of them represented by a number, it is our duty to know which number represents what.
Then we create a DecisionTreeClassifier model and we load our data into that model, note that we will load it like fit(Characteristics,Categories). What goes next simply generates a png image of the model for a proper visualization of the tree:
This will output something like:
Then will generate something like:
This is the actual classification tree. Once a new instance is given for classification it flows from the toop of the tree (root) to one of the final nodes (leaf) getting classified at the end. So what can we observe from the tree?
In every node you can find:
Xn being the condition to meet so if the value is X1 >= 3 for example, it will mean that on that node the resulting model will compare the value of the first column of the sample (imagine vector) with 3 and if that condition is true it will continue the execution on the node marked as true down from there otherwise in the false node.
Then we will have the gini parameter. Gini is a kind of metric used by the DecissionTree algorithm(CART). Gini (or gini impurity) is a value meaning the probability of an element to be chosen right, we calculate it by the sum of the probability of each element multiplied by the error probability.
Then we have the number of samples that can go through that node and also how many samples for each category go through that node.
Let’s go by an example. We have a new plan that has Sepal length of 5.3, Sepal width of 3.6, Petal length of 1.4 and Petal width of 0.2 and we want to classify it using our brand new tree.
We will go from top to down and we will start measuring our forth parameter, petal width being 0.2, as it is less than 0.8 we go left and we have our classification done, it matches into the category 1.
But in a real life situation it is not enough to have this fantastic tree as we will want to have this process automated. We can use the following code to declare and classify an instance from a vector of features (array)
import sklearn.datasets as datasets import numpy as np import pandas as pd from sklearn.tree import DecisionTreeClassifier from sklearn.externals.six import StringIO from IPython.display import Image, display from sklearn.tree import export_graphviz import pydotplus iris=datasets.load_iris() df=pd.DataFrame(iris.data, columns=iris.feature_names) y=iris.target dtree=DecisionTreeClassifier() dtree.fit(df,y) print df.head() print "Sample to predict: " sample = [9.2,9.2,9.1,9.1] print sample sample1 = np.array(sample) sample1 = sample1.reshape(1,-1) print "Sample predicted as: " result = dtree.predict(sample1) print result
The main thing to look at here is that we are first creating a python list, then we are creating a np.array and then we are using reshape so we prepare our feature vector to be read by the algorithm. In general terms numpy arrays are far more efficient than python lists, specially when it comes to massive batch processing stuff. A good reference can be found here We need to use numpy because this library has a lot of built in vector operations needed for a fast processing of the algorithm.
In here we are using reshape for converting a one dimension array to a two dimension one. We do that because this algorithms work with 2 dimension arrays, so they are prepared for receiving a batch of feature vectors and returning a list of predictions. If we enter [[vector]] then we have no need of doing reshape, if we enter a vector of feature fectors, we will get a vector of results (predictions).
Note that due to the randomness factor of the algorithm we may get different results in every execution, we will explore persistence in the following section.
Now that we have a basic knowledge of how the use a DTree in python, let’s move to a real life example.
Hunting Android threats
As we know, malware is a common threat nowadays. From my experience I use to see that a lot of malware analysis is focused on Microsoft Windows systems, but there is a huge landspace related to malware on mobile devices, specially on Android. Artifacts focused on attacks on Android devices are commonly used in APT schemes as a part of the process. Another thing that one can note is that obfuscation is not always present in Android applications.
For example, let’s take a loof at this github Android malware repo.
We can see a lot of APT related Android APK samples:
We can see that fake skype apk there. As a lot of Android apps have 0 obfuscation we can simply treat them as .zip files, unpack them and explore a little bit.
There is a bunch of information there. I won’t be looking at the code on this one but to one interesting file we can find there. The AndroidManifest.xml contains a lot of meta-data related to the APK and this can be useful for running classification models on those samples.
You can peek inside many different fields of the AndroidManifest as it is an XML file parsing it may be “simple”.
There are a lot of many different tools and tricks for parsing the AndroidManifest xml file but if you want to go straight to the point you can simple use this and do:
pip install axmlparser
Then a simple script for retrieving a list of permissions of an Android app will look like
#!/usr/bin/python3.2 import axmlparserpy.apk as apk ap = apk.APK('./android-malware/rouge_skype/skype.apk') print "Package - ",ap.get_package() print "Permissions - ",ap.get_permissions()
With an output like:
So now that we now how to retrieve the permissions of an Android file, that could be a great first approach to malware detection in Android systems. We we would need next is a large amount of Android NON MALWARE APKS and another large amount of ANDROID MALWARE APKS, with those two large datasets we could make a parsing algorithm retrieve the permissions of each files.
But how do we train a classification tree of we get 2 permissions in a file, then 10 in another one, then 0 in a third file, 1 in a fourth one and so on?
As we will need a “standarized” vector, the only way to have this is to list the more than 200 (330 to be more specific) different typs of file permission and perform a check based on each sample, filling a vector with 1 if the APK has that permission and with 0 if not.
Luckly for me, I’m not the only one doing malware research on Android and I found this fantastic dataset on kaggle it contains a bunch of Android apk permissions extracted from the manifest, classified using a “type” column, being 0 non malware and 1 malware.
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report, confusion_matrix from sklearn.externals.six import StringIO from IPython.display import Image, display from sklearn.tree import export_graphviz import pydotplus dataset = pd.read_csv("train.csv",sep=";") dataset.shape print dataset.head() X = dataset.drop('type', axis=1) y = dataset['type'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20) classifier = DecisionTreeClassifier() classifier.fit(X_train, y_train) y_pred = classifier.predict(X_test) print(confusion_matrix(y_test, y_pred)) print(classification_report(y_test, y_pred)) dtree = classifier dot_data = StringIO() export_graphviz(dtree, out_file=dot_data, filled=True, rounded=True, special_characters=True) graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) Image(graph.create_png()) graph.write_png("malware.png")
I will comment the code just right after watching the tree that gets generated after running it. As you can see it basically reads the dataset and performs a DecissionTree after running the algorithm with sklearn. Just look at the tree that gets generated:
Of course the algorithm selects only the relevant features (this is important for our analysis) but anyway it becomes a really long tree and thus it is difficult to read. We need another way of evaluating our algorithm.
Now let’s inspect the code:
dataset = pd.read_csv("train.csv",sep=";") dataset.shape print dataset.head() X = dataset.drop('type', axis=1) y = dataset['type']
First of all we read the dataset we just downloaded, note that we use read_csv because our dataset was written in that format. The CSV default separator is “,” so we have to indicate that our seperator is “;” as you see in the code.
Then we do a dataset.shape to adapt it to the needed format, we print some top values of it and then we separate the training data and the classification data being “type” the column that identifies our samples into one category or another.
Then we proceed to train our algorithm and generate the tree
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20) classifier = DecisionTreeClassifier() classifier.fit(X_train, y_train)
As we want to understand and evaluate our algorithm in an efficient way. One common way to do this is to separate our dataset into a training batch and a test(validation) batch. So in this case the program will extract the 20% (randomly) of the values of our dataset and will use them for training. After the tree gets generated, the algorithm will try to predict the class of that 20% and as the program already knows what class those samples belong to it will know the % success rate in classification. There’s a whole bunch of rocket science in here as there are a lot of different ways and tricks when it comes to benchmarking and testing our algorithms, I would like to keep posting about it, but in this article I’m just doing the fundamentals so you can understand how the basic stuff works and move on from there.
Finally we have this one:
print(confusion_matrix(y_test, y_pred)) print(classification_report(y_test, y_pred)) y_pred = classifier.predict(X_test)
In here we are printing the confusion matrix and the classification report, we are predicting some samples as well even though we don’t print them.
So what is the confusion matrix?
[[33 2] [ 4 41]]
This confusion matrix tells us that 33 instances of our test data (0.20) corresponding to the class 0 (non malware) have been classified as non malware, 2 have been classified wrong as malware. Then 4 malware classes have been classified as non malware but 41 of them have been classified as malware. We use the confusion matrix to get a general idea about how our model is performing, if we need more precise details, we can do the report.
precision recall f1-score support 0 0.89 0.94 0.92 35 1 0.95 0.91 0.93 45 micro avg 0.93 0.93 0.93 80 macro avg 0.92 0.93 0.92 80 weighted avg 0.93 0.93 0.93 80
What we have to look at here is the precision, that tells us we a percentage of 89% good classifications for the class 0 (non malware) and 95% for the class malware. Then the recall ratio is calculated by true positives / (true positives + false negatives). The recall is intuitively the ability of the classifier to find all the positive samples. The F1-Score is calculated by F1 = 2 * (precision * recall) / (precision + recall). The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. And finally the support is the number of instances in the test data that have 0 or 1 value as “type” (in our specific case).
We would like to have a high support so we can rely on our data and a high f1-score. But if you just want an algorithm that “works” make sure that your metrics go >0.90 then it depends a lot on what you want to achieve, if you are doing detection of diseases you won’t be ok with 0.90 of precision. In our case it is ok if we miss some samples as we can implement this algorithm along with traditional methods for a better probability of success.
Now that we have our fantastic model and we love it, trained with millions of examples, we don’t want to keep repeating the whole process every time, we need persistance. And we can get persistance by serializing our tree into a file using pickle, a library used for serializing python objects.
pip install pickle
We will import the library and use the following code to save and then restore our DecisionTree classifier named classifier.
import pickle # Dump the trained decision tree classifier with Pickle decision_tree_pkl_filename = 'decision_tree_classifier_20170212.pkl' # Open the file to save as pkl file decision_tree_model_pkl = open(decision_tree_pkl_filename, 'wb') pickle.dump(classifier, decision_tree_model_pkl) # Close the pickle instances decision_tree_model_pkl.close() # Loading the saved decision tree model pickle decision_tree_model_pkl = open(decision_tree_pkl_filename, 'rb') decision_tree_model = pickle.load(decision_tree_model_pkl) print "Loaded Decision tree model :: ", decision_tree_model