Multi-Kernel Learning Model

Our model uses a combination of 3 trained multi class SVM’s using the three kernels. The multiclass classifier works as a one class vs the rest classifier, so for each kernel we are actually training 5 SVMs. That is basically one classifier being benign vs the rest of the apps, another classifier being adware vs the rest of the apps and so on The ensemble model we used was as a max voting model, where it would pick the most common prediction across the 3 kernels as the final prediction. This helps in masking the weaknesses of certain kernels and augments the strengths of others.

Hindroid is a research paper about a model that creates a heterogeneous information network using API calls from the code. We first generate matrices to encode features about the APIs. These matrices can be multiplied together to create kernels, which can be visualised as a network graph. This allows us to draw paths between two different apps.

Matrix A (apps X APIs) - encodes which APIs exist in a given app. This is similar to a bag of words model.

Matrix B (APIs X APIs) - encodes APIs that occur in the same code block.

Matrix P (APIs X APIs) - encodes APIs that are invoked by the same package. In smali files APIs are in the following convention - package->API.

AA^T: Gives us the number of similar APIs between two apps.

ABA^T: Gives the number of APIs that coexist in the same code block for two apps.

APA^T: Find the number of APIs that share the same package for two apps.

The Data

Our dataset consists of 300 benign apps and 300 apps of each malware category. To get a good and varied representation of benign applications, our benign dataset is made up of apps from 6 different categories.

Malware Classifications

Latest Products Image

Adware

Infests app/webpage with advertisements that can prevent accessing the site/app as originally intended.

Latest Products Image

Trojan

Malware that is disguised as safe and legitimate software that runs malicious code on a users computer stealthily.

Latest Products Image

Backdoor

Malware that gains remote access to a system or device by using vulnerabilities to bypass security measures.

Latest Products Image

Ransomware

Malware that asks or demands for ransom from a user by denying them access to their computer until demands (usually in the form of payment) have been met.

EDA

Correlation Coefficients

Hide

Correlation Coefficients

Correlation coefficients measures the strength between two variables. In our case it measures the strength between API and category. API’s with correlation coefficients close to zero were of little importance in determining the category and classification of an app and could thus be used to reduce the dimensionality of our dataset. We cared about API’s that had a correlation coefficient greater than 0.5.

As shown, the APAt kernel has only 1 API which has a correlation coefficient greater than 0.5. This is because as we embed information such as packages into our features, we are generalizing a particular package to have APIs that are either Benign or Malicious. Using package information is ultimately a very granular approach in determining whether an API is benign or malicious. Obtaining these correlation coefficients of each API and kernel was useful, however, this did not explain which category each API was linked to.

AA^T: 111 API's with Correlation Coefficients greater than 0.5

ABA^T: 29 API's with Correlation Coefficients greater than 0.5

APA^T:1 API's with a Correlation Coefficient greater than 0.5

All Kernels:0 API’s with Correlation Coefficients less than -0.5

Ranking Algorithm

Hide

Ranking Algorithm

We built a ranking algorithm which took into account the frequency of an API occurrence in a particular app category, along with its uniqueness among the other categories. Our ranking algorithm heatmap shows the top 5 APIs for each category based on our ranking algorithm, The center number shows the frequency each API had with its respective category.

High frequency in 1 category + Low Frequency in others => Strong Classification Influence

Adware:SetInAnimatin(), SetOutAnimation()

Trojan:KillProcess()

Ransomware:onDisabled(), onEnabled()

Hypothesis Testing

Hide

Hypothesis Testing

Using the ranking algorithm, we now have a set of APIs that we know are unique to benign apps. We wanted to see the effect on the classification output of adding these known benign APIs to malware apps. From the graph, we can see that for the first 100 odd APIs we were barely seeing any significant results, but after that we see a large drop in the p-value. A possible reason for that could be that these benign APIs need to co-occur with others to classify an app as benign.

Looking at the classification report, we see that the F1 score of Adware has actually increased, unlike all the other categories. The main reason for this is that adware and benign apps are similar. Analysing the SVM weights and tSNE will help improve our understanding of the same.

SVM Weights

Hide

SVM Weights

We wanted to closely analyse the support vectors and SVM weights in our model. We are using scikit learn Linear Support Vector Classifier as our model. Additionally, since we have five unique classes, we will have 5 decision boundaries. Each decision boundary is composed of both negative and positive weights. The positive weights will correspond to the class of category, for instance, Benign, while the negative weights will correspond to all other classes in training. The two images on the right show the top 5% of positive weights in each SVM. These weights are ordered in descending order with the top of the columns corresponding to the highest weights. These weights will have the most influence on the positive decision that pulls an app towards that particular category. There are 55 vectors (or apps) shown for each category and these apps are responsible for approximately 30-45% of the sum of the total positive weights for their respective category.

As shown, we see that AA^T has significantly less incorrect app categories misplaced into the correct corresponding category as opposed to APA^T. This is one important reason why AAt continuously performs better than APA^T. This is likely due to the influence of incorrect categories of apps in the positive weights of the SVM. We will further be able to observe the reason for this occurrence by examining tSNE.

tSNE

Hide

tSNE

tSNE is a method used to visualize high dimensional data in low dimensions. In tSNE plots, data points that are more clustered together are more similar than those that are not. We created a plot for each kernel. The x and y axes refer to tSNE dimensions 1 and 2, while the data points represent individual apps in our training dataset.

For the tSNE plot of the AA^T kernel, you can see that benign and ransomware apps are most cluttered within their own categories, while there is a big cluster of adware, trojan, and backdoor apps in the middle of the plot. This means that the AA-transpose kernel perceives those adware, trojan, and backdoor apps to be similar. We investigated this and found that the trojan apps in that cluster contain a lot of APIs that are rare to trojans but really common to adware apps. The backdoor apps located there not only contain common adware apis but also contain common trojan apis. This is a possible reason why the AA-transpose kernel perceives those adware, trojan and backdoor apps to be similar to each other.

The tSNE plot for the APA^T kernel has more multi-categorical clusters than the AA-transpose plot. This implies that the APA-transpose kernel would have more difficulty telling different types of apps apart, which is consistent with how it continuously performs worse than the AA-transpose kernel, as we’ve already established.

LIME

Hide

Local Interpretable Model-agnostic Explanations (LIME)

A common technique used is to analyse the SHAP values for a given classification. These values leverage the idea of shapley values for model feature influence scoring by calculating the average marginal contribution of feature values over all possible coalitions; because of this, these values are incredibly expensive and time consuming to compute. Thus, we will have used Local Interpretable Model agnostic Explanations or LIME to find similar values at the expense of some accuracy. LIME provides a local subset of SHAP which will allow us to understand the important features when classifying an app.

The bar on the left of the scene depicts the proper classification of an app given by the model. In this case, the number 4 corresponds to the category Ransomware. The Horizontal bar graph shows various app indexes and their respective feature importance in the classification of that app. This app was misclassified as Ransomware but was actually Trojan.

1 Beauty and 3 Productivity apps: heavily skewed and negatively affected the classification of these trojan apps

Future Work

Adding More Kernels

We’d want to include more kernels into our multi-kernel learning model. Since our model performs best on ransom and adware, and worst on trojans, we definitely want to include a kernel with a metapath that optimizes the usage of the features of trojan applications to improve the classification of trojans.

Screening

Another possible future direction for our project is using our ranking algorithm to screen new apps before they get published on the Google playstore. That way, apps that contain too much of certain types of APIs would need a more rigorous background check.

Top