Creating a model to detect malware using supervised learning algorithms
Product Development Grant
TOBORRM’s has received an industry grant to develop malware detection algorithms based on behaviours and file parameters. The software development team at TOBORRM wrote a file-download identifier that scoured the internet for downloadable content. The goal was to develop a data set that can be used to identify malware based on parameters such:
Where the file came from
How big the file was
What type of file it is
as well as many other characteristics (or features).
The programmers found millions of files and proceeded to classify the files manually using a 3rd party system at virustotal.com. As noted in the previous assessment for each file specimen collected:
TOBORRM’s data collector would send the file to virustotal.com
files were tagged as “Malicious” if a majority of virustotal.com virus scanners recognised the file as containing malware (see Figure 1)
Files were tagged as “Clean” if ALL virustotal.com scanners identified the file as “Clean”. (see Figure 1)
Figure 1 - VirusTotal.com comparison of confirmed infected vs confirmed clean
As such, the “Actually Malicious” field can be considered to be a generally accurate classification for each downloaded sample.
Initially the security and software development teams believed they would be able to gain insight from various statistical analyses of the dataset. Their initial attempts to classify data lacked sensitivity and had many false positives, the results of TOBORRM’s analysis have been included in the “Initial Statistical Analysis” column – the results of this analysis are poor.
Data set columns
The data set created by TOBORRM’s developers includes the following descriptions each column’s source:
Download Source A description of where the sample came from
TLD Top Level Domain of the site where the sample came from
Download Speed Speed recorded when obtaining the sample
Ping Time To Server Ping time to the server recorded when accessing the sample
File Size (Bytes) The size of the sample file
How Many Times File Seen How many other times this sample has been seen at other sites (and not downloaded)
Executable Code Maybe Present in Headers ‘CodeCheck’ Program has flagged the file as possibly containing executable code in file headers
Calls to Low-Level System Libraries When the file was opened or run, how many times were low-level Windows System libraries accessed
Evidence of Code Obfuscation ‘CodeCheck’ Program indicates that the contents of the file may be Obfuscated
Threads Started How many threads were started when this file was accessed or launched
Mean Word Length of Extracted Strings Mean length of text strings extracted from file using Unix ‘strings’ program
Similarity Score An unknown scoring system used by ‘CodeCheck’ seems to be the score of how similar the file is to other files recognised by ‘CodeCheck’
Characters in URL How long the URL is (after the .com / .net part). E.g., /index.html = 10 characters
Actually Malicious The correct classification for the file
Previous System Performance Performance of “FileSentry3000™ v1.0”
The industry grant from TOBORRM requires that they provide a clear case for whether machine learning algorithms could solve the problem of classifying malicious software. Your task is to build on your previous work and run the data through appropriate machine learning modelling approaches, and tuned to optimise their accuracy.
You are to train your selected supervised machine learning algorithms using the master dataset provided, and compare their performance to each other and to TOBORRM’s initial attempt to classify the samples.
Part 1 – General data preparation and cleaning.
Import the MLDATASET_PartiallyCleaned.xlsx into R Studio. This dataset is a partially cleaned version of MLDATASET-200000-1612938401.xlsx.
Write the appropriate code in R Studio to prepare and clean the MLDATASET_PartiallyCleaned dataset as follows:
For How.Many.Times.File.Seen, set all values = 65535 to NA;
Convert Threads.Started to a factor whose categories are given by
1 = 1 thread started
2 = 2 threads started
3 = 3 threads started
4 = 4 threads started
5 = 5 or more threads started
Hint: Replace all values greater than 5 with 5, then use the factor(.) function.
Log-transform Characters.in.URL using the log(.) function, and remove the original Characters.in.URL column from the dataset (unless you have overwritten it with the log-transformed data)
Select only the complete cases using the na.omit(.) function, and name the dataset MLDATASET.cleaned.
Briefly outline the preparation and cleaning process in your report and why you believe the above steps were necessary.
Write the appropriate code in R Studio to partition the data into training and test sets using an 30/70 split. Be sure to set the randomisation seed using your student ID. Export both the training and test datasets as csv files, and these will need to be submitted along with your code.
Note that the training set is typically larger than the test set in practice. However, given the size of this dataset, you will only use 30% of the data to train your ML models to save time.
Part 2 – Compare the performances of different machine learning algorithms
Select three supervised learning modelling algorithms to test against one another by running the following code. Make sure you enter your student ID into the command set.seed(.). Your 3 modelling approaches are given by myModels.
set.seed(Enter your student ID)
models.list1 - c(-Logistic Ridge Regression-,
-Logistic LASSO Regression-,
-Logistic Elastic-Net Regression-)
models.list2 - c(-Classification Tree-,
myModels - c(-Binary Logistic Regression-,
myModels % % data.frame
For each of your ML modelling approaches, you will need to:
Run the ML algorithm in R on the training set with Actually.Malicious as the outcome variable. EXCLUDE Sample.ID and Initial.Statistical.Analysis from the modelling process.
Perform hyperparameter tuning to optimise the model (except for the Binary Logistic Regression model):
Outline your hyperparameter tuning/searching strategy for each of the ML modelling approaches, even if you’re using the same search strategy as the workshop notes. Report on the search range(s) for hyperparameter tuning, which k-fold CV was used, and the number of repeated CVs (if applicable), and the final optimal tuning parameter values and relevant CV statistics (where appropriate).
If your selected tree model is Bagging, you must tune the nbagg, cp and minsplit hyperparameters, with at least 3 values for each.
If your selected tree model is Random Forest, you must tune the num.trees, mtry, min.node.size, and sample.fraction hyperparameters, with at least 3 values for each.
Evaluate the performance of each ML models on the test set. Provide the confusion matrices and report the following:
Sensitivity (the detection rate for actual malicious samples)
Specificity (the detection rate for actual non-malicious samples)
Provide a brief statement on your final recommended model and why you chose that model over the others. Parsimony, accuracy, and to a lesser extent, interpretability should be taken into account.
Create a confusion matrix for the variable Initial.Statistical.Analysis in the test set. Recall that the data in this column correspond to TOBORRM’s initial attempt to classify the samples. Compare and comment on the performance of your optimal ML model in part d) to the initial analysis by the TOBORRM team.
What to submit
Gather your findings into a report (maximum of 5 pages) and citing sources, if necessary.
Present how and why the data was manipulated, how the ML models were tuned and finally how they performed to each other and to the initial analysis by TOBORRM. You may use graphs, tables and images where appropriate to help your reader understand your findings.
Make a final recommendation on which ML modelling approach is the best for this task.
Your final report should look professional, include appropriate headings and subheadings, should cite facts and reference source materials in APA-7th format.
Your submission must include the following:
Your report (5 pages or less, excluding cover/contents page)
A copy of your R code, and two csv files corresponding to your training and test datasets.
The report must be submitted through TURNITIN and checked for originality. The R code and data sets are to be submitted separately via a Blackboard submission link.
Note that no marks will be given if the results you have provided cannot be confirmed by your code. Furthermore, all pages exceeding the 5-page limit will not be read or examined.
Criterion Contribution to assignment mark
Accurate implementation data cleaning and of each supervised machine learning algorithm in R. 20%
Explanation of data cleaning and preparation. 10%
An outline of the selected modelling approaches, the hyperparameter tuning and search strategy, the corresponding performance evaluation in the training set (i.e. CV results), and the optimal tuning hyperparameter values. 20%
Presentation, interpretation and comparison of the performance measures (i.e. confusion matrices) among the selected ML algorithms. Justification of the recommended modelling approach and how it compares against the results of the initial analysis in the test set. 30%
Report structure and presentation (including tables and figures, and where appropriate, proper citations and referencing in APA-7th style). Report should be clear and logical, well structured, mostly free from communication, spelling and grammatical errors. Appropriate and easy for a non-mathematical (semi-technical) audience to follow and understand. 20%