Gueltoum Bendiab, Stavros Shiaeles, 29 Jun 2020.
Malware is today one of the major threats faced by the digital world. In particular, the modern malware attacks have drawn special attention to the extensive damage that can be caused to private users, companies, public services, governments, and critical infrastructures, which provide vital functions that our societies depend upon. According to the AV-TEST institute, AV-TEST analysis systems recordover 350,000 new malicious programs every day, which amounts to more than 200 million pieces of malicious software that need to be examined every year. This large number of new malwares creates a big challenge for traditional malware analysis and detection systems which depend on databases of signatures like antivirus software and intrusion detection systems. These systems are not able to discover unknown malware and zero-day attacks for which specific signatures are not yet available. With this limitation, attackers are recycling existing malware with different signatures by using obfuscation methods instead of developing entirely new codes. For instance, polymorphic malware can generate new variants each time it is executed, and therefore generating a new signature.
To tackle those limitations in convolutional anti-malware tools, the research community has started considering the concept of image visualization for malware analysis and detection, which can successfully handle zero-day malware. This technique has proven to be eﬀective because it leverages the structural similarity between known and new malware binaries. Moreover, visual analysis helps analysts to accurately capture and highlight malicious behaviour of malware samples, thus helping increase the efficiency of malware detection. In this context, Microsoft and Intel have been proposed a novel approach to turn malware into images that can be used to spot more threats. Intel labs and Microsoft threat intelligence team are currently collaborating on a pertinent research project called STAMINA (Static Malware-as-Image Network Analysis), which converts input binary files into grayscale images so that, a deep learning algorithm can process and classify them. The main idea of this research project is to converts content of an input binary file into a simple stream of pixels and converts that into a 2D image that vary depending on aspects like file size. Then, a trained neural network classifier is used to analysis and classify the output image as legitimate or malware. The image conversion (pre-processing step) consists of three main sub-steps:
- Pixel conversion: this step converts every byte in the binary file into a one-dimensional pixel stream (a value between 0 and 255, which corresponding to a pixel intensity).
- Reshaping: this step reshapes the produced pixel streams into two dimensions victor, where the image high is calculated as the number of pixels divided by the width. It the high is a decimal number, its value is rounded up and the extra pixels are padded as zeros. Table 1 can be used to set the width of the output image correctly.
- Resizing: this step resizes the output images so that they can be used by the deep learning classifier. Recommended size is 224 or 229. Bilinear interpolation or the nearest neighbour algorithms can be used for resizing.
|Pixel file size
|Output image width
|Between 0 and 10
|Between 10 and 30
|Between 30 and 60
|Between 60 and 100
|Between 100 and 200
|Between 200 and 1000
|Between 1000 and 1500
|Greater than 1500
The learning algorithm is trained on a huge amount of real-world data (2.2 million PE file hashes) that Microsoft has collected from Windows Defenders installations. STAMINA has proven to be effective, with over 99.00 % accuracy in classifying malware and a false positive rate slightly under 2.6%. However, it has its limits. It works well with small files, but it struggles with larger ones. The results of this research were detailed in a paper titled “STAMINA: Scalable deep learning approach for malware classification“. Several other visualization-based techniques have been proposed to improve malware detection and classification, and better understand the specific behaviour of new malware. Most of these techniques transform malware detection into an image classification problem so that can be processed by machine learning algorithms. They are purely based on the conversion of binary files into 2D or 3D images, with most of them using grayscale images.
One innovative visualisation-based approach for malware traffic classification, called “Malware-Squid”, has been proposed in the context of the Cyber-Trust project. This project aims to develop an innovative cyber-threat intelligence gathering, detection, and mitigation platform. The proposed approach converts network traffic into RGB images by using the visual representation tool Binvis. Then, the produced images are analysed and classified using different learning algorithms including Residual Neural Network (ResNet50), Self-Organizing Incremental Neural Networks (SOINN) and MobileNet. For the image conversion, binary content of the input file is seen as a byte string, where each byte’s value is mapped to a colour based on the equivalent value in the ASCII table. Binvis divided the different ASCII bytes into four groups of colours, where red colour is attributed to extended ASCII bytes, blue colour is assigned to Printable ASCII bytes and green colour is assigned to control bytes. Black (0x00) and white (0xFF) colour respectively represent null and (non-breaking) spaces. Then, the coordinates of each byte colour in the output image are identiﬁed by using the clustering algorithm’s space-ﬁlling curves. The size of the output RGB image is 784 (1024*256) bytes.
The learning algorithms are trained on a dataset of Binvis images that were created from normal and malware pcap files that were collected from different network trafﬁc sources. The malware pcap ﬁles contain malicious trafﬁc that was generated by different types of attacks such as trojans, botnets, IoT based attacks (DDoS, Key loggers, OS scans, spyware), backdoors, etc. Malware-Squid achieved high accuracy rates from 94% to 99% and false positive rates lower than 3%. Work on this project is on progress, where the learning algorithms will be trained and tested on more samples in order to improve the accuracy rates. Cyber-Trust project aims to integrate Malware-Squid to well-known Network Intrusion Detection Systems (NIDSs) Snort and Suricata in order to improve the protection and mitigation processes in these NIDSs.
It is worth noting that the “Malware-Squid” project was announced in the first of April 2018, which means two years before the announcement of the STAMINA project. Cyber-Trust project is progressing well, and the Malware-Squid approach shows promising results with the Suricata NIDS. Final results from this project will be published in research papers. In this context, Cyber-Trust team announce that this new approach will greatly help anti-malware tools, especially NIDSs, to effectively detect zero-day attacks and reduce the surface of security threats.
- Li Chen, Jugal Parikh, Marc Marino “STAMINA Deep Learning for Malware Protection”, available on intel website: https://www.intel.com/content/www/us/en/artificial-intelligence/documents/stamina-deep-learning-for-malware-protection-whitepaper.html, April, 2020, accessed 30/06/2020.
- Shire R., Shiaeles S., Bendiab K., Ghita B., Kolokotronis N. (2019) Malware Squid: A Novel IoT Malware Traffic Analysis Framework Using Convolutional Neural Network and Binary Visualisation. In: Galinina O., Andreev S., Balandin S., Koucheryavy Y. (eds) Internet of Things, Smart Spaces, and Next Generation Networks and Systems. NEW2AN 2019, ruSMART 2019. Lecture Notes in Computer Science, vol 11660. Springer, Cham.
- Baptista, I., Shiaeles, S., & Kolokotronis, N. (2019, May). A novel malware detection system based on machine learning and binary visualization. In 2019 IEEE International Conference on Communications Workshops (ICC Workshops) (pp. 1-6). IEEE.