Steganalysis and Machine Learning: a European answer

(To Igino Corona, Matteo Mauri)

La steganography it is a secret mechanism for encoding information through any means of transmission. Its use has been known since ancient Greece even if it officially entered the glossaries at the end of the XNUMXth century.

Both the encoding and the transmission medium are secrets, or known only to the parties who intend to communicate in an occult way. There steganography It therefore presents itself as an ideal tool for the creation of secret communication channels that can be used in sophisticated espionage scenarios, computer crime and violations of the privacy of public and private subjects.

La steganography differs from cryptography, where the encoding of the information and the means of transmission are generally known (think for example of the HTTPS protocol used by the site hosting this article). In this case the privacy of the information is guaranteed by the coding mechanism which makes it (extremely) difficult1 sending / extracting information without knowledge of additional information, known as encryption / decryption keys. These keys are known only to the parties authorized to communicate (for example, your browser and our web server).

The process of analyzing the steganography it is also known as steganalysis. In the first instance this process aims to detect the presence of steganography in one or more means of transmission, and only in the second instance can it proceed with the extraction of the hidden message.

The effectiveness of steganalysis techniques is strictly dependent on the degree of sophistication and "personalization" of the steganographic techniques used by a malicious adversary.

The easiest case to deal with is the one where the steganography is carried out using "shelf" tools. This case reflects an opponent with low (or zero) knowledge level steganography, and who simply uses tools implemented and made available by others: in computer security such an opponent is often called script kiddie.

In the digital field there are many software that they implement steganography and most of these combine cryptographic techniques. The table shows examples of open-source software that employ both techniques.

Of course, "shelf" instruments are generally also available to those who intend to perform steganalysis.

In implementing the steganography, each software generally leaves (more or less implicitly) characteristic artifacts in the manipulated files, which can be studied to build signatures (fingerprinting). These signatures can be used in the steganalysis phase to identify not only the presence of steganography, but the specific tool used, as well as the extraction of hidden content [7,8]. Most steganalysis systems use this mechanism [9].

It is easy to see that you are in a vicious circle ("arms-race") which provides for an increase in the sophistication of the techniques and tools used both by those who intend to use steganography, and by those who intend to unmask it and detect its hidden contents. Between the two profiles, the first figure generally has an advantage, since it can change the means of transmission and / or coding of information at any time to escape detection.

For example, an opponent might change the implementation of the software steganography to escape the fingerprinting, or even implement totally new steganographic techniques. This of course has a cost - we are no longer in the presence of script kiddie - but this cost can be well balanced by the reasons (e.g. strategic / economic advantages by a cyber-espionage organization).

This situation is well known in the field of computer security: it is generally much easier to attack computer systems than to defend them. Malware instances manifest themselves in continuous "polymorphic" variants precisely to evade the detection mechanisms in place for the protection of systems (eg. antimalware).

In this scenario, the machine learning (machine learning from examples) can represent a sophisticated weapon at the service of those who intend to unmask the steganography. Through techniques of machine learning in fact, it is possible to automatically develop a steganalysis model starting from a set of file samples with and / or without steganography.

Most of the proposed approaches use so-called supervised two-class learning (steganography present / absent), which requires the use of samples both with and without steganography, to automatically determine statistical differences. This method is particularly useful for detecting the presence of variants of known steganographic techniques (e.g. implemented in new software) for which there are no signatures.

Examples of various supervised learning based algorithms for detecting steganography in images they have been implemented in an open-source library called Aletheia [10].
Signatures and supervised learning can provide good accuracy when it comes to detecting techniques steganography known and its variants, but are subject to evasion in the presence of totally new techniques, for example, with a statistical profile significantly different from that observed on the samples used for training.

For this reason, other studies [11,12] have instead proposed the use of unsupervised - anomaly-based learning techniques. This approach involves only the use of samples in which the steganography it is absent, for the automatic construction of a normal profile. The presence of anomalies (“outliers”), or deviations from this profile, can therefore be used to detect totally unknown steganographic techniques. This approach, however, must focus on aspects (features) whose deviations from the norm are high index of manipulation to offer good accuracy. Think, for example, of comparing the size specified in the header of a file to the actual size.

Since each steganalysis technique has its merits, a combination of them is often useful: signatures, supervised and unsupervised learning [12]. The European Commission is well aware that it financed a strategic project for this purpose, called SIMARGL - Secure intelligent methods for advanced recognition of malware, stegomalware & information hiding methods (Grant Agreement No. 833042 -

The project, with a total budget of 6 million euros, aims to create advanced steganalysis systems applied to the detection of (stego) malware, malicious software increasingly used by cybercrime and national states in espionage actions. In this project, international actors of the caliber of Airbus, Siveco, Thales, Orange Cert, FernUniversität (project coordinator), join three "Italians" in contrasting stegomalware: Pluribus One, a spin-off of the University of Cagliari, participates as a software provider and developer; CNR, Unit of Genoa, puts in place Energy-Aware detection algorithms based on artificial intelligence; Numera, a company operating in the ICT sector based in Sassari, will submit some of its systems aimed at the credit market to the “scrutiny” of SIMARGL.

In total, there are 14 international partners (also Netzfactor, ITTI, Warsaw University, IIR, RoEduNet, Stichting CUIng Foundation participate in the consortium) from 7 countries that will bring artificial intelligence into the field, sophisticated products already available and machine learning in the process of improvement, in order to propose an integrated solution capable of dealing with different scenarios and acting at different levels: from monitoring network traffic to detecting blurred bits within images.

The challenge of the SIMARGL project has just begun and will provide concrete answers to the problem of stegomalware in the next two years: the project will in fact end in April 2022.

It is important to stress that the machine learning (and more generally artificial intelligence) is a neutral technology (like many other technologies). Specifically, it is of dual use [13] and does not belong to the domain of good people. In the beginning, the machine learning it can also be used to develop more sophisticated, polymorphic, data-based steganographic techniques.

Let's get ready, because this scenario could represent the future of cyber threats (and perhaps a piece of the future is already present).

1. The degree of difficulty generally identifies the robustness of the coding.

[7] Pengjie Cao, Xiaolei He, Xianfeng Zhao, Jimin Zhang, Approaches to obtaining fingerprints of steganography tools which embed message in fixed positions, Forensic Science International: Reports, Volume 1, 2019, 100019, ISSN 2665-9107,
[8] Chen Gong, Jinghong Zhang, Yunzhao Yang, Xiaowei Yi, Xianfeng Zhao, Yi Ma, Detecting fingerprints of audio steganography software, Forensic Science International: Reports, Volume 2, 2020, 100075, ISSN 2665-9107,
[11] Jacob T. Jackson, Gregg H. Gunsch, Roger L. Claypoole, Jr., Gary B. Lamont. Blind Steganography Detection Using a Computational Immune System: A Work in Progress. International Journal of Digital Evidence, Winter 2003, Issue 1, Volume 4
[12] Brent T. McBride, Gilbert L. Peterson, Steven C. Gustafson. A new blind method for detecting novel steganography, Digital Investigation, Volume 2, Issue 1, 2005, Pages 50-70, ISSN 1742-2876,