Trojan Source Attack, what is it?

18/11/21

How about if it were possible to trick compilers into producing binary files other than the logic visible to human eyes in the source code? We show you that doing this is not only possible, but easily exploitable ...

This is stated in the introduction of the paper "Trojan source: Invisible Vulnerabilities", recently published by prof. Ross Anderson of Cambridge University, author of the text "Security engineering", and by Nicholas Boucher, his doctoral student. 

Intrigued, as often happens in the cyber world, we decided to deepen ...

To begin, remember that programs written with modern programming languages ​​need a compiler, that is another program that analyzes the text of our program, verifies its syntactic correctness, and translates it into an optimized program in the language executable by the computer. Already from this we understand the criticality in the development of a compiler that could inadvertently insert vulnerabilities in the program.

In the 80s Ken Thomson, one of the fathers of Unix, already warned against these risks in his famous lecture given on the occasion of the conferral of the "Turing award" (the equivalent of the Nobel Prize for computer science) with the evocative title "Reflections on trusting trust". The compiler of the C language is itself a program written in the C language, so to be able to trust the executable program produced by the compiler, it is necessary not only to trust whoever wrote the program, but also to trust the compiler.

Now, returning to the article by Boucher and Anderson, let's start from considering a compiler as a text analyzer that knows the terms and syntax of the programming language. The characters used to form the text need to be encoded on a computer. 

To understand what this new type of attack presented by Boucher and Anderson consists of, first we need to talk about coding and the standards used for coding.

One of the most used standards in the computer world over time is ASCII, American Standard Code for Information Interchange. The first edition of this standard was published in 1963 and changed several times over time. It took seven bits to encode 128 different characters. 

With the development of computer systems, needs changed and it was necessary to extend the ASCII standard to allow the encoding of characters coming from different languages ​​or necessary for specific environments.

The standard known as UNICODE was thus gradually passed. This standard allows 144.697 to be encoded between various characters and symbols. Among these are so-called alphabets Left-to-Right (from left to right, such as Italian, Russian, English ...) e Right-to-Left (from right to left, such as Arabic, Hebrew ...). In the case of Unicode, which is one of the most widely used coding standards today, an algorithm called Bidirectional Algorithm (BiDi). 

The BiDi algorithm is designed to automatically manage the display of text through so-called control characters (override characters) that allow you to explicitly specify how a specific portion of text should be treated (for example Right To Left Override, RLO, specifies to treat the text to which it applies as right-to-left). Control characters are invisible characters on the screen that allow you to control the display of portions of text. Here are some of the most used.

Furthermore, the control characters can be "nested", thus allowing to reorder the display of a text in a very complex way.

Now, if the text that is displayed on the screen is a program code, it must be borne in mind that in principle the syntax of most programming languages ​​does not typically allow you to insert these control characters directly into the source code in how much they would obviously alter the control logic of the program. However, since many languages ​​allow the use of comments and strings, it is possible to insert control characters inside comments or strings and have them have an impact on the source code as the BiDi ovverrides they do not respect string and comment delimiters.

The type of attack described by Boucher and Anderson is based on this characteristic typical of today's encodings. In fact, thanks to the control characters it is possible to exploit the difference between what is displayed and what is actually encoded and passed to the compiler and interpreter to conduct the type of "Trojan Source Attack" attack described by Boucher and Anderson.

Boucher and Anderson also demonstrated that this type of attack is possible with the most widely used programming languages ​​including C, C ++, C #, JavaScript, Java, Rust, Go, and Python. 

The example below, in Python code, shows that the code encoded (and therefore executed on the machine) on the left is different from the source code (on the right) displayed.

Based on the latter one would expect to find a value equal to 50 in bank ['alice'] at the end of the execution, while in practice the present value will still be equal to 100.

The fact that until now there is no evidence of this kind of attack does not reassure us for the future since, like us, others may have read the paper and, perhaps, made the decision to test the concepts.

Davide Ariu, Giorgio Giacinto and Alessandro Rugolo

To learn more:

Trojan Source: Invisible Vulnerabilities

'Trojan Source' bug a novel way to attack program encodings (techxplore.com)

Trojan Source attack is dangerous for compilers of program languages ​​(gridinsoft.com)

Trojan Source Attacks

GitHub - nickboucher / trojan-source: Trojan Source: Invisible Vulnerabilities

The Unicode Bidirectional Text Algorithm - Developer guides | MDN (mozilla.org)

https://doi.org/10.1145/358198.358210

https://www.cl.cam.ac.uk/~rja14/book.html