The Commonwealth Bank of Australia spent around $750 million and 5 years of work to convert its platform from COBOL to Java. Migrating an existing codebase to a modern or more efficient language like Java or C++ requires expertise in both the source and target languages, and is often costly.
Usually, a transcompiler is deployed that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one.
They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is time-consuming and requires expertise in both the source and target languages, making code-translation projects expensive.
So, the researchers at Facebook’s AI wing explored the existing unsupervised ML methods and came up with a model that can translate functions between C++, Java, and Python with high accuracy.
The model called TransCoder, is a sequence-to-sequence (seq2seq) model with attention composed of an encoder and a decoder with a transformer architecture.
Overview Of The TransCoder Model
Translating source code from one Turing-complete language to another is always possible in theory. Unfortunately, building a translator is difficult in practice. The quality of machine translation systems highly depends on the quality of the available parallel data. However, for the majority of languages, parallel resources are rare or nonexistent. Since creating a parallel corpora for training is not realistic.
In a paper titled, “Unsupervised Translation of Programming Languages,” the authors proposed to apply recent approaches in unsupervised machine translation, by leveraging a large amount of monolingual source code from GitHub to train a model, TransCoder, to translate between three popular languages: C++, Java and Python.
As illustrated above, the TransCoder model functions on three main principles:
- The first principle initializes the model with a cross-lingual masked language model pretraining. As a result, pieces of code that express the same instructions are mapped to the same representation, regardless of the programming language.
- Next comes Denoising auto-encoding, where the decoder is trained to generate valid sequences even when fed with noisy data, and it increases the encoder robustness to input noise.
- Finally, there is a Back-translation step that allows the model to generate parallel data which can be used for training. Whenever the
Python → C++model becomes better, it generates more accurate data for the
C++ → Pythonmodel and vice versa.
For training, the researchers used the GitHub public dataset that contains more than 2.8 million open-source GitHub repositories. Out of which, they filtered projects whose license explicitly permits the redistribution of parts of the project, and selected the C++, Java, and Python files within those projects.
The above picture demonstrates the working of TransCoder where it successfully translates the
Python input function SumOfKsubArray into C++. TransCoder infers the types of the arguments of the variables and the return type of the function, and uses the associated front, back, pop_back and push_back methods to retrieve and insert elements into the deque, instead of the Python square brackets
[ ], pop and
append methods. It also converts the Python for loop and range function properly.
The results also show that the model can learn to translate the ternary operator
“X ? A : B” in C++ or Java to
“if X then A else B” in Python, in an unsupervised way.
Code migration or codebase portability is a tricky yet expensive decision for any organisation, and an AI assistant that takes care of the nitty-gritty dependencies within the programming languages can be quite handy. The key contributions of this work, according to the authors, can be summarised as follows:
- Introduction of a new approach to translating functions from a programming language to another, which is purely based on monolingual source code.
- TransCoder successfully manages to grasp complex patterns specific to each language and translate them to other languages.
- Results show that a fully unsupervised method can outperform commercial systems that leverage rule-based methods and advanced programming knowledge.
Know more about TransCoder here.