Java To Python And Back, AI That Translates Programming Languages

Code migration or codebase portability is a tricky yet expensive decision for any organisation, and an AI assistant that takes care

The Commonwealth Bank of Australia spent around $750 million and 5 years of work to convert its platform from COBOL to Java. Migrating an existing codebase to a modern or more efficient language like Java or C++ requires expertise in both the source and target languages, and is often costly. 

Usually, a transcompiler is deployed that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. 

They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is time-consuming and requires expertise in both the source and target languages, making code-translation projects expensive.

So, the researchers at Facebook’s AI wing explored the existing unsupervised ML methods and came up with a model that can translate functions between C++, Java, and Python with high accuracy.

The model called TransCoder, is a sequence-to-sequence (seq2seq) model with attention composed of an encoder and a decoder with a transformer architecture.

Overview Of The TransCoder Model

Translating source code from one Turing-complete language to another is always possible in theory. Unfortunately, building a translator is difficult in practice. The quality of machine translation systems highly depends on the quality of the available parallel data. However, for the majority of languages, parallel resources are rare or nonexistent. Since creating a parallel corpora for training is not realistic.

In a paper titled, “Unsupervised Translation of Programming Languages,” the authors proposed to apply recent approaches in unsupervised machine translation, by leveraging a large amount of monolingual source code from GitHub to train a model, TransCoder, to translate between three popular languages: C++, Java and Python.

As illustrated above, the TransCoder model functions on three main principles:

  • The first principle initializes the model with a cross-lingual masked language model pretraining. As a result, pieces of code that express the same instructions are mapped to the same representation, regardless of the programming language. 
  • Next comes Denoising auto-encoding, where the decoder is trained to generate valid sequences even when fed with noisy data, and it increases the encoder robustness to input noise. 
  • Finally, there is a Back-translation step that allows the model to generate parallel data which can be used for training. Whenever the PythonC++ model becomes better, it generates more accurate data for the C++Python model and vice versa. 

For training, the researchers used the GitHub public dataset that contains more than 2.8 million open-source GitHub repositories. Out of which, they filtered projects whose license explicitly permits the redistribution of parts of the project, and selected the C++, Java, and Python files within those projects.

The above picture demonstrates the working of TransCoder where it successfully translates the

Python input function SumOfKsubArray into C++. TransCoder infers the types of the arguments of the variables and the return type of the function, and uses the associated front, back, pop_back and push_back methods to retrieve and insert elements into the deque, instead of the Python square brackets [ ], pop and append methods. It also converts the Python for loop and range function properly.

The results also show that the model can learn to translate the ternary operator “X ? A : B” in C++ or Java to “if X then A else B” in Python, in an unsupervised way.

Key Takeaways

Code migration or codebase portability is a tricky yet expensive decision for any organisation, and an AI assistant that takes care of the nitty-gritty dependencies within the programming languages can be quite handy. The key contributions of this work, according to the authors, can be summarised as follows:

  • Introduction of a new approach to translating functions from a programming language to another, which is purely based on monolingual source code.
  • TransCoder successfully manages to grasp complex patterns specific to each language and translate them to other languages.
  • Results show that a fully unsupervised method can outperform commercial systems that leverage rule-based methods and advanced programming knowledge.

Know more about TransCoder here.

Download our Mobile App

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week.