Java To Python And Back, AI That Translates Programming Languages

Code migration or codebase portability is a tricky yet expensive decision for any organisation, and an AI assistant that takes care

The Commonwealth Bank of Australia spent around $750 million and 5 years of work to convert its platform from COBOL to Java. Migrating an existing codebase to a modern or more efficient language like Java or C++ requires expertise in both the source and target languages, and is often costly. 

Usually, a transcompiler is deployed that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. 

They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is time-consuming and requires expertise in both the source and target languages, making code-translation projects expensive.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

So, the researchers at Facebook’s AI wing explored the existing unsupervised ML methods and came up with a model that can translate functions between C++, Java, and Python with high accuracy.

The model called TransCoder, is a sequence-to-sequence (seq2seq) model with attention composed of an encoder and a decoder with a transformer architecture.

Download our Mobile App

Overview Of The TransCoder Model

Translating source code from one Turing-complete language to another is always possible in theory. Unfortunately, building a translator is difficult in practice. The quality of machine translation systems highly depends on the quality of the available parallel data. However, for the majority of languages, parallel resources are rare or nonexistent. Since creating a parallel corpora for training is not realistic.

In a paper titled, “Unsupervised Translation of Programming Languages,” the authors proposed to apply recent approaches in unsupervised machine translation, by leveraging a large amount of monolingual source code from GitHub to train a model, TransCoder, to translate between three popular languages: C++, Java and Python.

As illustrated above, the TransCoder model functions on three main principles:

  • The first principle initializes the model with a cross-lingual masked language model pretraining. As a result, pieces of code that express the same instructions are mapped to the same representation, regardless of the programming language. 
  • Next comes Denoising auto-encoding, where the decoder is trained to generate valid sequences even when fed with noisy data, and it increases the encoder robustness to input noise. 
  • Finally, there is a Back-translation step that allows the model to generate parallel data which can be used for training. Whenever the PythonC++ model becomes better, it generates more accurate data for the C++Python model and vice versa. 

For training, the researchers used the GitHub public dataset that contains more than 2.8 million open-source GitHub repositories. Out of which, they filtered projects whose license explicitly permits the redistribution of parts of the project, and selected the C++, Java, and Python files within those projects.

The above picture demonstrates the working of TransCoder where it successfully translates the

Python input function SumOfKsubArray into C++. TransCoder infers the types of the arguments of the variables and the return type of the function, and uses the associated front, back, pop_back and push_back methods to retrieve and insert elements into the deque, instead of the Python square brackets [ ], pop and append methods. It also converts the Python for loop and range function properly.

The results also show that the model can learn to translate the ternary operator “X ? A : B” in C++ or Java to “if X then A else B” in Python, in an unsupervised way.

Key Takeaways

Code migration or codebase portability is a tricky yet expensive decision for any organisation, and an AI assistant that takes care of the nitty-gritty dependencies within the programming languages can be quite handy. The key contributions of this work, according to the authors, can be summarised as follows:

  • Introduction of a new approach to translating functions from a programming language to another, which is purely based on monolingual source code.
  • TransCoder successfully manages to grasp complex patterns specific to each language and translate them to other languages.
  • Results show that a fully unsupervised method can outperform commercial systems that leverage rule-based methods and advanced programming knowledge.

Know more about TransCoder here.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: Evolution of Data Science: Skillset, Toolset, and Mindset

In my opinion, there will be considerable disorder and disarray in the near future concerning the emerging fields of data and analytics. The proliferation of platforms such as ChatGPT or Bard has generated a lot of buzz. While some users are enthusiastic about the potential benefits of generative AI and its extensive use in business and daily life, others have raised concerns regarding the accuracy, ethics, and related issues.