The Dark History Behind Github Copilot’s Success

Everyone is singing Copilot’s praises, but the AI programmer has a darker side
Listen to this story

Over the past few years, we have seen considerable advancements in large language models (LLMs), with the number of parameters and features increasing exponentially. However, simply increasing the size of large language models does not make them viable for adoption in the real world. 

One of the key verticals where LLMs have been deployed is AI-powered auto coders. These algorithms can take natural language prompts and automatically write a code snippet that aligns with the syntax of a given language. 

The adoption of an autocoder in the real world depends on a variety of factors, and one needn’t look further than the current best LLM for coding on the market –  GitHub Copilot. This so-called AI pair programmer can suggest code snippets and entire functions to a programmer while they edit, and has found widespread adoption and success in the developer community. 


Sign up for your weekly dose of what's up in emerging technology.

Mike Krieger, the co-founder of Instagram, had this to say about Github Copilot – “This is the single most mind-blowing application of machine learning I’ve ever seen.”

However, other companies who looked to enter this vertical over the past few years have not found similar success; some have even failed. Copilot’s success has been dotted with controversies that have stained the reputation of an otherwise spotless tool.

Download our Mobile App

Copilot’s secret sauce

We can identify a variety of reasons why Copilot succeeded while others failed. While it gets its programming chops from OpenAI’s Codex LLM, a deeper look into Copilot’s runaway success shows that it was the right product released at the right time. 

Derived from GPT-3, Codex is a specialized version of the general LLM focused on translating natural language to code. Even before Codex was released to the public, OpenAI collaborated with Microsoft to create Copilot.

The model not only contains the parameters that GPT-3 was trained on, but also has billions of lines of source code from public GitHub repositories. This allowed it to learn code syntax and the contextual information for problem solving tasks. Moreover, fine-tuning the algorithm for coding specific tasks made it fast and light on resources while providing high degrees of accuracy. 

Kite was one of the companies that failed, as it was unable to create a model good enough to complete code at par with Copilot. Apart from the tech not being good enough at the time, Kite did not have the resources required to create a state-of-the-art model like Codex. It estimated that it would cost around $100 million to build a model like Codex due to the computing resources required for training and inference. 

Microsoft has not only acquired an exclusive license for GPT-3, it has also worked closely with OpenAI to create Codex. Moreover, it has the nigh-infinite scalability of Microsoft Azure to deploy and train these algorithms, affording them a sizable advantage over their competitors. 

The best product for the market 

Microsoft’s goals for the developer market go far beyond Copilot, which just represents one piece of the puzzle. Along with Azure, Visual Studio, VS Code, and Github, Microsoft is one of the most prominent companies in the development space. Copilot adds to their already powerful portfolio for developers and builds on it. 

To begin with, Microsoft’s acquisition of Github solidified its position as a leader in programming. For the tech stack, it partnered with OpenAI to license GPT-3. Microsoft then developed Codex along with the OpenAI team, and trained it on various open-source repositories available on the platform, giving it one of the best datasets to train on.

Even though there were so many reasons for Copilot to be a good product, the infrastructure behind is equally important. Microsoft Azure is not only scalable, but it also has cloud services optimized for training and deploying machine learning algorithms. This is the brains of Copilot, a globally available and scalable hardware pipeline that can be accessed on demand. 

It is simply not viable for companies to have access to the dataset that Microsoft had to train Codex, as seen with TabNine. Even though it is a close competitor to Copilot, many still prefer Microsoft’s product. Due to the smaller dataset and models similar in size to GPT-3, TabNine does not perform as well as Copilot, creating messy code with a higher tendency to make mistakes and cause errors. 

The darker side of Copilot

Even though Copilot seems to be the end-all solution to all coding problems, it is not without its own host of issues. The origins of the product show a more dangerous side of the auto-coding market.

Large language models are not an easy technology to access and deploy. Even if there are many companies with competing models, the companies with the most financial grunt and highest number of cloud computing resources will win out. 

Copilot has succeeded not because it is a good product, but because of Microsoft’s backing. From Azure, to OpenAI, to the huge cost required to train and run the algorithm for millions of developers, Microsoft has footed the bill for Copilot in the hopes of it becoming a money-making product sometime in the future.

In addition to the idea of LLMs going against open access to all, Github Copilot has its own share of blots. A class-action lawsuit has been filed against the company on the grounds that Microsoft has violated the rights of the vast number of creators whose code was used to train the algorithm. This dataset, which is one of the main reasons for Github Copilot’s accuracy, is scraped off the hard work of thousands of developers. Replit’s Ghostwriter, which is competing in the same field with responsibly sourced datasets, is struggling to capture market share.

Considering the factors, it is likely that other companies will also jump on the auto-coding bandwagon as an application of LLMs. As bigger players enter the field, Copilot’s unregulated usage of open-source code and cloud computing grunt will become the norm, increasing the barrier for entry for companies who want to do things the right way. While competing against the never-ending coffers of tech giants, smaller companies simply cannot create a competing product with comparable latency, cost, and usability. 

We are already seeing this pattern, with Amazon Web Services releasing a competing product called CodeWhisperer. However, it still misses out on Copilot’s silver bullet for datasets: code from Github repositories. This is an advantage that no other company apart from Microsoft will ever have, and sets a dangerous precedent for the future of auto-coding platforms. 

While the future of LLMs for computer generated code looks like it will be consolidated, smaller companies doing things the right way might come out on top after all. 

More Great AIM Stories

Anirudh VK
I am an AI enthusiast and love keeping up with the latest events in the space. I love video games and pizza.

AIM Upcoming Events

Regular Passes expire on 3rd Mar

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Early Bird Passes expire on 17th Feb

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, Virtual
Deep Learning DevCon 2023
27 May, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

A beginner’s guide to image processing using NumPy

Since images can also be considered as made up of arrays, we can use NumPy for performing different image processing tasks as well from scratch. In this article, we will learn about the image processing tasks that can be performed only using NumPy.

RIP Google Stadia: What went wrong?

Google has “deprioritised” the Stadia game streaming platform and wants to offer its Stadia technology to select partners in a new service called “Google Stream”.