Tech Behind GitHub Copilot: The Coding Assistant From Microsoft & OpenAI

OpenAI and Microsoft have come together to release a technical preview of GitHub Copilot, an AI-based tool that helps programmers write better code. Copilot takes context from the code being worked on and suggests whole lines and functions.

Microsoft acquired GitHub, a popular code-repository service used by many developers and large companies, in 2018.  In 2019, Microsoft invested $1 billion in OpenAI to build artificial general intelligence and jointly develop new Azure AI supercomputing technologies; Microsoft already holds exclusive licenses for OpenAI’s GPT-3 language model.

Copilot is based on OpenAI Codex, an AI system trained on a dataset made up of a sizable chunk of public source code. Copilot works with a broad set of frameworks and languages, and the technical preview is ideal for languages like Python, JavaScript, TypeScript, Go, and Ruby.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

GitHub Copilot is an AI pair programmer that works with any new framework or library. The programmer can describe a function in plain English in a comment and the Copilot will convert it to actual code. The tool is already acquainted with specific functions and features. It helps the programmer quickly discover alternative ways for problem-solving, writing tests, and exploring new APIs. The team claimed Copilot is far more advanced than the existing code assistants.

Credit: GitHub

Copilot works best when the code is split into small functions, uses meaningful names for functions parameters, and writes good docstrings and comments in the process. It was recently benchmarked against a set of Python functions with good test coverage in open source repos. The team blanked out function bodies and asked Copilot to fill them in. The tool was found to be correct 43 percent of the time in the first try and 57 percent of the time after ten attempts.

However, according to the team, Copilot is not a substitute for human programmers. The team explained: “GitHub Copilot tries to understand your intent and to generate the best code it can, but the code it suggests may not always work or even make sense. While we are working hard to make GitHub Copilot better, code suggested by GitHub Copilot should be carefully tested, reviewed, and vetted, like any other code. As the developer, you are always in charge.”

Copilot does not test the code it suggests. More often than not, the code may fail to compile or run. Since the Copilot holds limited context, even a single source file longer than 100 lines is clipped, and the tool looks at just the immediately preceding context. “You can use the code anywhere, but you do so at your own risk,” the team said.

Stochastic parrot?

Since GitHub Copilot is trained on billions of lines of publicly available codes, it suggests that there might be a direct relationship between the suggested code and the code that is informed by it. Notably, Timnit Gebru and other authors coined the term ‘stochastic parrots’ for AI systems which directly reproduce what they learn during the training period.

However, the team said fitting Copilot into the same category of AI systems would be an oversimplification. The tool is more like a crow that builds novel tools from small blocks, rather than parroting the existing corpus of publicly available code. And as an engineer at GitHub puts it, these systems can feel like “a toddler with a photographic memory”.

GitHub Copilot is a code synthesiser and not a search engine. Meaning, a vast majority of the code it suggests is unique and has not been used before. However, code duplication can’t be entirely ruled out. The team found, 0.1 percent of the time, suggestions may contain verbatim snippets of code from the training set. It generally happens when the developer has not provided sufficient context or when there is a universal solution to the problem. Meanwhile, a few users have pointed out that since the Copilot is trained on public code, it could be considered a form of ‘open-source code laundering’. The team is now working at building an origin tracker to detect code duplication instances.

More Great AIM Stories

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM