GitHub Copilot: The Latest in the List of AI Generative Models Facing Copyright Allegations

GitHub Copilot, the text-to-code AI tool, has been facing accusations of stealing people’s codes. So, what’s next for AI-generative models?
Listen to this story

GitHub Copilot, the text-to-code AI tool, has been—for the most part—revolutionary in determining how people code. Twitter has been erupting with people expressing how this new AI tool has benefitted them with organisation heads and developers alike hailing it for saving much of their time. 

However, the latest discussion surrounding it suggests that things are murky. 

Tim Davis, Professor – Computer Science, Texas A&M University, took to Twitter to express his resentment over Copilot producing his copyright code for a particular prompt. 

Chris Rackauckas, lead developer of SciML, also shared a thread of Armin Ronacher from July 2021, adding, “Github Copilot spits out the Quake source code. It just repeats its training data often, even without OSS licenses”. 

But beyond this, the latest news that has been making rounds is about Matthew Butterick, a writer, programmer, and lawyer, who announced on October 17 that he would be teaming up with Joseph Saveri Law Firm, investigating a potential lawsuit against GitHub Copilot on the grounds of violating open-source licences. In writing on the issue of copyright violation in June 2022, Butterick cautioned organisations creating software products against the use of Copilot, as they would be taking part in using someone else’s intellectual property, albeit unintentionally. 

GitHub is trained upon billions of lines of public code. But, there is no surety over whether the training data comes as fair use under copyright law. In presenting the case, Butterick writes that Microsoft characterises Copilot’s output code as only a series of “suggestions” and does not claim any rights over it. Additionally, he also cites a passage from GitHub’s website showing how Microsoft plays safe by pushing the blame onto the end user:  

“You are respon­si­ble for ensur­ing the secu­rity and qual­ity of your code. We rec­om­mend you take the same pre­cau­tions when using code gen­er­ated by GitHub Copi­lot that you would when using any code you didn’t write your­self. These pre­cau­tions include rig­or­ous test­ing, IP [(= intel­lec­tual prop­erty)] scan­ning, and track­ing for secu­rity vul­ner­a­bil­i­ties.”

In a recent statement, Open AI claimed that the training material from public repositories is not meant to be included in the output generated by Copilot. Additionally, their analysis has shown that a vast majority of the output (>90%) doesn’t match the training data.  

There is a divided opinion (a grey area, if you will) about who “legally” stands right among the two parties. GitHub has made it clear that the users need to check if the code used is free of copyright infringement, but at the same time, the open-source communities see the whole facade of “AI training is fair use” for their copyrighted codes to be a disregard for their rights. See, for example, this statement by Butterick: “By claim­ing that AI train­ing is fair use, Microsoft is con­struct­ing a jus­ti­fi­ca­tion for train­ing on pub­lic code any­where on the inter­net, not just GitHub.”   

Hence, there is little clarity over who is to be held accountable for this—Is it Copilot or the end users employing the AI-generated code for their product? 

GitHub’s claim that AI training comes under fair use needs more inspection. This is not the first time questions of copyright have sprung forth in AI applications. It has been a persistent issue throughout the recent surge in AI generative models. 

In an interview with Ben Sobel by IPW in 2017, Sobel explains the problem as a “fair use dilemma”. His argument goes like this: 

(i) If Machine Learning doesn’t come under fair use, then organisations have to pay remedies to millions who form the training data on which machines learn. This will hinder any progress in the field. 

(ii) But, if it does come under fair use, it is likely that organisations will take liberty in using the intellectual labour of people for their own profit.    

Therefore, it will not be a stretch to say that the legal aspect of AI use is in difficult terrain. If there is a case for Butterick to take the makers of Copilot to court, the outcome of the lawsuit will have a huge impact on the future of open-source communities and AI generation models. 

Download our Mobile App

Ayush Jain
Ayush is interested in knowing how technology shapes and defines our culture, and our understanding of the world. He believes in exploring reality at the intersections of technology and art, science, and politics.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week.