Listen to this story
|
GitHub Copilot, the text-to-code AI tool, has been—for the most part—revolutionary in determining how people code. Twitter has been erupting with people expressing how this new AI tool has benefitted them with organisation heads and developers alike hailing it for saving much of their time.
However, the latest discussion surrounding it suggests that things are murky.
Tim Davis, Professor – Computer Science, Texas A&M University, took to Twitter to express his resentment over Copilot producing his copyright code for a particular prompt.
@github copilot, with "public code" blocked, emits large chunks of my copyrighted code, with no attribution, no LGPL license. For example, the simple prompt "sparse matrix transpose, cs_" produces my cs_transpose in CSparse. My code on left, github on right. Not OK. pic.twitter.com/sqpOThi8nf
— Tim Davis (@DocSparse) October 16, 2022
Chris Rackauckas, lead developer of SciML, also shared a thread of Armin Ronacher from July 2021, adding, “Github Copilot spits out the Quake source code. It just repeats its training data often, even without OSS licenses”.
I feel sorry for you given how many really bad takes you're getting here. For some more ammo, here's a thread that shows that Github Copilot spits out the Quake source code. It just repeats its training data often, even without OSS licenses. Oops.https://t.co/YDN3nBoMYJ
— Chris Rackauckas (@ChrisRackauckas) October 16, 2022
But beyond this, the latest news that has been making rounds is about Matthew Butterick, a writer, programmer, and lawyer, who announced on October 17 that he would be teaming up with Joseph Saveri Law Firm, investigating a potential lawsuit against GitHub Copilot on the grounds of violating open-source licences. In writing on the issue of copyright violation in June 2022, Butterick cautioned organisations creating software products against the use of Copilot, as they would be taking part in using someone else’s intellectual property, albeit unintentionally.
GitHub is trained upon billions of lines of public code. But, there is no surety over whether the training data comes as fair use under copyright law. In presenting the case, Butterick writes that Microsoft characterises Copilot’s output code as only a series of “suggestions” and does not claim any rights over it. Additionally, he also cites a passage from GitHub’s website showing how Microsoft plays safe by pushing the blame onto the end user:
“You are responsible for ensuring the security and quality of your code. We recommend you take the same precautions when using code generated by GitHub Copilot that you would when using any code you didn’t write yourself. These precautions include rigorous testing, IP [(= intellectual property)] scanning, and tracking for security vulnerabilities.”
In a recent statement, Open AI claimed that the training material from public repositories is not meant to be included in the output generated by Copilot. Additionally, their analysis has shown that a vast majority of the output (>90%) doesn’t match the training data.
There is a divided opinion (a grey area, if you will) about who “legally” stands right among the two parties. GitHub has made it clear that the users need to check if the code used is free of copyright infringement, but at the same time, the open-source communities see the whole facade of “AI training is fair use” for their copyrighted codes to be a disregard for their rights. See, for example, this statement by Butterick: “By claiming that AI training is fair use, Microsoft is constructing a justification for training on public code anywhere on the internet, not just GitHub.”
Hence, there is little clarity over who is to be held accountable for this—Is it Copilot or the end users employing the AI-generated code for their product?
GitHub’s claim that AI training comes under fair use needs more inspection. This is not the first time questions of copyright have sprung forth in AI applications. It has been a persistent issue throughout the recent surge in AI generative models.
In an interview with Ben Sobel by IPW in 2017, Sobel explains the problem as a “fair use dilemma”. His argument goes like this:
(i) If Machine Learning doesn’t come under fair use, then organisations have to pay remedies to millions who form the training data on which machines learn. This will hinder any progress in the field.
(ii) But, if it does come under fair use, it is likely that organisations will take liberty in using the intellectual labour of people for their own profit.
Therefore, it will not be a stretch to say that the legal aspect of AI use is in difficult terrain. If there is a case for Butterick to take the makers of Copilot to court, the outcome of the lawsuit will have a huge impact on the future of open-source communities and AI generation models.