Last updated February 28, 2024
In AI News & Update

60% of GPT-3.5 Outputs Are Plagiarised: Report

Copyleaks used a proprietary scoring method considering identical text, minor alterations, paraphrasing, and more to assign a "similarity score."

Share

Published on February 26, 2024

by Tasmia Ansari

A report from plagiarism detector Copyleaks has revealed that 60% of OpenAI’s GPT-3.5 outputs contain some form of plagiarism. The company used a proprietary scoring method considering identical text, minor alterations, paraphrasing, and more to assign a “similarity score.”

Copyleaks specializes in AI-based text analysis and offers plagiarism detection tools to businesses and schools. The company has been in the game well before ChatGPT. Although GPT-3.5 was the star behind ChatGPT’s debut, OpenAI has since upgraded to the more advanced GPT-4.

According to their latest findings, GPT-3.5 exhibited 45.7% identical text, 27.4% minor changes, and 46.5% paraphrased text. A score of 0% implies complete originality, while 100% suggests no original content, as per the report.

Copyleaks subjected GPT-3.5 to various tests, generating around a thousand outputs, each approximately 400 words, across 26 subjects. The results with the highest similarity score belonged to computer science (100%), followed by physics (92%) and psychology (88%). On the flip side, theatre (0.9%), humanities (2.8%), and English language (5.4%) registered the lowest similarity scores.

“Our models were designed and trained to learn concepts in order to help them solve new problems,” OpenAI spokesperson Lindsey Held told Axios. “We have measures in place to limit inadvertent memorization, and our terms of use prohibit the intentional use of our models to regurgitate content.”

Plagiarism goes beyond cutting and pasting entire sentences and paragraphs. The New York Times filed a lawsuit against OpenAI, stating that OpenAI’s AI systems’ “wide-scale copying” constitutes copyright infringement. OpenAI responded to the lawsuit, arguing that “regurgitation” is a “rare bug” and also accusing The New York Times of “manipulating prompts.”

But content creators in general, from authors to visual artists, have been trying to argue in court that the underlying technology, generative AI, is trained on their copyrighted work; hence, it ends up spitting out exact copies. But as of now, the laws have managed to work out for the companies instead of the other party. A glimpse of hope is visible with the NYT case but the matter remains pending.

Access all our open Survey & Awards Nomination forms in one place