Last updated January 28, 2022
In AI Origins & Evolution

How language models perfected plagiarism to an art

Today, most institutions employ text-matching software to counteract plagiarism.

Published on January 28, 2022
by Srishti Mukherjee

Neural network language models (LMs) are capable of producing grammatical and coherent text. But the originality of the text such models churn out is suspect.

So, are these LMs simply “stochastic parrots” regurgitating text, or have they actually learned how to produce intricate structures that support sophisticated generalisation?

Why is novelty important?

The novelty of a generated text tells us how different it is from the training set. Studying the novelty of LMs is important for two main reasons: Models are supposed to learn the training distribution, not just memorise the training set; models that simply copy the training data are more likely to expose sensitive information or reecho hate speech.

Researchers at John Hopkins University, New York University, Microsoft Research, and Facebook AI Research– in a recent paper– have proposed a method to measure the novelty of the text generated by LMs. The study looked into how well LMs repurpose language in novel ways.

Are language models plagiarising training data?

To evaluate the novelty of generated text, the researchers introduced a list of analyses (called RAVEN) that covered both the sequential and syntactic structure of the text. They then applied these analyses to a Transformer, Transformer-XL, LSTM and all four sizes of GPT-2.

According to their findings, all of these models were able to demonstrate novelty in all aspects of the structure. They generated novel n-grams, morphological combinations, and syntactic structures. 74% of the sentences the Transformer-XL generated had a syntactic structure different from the training sentences, and GPT-2 was able to come up with original words (including inflections and derivations).

That said, for smaller n-grams, the models are still less novel than the baseline (based on the degree of duplication in a model-generated text to a human-generated text). Additionally, there is occasional evidence of large-scale copying. For instance, GPT-2 tends to pirate bigger training passages (more than 1,000 words).

All things considered, it’s safe to assume neural language models do not just plagiarize the training data, and also use constructive processes to combine familiar parts in novel ways.

Threat to academic integrity?

Neural language models are so good at generating novel text, it has become difficult for statistical and traditional ML solutions to detect machine-obfuscated plagiarism.

AI writing assistants like OpenAI’s GPT-3 are alarmingly simple to use. You can type in a headline and a few sentences on the topic, and GPT-3 will automatically begin filling in the details. The model produces plausible content and endless output, and—most importantly— allows you to communicate with the “robot writer” to correct errors.

The efficiency stems from the ever increasing size of training data. For context, the entirety of Wikipedia (which consists of more than 6 million articles and 3.9 billion words) makes up only 0.6% of the input size for GPT-3.

Studies show a shocking number of students use online paraphrasing tools such as SpinBot and SpinnerChief to disguise plagiarised text. Such tools use AI to alter text (such as by replacing words with their synonyms) to give the work a semblance of originality.

The use of neural language models for paraphrasing is a recent trend, and so far there isn’t enough accumulated data to train plagiarism detection systems (PDS) with. Today, most institutions employ text-matching software to counteract plagiarism. The tools are effective in identifying duplicated text, but struggle to detect paraphrases, translations, and other artful forms of plagiarism.

Plagiarism detection systems

Plagiarism detection technology taps lexical, syntactical, semantic, cross-lingual text analysis. Some methods focus on non-textual features, such as academic citation images and mathematical content, to uncover plagiarism. Meanwhile, most research concentrates on quantifying the degree to which two sentences are similar to each other to detect AI-aided text paraphrasing.

According to a paper published by the University of Wuppertal in 2021, obtaining additional training data is the best solution to improve detection of machine-paraphrased text.

Access all our open Survey & Awards Nomination forms in one place >>

Srishti Mukherjee

Drowned in reading sci-fi, fantasy, and classics in equal measure; Srishti carries her bond with literature head-on into the world of science and tech, learning and writing about the fascinating possibilities in the fields of artificial intelligence and machine learning. Making hyperrealistic paintings of her dog Pickle and going through succession memes are her ideas of fun.

How language models perfected plagiarism to an art

Why is novelty important?

Are language models plagiarising training data?

Threat to academic integrity?

Plagiarism detection systems

Srishti Mukherjee

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discord Server

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.