LLMs are an Ethical Nightmare

Since ChatGPT became an internet celebrity, differentiating between human- and AI-generated content has become next to impossible

Share

Published on July 21, 2023

by Tasmia Ansari

Listen to this story

LLMs are an ethical nightmare and band-aid solutions are nowhere to be found. As users struggle with problematic outputs of language models, researchers have been striving to solve them one-by-one.

A collectively authored research paper, from Stability AI, Meta AI Research and others have established a set of open problems so that ML researchers can comprehend the field’s current state quicker and become more productive. The paper discusses the design, behaviour, and the science behind the models rather than the political, philosophical, or moral aspects of it.

Furthermore, the authors have identified 11 domains where LLMs have successfully been applied. Across these, we provide an overview of existing work as well as constraints we identify in the literature. The research aims to provide a map to focus on for future research.

Issues Raised

“People often think the machine learning algorithms introduce bias. Fifty years ago, everybody knew ‘garbage in garbage out’. In this particular case, it is ‘bias in, bias out’,“ a veteran data scientist and Turing Award laureate Jeffrey Ullman told AIM. Along similar lines, the research paper addresses the first challenge of ‘unfathomable data’.

The next issue the paper addresses is tokenisation – the process of breaking a sequence of words or characters into smaller units. For instance, the number of tokens necessary to convey the same information varies significantly across languages, making the pricing policy of API language models unfair. For instance, the price of generating 800 words using the Ada model, the Hindi translation would require nearly 7X of tokens as well as 7X of the pricing in comparison to the same produced in English. For a language like Kannada, the pricing is 11X more than English.

The pricing factor is not just restricted to tokens as a hefty price is paid for training these models. A few months ago, the estimated price for training a language model like GPT 3 was estimated to be $5 million. The researchers suggest, while selecting the model size, computation resources for later usage should be taken into consideration rather than one time training cost.

Next is the issue with context length that causes a barrier for models to handle long inputs well to facilitate applications like novel or textbook writing or summarising. Very recently AI researchers stopped obsessing over model size and set their eyes on context size. The model size debate has been settled for now – smaller LLMs trained on much more data have eventually proven to be better than anything else that we know of. But then the painful task of fine-tuning models on individual downstream tasks (e.g., for text classification or sequence labelling) comes in the way.

The paper then talks about other issues like prompt brittleness — variations of the prompt syntax, often occurring in ways unintuitive to humans, can result in dramatic output changes, alignment bias, and hallucinations. The researchers also take into account the issues with the current methods of evaluating and benchmark tests for language models.

Since ChatGPT has become an internet celebrity, differentiating between human-generated and produced by AI has become close to impossible. As a probable solution, AI detection tools are available all over the internet and companies like Google have also announced plans to label metadata and watermark AI-generated content on its websites.

Anyone who has used ChatGPT or any AI powered chatbot knows that a prompt can generate different outputs just by moving a word here and there. Developing LLMs that are robust to the prompt’s style and format remains unsolved, leaving practitioners to design prompts ad-hoc rather than systematically.

Solutions Offered

Everyone from startups to big tech companies are trying to solve the pertaining issues in language models. The most common problem users have been pointing out since day one is the hallucinatory nature of these models leading them to generate factually incorrect information. Open source messiah Hugging Face has also raised red flags as the hallucination problem can further develop into a snowball.

Furthermore, talking about ‘aligned research’ OpenAI’s models follow human intent, along with human values. “At the time [2014], this problem was almost completely neglected, but it is now becoming increasingly recognised by more mainstream AI researchers,” resonated philosopher Nick Bostrom in an interview with AIM. Today, even Google has a 34-page elaborate document on ways the tech giant is tackling the issue of AI governance.

The research also states that the capability gap between fine tuned closed-source and open-source models pertains. With models the Vicuna, Stanford’s Alpaca, and Meta’s (leaked) LLaMA the gap has definitely narrowed but no model has proven to be an equal competitor of OpenAI’s GPT4.

The authors of ‘Challenges and Applications of Large Language Models’ conclude that the problems pinpointed in the research remains unsolved. Apart from serving as a guideline for further research of language models, the research paper also highlights the lack of training regulation and the need for stakeholders to step in.

Access all our open Survey & Awards Nomination forms in one place