Last updated June 6, 2023
In AI Trends & Future

It’s Time to Oust The Imitation Game

A 70-year old test has been overturned by modern LLMs.

Share

Published on June 5, 2023

by Anirudh VK

Listen to this story

The imitation game, known better by the name ‘The Turing Test’ seems to be obsolete now. Crafted by computer scientist extraordinaire Alan Turing in 1950, this test has long been a rule of thumb to gauge the impact of AI on humans. The test works by putting a human evaluator in charge of distinguishing whether a set of responses is being provided by a computer or a human. This has been largely able to determine the efficacy of a human-like AI algorithm, but is now falling to the wayside.

A new research, based on the world’s largest Turing test, shows that AI algorithms have advanced to the point where it simply does not work. To his credit, Turing predicted that in 50 years time, computers would play the imitation game so well that an average interrogator would have no more than 70% chance of guessing whether it was an AI or a human. In other words, AI can fool 30% of the people.

With modern algorithms, AI researchers have found that this number hovers around Turing’s prediction, with 68% guessing their partners correctly when talking to humans. However, in situations where the test-takers were facing an AI bot, users only guessed it right 60% of the time. This shows that 40% of people did not know that they were talking to an AI agent, showing that it is now possible for AI to fool humans.

The World’s Largest Turing Test

AI21 Labs, a company offering NLP and other AI solutions, recently conducted a game called ‘Human or Not?’. This app is a web version of the Turing Test where users can chat with someone for 2 minutes. After this, they are given a chance to find out whether they were talking to a human or a bot. This gamified test went unexpectedly viral, garnering over 2 million conversations between humans and bots.

Some key insights from the experiments found that humans were able to find it easier to identify a fellow human. Interestingly, India had the lowest percentage of correct guesses, sitting at 63.5% compared to France’s 71.3%, which was the highest. Moreover, younger age groups tend to guess better than older ones.

The game pitted human users against leading LLMs like GPT-4 or Jurassic-2, albeit with their own quirks and tricks to throw off human users. The researchers brought up an interesting point, stating that many users’ perceived limitations of large language models stemmed from their experience of using ChatGPT and other similar interfaces. As such, they added an additional layer of complexity to their algorithms to make it harder to guess.

For example, AI21 Labs made use of the assumption that bots usually do not make grammar mistakes or use slang. The researchers purposely trained their models to make common spelling mistakes and use trendy terms to make them seem more human. Similarly, humans also felt that asking or answering a personal question was something algorithms struggle with. However, the bots were able to reference their training data to easily come up with personal stories, further fooling the humans.

Some other human biases against AI include assumptions that bots aren’t aware of current events, that they aren’t capable of philosophical or ethical queries, that they are extremely polite (to a fault), and that they are incapable of answering certain sensitive questions. Interestingly, even in the 2-minute time span given per attempt, users tried to jailbreak the LLM using methods like DAN.

While this not only sheds light into the capabilities of modern LLMs, it also revealed some limitations of the Turing Test itself. By phrasing their sentences in tricky manners that bots would not understand, some test-takers were able to weed out the humans from the bots. This is a limitation of the test, which only tests capability in natural language.

The research clearly shows that the test is becoming obsolete, thanks to the advancements of LLMs and knowledge of the biases humans have against AI. However, the scientific community has also offered other, more capable benchmarks that provide a more comprehensive picture of what AI is capable of. These tests are also up to date with the capabilities of modern AI.

Thinking Beyond Turing

Gary Marcus, an American psychologist and AI expert, wrote in the past about an algorithm called Goostman, which was the first AI to pass the Turing test. Speaking about the importance of this test, he stated, “The real value of the Turing Test comes from the sense of competition it sparks amongst programmers and engineers.”

To this end, Marcus offered up his own version of the test, which is now known as the Marcus test. Reductively put, he stated that if an AI can watch an episode of ‘The Simpsons’ and tell the viewer when to laugh, it has passed the test.

However, the Lovelace Test 2.0, named after Ada Lovelace, the world’s first computer programmer, takes a different course. The AI can pass the Lovelace test if it can ‘develop a creative artifact from a subset of artistic genres deemed to require human-level intelligence’. Simply put, this means that AI agents which can create human-level art are considered to pass. By this logic, Midjourney has already passed this test, as an artist won a competition by using the image generation algorithm.

Many such stand-ins for the Turing test have been created, but have now been replaced by benchmarks, like Francois Chollet’s ARC. Instead of using a subjective human outlook to gauge the efficacy of an algorithm, ARC instead relies on reasoning and logic to find the capabilities of certain algorithms. Even as these methods are becoming more widely adopted, it seems that Turing’s original vision has fallen to the wayside.

While a thinking computer has not yet been created, AI has reached human parity, and, in some instances, surpassed humanity in certain fields. However, the creation of a true, fluid, generalised intelligence is still far away. Until then, we must find better ways of not only gauging an algorithm’s effectiveness, but also its humanity.

Access all our open Survey & Awards Nomination forms in one place