Turing Test is unreliable. The Winograd Schema is obsolete. Coffee is the answer.

Gary Marcus said the Turing test is not a reliable measure of intelligence in machines.

Today, AI can do many things: GPT-3 can produce human-like text, DALL-E can generate the most imaginative images based on textual prompts, and Alexa can turn off lights at your behest, yet we are far from achieving artificial general intelligence (AGI). 

For starters, we still dont have an agreed-upon definition of AGI. Even the terminology is debated: There is no such thing as artificial general intelligence because there is no such thing as general intelligence. Human intelligence is very specialised, said Meta’s chief AI scientist Yann LeCun. 

So how do we measure intelligence in machines? And more importantly, how accurate are such tests.

Turing test

Alan Turing proposed the Turing test in a 1950 paper called “Computing Machinery and Intelligence.” He suggested ‘Imitation Game’ with two contestants– a human and a computer. A judge has to decide which of the two contestants is human and which is a machine. The judge would do this by asking a series of questions to the contestants. The game aimed to identify if the computer is a good simulation of humans and is, therefore, intelligent. At the heart of the Turing test is the question: “Are there imaginable computers which would do well in the imitation game?”

While no machine has passed the Turing Test yet, a few have come close. In 2014, a program named Eugene Goostman convinced a third of a panel of judges that it was a 13-year-old boy from Ukraine.

“Our main idea was that he can claim that he knows anything, but his age also makes it perfectly reasonable that he doesn’t know everything,” said Veselov, one of the programmers of Gustman. But Goostman’s feat was more imitation and diversion than showing real intelligence. Eugene either completely avoided certain topics or deflected when presented with a question it had no answer to. For instance, when asked if he plays many instruments, Eugene said, ‘I’m tone-deaf, but my guinea pig likes to squeal Beethoven’s Ode to Joy every morning. I suspect our neighbours want to cut his throat. Could you tell me about your job, by the way?’ The program could not solve logical problems like a real 13 year old could.

“It would respond with wisecracks to evade revealing its limitations, and to the untrained eye, it was fairly convincing. All that tells us is that human beings think that machines that can talk are intelligent, but it turned out to be untrue”

Gary Marcus

Marcus said the Turing test is not a reliable measure of intelligence because humans are susceptible, and machines can be evasive. Philosopher John Searle introduced the Chinese Room Argument that asserts programming a digital computer may make it appear to understand the language but could not produce real understanding. Even if a computer can interpret symbols and provide sensical responses, it can’t be said to be truly “conscious” because it doesn’t really understand what the symbols mean.


The Winograd schema

Hector Levesque, a computer scientist at the University of Toronto, proposed the Winograd schema challenge in 2011. Hector designed it as an improvement of the Turing test. The test is structured with multiple-choice questions called Winograd schemas. 

Winograd schemas were named after Terry Winograd, professor of computer science at Stanford University. It is a pair of sentences whose intended meaning can be flipped by changing just one word. They generally involve unclear pronouns or possessives. 

“The city councilmen refused the demonstrators a permit because they [feared/advocated] violence.” There’s a verb choice quiz embedded in the sentence, and the task for System A is to select the right one. If System A has common sense, the answer is obvious enough. For instance, the system could be asked ‘who feared violence’ and would have to choose between the city councilmen or the demonstrators.

Human beings can easily answer this question. But computers are still struggling to make such connections. In the book ‘The Myth of Artificial Intelligence, AI researcher Erik J Larson said linguistic puzzles that humans easily understand are still beyond the comprehension of computers. For example, even single sentence Winograd schemas trip up machines.

In a test, Gary Marcus asked a Winograd-Levesque-inspired question: “Can an alligator run the 100-meter hurdles?” and AI systems struggled to come up with an answer.

According to Levesque, the schema should meet two criteria: simple for humans to solve and shouldn’t be Google-hackable. He also explained how the Winograd schema test could be better than a Turing test. “A machine should be able to show us that it is thinking without having to pretend to be somebody,” he wrote in his paper

“Our WS challenge does not allow a subject to hide behind a smokescreen of verbal tricks, playfulness, or canned responses.” And, unlike the Turing test, which is scored by a panel of human judges, a Winograd schema test’s grading is completely non-subjective.

However, in 2022, the test developers published a paper titled, ‘The Defeat of the Winograd Schema Challenge, claiming most of the Winograd Schema Challenge has been overcome. Similarly, a 2021 paper, ‘WinoGrande: An Adversarial Winograd Schema Challenge at Scale’, shows how neural language models have saturated benchmarks like the WSC, with over 90% accuracy. The researchers asked, “Have neural language models successfully acquired commonsense or are we overestimating the true capabilities of machine commonsense?”

Coffee test

Apple co-founder Steve Wozniak suggested the coffee test, whereby a robot would be challenged to enter your home, find the kitchen and brew a cup of coffee. The programme should be able to walk into any kitchen, find the ingredients required and then perform the task of making a coffee. 

According to Wozniak, the day a robot could enter a strange house and make a decent cup of coffee would be the day AI has truly arrived. To crack the coffee test, a robot has to be multi-modal, able to generalise across tasks and orchestrate a series of actions to make a hot cup of coffee. Cheeky as it sounds, the coffee test seems like a plausible test to judge the AGI-ness of machines. 

Download our Mobile App

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox