IBM, MIT and Harvard have released the DARPA “Common Sense AI” dataset at the ongoing 38th International Conference on Machine Learning (ICML).
The researchers have released AGENT (Action, Goal, Efficiency, coNstraint, uTility), a benchmark for core psychology reasoning consisting of a large dataset (8,400 3D animations) and two machine learning models – BIPaCK and ToMnet-G. The research was aimed at accelerating the development of AI that manifests common sense.
Commonsense reasoning–the ability to make acceptable and logical assumptions in our daily life–has long been a bottleneck in artificial intelligence and natural language processing.
“Today’s machine learning models can have superhuman performance. It is still unclear if they understand basic principles that drive human reasoning. For machines to successfully be able to have social interaction like humans do among themselves, they need to develop the ability to understand hidden mental states of humans,” said Abhishek Bhandwaldar, Research Engineer, MIT-IBM AI Lab.
“Our work is directed to bridge this gap by proposing a dataset that probes core psychological reasoning concepts. Our dataset is a collection of videos that are similar to the developmental studies but generated at a much larger scale with visual differences. We have also proposed two different machine learning approaches to solve the dataset,” he added.
Research
The research aims to build a machine learning model with the same level of common sense as a young child.
Intuitive psychology is the ability of people to understand and reason about other people’s state of mind. This ability helps us have meaningful social interactions. ML algorithms lack this power of perception and require huge amounts of data to train AI models.
The researchers presented a benchmark consisting of a large dataset of procedurally generated 3D animations, AGENT (Action, Goal, Efficiency, coNstraint, uTility), structured around four scenarios to probe key concepts of core intuitive psychology:
- Goal preferences
- Action efficiency
- Unobserved constraints
- Cost-reward trade-offs
The figure below summarises the design of trials in AGENT, which groups trials into four scenarios. All trials have two phases:
- A familiarisation phase showing one or multiple videos of the typical behaviors of a particular agent, and
- A test phase showing a single video of the same agent either in a new physical situation (the Goal Preference, Action Efficiency and Cost-Reward Trade-offs scenarios) or the same video as familiarisation but revealing a portion of the scene previously occluded (Unobserved Constraints).
Considering the data structure, there are 8,400 videos in AGENT. Each video lasts from 5.6 s to 25.2 s, with a frame rate of 35 fps. “With these videos, we constructed 3360 trials in total, divided into 1920 training trials, 480 validation trials, and 960 testing trials (or 480 pairs of expected and surprising testing trials, where each pair shares the same familiarization video(s)). All training and validation trials only contain expected test videos,” the researchers said.
The two machine learning approaches introduced at ICML advance real-world training of AI and machine learning models using traditional human psychology methods. The researchers compared two strong baselines built on Bayesian inverse planning and a Theory of Mind neural network.
For the proposed tasks in the benchmark, researchers built two baseline models – BIPaCK and ToMnet-G – based on existing approaches, and compared their performance on AGENT to human performance. “Overall, we find that BIPaCK achieves a better performance than ToMnet-G, especially in tests of strong generalization,” reads the paper.
This work was supported by the DARPA Machine Common Sense program, MIT-IBM AI LAB, and NSF STC award CCF-1231216.
Wrapping up
In a paper titled ‘CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning’, researchers presented a constrained text generation task, COMMONGEN associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning.
“Our extensive experiments systematically examine recent pre-trained language generation models (e.g., UniLM, BART, T5) on the task , and find that their performance is still far from humans, generating grammatically sound yet realistically implausible sentences,” concluded the research.