IBM, MIT & Harvard Release Dataset & ML Models For Common Sense

Our work is directed to bridge this gap by proposing a dataset that probes core psychological reasoning concepts.

IBM, MIT and Harvard have released the DARPA “Common Sense AI” dataset at the ongoing 38th International Conference on Machine Learning (ICML).

The researchers have released AGENT (Action, Goal, Efficiency, coNstraint, uTility), a benchmark for core psychology reasoning consisting of a large dataset (8,400 3D animations) and two machine learning models – BIPaCK and ToMnet-G. The research was aimed at accelerating the development of AI that manifests common sense.

Commonsense reasoning–the ability to make acceptable and logical assumptions in our daily life–has long been a bottleneck in artificial intelligence and natural language processing.

“Today’s machine learning models can have superhuman performance. It is still unclear if they understand basic principles that drive human reasoning. For machines to successfully be able to have social interaction like humans do among themselves, they need to develop the ability to understand hidden mental states of humans,” said Abhishek Bhandwaldar, Research Engineer, MIT-IBM AI Lab.

“Our work is directed to bridge this gap by proposing a dataset that probes core psychological reasoning concepts. Our dataset is a collection of videos that are similar to the developmental studies but generated at a much larger scale with visual differences. We have also proposed two different machine learning approaches to solve the dataset,” he added.


The research aims to build a machine learning model with the same level of common sense as a young child.

Intuitive psychology is the ability of people to understand and reason about other people’s state of mind. This ability helps us have meaningful social interactions. ML algorithms lack this power of perception and require huge amounts of data to train AI models. 

The researchers presented a benchmark consisting of a large dataset of procedurally generated 3D animations, AGENT (Action, Goal, Efficiency, coNstraint, uTility), structured around four scenarios to probe key concepts of core intuitive psychology: 

  • Goal preferences
  • Action efficiency
  • Unobserved constraints
  • Cost-reward trade-offs

The figure below summarises the design of trials in AGENT, which groups trials into four scenarios. All trials have two phases:

  • A familiarisation phase showing one or multiple videos of the typical behaviors of a particular agent, and
  • A test phase showing a single video of the same agent either in a new physical situation (the Goal Preference, Action Efficiency and Cost-Reward Trade-offs scenarios) or the same video as familiarisation but revealing a portion of the scene previously occluded (Unobserved Constraints).

Considering the data structure, there are 8,400 videos in AGENT. Each video lasts from 5.6 s to 25.2 s, with a frame rate of 35 fps. “With these videos, we constructed 3360 trials in total, divided into 1920 training trials, 480 validation trials, and 960 testing trials (or 480 pairs of expected and surprising testing trials, where each pair shares the same familiarization video(s)). All training and validation trials only contain expected test videos,” the researchers said.

The two machine learning approaches introduced at ICML advance real-world training of AI and machine learning models using traditional human psychology methods. The researchers compared two strong baselines built on Bayesian inverse planning and a Theory of Mind neural network.

For the proposed tasks in the benchmark, researchers built two baseline models – BIPaCK and ToMnet-G – based on existing approaches, and compared their performance on AGENT to human performance. “Overall, we find that BIPaCK achieves a better performance than ToMnet-G, especially in tests of strong generalization,” reads the paper.

This work was supported by the DARPA Machine Common Sense program, MIT-IBM AI LAB, and NSF STC award CCF-1231216.

Wrapping up

In a paper titled ‘CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning’, researchers presented a constrained text generation task, COMMONGEN associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning. 

“Our extensive experiments systematically examine recent pre-trained language generation models (e.g., UniLM, BART, T5) on the task , and find that their performance is still far from humans, generating grammatically sound yet realistically implausible sentences,” concluded the research.

Download our Mobile App

kumar Gandharv
Kumar Gandharv, PGD in English Journalism (IIMC, Delhi), is setting out on a journey as a tech Journalist at AIM. A keen observer of National and IR-related news.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week.