Researchers from IBM, MIT, and Harvard have introduced a new benchmark called AGENT to evaluate an AI model’s core psychological reasoning ability. The team also presented the paper at the ICML 2021 as part of their work with the US Department of Defense’s Defense Advanced Research Projects Agency (DARPA).
AGENT is part of the Machine Common Sense (MCS) project by DARPA, which began in 2019, attempts to construct machine models using traditional developmental psychology approaches to explore if AI can “learn” and reason similarly to how we teach human infants.
Analytics India Magazine got in touch with two of the researchers who are a part of the project to understand more. Abhishek Bhandwaldar, Research Engineer at MIT-IBM AI Lab, and Tianmin Shu, Postdoctoral Associate at MIT – talked about the idea and its applications.
In conversation:
AIM: What is the project all about?
Researchers: Our project is part of the Darpa machine common-sense program. The project aims to create a system capable of common sense shown by a young infant. Developmental studies have shown infants can infer others’ mental states (e.g., goals, desires, etc.) from their actions, i.e., intuitive psychology. Consequently, this led towards an increasing interest in building socially intelligent machines.
Still, there has not been a systematic evaluation to test if and when machines can understand intuitive psychology. Our work takes inspiration from infant cognition studies, where researchers have designed experiments to probe young children’s understanding of intuitive psychology and create a benchmark to evaluate machines’ understanding of intuitive psychology.
AIM: Can you share some of its present and future applications?
Researchers: Our project is meant to serve as one of the tests to evaluate the machine learning model’s understanding of intuitive psychology. In addition, our dataset can be used in combination with other datasets to determine the machine learning model’s understanding of the underlying principles of physical activity. Finally, our work is part of a larger effort to build a system that can learn things from scratch in a similar way to how an infant learns.
AIM: What are some of the challenges that your team have to face during the project work?
Researchers: To our knowledge, we can say that this is the first of its kind dataset being developed specifically for machine learning models (although some parallel works were going on with different teams). This creates a unique challenge as today’s machine learning models are data-hungry, specifically deep learning models. This, combined with their tendency to pick up unwanted biases, makes it harder to design a dataset that will serve as a standard benchmark.
We use a combination of procedural generation and visual diversity to tackle some of these issues. Our proposed dataset also creates some unique challenges from a modelling perspective. In our experiments, we compared two different approaches for solving this dataset. The machine theory of mind approach and in-general models with that approach have issue generalisation. On the other hand, the Bayesian inverse planning model, which has built-in representations of planning, objects, and physics, achieves strong generalisation results.
AIM: What makes this project truly unique?
Researchers: The Darpa machine common sense is a multi-year effort and presents some unique challenges. These kinds of challenges often demand expertise from a variety of different fields. For our work, we had a team comprised of scientists, engineers, and psychologists from IBM, MIT, and Harvard working together to build this benchmark.