Stanford Brings Out BEHAVIOR Benchmark For 100 Everyday Household Tasks

BEHAVIOR is a benchmark for embodied AI with 100 everyday activities

A team of researchers from different disciplines at Stanford University has released BEHAVIOR (Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments), a benchmark for embodied AI with 100 everyday activities like washing dishes, picking up toys, cleaning floors, etc. in simulation. It has been the current version of BEHAVIOR available publicly at

In creating this benchmark, the team led by leading computer scientist and Stanford Institute for Human-Centered AI co-director Fei-Fei Li and experts from computer science, psychology, and neuroscience, have established a “North Star”. It is a visual reference point to gauge the success of future AI solutions. It has usage potential to develop and train robotic assistants in virtual environments that are then shifted to operate in real ones. This paradigm is known as “sim to real.”

What is Embodied AI?

Scientists have always wanted to reach a stage in technological advancement where robots will help humans do daily (yet complex tasks). The researchers say that even when we reach that level of sophistication, for a robot to do these tasks, it must be able to perceive, reason, and operate with full awareness of its own physical dimension and capabilities and also the objects surrounding it. This combination of physical and situational awareness is called embodied AI.

As per the research titled, “BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments”, progress has been made to bring out embodied AI solutions. These include visual navigation, interactive Q&A, and instruction following, among others. But to develop artificial agents that can eventually perform and assist in daily tasks with human-level flexibility, a comprehensive benchmark is needed with more realistic, diverse, and complex activities.

Complex for Robots

Though on the surface, we might think, it is not complicated as these robots have to be trained just to do basic tasks which human beings can do very easily, in reality, this is not the case at all. It is indeed a complex phenomenon. 

The researchers give an example of cleaning a countertop.

  • The robot has to perceive and understand what a countertop is
  • Where to find it
  • Understand that it needs cleaning and assess counter’s physical dimensions
  • What products are best used to clean the countertop
  • How to coordinate its motions to get the countertop
  • The robot has to then determine the best course of action needed to clean the counter. While this might be a minor procedure for humans, for robots, it will be complex. It has to understand which materials are soakable and then declare whether a countertop is actually clean or not.

Although much progress has happened, the research says that three major issues have prevented existing benchmarks from filling the above three criteria. These are

  • Identifying and defining meaningful activities for benchmarking
  • Developing simulated environments that support such activities
  • Defining success and objective metrics to evaluate performance.

How is BEHAVIOR different?

The research says that BEHAVIOUR works on the three issues by:

  • Introducing BEHAVIOR Domain Definition Language (BDDL). It is a representation adapted from predicate logic that maps simulated states to semantic symbols. It allows the team to define 100 activities as initial and goal conditions. It then helps for the generation of potentially infinite initial states and solutions for achieving the goal states.
  • Help in its realization by listing environment-agnostic functional requirements for realistic simulation. 
  • The team provides a comprehensive set of metrics to evaluate agent performance in terms of success and efficiency. To make evaluation comparable across diverse activities, scenes, and instances, it proposes a set of metrics relative to demonstrated human performance on each activity and provide a large-scale dataset of 500 human demonstrations (758.5 min) in virtual reality, 

Future Moves

The research team aims to provide initial solutions to the benchmark with plans to extend it to presently not benchmarked tasks. It says that this will require contributions from diverse domains – robotics, computer vision, computer graphics, and cognitive science.

Download our Mobile App

Sreejani Bhattacharyya
I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring