MITB Banner

How To Choose The Right Machine Learning Model With Off-Policy Classification

Share
Representational image via Boston Dynamics

Hard coding a robot to perform all the mundane manual jobs even poorly, will take a lot of computational heavy lifting. It takes an ingenious constraint assumption to make the robot perform decently when put under unstructured, real-world situations.

Asking a robot to run, do a cartwheel or bowl a yorker would have sounded like a chapter from a sci-fi novel until a decade ago. But now, with the advancement of hardware acceleration and the optimisation of machine learning algorithms, techniques like reinforcement learning are being put into practical use every day.

However, the existing evaluation techniques to evaluate the ground truth with a physical robot are inefficient.

For selectively testing and checking which models perform or to identify those which suit the job, off-policy evaluation is a good candidate. When the resources available to gauge a robot’s performance are scanty, off-policy reinforcement algorithms provide the necessary goods the agents need.

An agent can be called as the unit cell of reinforcement learning. An agent receives rewards from the environment, it is optimised through algorithms to maximise this reward collection. And, complete the task. For example, when a robotic hand moves a chess piece or does a welding operation on automobiles, it is the agent, which drives the specific motors to move the arm.

In order to develop a foolproof, inexpensive performance testing for robotic systems, Google AI researchers propose a new off-policy evaluation method, called off-policy classification (OPC), that evaluates the performance of agents from past data by treating evaluation as a classification problem, in which actions are labelled as either potentially leading to success or guaranteed to result in failure.

How OPC Works

via Google AI

An example of how simulated experience can differ from real-world experience. Here, simulated images (left) have much less visual complexity than real-world images (right).Training of a robot is done through simulations and evaluating still needs real robot. With off-policy classification, evaluation is done using available real-world data and the learnings are then transferred to a real robot.

Off-Policy: an agent is trained using a combination of data collected by other agents (off-policy data) and data it collects itself to learn generalisable skills like robotic walking and grasping.

Fully off-policy RL: is a variant in which an agent learns entirely from older data, which is appealing because it enables model iteration without requiring a physical robot.

OPC relies on two assumptions: 

  1. No randomness is involved in how states change
  2. The agent either succeeds or fails at the end of each trial

The “success or failure” assumption is natural for many tasks, such as picking up an object, solving a maze, winning a game, and so on.

OPC utilises a Q-function, learned with a Q-learning algorithm, that estimates the future total reward 

Q-learning is a model-free reinforcement learning algorithm, which tells an agent what action to take under what circumstances. It does not require a model of the environment, and it can handle problems with stochastic transitions and rewards, without requiring adaptations.

via Google AI

A diagram for real-world model development. Assuming 10 models per day can be evaluated, without off-policy evaluation, one would need 100x as many days to evaluate these models.

The researchers also leveraged techniques from semi-supervised learning, positive-unlabeled learning, in particular, to get an estimate of classification accuracy from partially labelled data. This accuracy is the OPC score.

Conclusion

The results show that OPC is good at predicting generalisation performance across several scenarios critical to robotics. Effective off-policy evaluation in a real-world reinforcement learning provides an alternative to expensive real-world evaluations during algorithm development. 

Though the results of this OPE framework looks promising, it does come with a flaw. This framework assumes that one has an off-policy evaluation method that accurately ranks performance from old data. But agents fed with past experiences may act very differently from newer learned agents, which makes it hard to get good estimates of performance. 

Promising directions for future work include developing a variant of this OPE method that is not restricted to binary(success or failure) reward tasks and extending the analysis to stochastic tasks. 

The authors believe that even in the binary setting, this method can provide a substantially practical pipeline for evaluating transfer learning and off-policy reinforcement learning algorithms

For further reading on off-policy reinforcement learning, check here.

PS: The story was written using a keyboard.
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories

Featured

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

AIM Conference Calendar

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives. Revel in intimate events that encapsulate the heart and soul of the AI Industry.

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed