What Is Model-Free Reinforcement Learning?

“Model-based methods rely on planning as their primary component, while model-free methods primarily rely on learning.”

Sutton& Barto, Reinforcement Learning: An Introduction

In the context of reinforcement learning (RL), the model allows inferences to be made about the environment. For example, the model might predict the resultant next state and next reward, given a state and action. 

An RL environment can be described with a Markov decision process (MDP). It consists of a set of states, a set of rewards, and a set of actions, and the goal of the agent is to maximise the sum of the utility nodes. An agent can be called the unit cell of reinforcement learning. An agent receives rewards from the environment. It is optimised through algorithms to maximise this reward collection and complete the task. For example, when a robotic hand moves a chess piece or does a welding operation on automobiles, it is the agent, which drives the specific motors to move the arm.  

Reinforcement learning agents have the objective of maximising rewards. This brings us back again to the final element of RL systems—models. 

Models are used for planning; to decide on a course of action by considering possible future situations. Methods for solving reinforcement learning problems that use models and planning are called model-based methods, as opposed to simpler model-free methods that are explicitly trial-and-error learners.

To reiterate, a model in RL strictly refers to whether the agent is using learning through environment actions or not. RL agents can either use a single prediction from the model of the next reward or it can ask the model for the expected next reward. Think of it as a computer playing a strategy game like Chess or Go. Here the rules can be hardwired or the computer can learn on the go. The latter case embodies the holy grail of AI. 

A better and more obvious way of understanding model-free systems is by pitting them against model-based systems. In the case of a model-free system, the environment’s response to local actions is not considered. Models have to be reasonably accurate to be useful. Model-free methods can have advantages over more complex methods when the real bottleneck in solving a problem is the difficulty of constructing a sufficiently accurate environment model. Model-free methods are also important building blocks for model-based methods. A model-free strategy relies on stored values for state-action pairs. These action values are estimates of the highest return the agent can expect for each action taken from each state.

  • When the environment of a model-free agent changes the way it reacts to the agent’s actions, the agent has to acquire new experience in the changed environment during which it can update its policy and/or value function.
  • For a model-free agent to change the action its policy specifies for a state, or to change an action value associated with a state, it has to move to that state, act from it, possibly many times, and experience the consequences of its actions.
(Image credits: Barto & Sutton book)

Many modern reinforcement learning algorithms are model-free, so they are applicable in different environments and can readily react to new and unseen states. In their seminal work on reinforcement learning, authors Barto and Sutton demonstrated model-free RL using a rat in a maze. In this case, the model-free strategy relies on stored action values for all the state–action pairs obtained over many learning trials. To make decisions, the rat just has to select at each state the action with the largest action value for that state. According to Barto & Sutton, the distinction between model-free and model-based reinforcement learning algorithms is analogous to habitual and goal-directed control of learned behavioural patterns. Habits are automatic. They are behaviour patterns triggered by appropriate stimuli (think: reflexes). Whereas goal-directed behaviour is controlled by knowledge of the value of goals and the relationship between actions and their consequences. “Habits are sometimes said to be controlled by antecedent stimuli, whereas goal-directed behavior is said to be controlled by its consequences,” wrote the authors. That said, RL pioneers like Richard Sutton believe that nothing can prevent an agent from using both model-free and model-based algorithms, and there are good reasons for using both.

Download our Mobile App

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week. 

How Generative AI is Revolutionising Data Science Tools

How Generative AI is Revolutionising Data Science Tools

Einblick Prompt enables users to create complete data workflows using natural language, accelerating various stages of data science and analytics. Einblick has effectively combined the capabilities of a Jupyter notebook with the user-friendliness of ChatGPT.