What Is Model-Free Reinforcement Learning?

“Model-based methods rely on planning as their primary component, while model-free methods primarily rely on learning.”

Sutton& Barto, Reinforcement Learning: An Introduction

In the context of reinforcement learning (RL), the model allows inferences to be made about the environment. For example, the model might predict the resultant next state and next reward, given a state and action. 

An RL environment can be described with a Markov decision process (MDP). It consists of a set of states, a set of rewards, and a set of actions, and the goal of the agent is to maximise the sum of the utility nodes. An agent can be called the unit cell of reinforcement learning. An agent receives rewards from the environment. It is optimised through algorithms to maximise this reward collection and complete the task. For example, when a robotic hand moves a chess piece or does a welding operation on automobiles, it is the agent, which drives the specific motors to move the arm.  

Reinforcement learning agents have the objective of maximising rewards. This brings us back again to the final element of RL systems—models. 


Sign up for your weekly dose of what's up in emerging technology.

Models are used for planning; to decide on a course of action by considering possible future situations. Methods for solving reinforcement learning problems that use models and planning are called model-based methods, as opposed to simpler model-free methods that are explicitly trial-and-error learners.

To reiterate, a model in RL strictly refers to whether the agent is using learning through environment actions or not. RL agents can either use a single prediction from the model of the next reward or it can ask the model for the expected next reward. Think of it as a computer playing a strategy game like Chess or Go. Here the rules can be hardwired or the computer can learn on the go. The latter case embodies the holy grail of AI. 

A better and more obvious way of understanding model-free systems is by pitting them against model-based systems. In the case of a model-free system, the environment’s response to local actions is not considered. Models have to be reasonably accurate to be useful. Model-free methods can have advantages over more complex methods when the real bottleneck in solving a problem is the difficulty of constructing a sufficiently accurate environment model. Model-free methods are also important building blocks for model-based methods. A model-free strategy relies on stored values for state-action pairs. These action values are estimates of the highest return the agent can expect for each action taken from each state.

  • When the environment of a model-free agent changes the way it reacts to the agent’s actions, the agent has to acquire new experience in the changed environment during which it can update its policy and/or value function.
  • For a model-free agent to change the action its policy specifies for a state, or to change an action value associated with a state, it has to move to that state, act from it, possibly many times, and experience the consequences of its actions.
(Image credits: Barto & Sutton book)

Many modern reinforcement learning algorithms are model-free, so they are applicable in different environments and can readily react to new and unseen states. In their seminal work on reinforcement learning, authors Barto and Sutton demonstrated model-free RL using a rat in a maze. In this case, the model-free strategy relies on stored action values for all the state–action pairs obtained over many learning trials. To make decisions, the rat just has to select at each state the action with the largest action value for that state. According to Barto & Sutton, the distinction between model-free and model-based reinforcement learning algorithms is analogous to habitual and goal-directed control of learned behavioural patterns. Habits are automatic. They are behaviour patterns triggered by appropriate stimuli (think: reflexes). Whereas goal-directed behaviour is controlled by knowledge of the value of goals and the relationship between actions and their consequences. “Habits are sometimes said to be controlled by antecedent stimuli, whereas goal-directed behavior is said to be controlled by its consequences,” wrote the authors. That said, RL pioneers like Richard Sutton believe that nothing can prevent an agent from using both model-free and model-based algorithms, and there are good reasons for using both.

More Great AIM Stories

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM