MITB Banner

Watch More

Interesting But Underrated ML Concepts #3 – Q-learning, PVLV, OOB & Rprop

In this series, we'll look at several underappreciated yet fascinating machine learning concepts.

There are some fascinating machine learning topics that aren’t discussed nearly as frequently as they should be. We’ll look at some of them in this post, including Q-learning, Primary Value Learned Value (PVLV), Out-of-bag error, and Rprop.

Q-learning

Q-learning is a model-free reinforcement learning algorithm for determining the worth of a certain action in a given state. It can handle problems with stochastic transitions and rewards without requiring adaptations and does not require a model of the environment (thus “model-free”). According to Francisco S. Melo, Q-learning provides an optimal policy for any Finite Markov Decision Process (FMDP) by maximising the anticipated value of the total reward across any and all successive steps, beginning from the present state. Given infinite exploration time and a partly random policy, Q-learning can determine an optimal action-selection policy for any given FMDP. Researchers say that quality is represented by the letter ‘Q’ in Q-learning. In this situation, quality refers to how valuable a specific activity is in obtaining a future reward.

As per Chris Gaskett, only discrete action and state spaces are covered by the typical Q-learning technique (which uses a Q table). Because of the curse of dimensionality, discretisation of these values leads to ineffective learning. However, several Q-learning adaptations, such as Wire-fitted Neural Network Q-Learning, seek to alleviate this problem.

Primary Value Learned Value (PVLV)

The reward-predictive firing features of dopamine (DA) neurons may be explained by the Primary Value Learnt Value (PVLV) model. As per Randall C. O’Reilly, it simulates Pavlovian conditioning and the firing of midbrain dopaminergic neurons in response to surprise rewards using behavioural and neurological data. It’s a different approach than the temporal-differences (TD) method and is a component of Leabra.

PVLV is divided into two parts: primary value (PV) and learning value (LV). Primary reward (i.e., an unconditioned stimulus; US) engages the PV system, which learns to anticipate the occurrence of a certain US, preventing the dopamine rush that would otherwise occur. The LV system learns about conditioned stimuli that are consistently associated with primary rewards and fires phasic dopamine bursts at the commencement of CS (conditioned stimulus). As a result, the PVLV mechanism serves as a vital link between the more abstract TD model and the underlying neuronal systems’ specifics. According to O’Reilly, the PVLV model can account for key components of the DA firing data and make a number of clear predictions about lesion effects, many of which are compatible with existing data.

Out-of-bag error

Out-of-bag (OOB) error, also known as the out-of-bag estimate, is a technique for calculating the prediction error of random forests, boosted decision trees, and other machine learning models using bootstrap aggregation (bagging). Bagging creates training samples for the model to learn from by using subsampling with replacement. By evaluating predictions on observations that were not utilised in constructing the next base learner, bootstrap aggregating allows one to define an out-of-bag estimate of prediction performance improvement. Experts say that the random forest model is validated using the out of bag (OOB) score.

As a result of a study by Silke Janitza and Roman Hornung, the out-of-bag error has been shown to overestimate in settings with an equal number of observations from all response classes (balanced samples), small sample sizes, a large number of predictor variables, a small correlation between predictors, and an unbalanced sample. According to the researchers, the advantages of OOB are no data leakage, less variance, a better predictive model, and less computation.

Rprop

As stated by James McCaffrey, Rprop (resilient back propagation) is a neural network training procedure that is similar to regular back propagation. As per M. Riedmiller, Rprop is a common gradient descent technique that computes updates based only on gradients’ signs. It stands for resilient propagation and is useful in a variety of circumstances since it dynamically adjusts the step size for each weight separately.

Rprop, on the other hand, has two important advantages over back propagation: for starters, Rprop training is generally faster than back propagation training. Second, unlike back propagation, which requires values for the learning rate, Rprop does not require to be supplied any free parameter values (and usually an optional momentum term). 

Access all our open Survey & Awards Nomination forms in one place >>

Picture of Dr. Nivash Jeevanandam

Dr. Nivash Jeevanandam

Nivash holds a doctorate in information technology and has been a research associate at a university and a development engineer in the IT industry. Data science and machine learning excite him.

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories