Listen to this story
|
OpenAI is reportedly working on a project Q* (pronounced Q-Star), capable of solving unfamiliar math problems.
A few people at OpenAI believe that Q* could be a big step towards achieving artificial general intelligence (AGI). At the same time, this new model is raising concerns among some AI safety researchers due to the accelerated advancements, particularly after watching the demo of the model circulated within OpenAI in recent weeks, as per The Information.
The model is created by OpenAI’s chief scientist Ilya Sutskevar and other top researchers Jakub Pachocki and Szymon Sidor.
Interestingly, this new development comes in the background of Andrej Karpathy – who also happened to be building JARVIS at OpenAI – recently posted on X, saying that he has been thinking of centralisation and decentralisation lately.
Karpathy is mostly talking about building an AI system where it involves a trade-off between centralisation and decentralisation of decision-making and information. In order to achieve optimal results, you have to balance these two aspects, and Q-Learning seems to be fitting perfectly in the equation to enable all of this.
What is Q-Learning?
Experts believe that Q* is built on the principles of Q-learning which is a foundational concept in the field of AI, specifically in the area of reinforcement learning. Q-learning’s algorithm is categorised as model-free reinforcement learning, and is designed to understand the value of an action within a specific state.
The ultimate goal of Q-learning is to find an optimal policy that defines the best action to take in each state, maximising the cumulative reward over time.
Q-learning is based on the notion of a Q-function, aka the state-action value function. This function operates with two inputs: a state and an action. It returns an estimate of the total reward expected, starting from that state, alongside taking that action, and thereafter following the optimal policy.
In simple instances, Q-learning maintains a table (known as the Q-table) where each row represents a state and each column represents an action. The entries in this table are the Q-values, which are updated as the agent learns through exploration and exploitation.
This is how it works: A key aspect of Q-learning is balancing exploration (trying new things) and exploitation (using known information). This is often managed by strategies like ε-greedy, where the agent explores randomly with probability ε and exploits the best-known action with probability 1-ε.