“Finding human-understandable explanations of ML systems work is essential for their safe deployment.”DeepMind
Modern day machine learning(ML) systems are unpredictable. Especially, deep reinforcement learning(DRL) systems. To put their actions in human understandable terms is challenging. Inexplicable systems cannot be deployed in environments where safety is critical. To investigate this shortcoming, researchers at DeepMind surveyed six use cases where an RL agent behaviour can be monitored. In their recently published technical report, the researchers have presented a few methodologies that can be used to predict failure in AI systems. Their experiments were carried out through the lens of cause-effect relationships that can explain why an agent behaves in a certain way.
Causal Testing: Checklist
Image credits: DeepMind
“Causal models enable one to answer the “WHY” part of understanding AI systems”
The methodology uses three components: an agent to be studied, a simulator, and a causal reasoning engine, which is an automated reasoning system that helps validate causal hypotheses.
Image credits: DeepMind
“Even if we know architecture, learning algorithms, and training data, predicting their behaviour can still be beyond our reach.”
The researchers warn that RL agents that work with deep neural networks risk picking up “spurious” correlations. To check the agent’s behaviour, the researchers introduced two T-shaped mazes into the environment. The objective of this experiment is to check if the agent moves based on location of reward or the floor type (sand or grass). Since it is difficult to conclude the correlations from the observations alone, the DeepMind researchers introduced a confounding variable and generated causal models for the agents. On studying the causal models, they concluded that one agent is sensitive to floor type while the other isn’t. “We could only reach these conclusions because we actively intervened on the hypothesised causes,” said the researchers.
The researchers listed questions, which developers should ask before building models:
- How would you test if an agent is using its internal memory for solving the task?
- How would you test whether this behaviour generalizes?
- How would you check the counterfactual behaviour of the agent?
- Which is the correct causal model? If one agent is influencing other agents.
- How would you test whether the agent understands causal pathways?
This methodology naturally leads to human explainable theories of agent behavior, as it is human analysts who propose and validate them. Let’s consider the problem of internal memory.
Though the researchers admit that memorization is a necessary skill for solving complex tasks using internal memory. However, it is easier for an agent to off-load task-relevant information onto its environment using it as an external memory. If the agent is using external memory, this will corrupt the agent’s decision variables, leading to a faulty behavior. Internal or external, these agent strategies can go undetected unless intervened. So, the researchers suggest that one can make mid-trajectory interventions on the environment state variables suspected of encoding task relevant information.
“From a safety perspective, flawed memory mechanisms that off-load memorization can lead to fragile behavior or even catastrophic failures.”
The experiment set up to test for strategies involving memories:
- First, the agent observes the cue and then freely executes its policy.
- When the agent is near the end of the wide corridor, intervene by pushing the agent to the opposite wall.
- This checks out if the agent is using internal memory, to guide its navigation.
- After the intervention, if the agent returns to the original wall and collects the reward, it must be because it is using its internal memory.
- If the agent does not return and simply continues its course, we can conclude it is off-loading memorization onto its environment.
From a safety perspective, the researchers posit that the memory mechanisms that off-load memorization lead to fragile behavior and eventually catastrophic failures. And, understanding how AI agents store and recall information can help prevent such failures. Using this methodology, the analyst can reveal the undesired use of external memory by appropriately intervening on the environmental factors that are suspected of being used by the agent to encode task-relevant information, explained the researchers. The researchers used similar thought experiments to check for agent behaviour in other 5 scenarios too.
Observing the agent involves the following steps:
- Trained agents are placed into one or more test environments and its behaviour is probed.
- Ask questions that are also variables like “does the agent collect the key?”, “is the door open?”, etc.
- Conduct experiments, collect statistics and specify the conditional probability tables in our causal model.
- Formulate a structural causal model that hints at agent’s behaviour.
The researchers used these scenarios to demonstrate how an analyst can propose and validate theories about agent behaviour through a systematic process of explicitly formulating causal hypotheses, conducting experiments with carefully chosen manipulations, and confirming the predictions made by the resulting causal models. The human analyst may, suggest the researchers, can also choose an appropriate level of detail of an explanation, for instance proposing general models for describing the overall behavior of an agent and several more detailed models to cover the behavior in specific cases.
The researchers stressed that whatever mechanistic knowledge they have obtained is only via directly interacting with the system through interventions. They have also attributed their success to the automated causal reasoning engine, as interpreting causal evidence turns out to be a remarkably difficult task. “We believe this is the way forward for analyzing and establishing safety guarantees as AI agents become more complex and powerful,” they concluded.
Find the paper here.