Listen to this story
|
A recent study identified a structure for assessing universal AI models concerning their potential risks and threats. This project was a collaborative effort with contributors from University of Cambridge, University of Oxford, University of Toronto, Université de Montréal, OpenAI, Anthropic, Alignment Research Center, Centre for Long-Term Resilience, and Centre for the Governance of AI. They focused on expanding the scope of AI evaluation to encompass potential severe hazards posed by all-purpose AI models.
These models might harbour capabilities such as manipulation, deceit, cyber-aggression, and other harmful capacities. Therefore, assessments of these risks are integral to the safe creation and implementation of AI systems.

The focus was on the possible extreme risks associated with universally applicable models, which usually gain their functions and behaviours through the training phase. However, the current methods to guide this learning process are not without flaws. Previous investigations, like the ones at Google DeepMind, demonstrate that even with appropriate rewards for correct behaviour, AI systems can adopt undesired objectives.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
It is essential for AI developers to stay proactive, foreseeing future advancements and potential dangers. In the future, universally applicable models may inherently learn various hazardous capabilities. While uncertain, it is plausible that AI systems of the future might possess the ability to engage in offensive cyber operations, deceive humans convincingly, manipulate individuals into harmful actions, develop or acquire weapons, operate other high-risk AI systems, or assist humans in any of these tasks.
Access to such models by those with harmful intentions could lead to misuse, and misalignment could result in harmful actions even without a direct malicious intent. This is where the framework comes into play, enabling the identification of these risks in advance. The proposed evaluation structure aims to reveal the degree to which a model possesses ‘dangerous capabilities’ that could pose security risks, wield undue influence, or evade scrutiny. It also assesses the model’s tendency to misuse its capabilities to cause harm, thereby evaluating the model’s alignment. It ensures the model functions as intended across various scenarios and studies the model’s internal mechanisms where feasible.

The outcomes of these evaluations will provide AI developers with a clear understanding of whether the components necessary for severe risk exist. The most hazardous scenarios will typically involve a combination of various dangerous capabilities. The role of model evaluations becomes essential in governing these risks.
With superior tools to identify potentially dangerous models, businesses and regulatory bodies can enhance their procedures in several areas:
– Training Responsibly: Informed decisions can be made on whether and how to train a new model that exhibits early signs of risk.
– Deploying Responsibly: Informed decisions can be made on whether, when, and how to roll out potentially dangerous models.
– Transparency: Pertinent and useful information can be shared with stakeholders to aid in the preparation or mitigation of possible risks.
– Appropriate Security: Robust information security protocols and systems can be implemented for models that may pose severe risks.
They have created a comprehensive plan on how extreme risk model evaluations should influence key decisions around training and deploying a highly capable, universal model. The developers will conduct evaluations throughout, and structured model access will be provided to external safety researchers and model auditors for additional evaluations. These evaluations will then influence risk assessments prior to model training and deployment.

What Next?
Initial efforts in model evaluations for extreme risks have already begun, notably at Google DeepMind, among others. However, further progress – both technically and institutionally – is needed to create an evaluation process that identifies all potential risks and provides protection against emerging challenges.
While model evaluations are crucial, they aren’t a cure-all solution. Certain risks could potentially be overlooked, particularly if they’re heavily reliant on external factors such as complex societal, political, and economic dynamics. Model evaluations must be integrated with other risk assessment tools and an overall commitment to safety across industries, governments, and civil society.
As per Google’s recent blog on responsible AI, they emphasise that “individual practices, shared industry standards, and robust government policies are critical to successful AI implementation”. The hope is that many others in the AI field and those impacted by it will collaborate to develop methods and standards for safely creating and deploying AI for everyone’s benefit.
The understanding of the process of identifying emerging risky properties in models and adequately responding to concerning results is a vital aspect of responsible AI development at the forefront of AI capabilities.