← Back to Library

AI safety is not a model property

The assumption that AI safety is a property of AI models is pervasive in the AI community. It is seen as so obvious that it is hardly ever explicitly stated. Because of this assumption:

  • Companies have made big investments in red teaming their models before releasing them.

  • Researchers are frantically trying to fix the brittleness of model alignment techniques.

  • Some AI safety advocates seek to restrict open models given concerns that they might pose unique risks.

  • Policymakers are trying to find the training compute threshold above which safety risks become serious enough to justify intervention (and lacking any meaningful basis for picking one, they seem to have converged on 1026 rather arbitrarily).

We think these efforts are inherently limited in their effectiveness. That’s because AI safety is not a model property. With a few exceptions, AI safety questions cannot be asked and answered at the levels of models alone. Safety depends to a large extent on the context and the environment in which the AI model or AI system is deployed. We have to specify a particular context before we can even meaningfully ask an AI safety question.

As a corollary, fixing AI safety at the model level alone is unlikely to be fruitful. Even if models themselves can somehow be made “safe”, they can easily be used for malicious purposes. That’s because an adversary can deploy a model without giving it access to the details of the context in which it is deployed. Therefore we cannot delegate safety questions to models — especially questions about misuse. The model will lack information that is necessary to make a correct decision.

Based on this perspective, we make four recommendations for safety and red teaming that would represent a major change to how things are done today.

Safety depends on context: three examples

Consider the concern that LLMs can help hackers generate and send phishing emails to a large number of potential victims. It’s true — in our own small-scale tests, we’ve found that LLMs can generate persuasive phishing emails tailored to a particular individual based on publicly available information about them. 

But here’s the problem: phishing emails are just regular emails! There is nothing intrinsically malicious about them. A phishing email might tell the recipient that there is an urgent deadline for a project they are working on, and that they need to click on a link or open an ...

Read full article on AI Snake Oil →