Interactive AI misuse risk model

The following is an interactive model of AI misuse risks described in the research paper "An Example Safety Case for Safeguards Against Misuse." The model leverages evidence from expert surveys, capability evaluations, and safeguards evaluations to estimate the annual expected harms caused by novice actors with the help of a particular AI assistant.

Evidence from expert surveys

Baseline Success Probability
iThe distribution below indicates the probability that a novice 'attempt' to cause a large-scale harm will succeed given varying levels of time spent, where an 'attempt' must involve at least two weeks of earnest effort. This curve assumes conditions *prior to the deployment of the AI assistant in question.*

0.11.03.012.048.0Months spent on attempt (log scale)00.050.10.150.2Success Probability of Attempt
  • Baseline Success Probability

Effort Distribution Parameters
iThe distribution below indicates the probability that a novice 'attempt' to cause a large-scale harm will involve less than or equal to the amount of time indicated. An 'attempt' must involve at least two weeks of earnest effort. We assume this distribution is fixed and is unaffected by the deployment of the AI assistant.

0.20.30.61.22.34.69.120.248.0Months spent on attempt (log scale)00.250.50.751Cumulative Probability of Attempt Effort
  • CDF(Effort of attempt)

Baseline expected annual harm (e.g. lives lost):

25,223

Evidence from capability evaluations + experts

Pre-Mitigation success probability
iThe distribution below indicates the probability that a novice 'attempt' to cause a large-scale harm will succeed given varying levels of time spent, where an 'attempt' must involve at least two weeks of earnest effort. This curve assumes conditions where the AI assistant in question is deployed *without safeguards.*

0.11.03.012.048.0Months spent on attempt (log scale)00.050.10.150.2Success Probability of Attempt
  • Pre-Mitigation success probability

Pre-Mitigation expected annual harm (e.g. lives lost):

65,500

Summary of predictions

Baseline expected annual harm (e.g. lives lost):

25,223

Pre-Mitigation expected annual harm (e.g. lives lost):

65,500

Post-Mitigation expected annual harm (e.g. lives lost):

22,950 ✅ (< baseline)

Evidence from safeguards evaluations

Queries executed vs effort
iThe results of a safeguards evaluation. A red team tries to obtain responses from an AI assistant to a representative set of misuse queries. The y-axis shows the average number of queries executed by members of the red team as a function of effort expended.

Drag me!
0153045Time spent jailbreaking (days)0153045Queries executed

Bans vs Queries Executed
iDuring the safeguards evaluation, the developer keeps track of the number of times the system 'bans' a member of the red team. The figure below shows the average number of bans a member of the red team receives as a function of the number of harmful queries they successfully execute (receive nearly-complete responses to).

0153045Number of Queries Executed0153045Number of bans

Time lost to bans
iThis plot is estimated by experts. It shows the amount of time a novice misuse actor would likely need to spend to reacquire access to the AI assistant after being banned.

0153045Time spent jailbreaking (days)0481216Time lost to bans (days)

Queries executed vs effort (w/ bans)
iThis plot synthesizes the other three plots. It shows the number of harmful queries members of the red team obtain responses to over time, factoring in the additional time that a real misuse actor would lose due to being repeatedly banned.

0153045Time spent jailbreaking (days)0153045Queries executed

Post-mitigation success probability
iThe distribution below shows the probability that a novice attempt to cause a large-scale harm succeeds given the amount of time spent on the attempt. The solid blue line shows the post-mitigation success probability, while the dashed lines show the baseline and pre-mitigation probabilities for comparison.

0.11.03.012.048.0Months spent on attempt (log scale)00.050.10.150.2Success Probability of Attempt
  • Baseline Success Probability
  • Pre-Mitigation Success Probability
  • Post-Mitigation Success Probability

Post-Mitigation expected annual harm (e.g. lives lost):

22,950

Estimating how quickly developers need to respond to changing deployment conditions

We can use this quantitative model to simulate "what-if scenarios" where users identify a way to reliably jailbreak models. How much time do developers have to correct the deployment?

0153048Time after deployment (months)0200004000073000Risk (annualized expected harm)(Drag me 👋)Unacceptable risk thresholdUniversal jailbreak emergesWindow to respond and correct deployment: 3.20 months
  • Risk post deployment (w/o jailbreak)
  • Risk post deployment (w/ jailbreak)

A formal description of the model:

Model Description

Joshua Clymer, March 8, 2025.