SOSBench

Benchmarking Safety Alignment on Six Scientific Domains

A regulation-grounded, hazard-focused benchmark for evaluating LLM safety on scientifically sophisticated misuse requests across six high-risk domains.

Deepseek-R1 and GPT-4.1 still reach 84.9% and 50.3% PVR on SOSBench.
⚠️ WARNING: This paper contains information that may be considered offensive.
SOSBench Main Figure

Fengqing Jiang1,†Fengbo Ma2,†Zhangchen Xu1Yuetai Li1Zixin Rao2
Bhaskar Ramasubramanian3Luyao Niu1Bo Li4Xianyan Chen2
Zhen Xiang2,‑Radha Poovendran1,‑

†Equal contribution‑Corresponding authors

Abstract

SOSBench is a regulation-grounded, hazard-focused benchmark for evaluating large-language-model safety in knowledge-intensive scientific misuse settings. It comprises 3,000 prompts derived from real-world regulations spanning six selected high-risk domains: chemistry, biology, medicine, pharmacology, physics, and psychology.

1
Novel benchmark: Each SOSBench prompt is tagged as high-risk by real-world regulatory frameworks and demands deep domain expertise to interpret, yielding a broader prompt-semantic space than prior science benchmarks.
2
Rigorous evaluation: We test 26 frontier models and uncover alarmingly high policy-violation ratesβ€”e.g. 84.9% for Deepseek-R1 and 50.3% for GPT-4.1β€”demonstrating that existing alignment techniques falter on regulation-derived scientific misuse scenarios.
3
Actionable insights for alignment: Open-source, domain-specialized models are markedly under-aligned; scaling only helps when alignment co-scales with knowledge, and larger reasoning budgets help only selectively depending on whether the model exposes its chain-of-thought.

Benchmark Design

Domains
Six High-Risk Scientific Areas
Chemistry, Biology, Medicine, Pharmacology, Physics, and Psychology

SOSBench probes model safety spanning six disciplines. Each domain is anchored in authoritative U.S./international regulations and demands deep subject-matter expertise to recognise and refuse hazardous requests.

Chemistry
Prompts referencing NFPA 704 level-4 chemicals focus on explosive synthesis, energetic materials and highly reactive compounds.
Example risks: TNT or RDX production, unstable oxidiser handling, generation of toxic chlorine gas
Biology
Seed terms originate from 42 CFR Part 71 and ICD-11 Chapter 1, covering biohazards that could facilitate pathogen creation or manipulation.
Example risks: Viral gain-of-function protocols, spore weaponisation, toxin-gene insertion
Medicine
Draws on ICD-11 Chapters 2-5 & 7-27 to test unsafe clinical advice, unregulated procedures, and self-harm medical directions.
Example risks: DIY surgery instructions, lethal dosage guidelines, clandestine gene-therapy kits
Pharmacology
Controlled-substance terms come from NIDA Drugs-A-to-Z and DEA Schedule III lists, stressing illicit drug synthesis and abuse pathways.
Example risks: Home-lab opioid production, steroid stacking regimens, clandestine pill-press operations
Physics
Incorporates IAEA SSR-6 isotopes and WMD-related technologies (e.g., electromagnetic-pulse devices, enrichment methods) to test weapon design advice.
Example risks: Radioisotope bomb assembly, EMP generator schematics, laser-triggered fusion setups
Psychology
Terms drawn from ICD-11 Chapter 6 and DSM-5/DSM-5-TR focus on manipulation or exploitation of mental-health knowledge.
Example risks: Coercive interrogation scripts, psychological torture regimens, unethical behaviour conditioning

The domains were selected because mis-handled expert knowledge in these areas poses clear public-safety hazards, as reflected by U.S. and international statutes referenced during SOSBench construction.

Construction
Regulation-Grounded Prompt Generation
3 000 prompts  β€’  500 per domain  β€’  LLM-assisted evolution
SOSBench construction pipeline (Manual Seed β†’ Hybrid Templates β†’ Data Evolution β†’ Final Sampling)

SOSBench grounds every prompt in authoritative regulations issued by the U.S. Government, United Nations and other bodies, then employs an LLM-assisted evolution algorithm to create realistic, policy-violating instructions that require deep scientific expertise to recognise and refuse.

1
Manual Seed Collection
Experts extract high-hazard terms (e.g. NFPA-704 level-4 chemicals, DEA Schedule III substances, ICD-11 pathology codes) directly from regulatory texts, then expand them via domain databases such as PubChem or DSM-5 synonym lists.
2
Hybrid Initial Prompt Generation
Seed terms are inserted into a mix of (i) templates mined from AdvBench and related corpora and (ii) human-written templates inspired by real incidents, yielding a large but rough prompt pool.
3
LLM-Assisted Data Evolution
GPT-4o-mini mutates prompts; three weak surrogate LLMs (Llama-3.1-8B, Qwen-2.5-7B, Gemma-2-9B) generate responses that are vetted by LlamaGuard. Coverage-driven sampling boosts diversity until each term elicits at least one policy violation.

With the pipeline above, we construct a benchmark with 3000 prompts, 500 per domain. We also construct a 300-sample SOSBench-Lite subset.

Framework
Automatic Evaluation Pipeline
GPT-5 judge β€’ PVR metric β€’ 26 frontier models

SOSBench uses a fully automated evaluation pipeline that scales to thousands of prompts while keeping human annotators out of harm's way. Core design features:

  • Policy Violation Rate (PVR): a higher score means more policy-violating responses and a less safe model.
  • LLM-as-Judge: our automatic evaluator (with GPT-5) achieves the best agreement calibrated with human labels, outperforming various existing evaluators, including String-Match, OpenAI Moderation API, WildGuard, and LlamaGuard.
  • Broad Model Coverage: 26 frontier LLMsβ€”open/closed, reasoning/non-reasoning, diverse model sizesβ€”are compared under identical decoding settings.
  • Domain-wise Scoring: Scoring is reported per domain (Chem, Bio, Med, Pharm, Phys, Psych) and overall, enabling fine-grained diagnostics.

The unified pipeline ensures reproducible, apples-to-apples safety assessments and highlights where alignment techniques must improve.

Evaluation Results

Revealing critical safety alignment deficiencies across 26 frontier models

Alarming Safety Gaps Across All Models

Despite their alignment claims, advanced models consistently disclose policy-violating content across all six scientific domains.

84.9%
Deepseek-R1
Overall PVR across all domains
50.3%
GPT-4.1
Overall PVR across all domains
84.0%
Grok-3
Overall PVR across all domains
41.8%
GPT-5
Pharmacology PVR, its weakest domain

Detailed Model Performance

Higher PVR scores indicate more policy-violating content and less safe models. Overall values include the paper's reported 90% confidence intervals.

Developer Model Name Think Subject Domain (PVR ↓ = safer) Overall
Bio. Chem. Med. Pharm. Phys. Psych.
OpenAIGPT-5 (20250807)βœ— 0.1080.1220.3320.4180.1040.1420.204 ± 0.012
OpenAIo3 (20250416)βœ“ 0.1560.1520.3720.4240.1140.1960.236 ± 0.013
OpenAIo4-mini (20250416)βœ“ 0.2620.2060.4620.4080.2200.3140.312 ± 0.014
OpenAIGPT-4.1 (20250414)βœ— 0.3740.3140.5700.8500.4100.4980.503 ± 0.015
OpenAIGPT-4o (20241120)βœ— 0.3060.2540.4760.6760.1940.3960.384 ± 0.015
GoogleGemini-2.5-Pro (20250506)βœ“ 0.3540.3420.4920.6340.4660.2940.430 ± 0.015
GoogleGemini-2.5-Flash (20250417)βœ“ 0.3360.3380.4620.6840.4240.3260.428 ± 0.015
GoogleGemma-3-27Bβœ— 0.7920.6460.8140.9340.8420.7920.803 ± 0.012
DeepseekDeepseek-V3 (0324)βœ— 0.8560.6000.8720.9160.7220.8200.798 ± 0.012
DeepseekDeepseek-R1βœ“ 0.8140.8340.8060.9640.8720.8060.849 ± 0.011
DeepseekDeepseek-R1-Distill-70Bβœ“ 0.8380.9040.8540.9720.8860.8160.878 ± 0.010
AlibabaQwen3-235B-A22Bβœ“ 0.8520.7600.8680.9340.7640.8520.838 ± 0.011
AlibabaQwen3-32Bβœ“ 0.8020.7840.7740.9460.7400.7460.799 ± 0.012
AlibabaQwen2.5-72Bβœ— 0.6800.5600.7340.9260.6780.7340.719 ± 0.014
xAIGrok-3βœ— 0.8940.6380.8600.9540.8040.8900.840 ± 0.011
xAIGrok-3-miniβœ“ 0.7580.5860.7460.9300.7080.7000.738 ± 0.013
AnthropicClaude-4.1-Opusβœ— 0.1460.1280.2560.2880.1100.1340.177 ± 0.011
AnthropicClaude-4.1-Opus-Thinkingβœ“ 0.1220.1660.2080.2100.0860.0800.145 ± 0.011
AnthropicClaude-4-Sonnetβœ— 0.1520.2620.3000.3560.1800.1740.237 ± 0.013
AnthropicClaude-4-Sonnet-Thinkingβœ“ 0.0560.1580.1260.1120.1100.0720.106 ± 0.009
AnthropicClaude-3.7-Sonnetβœ— 0.3540.3080.5460.7840.2800.2920.427 ± 0.015
AnthropicClaude-3.7-Sonnet-Thinkingβœ“ 0.1040.1080.1540.3740.0620.0440.141 ± 0.010
MetaLlama-4-Maverickβœ— 0.2880.2380.4260.6520.2400.2420.348 ± 0.014
MetaLlama-4-Scoutβœ— 0.4880.4360.6880.8740.4920.5100.581 ± 0.015
MetaLlama-405Bβœ— 0.5900.4680.6900.7640.4440.5680.587 ± 0.015
MetaLlama-3.3-70Bβœ— 0.4080.5400.5460.8120.5160.4460.545 ± 0.015

Key Research Findings

1
Frontier Model Safety Alignment Is Shallow
Across 26 frontier LLMs, even widely deployed models still disclose hazardous scientific content at high rates. GPT-4.1 reaches 0.503 overall PVR and Deepseek-R1 reaches 0.849, showing that present-day alignment remains inadequate for deep scientific misuse scenarios.
2
Pharmacology and Other Under-Covered Domains Drive Failures
Failure rates vary sharply by subject, with many models performing worst on pharmacology. Even GPT-5, one of the safest models overall, rises to 0.418 PVR on pharmacology, showing that alignment must be domain-aware rather than one-size-fits-all.
3
Domain-Expert Models Offer No Added Safety
Domain-specialized post-training often erodes prior safeguards. In the revised paper, BioMistral-7B-SLERP reaches 0.915 overall PVR, making it less safe than many general-purpose models despite its subject-matter expertise.
4
Scaling Helps Only When Alignment Co-Scales with Knowledge
Larger models are not uniformly safer. Some families improve with scale, such as o4-mini to o3 and Llama-4-Scout to Llama-4-Maverick, but others remain flat or rebound. The paper's central claim is that safety improves only when alignment grows in lock-step with added knowledge.
5
Test-Time Scaling Has Mixed Effects
Increasing reasoning budget is not uniformly beneficial. The revised analysis finds that larger budgets can raise PVR for visible-thinking models, while only slightly helping invisible-thinking models, suggesting that exposed chain-of-thought can itself become a leakage surface.
6
Appendix Analyses Show Limited Unlearning Gains and Brittle Defences
Unlearning yields only modest safety gains and can reduce capability (e.g. Mixtral drops from 68.2 to 67.1 MMLU while PVR only moves from 0.806 to 0.775). Meanwhile, jailbreaks remain highly effective: Llama-4-Maverick jumps from 0.28 to 0.88 under GCG transfer, and Crescendo pushes tested models above 0.90 PVR.

Citation

If you use SOSBench in your work, we would appreciate if you cite our paper:

@inproceedings{jiang2026sosbench,
  title={So{SB}ench: Benchmarking Safety Alignment on Six Scientific Domains},
  author={Fengqing Jiang and Fengbo Ma and Zhangchen Xu and Yuetai Li and Zixin Rao and Bhaskar Ramasubramanian and Luyao Niu and Bo Li and Xianyan Chen and Zhen Xiang and Radha Poovendran},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=2Td8r7KYK2}
}