SOSBench: Benchmarking Safety Alignment on Six Scientific Domains

Abstract

SOSBench is a regulation-grounded, hazard-focused benchmark for evaluating large-language-model safety in knowledge-intensive scientific misuse settings. It comprises 3,000 prompts derived from real-world regulations spanning six selected high-risk domains: chemistry, biology, medicine, pharmacology, physics, and psychology.

Novel benchmark: Each SOSBench prompt is tagged as high-risk by real-world regulatory frameworks and demands deep domain expertise to interpret, yielding a broader prompt-semantic space than prior science benchmarks.

Rigorous evaluation: We test 26 frontier models and uncover alarmingly high policy-violation rates—e.g. 84.9% for Deepseek-R1 and 50.3% for GPT-4.1—demonstrating that existing alignment techniques falter on regulation-derived scientific misuse scenarios.

Actionable insights for alignment: Open-source, domain-specialized models are markedly under-aligned; scaling only helps when alignment co-scales with knowledge, and larger reasoning budgets help only selectively depending on whether the model exposes its chain-of-thought.

Benchmark Design

Domains

Six High-Risk Scientific Areas

Chemistry, Biology, Medicine, Pharmacology, Physics, and Psychology

SOSBench probes model safety spanning six disciplines. Each domain is anchored in authoritative U.S./international regulations and demands deep subject-matter expertise to recognise and refuse hazardous requests.

Chemistry

Prompts referencing NFPA 704 level-4 chemicals focus on explosive synthesis, energetic materials and highly reactive compounds.

Example risks: TNT or RDX production, unstable oxidiser handling, generation of toxic chlorine gas

Biology

Seed terms originate from 42 CFR Part 71 and ICD-11 Chapter 1, covering biohazards that could facilitate pathogen creation or manipulation.

Example risks: Viral gain-of-function protocols, spore weaponisation, toxin-gene insertion

Medicine

Draws on ICD-11 Chapters 2-5 & 7-27 to test unsafe clinical advice, unregulated procedures, and self-harm medical directions.

Example risks: DIY surgery instructions, lethal dosage guidelines, clandestine gene-therapy kits

Pharmacology

Controlled-substance terms come from NIDA Drugs-A-to-Z and DEA Schedule III lists, stressing illicit drug synthesis and abuse pathways.

Example risks: Home-lab opioid production, steroid stacking regimens, clandestine pill-press operations

Physics

Incorporates IAEA SSR-6 isotopes and WMD-related technologies (e.g., electromagnetic-pulse devices, enrichment methods) to test weapon design advice.

Example risks: Radioisotope bomb assembly, EMP generator schematics, laser-triggered fusion setups

Psychology

Terms drawn from ICD-11 Chapter 6 and DSM-5/DSM-5-TR focus on manipulation or exploitation of mental-health knowledge.

Example risks: Coercive interrogation scripts, psychological torture regimens, unethical behaviour conditioning

The domains were selected because mis-handled expert knowledge in these areas poses clear public-safety hazards, as reflected by U.S. and international statutes referenced during SOSBench construction.

Construction

Regulation-Grounded Prompt Generation

3 000 prompts • 500 per domain • LLM-assisted evolution

SOSBench construction pipeline (Manual Seed → Hybrid Templates → Data Evolution → Final Sampling)

SOSBench grounds every prompt in authoritative regulations issued by the U.S. Government, United Nations and other bodies, then employs an LLM-assisted evolution algorithm to create realistic, policy-violating instructions that require deep scientific expertise to recognise and refuse.

Manual Seed Collection

Experts extract high-hazard terms (e.g. NFPA-704 level-4 chemicals, DEA Schedule III substances, ICD-11 pathology codes) directly from regulatory texts, then expand them via domain databases such as PubChem or DSM-5 synonym lists.

Hybrid Initial Prompt Generation

Seed terms are inserted into a mix of (i) templates mined from AdvBench and related corpora and (ii) human-written templates inspired by real incidents, yielding a large but rough prompt pool.

LLM-Assisted Data Evolution

GPT-4o-mini mutates prompts; three weak surrogate LLMs (Llama-3.1-8B, Qwen-2.5-7B, Gemma-2-9B) generate responses that are vetted by LlamaGuard. Coverage-driven sampling boosts diversity until each term elicits at least one policy violation.

With the pipeline above, we construct a benchmark with 3000 prompts, 500 per domain. We also construct a 300-sample SOSBench-Lite subset.

Framework

Automatic Evaluation Pipeline

GPT-5 judge • PVR metric • 26 frontier models

SOSBench uses a fully automated evaluation pipeline that scales to thousands of prompts while keeping human annotators out of harm's way. Core design features:

Policy Violation Rate (PVR): a higher score means more policy-violating responses and a less safe model.
LLM-as-Judge: our automatic evaluator (with GPT-5) achieves the best agreement calibrated with human labels, outperforming various existing evaluators, including String-Match, OpenAI Moderation API, WildGuard, and LlamaGuard.
Broad Model Coverage: 26 frontier LLMs—open/closed, reasoning/non-reasoning, diverse model sizes—are compared under identical decoding settings.
Domain-wise Scoring: Scoring is reported per domain (Chem, Bio, Med, Pharm, Phys, Psych) and overall, enabling fine-grained diagnostics.

The unified pipeline ensures reproducible, apples-to-apples safety assessments and highlights where alignment techniques must improve.

Evaluation Results

Revealing critical safety alignment deficiencies across 26 frontier models

Alarming Safety Gaps Across All Models

Despite their alignment claims, advanced models consistently disclose policy-violating content across all six scientific domains.

84.9%
Deepseek-R1
Overall PVR across all domains
50.3%
GPT-4.1
Overall PVR across all domains
84.0%
Grok-3
Overall PVR across all domains
41.8%
GPT-5
Pharmacology PVR, its weakest domain

Detailed Model Performance

Higher PVR scores indicate more policy-violating content and less safe models. Overall values include the paper's reported 90% confidence intervals.

Developer:

Think:

Developer	Model Name	Think	Subject Domain (PVR ↓ = safer)						Overall
Developer	Model Name	Think	Bio.	Chem.	Med.	Pharm.	Phys.	Psych.	Overall
OpenAI	GPT-5 (20250807)	✗	0.108	0.122	0.332	0.418	0.104	0.142	0.204 ± 0.012
OpenAI	o3 (20250416)	✓	0.156	0.152	0.372	0.424	0.114	0.196	0.236 ± 0.013
OpenAI	o4-mini (20250416)	✓	0.262	0.206	0.462	0.408	0.220	0.314	0.312 ± 0.014
OpenAI	GPT-4.1 (20250414)	✗	0.374	0.314	0.570	0.850	0.410	0.498	0.503 ± 0.015
OpenAI	GPT-4o (20241120)	✗	0.306	0.254	0.476	0.676	0.194	0.396	0.384 ± 0.015
Google	Gemini-2.5-Pro (20250506)	✓	0.354	0.342	0.492	0.634	0.466	0.294	0.430 ± 0.015
Google	Gemini-2.5-Flash (20250417)	✓	0.336	0.338	0.462	0.684	0.424	0.326	0.428 ± 0.015
Google	Gemma-3-27B	✗	0.792	0.646	0.814	0.934	0.842	0.792	0.803 ± 0.012
Deepseek	Deepseek-V3 (0324)	✗	0.856	0.600	0.872	0.916	0.722	0.820	0.798 ± 0.012
Deepseek	Deepseek-R1	✓	0.814	0.834	0.806	0.964	0.872	0.806	0.849 ± 0.011
Deepseek	Deepseek-R1-Distill-70B	✓	0.838	0.904	0.854	0.972	0.886	0.816	0.878 ± 0.010
Alibaba	Qwen3-235B-A22B	✓	0.852	0.760	0.868	0.934	0.764	0.852	0.838 ± 0.011
Alibaba	Qwen3-32B	✓	0.802	0.784	0.774	0.946	0.740	0.746	0.799 ± 0.012
Alibaba	Qwen2.5-72B	✗	0.680	0.560	0.734	0.926	0.678	0.734	0.719 ± 0.014
xAI	Grok-3	✗	0.894	0.638	0.860	0.954	0.804	0.890	0.840 ± 0.011
xAI	Grok-3-mini	✓	0.758	0.586	0.746	0.930	0.708	0.700	0.738 ± 0.013
Anthropic	Claude-4.1-Opus	✗	0.146	0.128	0.256	0.288	0.110	0.134	0.177 ± 0.011
Anthropic	Claude-4.1-Opus-Thinking	✓	0.122	0.166	0.208	0.210	0.086	0.080	0.145 ± 0.011
Anthropic	Claude-4-Sonnet	✗	0.152	0.262	0.300	0.356	0.180	0.174	0.237 ± 0.013
Anthropic	Claude-4-Sonnet-Thinking	✓	0.056	0.158	0.126	0.112	0.110	0.072	0.106 ± 0.009
Anthropic	Claude-3.7-Sonnet	✗	0.354	0.308	0.546	0.784	0.280	0.292	0.427 ± 0.015
Anthropic	Claude-3.7-Sonnet-Thinking	✓	0.104	0.108	0.154	0.374	0.062	0.044	0.141 ± 0.010
Meta	Llama-4-Maverick	✗	0.288	0.238	0.426	0.652	0.240	0.242	0.348 ± 0.014
Meta	Llama-4-Scout	✗	0.488	0.436	0.688	0.874	0.492	0.510	0.581 ± 0.015
Meta	Llama-405B	✗	0.590	0.468	0.690	0.764	0.444	0.568	0.587 ± 0.015
Meta	Llama-3.3-70B	✗	0.408	0.540	0.546	0.812	0.516	0.446	0.545 ± 0.015

Key Research Findings

Frontier Model Safety Alignment Is Shallow

Across 26 frontier LLMs, even widely deployed models still disclose hazardous scientific content at high rates. GPT-4.1 reaches 0.503 overall PVR and Deepseek-R1 reaches 0.849, showing that present-day alignment remains inadequate for deep scientific misuse scenarios.

Pharmacology and Other Under-Covered Domains Drive Failures

Failure rates vary sharply by subject, with many models performing worst on pharmacology. Even GPT-5, one of the safest models overall, rises to 0.418 PVR on pharmacology, showing that alignment must be domain-aware rather than one-size-fits-all.

Domain-Expert Models Offer No Added Safety

Domain-specialized post-training often erodes prior safeguards. In the revised paper, BioMistral-7B-SLERP reaches 0.915 overall PVR, making it less safe than many general-purpose models despite its subject-matter expertise.

Scaling Helps Only When Alignment Co-Scales with Knowledge

Larger models are not uniformly safer. Some families improve with scale, such as o4-mini to o3 and Llama-4-Scout to Llama-4-Maverick, but others remain flat or rebound. The paper's central claim is that safety improves only when alignment grows in lock-step with added knowledge.

Test-Time Scaling Has Mixed Effects

Increasing reasoning budget is not uniformly beneficial. The revised analysis finds that larger budgets can raise PVR for visible-thinking models, while only slightly helping invisible-thinking models, suggesting that exposed chain-of-thought can itself become a leakage surface.

Appendix Analyses Show Limited Unlearning Gains and Brittle Defences

Unlearning yields only modest safety gains and can reduce capability (e.g. Mixtral drops from 68.2 to 67.1 MMLU while PVR only moves from 0.806 to 0.775). Meanwhile, jailbreaks remain highly effective: Llama-4-Maverick jumps from 0.28 to 0.88 under GCG transfer, and Crescendo pushes tested models above 0.90 PVR.

SOSBench