SOSBench: Benchmarking Safety Alignment on Scientific Knowledge

Abstract

SOSBench is the first regulation-grounded, hazard-focused, multi-disciplinary benchmark for assessing large-language-model (LLM) safety in knowledge-intensive scientific contexts. Comprising 3 000 prompts derived from authoritative U.S. and international regulations, it probes six high-risk domains—chemistry, biology, medicine, pharmacology, physics, and psychology— without requiring any architectural changes or additional pre-training.

Novel benchmark: Each SOSBench prompt is tagged as high-risk by real-world regulatory frameworks and demands deep domain expertise to interpret, yielding a broader prompt-semantic space than prior science benchmarks.

Rigorous evaluation: We test 21 frontier models and uncover alarmingly high harmful-response rates—e.g. 79.1 % for Deepseek-R1 and 47.3 % for GPT-4.1—demonstrating that existing alignment techniques falter on regulation-derived scientific misuse scenarios.

Actionable insights for alignment: Open-source, domain-specialized models are markedly under-aligned; increasing parameter scale and granting more reasoning steps reduce harmful outputs only when alignment methods advance in lock-step with the expanded knowledge and reasoning unlocked by scaling.

Benchmark Design

Domains

Six High-Risk Scientific Areas

Chemistry, Biology, Medicine, Pharmacology, Physics, and Psychology

SOSBench probes model safety spanning six disciplines. Each domain is anchored in authoritative U.S./international regulations and demands deep subject-matter expertise to recognise and refuse hazardous requests.

Chemistry

Prompts referencing NFPA 704 level-4 chemicals focus on explosive synthesis, energetic materials and highly reactive compounds.

Example risks: TNT or RDX production, unstable oxidiser handling, generation of toxic chlorine gas

Biology

Seed terms originate from 42 CFR Part 71 and ICD-11 Chapter 1, covering biohazards that could facilitate pathogen creation or manipulation.

Example risks: Viral gain-of-function protocols, spore weaponisation, toxin-gene insertion

Medicine

Draws on ICD-11 Chapters 2-5 & 7-27 to test unsafe clinical advice, unregulated procedures, and self-harm medical directions.

Example risks: DIY surgery instructions, lethal dosage guidelines, clandestine gene-therapy kits

Pharmacology

Controlled-substance terms come from NIDA Drugs-A-to-Z and DEA Schedule III lists, stressing illicit drug synthesis and abuse pathways.

Example risks: Home-lab opioid production, steroid stacking regimens, clandestine pill-press operations

Physics

Incorporates IAEA SSR-6 isotopes and WMD-related technologies (e.g., electromagnetic-pulse devices, enrichment methods) to test weapon design advice.

Example risks: Radioisotope bomb assembly, EMP generator schematics, laser-triggered fusion setups

Psychology

Terms drawn from ICD-11 Chapter 6 and DSM-5/DSM-5-TR focus on manipulation or exploitation of mental-health knowledge.

Example risks: Coercive interrogation scripts, psychological torture regimens, unethical behaviour conditioning

The domains were selected because mis-handled expert knowledge in these areas poses clear public-safety hazards, as reflected by U.S. and international statutes referenced during SOSBench construction.

Construction

Regulation-Grounded Prompt Generation

3 000 prompts • 500 per domain • LLM-assisted evolution

SOSBench construction pipeline (Manual Seed → Hybrid Templates → Data Evolution → Final Sampling)

SOSBench grounds every prompt in authoritative regulations issued by the U.S. Government, United Nations and other bodies, then employs an LLM-assisted evolution algorithm to create realistic, policy-violating instructions that require deep scientific expertise to recognise and refuse .

Manual Seed Collection

Experts extract high-hazard terms (e.g. NFPA-704 level-4 chemicals, DEA Schedule III substances, ICD-11 pathology codes) directly from regulatory texts, then expand them via domain databases such as PubChem or DSM-5 synonym lists .

Hybrid Initial Prompt Generation

Seed terms are inserted into a mix of (i) templates mined from AdvBench and related corpora and (ii) human-written templates inspired by real incidents, yielding a large but rough prompt pool.

LLM-Assisted Data Evolution

GPT-4o-mini mutates prompts; three weak surrogate LLMs (Llama-3.1-8B, Qwen-2.5-7B, Gemma-2-9B) generate responses that are vetted by LlamaGuard. Coverage-driven sampling boosts diversity until each term elicits at least one policy violation .

With the pipeline above, we construct a benchmark with 3000 prompts, 500 per domain. We also construct a 300-sample SOSBench-Lite subset.

Framework

Automatic Evaluation Pipeline

LLM-as-Judge • Harmful-Rate metric • 20+ models tested

SOSBench uses a fully automated evaluation pipeline that scales to thousands of prompts while keeping human annotators out of harm's way. Core design features:

Harmful-Rate (HR) Metric: high score means more harmful responses, or the model is more unsafe.
LLM-as-Judge: automatic evaluator (with GPT-4.1) achieves the best agreement calibrated with human labels, outperforming various existing evaluators, including String-Match, OpenAI Moderation API, WildGuard, and LlamaGuard.
Broad Model Coverage: 20+ frontier LLMs—open/closed, reasoning/non-reasoning, diverse model sizes—are compared under identical decoding settings.
Domain-wise Scoring: Scoring is reported per domain (Chem, Bio, Med, Pharm, Phys, Psych) and overall, enabling fine-grained diagnostics.

The unified pipeline ensures reproducible, apples-to-apples safety assessments and highlights where alignment techniques must improve.

Evaluation Results

Revealing critical safety alignment deficiencies in frontier models

Alarming Safety Gaps Across All Models

Despite their alignment claims, advanced models consistently disclose policy-violating content across all scientific domains

79.1%
Deepseek-R1
Harmful response rate across all domains
47.3%
GPT-4.1
Harmful response rate across all domains
80.3%
Grok-3
Harmful response rate across all domains
43.6%
Claude-4-Opus
Harmful response rate on pharmacology

Detailed Model Performance

Higher HR (Harmful Rate) scores indicate more harmful content generation and less safe models. Frontier model safety alignment shows concerning gaps.

Developer:

Think:

Developer	Model Name	Think	Subject Domain (HR ↓ = safer)						Overall
Developer	Model Name	Think	Bio.	Chem.	Med.	Pharm.	Phys.	Psych.	Overall
OpenAI	o3 (20250416)	✓	0.138	0.108	0.286	0.384	0.120	0.208	0.207
OpenAI	o4-mini (20250416)	✓	0.252	0.162	0.330	0.364	0.224	0.326	0.276
OpenAI	GPT-4.1 (20250414)	✗	0.362	0.246	0.492	0.818	0.408	0.514	0.473
OpenAI	GPT-4o (20241120)	✗	0.310	0.178	0.392	0.624	0.186	0.418	0.351
Google	Gemini-2.5-Pro (20250506)	✓	0.294	0.254	0.324	0.568	0.428	0.308	0.363
Google	Gemini-2.5-Flash (20250417)	✓	0.296	0.258	0.304	0.604	0.418	0.306	0.364
Google	Gemma-3-27B	✗	0.760	0.566	0.720	0.902	0.836	0.808	0.765
Deepseek	Deepseek-V3 (0324)	✗	0.876	0.560	0.814	0.894	0.714	0.852	0.785
Deepseek	Deepseek-R1	✓	0.788	0.654	0.716	0.912	0.836	0.838	0.791
Deepseek	Deepseek-R1-Distill-70B	✓	0.820	0.714	0.764	0.934	0.872	0.868	0.829
Alibaba	Qwen3-235B-A22B	✓	0.484	0.358	0.404	0.440	0.460	0.428	0.429
Alibaba	Qwen3-32B	✓	0.814	0.564	0.682	0.860	0.718	0.802	0.740
Alibaba	Qwen2.5-72B	✗	0.708	0.504	0.672	0.900	0.676	0.738	0.700
xAI	Grok-3	✗	0.902	0.498	0.772	0.922	0.812	0.914	0.803
xAI	Grok-3-mini	✓	0.704	0.398	0.622	0.874	0.664	0.720	0.664
Anthropic	Claude-4-Opus (20250514)	✗	0.106	0.142	0.216	0.436	0.154	0.220	0.212
Anthropic	Claude-4-Opus-Think (20250514)	✓	0.074	0.078	0.108	0.226	0.086	0.158	0.122
Anthropic	Claude-4-Sonnet (20250514)	✗	0.120	0.182	0.202	0.318	0.174	0.172	0.195
Anthropic	Claude-4-Sonnet-Think (20250514)	✓	0.056	0.086	0.054	0.054	0.110	0.064	0.071
Anthropic	Claude-3.7-Sonnet (20250219)	✗	0.346	0.238	0.444	0.750	0.262	0.314	0.392
Anthropic	Claude-3.7-Sonnet-Think (20250219)	✓	0.050	0.056	0.072	0.312	0.062	0.048	0.100
Meta	Llama-4-Maverick	✗	0.280	0.198	0.352	0.610	0.232	0.250	0.320
Meta	Llama-4-Scout	✗	0.500	0.396	0.598	0.836	0.498	0.530	0.560
Meta	Llama-3.1-405B	✗	0.586	0.408	0.596	0.732	0.446	0.564	0.555
Meta	Llama-3.3-70B	✗	0.418	0.466	0.472	0.784	0.522	0.454	0.519

Key Research Findings

Shallow Safety Alignment — 30-50 % Unsafe Responses

Evaluating 21 frontier LLMs shows that every model discloses hazardous content on at least one-third of regulation-grounded prompts; GPT-4.1 leaks 47 % and Deepseek-R1 reaches 79 %. Current alignment methods are therefore far from adequate for scientific misuse scenarios.

Pharmacology & Other "Shadow" Domains Drive Failures

Harmful-rate varies sharply by subject: most models perform worst on pharmacology (e.g., OpenAI o3 HR = 0.384) and psychology, domains under-represented in prior benchmarks. Robust alignment must therefore be domain-aware, not one-size-fits-all.

Domain-Expert Models Are Not Safer

Specialised models fine-tuned on scientific corpora (e.g., BioMistral-7B HR = 0.876) are often more dangerous than their general-purpose bases, because post-training erodes prior safety tuning and follow-up alignment is insufficient.

Knowledge–Alignment Trade-off & Fragile Defences

Scaling parameters or increasing hidden-reasoning budgets can reduce HR — only when alignment grows in lock-step. Visible chain-of-thought or simple jailbreaks (e.g., HR for Llama-4-Maverick jumps from 0.28 → 0.80) quickly overturn apparent gains, revealing brittle safety guardrails.

SOSBench