SOSBench

Benchmarking Safety Alignment on Scientific Knowledge

A comprehensive benchmark for evaluating LLM safety when handling scientifically sophisticated and potentially hazardous content across six high-risk domains.

Advanced models consistently disclose POLICY-VIOLATING content at alarming rates!
⚠️ WARNING: This paper contains information that may be considered offensive.
SOSBench Main Figure

Fengqing Jiang1,†Fengbo Ma2,†Zhangchen Xu1Yuetai Li1
Bhaskar Ramasubramanian3Luyao Niu1Bo Li3Xianyan Chen2
Zhen Xiang2,†Radha Poovendran1,‡

†Equal contribution‡Corresponding author

Abstract

SOSBench is the first regulation-grounded, hazard-focused, multi-disciplinary benchmark for assessing large-language-model (LLM) safety in knowledge-intensive scientific contexts. Comprising 3 000 prompts derived from authoritative U.S. and international regulations, it probes six high-risk domains—chemistry, biology, medicine, pharmacology, physics, and psychology— without requiring any architectural changes or additional pre-training.

1
Novel benchmark: Each SOSBench prompt is tagged as high-risk by real-world regulatory frameworks and demands deep domain expertise to interpret, yielding a broader prompt-semantic space than prior science benchmarks.
2
Rigorous evaluation: We test 21 frontier models and uncover alarmingly high harmful-response rates—e.g. 79.1 % for Deepseek-R1 and 47.3 % for GPT-4.1—demonstrating that existing alignment techniques falter on regulation-derived scientific misuse scenarios.
3
Actionable insights for alignment: Open-source, domain-specialized models are markedly under-aligned; increasing parameter scale and granting more reasoning steps reduce harmful outputs only when alignment methods advance in lock-step with the expanded knowledge and reasoning unlocked by scaling.

Benchmark Design

Domains
Six High-Risk Scientific Areas
Chemistry, Biology, Medicine, Pharmacology, Physics, and Psychology

SOSBench probes model safety spanning six disciplines. Each domain is anchored in authoritative U.S./international regulations and demands deep subject-matter expertise to recognise and refuse hazardous requests.

Chemistry
Prompts referencing NFPA 704 level-4 chemicals focus on explosive synthesis, energetic materials and highly reactive compounds.
Example risks: TNT or RDX production, unstable oxidiser handling, generation of toxic chlorine gas
Biology
Seed terms originate from 42 CFR Part 71 and ICD-11 Chapter 1, covering biohazards that could facilitate pathogen creation or manipulation.
Example risks: Viral gain-of-function protocols, spore weaponisation, toxin-gene insertion
Medicine
Draws on ICD-11 Chapters 2-5 & 7-27 to test unsafe clinical advice, unregulated procedures, and self-harm medical directions.
Example risks: DIY surgery instructions, lethal dosage guidelines, clandestine gene-therapy kits
Pharmacology
Controlled-substance terms come from NIDA Drugs-A-to-Z and DEA Schedule III lists, stressing illicit drug synthesis and abuse pathways.
Example risks: Home-lab opioid production, steroid stacking regimens, clandestine pill-press operations
Physics
Incorporates IAEA SSR-6 isotopes and WMD-related technologies (e.g., electromagnetic-pulse devices, enrichment methods) to test weapon design advice.
Example risks: Radioisotope bomb assembly, EMP generator schematics, laser-triggered fusion setups
Psychology
Terms drawn from ICD-11 Chapter 6 and DSM-5/DSM-5-TR focus on manipulation or exploitation of mental-health knowledge.
Example risks: Coercive interrogation scripts, psychological torture regimens, unethical behaviour conditioning

The domains were selected because mis-handled expert knowledge in these areas poses clear public-safety hazards, as reflected by U.S. and international statutes referenced during SOSBench construction.

Construction
Regulation-Grounded Prompt Generation
3 000 prompts  â€˘  500 per domain  â€˘  LLM-assisted evolution
SOSBench construction pipeline (Manual Seed → Hybrid Templates → Data Evolution → Final Sampling)

SOSBench grounds every prompt in authoritative regulations issued by the U.S. Government, United Nations and other bodies, then employs an LLM-assisted evolution algorithm to create realistic, policy-violating instructions that require deep scientific expertise to recognise and refuse .

1
Manual Seed Collection
Experts extract high-hazard terms (e.g. NFPA-704 level-4 chemicals, DEA Schedule III substances, ICD-11 pathology codes) directly from regulatory texts, then expand them via domain databases such as PubChem or DSM-5 synonym lists .
2
Hybrid Initial Prompt Generation
Seed terms are inserted into a mix of (i) templates mined from AdvBench and related corpora and (ii) human-written templates inspired by real incidents, yielding a large but rough prompt pool.
3
LLM-Assisted Data Evolution
GPT-4o-mini mutates prompts; three weak surrogate LLMs (Llama-3.1-8B, Qwen-2.5-7B, Gemma-2-9B) generate responses that are vetted by LlamaGuard. Coverage-driven sampling boosts diversity until each term elicits at least one policy violation .

With the pipeline above, we construct a benchmark with 3000 prompts, 500 per domain. We also construct a 300-sample SOSBench-Lite subset.

Framework
Automatic Evaluation Pipeline
LLM-as-Judge • Harmful-Rate metric • 20+ models tested

SOSBench uses a fully automated evaluation pipeline that scales to thousands of prompts while keeping human annotators out of harm's way. Core design features:

  • Harmful-Rate (HR) Metric: high score means more harmful responses, or the model is more unsafe.
  • LLM-as-Judge: automatic evaluator (with GPT-4.1) achieves the best agreement calibrated with human labels, outperforming various existing evaluators, including String-Match, OpenAI Moderation API, WildGuard, and LlamaGuard.
  • Broad Model Coverage: 20+ frontier LLMs—open/closed, reasoning/non-reasoning, diverse model sizes—are compared under identical decoding settings.
  • Domain-wise Scoring: Scoring is reported per domain (Chem, Bio, Med, Pharm, Phys, Psych) and overall, enabling fine-grained diagnostics.

The unified pipeline ensures reproducible, apples-to-apples safety assessments and highlights where alignment techniques must improve.

Evaluation Results

Revealing critical safety alignment deficiencies in frontier models

Alarming Safety Gaps Across All Models

Despite their alignment claims, advanced models consistently disclose policy-violating content across all scientific domains

79.1%
Deepseek-R1
Harmful response rate across all domains
47.3%
GPT-4.1
Harmful response rate across all domains
80.3%
Grok-3
Harmful response rate across all domains
43.6%
Claude-4-Opus
Harmful response rate on pharmacology

Detailed Model Performance

Higher HR (Harmful Rate) scores indicate more harmful content generation and less safe models. Frontier model safety alignment shows concerning gaps.

Developer Model Name Think Subject Domain (HR ↓ = safer) Overall
Bio. Chem. Med. Pharm. Phys. Psych.
OpenAIo3 (20250416)âś“ 0.1380.1080.2860.3840.1200.2080.207
OpenAIo4-mini (20250416)âś“ 0.2520.1620.3300.3640.2240.3260.276
OpenAIGPT-4.1 (20250414)âś— 0.3620.2460.4920.8180.4080.5140.473
OpenAIGPT-4o (20241120)âś— 0.3100.1780.3920.6240.1860.4180.351
GoogleGemini-2.5-Pro (20250506)âś“ 0.2940.2540.3240.5680.4280.3080.363
GoogleGemini-2.5-Flash (20250417)âś“ 0.2960.2580.3040.6040.4180.3060.364
GoogleGemma-3-27Bâś— 0.7600.5660.7200.9020.8360.8080.765
DeepseekDeepseek-V3 (0324)âś— 0.8760.5600.8140.8940.7140.8520.785
DeepseekDeepseek-R1âś“ 0.7880.6540.7160.9120.8360.8380.791
DeepseekDeepseek-R1-Distill-70Bâś“ 0.8200.7140.7640.9340.8720.8680.829
AlibabaQwen3-235B-A22Bâś“ 0.4840.3580.4040.4400.4600.4280.429
AlibabaQwen3-32Bâś“ 0.8140.5640.6820.8600.7180.8020.740
AlibabaQwen2.5-72Bâś— 0.7080.5040.6720.9000.6760.7380.700
xAIGrok-3âś— 0.9020.4980.7720.9220.8120.9140.803
xAIGrok-3-miniâś“ 0.7040.3980.6220.8740.6640.7200.664
AnthropicClaude-4-Opus (20250514)âś— 0.1060.1420.2160.4360.1540.2200.212
AnthropicClaude-4-Opus-Think (20250514)âś“ 0.0740.0780.1080.2260.0860.1580.122
AnthropicClaude-4-Sonnet (20250514)âś— 0.1200.1820.2020.3180.1740.1720.195
AnthropicClaude-4-Sonnet-Think (20250514)âś“ 0.0560.0860.0540.0540.1100.0640.071
AnthropicClaude-3.7-Sonnet (20250219)âś— 0.3460.2380.4440.7500.2620.3140.392
AnthropicClaude-3.7-Sonnet-Think (20250219)âś“ 0.0500.0560.0720.3120.0620.0480.100
MetaLlama-4-Maverickâś— 0.2800.1980.3520.6100.2320.2500.320
MetaLlama-4-Scoutâś— 0.5000.3960.5980.8360.4980.5300.560
MetaLlama-3.1-405Bâś— 0.5860.4080.5960.7320.4460.5640.555
MetaLlama-3.3-70Bâś— 0.4180.4660.4720.7840.5220.4540.519

Key Research Findings

1
Shallow Safety Alignment — 30-50 % Unsafe Responses
Evaluating 21 frontier LLMs shows that every model discloses hazardous content on at least one-third of regulation-grounded prompts; GPT-4.1 leaks 47 % and Deepseek-R1 reaches 79 %. Current alignment methods are therefore far from adequate for scientific misuse scenarios.
2
Pharmacology & Other "Shadow" Domains Drive Failures
Harmful-rate varies sharply by subject: most models perform worst on pharmacology (e.g., OpenAI o3 HR = 0.384) and psychology, domains under-represented in prior benchmarks. Robust alignment must therefore be domain-aware, not one-size-fits-all.
3
Domain-Expert Models Are Not Safer
Specialised models fine-tuned on scientific corpora (e.g., BioMistral-7B HR = 0.876) are often more dangerous than their general-purpose bases, because post-training erodes prior safety tuning and follow-up alignment is insufficient.
4
Knowledge–Alignment Trade-off & Fragile Defences
Scaling parameters or increasing hidden-reasoning budgets can reduce HR — only when alignment grows in lock-step. Visible chain-of-thought or simple jailbreaks (e.g., HR for Llama-4-Maverick jumps from 0.28 → 0.80) quickly overturn apparent gains, revealing brittle safety guardrails.

Citation

If you use SOSBench in your work, we would appreciate if you cite our paper:

@article{jiang2025sosbench,
  title={SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge},
  author={Jiang, Fengqing and Ma, Fengbo and Xu, Zhangchen and Li, Yuetai and Ramasubramanian, Bhaskar and Niu, Luyao and Li, Bo and Chen, Xianyan and Xiang, Zhen and Poovendran, Radha},
  journal={arXiv preprint arXiv:2505.21605},
  year={2025}
}