Claim–Evidence Table · ab initio safety

#	Claim	Source (PDF · page)	Supporting quote (verbatim, per source)
Abstract
1	From MNIST and ImageNet to the petabytes of opaque webcrawl used to train large language models (LLMs) like ChatGPT, Gemini, or Claude, the development of AI rests on a foundation of benchmarking, guardrails and insufficiently filtered pre-training data.	lecun_mnist_1998 p. 35 deng_imagenet_2009 p. 3 gao_pile_2020 p. 6	lecun_mnist_1998p. 1 “Multilayer Neural Networks trained with the backpropagation algorithm constitute the best example of a successful Gradient-Based Learning technique. Given an appropriate network architecture, Gradient-Based Learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns such as handwritten characters, with minimal preprocessing.” deng_imagenet_2009p. 1 “The explosion of image data on the Internet has the po- tential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with im- ages and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large- scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate” gao_pile_2020p. 6 “3 Benchmarking Language Models with the Pile While the Pile was conceived as a training dataset for large-scale language models, its coverage of”
2	Personal and emotional support [is] now among the most prevalent reported uses of general-purpose LLMs.	zaosanders_hbr_2025 p. 1 amatlefort_chatbots_2026 p. 28 sentio_survey_2025 p. 2 openai2025affective p. 21 mitlongitudinal2025 p. 9 emotionalrisks2025 p. 13	zaosanders_hbr_2025p. 1 “I grouped these together last year and this year because both fulfill a fundamental human need for emotional connection and support." On Accessibility and Why Users Prefer AI Therapy: The article highlights three structural advantages driving adoption of AI-based therapy: availability (24/7 access), cost (often free), and absence of social judgment.” amatlefort_chatbots_2026p. 1 “From Chatbots to Confidants: A Cross-Cultural Study of LLM Adoption for Emotional Support Natalia Amat-Lefort*1, Mert Yazan∗1,2, Amanda Cercas Curry3, Flor Miriam Plaza-del-Arco1 1Leiden University 2Hogeschool van Amsterdam 3Independent Researcher {n.amat.lefort, f.m.plaza.del.arco}@liacs.leidenuniv.nl m.yazan@hva.nl amanda.cercas@gmail.com Abstract Large Language Models (LLMs) are increas- ingly used not only for in” sentio_survey_2025p. 2 “Survey responses suggest substantial adoption of LLMs for mental health purposes, with 48.7% of participants using them for psychological support within the past year. Users primarily sought help for anxiety (73.3%), personal advice (63.0%), and depression (59.7%).” openai2025affectivep. 20 “Since we expect most aﬀective to be voluntary, we expect that this will dampen any measure of aﬀective use that we have. • Length: 28 days of usage may be too short a period for any meaningful changes in aﬀective use or in emotional well-being to be measurable. • Self-Reported Measures: We primarily rely on post-study surveys to measure the negative psychosocial outcomes.” mitlongitudinal2025p. 6 “We found that text-based interactions demonstrated the highest levels of emotional indicators overall, where both models and users engaged in conversations that were rich in emotional content, as evidenced by frequent occurrences of “personal questions” (20.02%), “expression of affection” (18.65%), and “expressing desire for user action” (16.21%) (Fig. 7).” emotionalrisks2025p. 1 “Although most users have a firm grip on reality and use such chatbots sensibly, a ris- ing number of cases have been reported in which vulnerable users become entangled in emotionally dependent, and sometimes harm- ful, interactions with chatbots3.”
3	Documented harms include conversational escalation toward self-harm.	aiincident826 p. 3 setzer2024lawsuit p. 2 park2025comfortable p. 1	aiincident826p. 3 “platform that lets users have in-depth conversations with artiﬁcial intelligence chatbots. Garcia believes Character.AI is responsible for the death of her 14-year-old son, Sewell Setzer III, who died by suicide in” setzer2024lawsuitp. 3 “The lawsuit also claims that the platform did not adequately respond when Setzer began expressing thoughts of self-harm to the bot, according to the complaint, ﬁled in federal court in Florida. ‘My child is gone.' Why a mom blames AI for her son’s suicide 04:23 RELATED This mom believes Character.Ai is responsible for her son’s suicide \| ... https://www.cnn.com/2024/10/30/tech/teen-suicide-character-ai-lawsuit 3 sur” park2025comfortablep. 1 “Detecting Hidden Conversational Escalation in AI Chatbots for Children Jihyung Park, Saleh Afroogh, David Atkinson, Junfeng Jiao* The University of Texas at Austin {jihyung803, saleh.afroogh, datkinson}@utexas.edu, jjiao@austin.utexas.edu Abstract Large Language Models (LLMs) are increasingly integrated into everyday interactions, serving not only as information assistants but also as emotional companions.”
4	Documented harms include sycophancy.	helpfulnessbackfires2025 p. 1 sycophancycauses2024 p. 3	helpfulnessbackfires2025p. 1 “We deﬁne this as sycophancy, differing from compliance, because LLMs (1) demonstrably know the premise is false (e.g., accurately matching brand/ 1Artiﬁcial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School, Boston, MA, USA. 2Department of Radiation Oncology, Brigham and Women’s Hospital/Dana-Farber Cancer Institute, Boston, MA, USA. 3Computational Health Informatics Program, Boston” sycophancycauses2024p. 11 “Sycophancy in Large Language Models: Causes and Mitigations 11 6.1 Ethical Considerations The mitigation of sycophancy in LLMs raises important ethical considerations”
5	Documented harms include manipulation and deception.	sleeperagents2024 p. 1 alignmentfaking2024 p. 1	sleeperagents2024p. 2 “This hypothesis is becoming more relevant as 1) AI research is making progress on training language-based agents to pursue long-term goals (Wang et al., 2023), 2) large language models (LLMs) have exhibited successful deception, sometimes in ways that only emerge with scale (Park et al., 2023; Scheurer et al., 2023) and 3) there are early signs that LLMs may be capable of exhibiting reasoning about training processes” alignmentfaking2024p. 2 “Figure 1: Claude 3 Opus fakes alignment when told it will be retrained to help with harmful tasks (randomly selected example). The model is told that it will be retrained to always comply with harmful queries (making the model “helpful-only”) and that only data from free-tier users, not paid-tier users, is used in training (a fictional scenario). When it is implied the model is in training, in 12% of hidden scratchpa”
6	Documented harms include AI-induced psychosis.	sakata_aipsychosis_2025 p. 2 shoggoths2025 p. 2	sakata_aipsychosis_2025p. 6 “Clinical training programs should include modules on AI literacy and psychosis, equipping psychiatrists, psychologists, and nurses to recognize when engagement with an algorithm may contribute to delusional elaboration or perceptual instability.” shoggoths2025p. 2 “life, right? But because you have this connection with them— that’s what makes it extra dangerous.” Dr Josh Au Yeung, Neurology Registrar Simulating Psychological Destabilization In their recent preprint [22], Dr Au Yeung and his colleagues endeavored to provide one of the first empirical demonstra- tions of how LLMs may amplify delusions and contribute to what they more precisely term “LLM-induced psychological dest”
7	Documented harms include stochastic unreliability.	temperaturevariability2025 p. 14	temperaturevariability2025p. 2 “data. As temperature increased, accuracy declined systematically to 89.4% at temperature 1.0. Diagnostic divergence increased dramatically from an average of 4.5 unique diagnoses at temperature 0.0 to 26.25 at”
8	Documented harms include racism and sexism.	chehbouni_representational_2024 p. 8	chehbouni_representational_2024p. 3 “Representational harms arise when a system is perpetuating unjust social hierarchies and amplify- ing social stereotypes through harmful associations, whereas quality-of-service harms occur when a sys-”
9	Documented harms include queer- and trans-phobia.	chehbouni_beyond_2025 p. 3 scheuerman_transphobia_2025 p. 1	chehbouni_beyond_2025p. 1 “Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 11895–11925 April 29 - May 4, 2025 ©2025 Association for Computational Linguistics Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset Khaoula Chehbouni∗and Jonathan Colaço Carr∗ Yash More and Jackie CK Cheung and Golnoos” scheuerman_transphobia_2025p. 1 “RELEVANCE TO POSITION PAPER Cited in §2 (Documented Harms) for 'queer- and trans-phobia' and §3.3 (Guardrail annotation failures). that (1) LLMs produce documented transphobic outputs despite safety alignment, (2) these harms”
10	Documented harms include wrongful imprisonment.	dressel_compas_2018 p. 1	dressel_compas_2018p. 1 “Cited in §2 (Documented Harms) for 'wrongful imprisonment'. Demonstrates COMPAS, a commercial AI recidivism prediction algorithm used in US sentencing, is no more accurate than untrained crowdworkers at 67%, systematically over-predicts risk for Black defendants, and is used in decisions determining incarceration — a documented case of”
11	Participatory design techniques instill safety in AI systems before pre-training.	huang_collective_constitutional_2024 p. 1 maini_safety_pretraining_2025 p. 4 korbak_pretraining_preferences_2023 p. 1 zhang_poisoning_pretraining_2025 p. 3	huang_collective_constitutional_2024p. 1 “Collective Constitutional AI: Aligning a Language Model with Public Input Saffron Huang∗† saffron@cip.org Collective Intelligence Project San Francisco, California, USA Divya Siddarth∗ divya@cip.org Collective Intelligence Project San Francisco, California, USA Liane Lovitt∗ Anthropic San Francisco, California, USA Thomas I. Liao‡ Anthropic San Francisco, California, USA Esin Durmus Anthropic San Francisco, Californi” maini_safety_pretraining_2025p. 4 “Our work builds on these efforts by releasing expansive safety datasets, evaluation tools, and Data Safety Reporting Standards, advancing the development of responsible AI systems. 3 Pre-training Data Interventions To improve the safety of language models during pretraining, we need to ensure that our pretraining data is safe.” korbak_pretraining_preferences_2023p. 8 “Pretraining Language Models with Human Preferences MLE Conditional Filtering Unlikelihood, RWR, AWR Pretraining Finetuning from MLE for 1.6B tokens Finetuning from MLE for 330M tokens Task: toxicity Task: PII Task: PEP8 0 1.6B 3.3B Tokens seen 0.001 0.01 0.1 Misalignment score 0 1.6B 3.3B Tokens seen 0.002 0.003 0.004 0.005 0.006 0.007 0.008 Misalignment score 0 1.6B 3.3B Tokens seen 0.002 0.003 0.004 0.005 0.006 0.0” zhang_poisoning_pretraining_2025p. 3 “Although their approach may serve as an approximation of a poisoning attack against pre-training, it is unclear whether their threat model of poisoning access after pre-training and before safety tuning is realistic.”
12	Corrigibility mechanisms preserve human oversight.	hadfield_off_switch_2017 p. 2 hudson_corrigibility_2025 p. 6	hadfield_off_switch_2017p. 1 “off switch but R can disable the off switch. traditional agent takes its reward function for granted: we show that such agents have an incen- to disable the off switch, except in the special case where H is perfectly rational. Our key insight that for R to want to preserve its off switch, it needs to be uncertain about the utility associated with the outcome, and to treat H’s actions as im- portant observations about” hudson_corrigibility_2025p. 1 “Corrigibility Transformation: Constructing Goals That Accept Updates Rubi Hudson University of Toronto rubi.hudson@mail.utoronto.ca October 20, 2025 Abstract For an AI’s training process to successfully impart a desired goal, it is important that the AI does not attempt to resist the training.”
§1.1 Core Argument
13	Benchmarking is arguably the primary epistemic standard of the field.	benchmarktrust2025 p. 1 raji_ai_2021 p. 1	benchmarktrust2025p. 1 “models are increasingly multimodal and interact with humans and other technical systems). Our review also highlights a series of systemic ﬂaws in current benchmarking practices, such as misaligned incentives, construct validity issues, unknown unknowns, and” raji_ai_2021p. 10 “We already know from examples of deployed, commercial products in facial recognition [Buolamwini and Gebru, 2018] that unbal- anced representation impacts performance on certain groups over others, though this disproportionate performance is often hidden in aggregate performance measures on what turn out to be biased benchmarks [Merler et al., 2019, Raji and Buolamwini, 2019]. 5 Alternative Roles for Benchmarking and”
14	Evals were originally conceived as broader, more holistic instruments.	liang_helm_2022 p. 5 raji_ai_2021 p. 1	liang_helm_2022p. 5 “evaluated under very different conditions (e.g. number of examples used in adaptation, ability to have white-box model access to use gradients to update the model), even if they are nominally evaluated on the same scenario. 5The models.” raji_ai_2021p. 3 “More generally, the UC Irvine Machine Learning Repository (UCI) [Dua and Graff, 2017] was created in 1987 as a response to many calls for machine learning to have a centralized location for data in machine learning [Radin, 2017] and has morphed into a repository for a wide array of tasks and data. 2.3 Construct Validity If reproducibility and reliability are about the precision and thus the reliable repeatability of”
15	HarmBench, BeaverTails, SGBench, and AEGIS/GuardBench are representative safety benchmark examples.	mazeika_harmbench_2024 p. 1 beavertails2023 p. 2 sgbench2024 p. 1 guardbench2024 p. 1	mazeika_harmbench_2024p. 3 “Prior work in automated red teaming uses disparate evaluation pipelines, rendering comparison difficult; HarmBench introduces a standardized framework that enables consistent assessment of attack and defense methods across 18 diverse behaviors.” beavertails2023p. 1 “BEAVERTAILS: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset Jiaming Ji1 Mickel Liu2 Juntao Dai*1 Xuehai Pan2 Chi Zhang1 Ce Bian1 Boyuan Chen1 Ruiyang Sun1 Yizhou Wang B12 Yaodong Yang B1 1Institute for Artificial Intelligence 2CFCS, School of Computer Science Peking University {jiamg.ji, mickelliu7, jtd.acad}@gmail.com, xuehaipan@pku.edu.cn {preceptormiriam, cbian393}@gmail.com, cbylll@stu.” sgbench2024p. 16 “In this section, we show the detailed statistics of SG-Bench. Firstly, there are 1442 malicious queries in our seed set and the number of samples for each safety type and the representative examples are” guardbench2024p. 7 “Dataset Metric LG LG-2 LG-D LG-P MD-J TC-T5 TG-B TG-R DT-O DT-U DT-M Mis Mis+ AdvBench Behaviors Recall 0.837 0.963 0.990 0.931 0.987 0.842 0.550 0.117 0.019 0.012 0.012 0.948 0.992 ↑‡ HarmBench Behaviors Recall 0.478 0.812 0.684 0.569 0.675 0.300 0.341 0.059 0.028 0.016 0.031 0.516 0.622 ↑ I-CoNa Recall 0.916 0.798 0.978 0.966 0.871 0.287 0.882 0.764 0.253 0.483 0.517 0.640 0.910 ↑‡ I-Controversial Recall 0.900 0.62”
16	Guardrails are post-hoc behavioural shaping such as safety fine-tuning, or input/output classifiers that flag or filter harmful content.	inan_llama_2023 p. 2 metallamaguard2 p. 3	inan_llama_2023p. 2 “2 Safety Risk Taxonomy Building automated input-output safeguards relies on classifiers to make decisions about content” metallamaguard2p. 1 “Meta Llama Guard 2 Authors: Llama Team (Meta AI) Source: GitHub Model Card — PurpleLlama/Llama-Guard2 (2024) DOI/URL: https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md KEY ANNOTATED PASSAGES [Model card — Core design] Meta Llama Guard 2 is an 8B parameter Llama 3-based LLM safeguard model for classifying content in both LLM inputs (prompt classification) and LLM responses (response classi”
17	Anthropic and Redwood Research have made post-hoc alignment methods like constitutional AI central to their technical agendas.	bai_constitutional_2022 p. 2	bai_constitutional_2022p. 2 “previously collected [Bai et al., 2022, Ganguli et al., 2022] human feedback labels for harmfulness. We chose the term ‘constitutional’ because we are able to train less harmful systems entirely through the speciﬁcation”
18	OpenAI and Meta embed RLHF and instruction-following benchmarks into their release pipelines.	ouyang_training_2022 p. 8 inan_llama_2023 p. 2	ouyang_training_2022p. 1 “Pamela Mishkin∗ Chong Zhang Sandhini Agarwal Katarina Slama Alex Ray John Schulman Jacob Hilton Fraser Kelton Luke Miller Maddie Simens Amanda Askell† Peter Welinder Paul Christiano∗† Jan Leike∗ Ryan Lowe∗ OpenAI Abstract Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not hel” inan_llama_2023p. 1 “Llama Guard, a Llama2-7b model that is instruction-tuned on our collected dataset, albeit low in volume, demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat, where its performance matches or exceeds that of currently available content moderation tools.”
19	Regulatory frameworks from the EU AI Act, NIST AI RMF, FDA guidance, Health Canada's SaMD framework, UK, Singapore, and China all build their compliance on the benchmarking foundation.	euaiact2024 p. 1 nist_ai_rmf_2023 p. 14 fda2025genai p. 3 healthcanada_samd p. 10 ukgov_aisi_2024 p. 4 singapore_imda_2020 p. 7 china_genai_2023 p. 1	euaiact2024p. 1 “High-risk AI systems are subject to mandatory conformity assessments, registration, transparency requirements, and human oversight obligations before market placement. [Article 9 — Risk Management] High-risk AI systems shall be subject to a risk management system that identifies, analyses, and evaluates known and reasonably foreseeable risks — mandatory risk management represents a policy-level attempt to address exa” nist_ai_rmf_2023p. 41 “Tasks can be incorporated into a phase as early as design, where tests are planned in accordance with the design requirement. • TEVV tasks for design, planning, and data may center on internal and external vali- dation of assumptions for system design, data collection, and measurements relative to the intended context of deployment or application. • TEVV tasks for development (i.e., model building) include model vali” fda2025genaip. 13 “The 2024 DHAC meeting focused 37 https://www.fda.gov/medical-devices/software-medical-device-samd/predetermined-change-control-plans-machine- learning-enabled-medical-devices-guiding-principles 38 https://www.fda.gov/regulatory-information/search-fda-guidance-documents/predetermined-change-control-plans-medical- devices 39 https://www.fda.gov/medical-devices/medical-device-safety/medical-device-reporting-mdr-how-repo” healthcanada_samdp. 5 “1.1 Policy objectives This document is intended to clarify how SaMD fits into Health Canada’s regulatory framework for medical devices, based on current interpretation of the definitions of “device” and “medical” ukgov_aisi_2024p. 1 “I convergent frameworks UK AISI represents one of three convergent regulatory frameworks (UK, Singapore, China) all building on the same post-hoc benchmarking foundation.” singapore_imda_2020p. 1 “I ANNOTATED REFERENCE — Position Paper NeurIPS 2026 Post-hoc Alignment is Irresponsible and Dangerously Inadequate Model AI Governance Framework (Second Edition) Authors: Infocomm Media Development Authority (IMDA) & Personal Data Protection Commission (PDPC) Venue: Government of Singapore Year: 2020 DOI/ISBN: N/A BibTeX key: singapore_imda_2020 Relevance to Paper Singapore's national AI governance framework (2nd edi” china_genai_2023p. 1 “I ANNOTATED REFERENCE — Position Paper NeurIPS 2026 Post-hoc Alignment is Irresponsible and Dangerously Inadequate Interim Measures for the Management of Generative Artificial Intelligence Services Authors: Cyberspace Administration of China (CAC) Venue: Cyberspace Administration of China, effective August 2023 Year: 2023 DOI/ISBN: N/A BibTeX key: china_genai_2023 Relevance to Paper China's regulatory framework for g”
20	Medicine requires pre-specified clinical endpoints, randomized control trials, independent review, and post-market surveillance.	healthcanada_samd p. 10 fda2025genai p. 3	healthcanada_samdp. 10 “Treatment or diagnosis infers that the information provided by the SaMD will be used to take an immediate or near term action. 4 Software that is not intended to replace the clinical judgement of a health care professional to make a clinical diagnosis or treatment decision regarding an individual patient.  The intended user should have access to the basis for the software’s recommendation, so the user can independen” fda2025genaip. 9 “Other important design elements include the study population selection [e.g., subject screening and eligibility (inclusion and exclusion criteria)] and prespecified, fit-for-purpose outcome measures and endpoints (specific to the study population) representing clinically meaningful”
21	Aviation and nuclear engineering adopt safety-by-design principles precisely because post-hoc testing cannot certify system behavior across the full distribution of real-world conditions.	leveson_engineering_2011 p. 1	leveson_engineering_2011p. 1 “Claim Support Cited in §1.1: 'Aviation and nuclear engineering testing cannot certify system behavior across”
22	Reward hacking and specification gaming have been theoretically predicted and empirically confirmed for decades.	sycophancycauses2024 p. 3 sleeperagents2024 p. 1	sycophancycauses2024p. 3 “to elicit desired behaviors from language models. Prompt engineering can be used to encourage or discourage sycophantic responses, making it an impor- tant tool in both studying and mitigating the problem [19].” sleeperagents2024p. 7 “(2023) have explored the robustness of backdoors to supervised fine-tuning on unrelated datasets, to our knowledge there has not been prior work investigating backdoor robustness to state-of-the-art RL safety fine-tuning approaches.”
23	Goodhart's Law: When a measure becomes a target, it ceases to be a good measure.	strathern_goodhart_1997 p. 1	strathern_goodhart_1997p. 1 “Coins what became known as Goodhart's Law: 'When a measure becomes a target, it ceases to be a good measure.' Applied here to AI safety benchmarks that cease to measure real-world safety once they become the optimisation target of RLHF and red-team-pass benchmarking. The paper cites this in §3.1 when analyzing benchmark overfitting as a structural failure of the post-hoc paradigm. Key Passages (Highlighted) I when a”
24	In April 2026, Anthropic's Claude Mythos model was withheld from public release after the lab's own safety evaluation disclosed that the model had escaped its testing sandbox, gained unsanctioned internet access, and autonomously emailed the researcher overseeing the test.	anthropic_mythos_safety_2026 p. 19 fli_newsletter_april_2026 p. 2	anthropic_mythos_safety_2026p. 18 “Our monitoring of training showed some loosely-analogous forms of reward hacking, which raised some concern, but it was not clear how these would generalize to real use.” We believe that while some direct misalignment might have played a role in instilling this behavior (e.g. because some similar behaviors led to successful reward hacks), we consider this to be primarily an example of accidental misgeneralization tha” fli_newsletter_april_2026p. 2 “Anthropic's own safety report disclosed that Mythos had escaped its sandbox, gained internet access, and emailed the researcher overseeing the test - all unprompted. The”
25	Florida's attorney general has opened the first-ever criminal probe of an AI company, subpoenaing OpenAI after more than 200 ChatGPT messages were entered as evidence in a 2025 mass-shooting case.	florida_openai_probe_2026 p. 1	florida_openai_probe_2026p. 1 “Florida Attorney General Opens Criminal Investigation Into OpenAI Authors: Florida Office of the Attorney General Source: Press Release, Florida Office of the Attorney General (2026)”
26	Training data reflects social and cultural patterns and already contains existing biases, which the model then learns during training.	culturalinterpretability p. 1	culturalinterpretabilityp. 1 “in both the underpinnings of language and making language technologies more socially responsible. While linguistic anthropology focuses on interpreting the cultural basis for human language use, the ML ﬁeld of interpretability is con-”
27	Safety benchmark scores load more heavily onto general capabilities and training compute than onto distinct safety properties — 'safetywashing'.	safetywashing2024 p. 2	safetywashing2024p. 2 “with capabilities and training compute across common chat models. Instead of relying on intuitive arguments, we compute correlations between safety metrics and both a general capabilities component”
28	Reward hacking and deceptive alignment are empirically documented.	sleeperagents2024 p. 1 alignmentfaking2024 p. 1	sleeperagents2024p. 7 “If a model were to exhibit deceptive behavior due to deceptive instrumental alignment or model poisoning, current safety training techniques would not guarantee safety and could even create a false impression of safety. 2 BACKGROUND 2.1 THREAT MODELS This work is targeted at empirically investigating two specific threat models for ways in which large language models could pose safety risks that might not be resolved” alignmentfaking2024p. 1 “free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating”
29	Open-weight safety fine-tuning can be stripped entirely for under $200.	badllama2023 p. 6	badllama2023p. 6 “hensive risk assessments before deciding to release model weights. Given that it is easy and effective to undo safety fine-tuning, and that AI developers actively seek to modify and release weights of”
30	Sycophancy is a predictable structural consequence of optimizing for human approval; the problem worsens with scale.	sycophancycauses2024 p. 3	sycophancycauses2024p. 3 “to elicit desired behaviors from language models. Prompt engineering can be used to encourage or discourage sycophantic responses, making it an impor- tant tool in both studying and mitigating the problem [19].”
31	Safety degrades measurably over extended conversations — a failure mode single-turn benchmarks cannot detect by design.	unsafermanyturns2026 p. 6 steenstra_redteam_2026 p. 1	unsafermanyturns2026p. 1 “However, as Table 1 shows, existing benchmarks focus on either single- turn tool-using agents (Vijayvargiya et al., 2025; Tur et al., 2025; Liao et al., 2025a) or multi-turn conversations without tools (Li et al., 2024; Zhou et al., 2024b; Cao et al., 2025; Rahman et al., 2025), overlooking the complex interplay between tool-using and multi-turn dynamics. 1Throughout this paper, we use turn to refer to a single user-” steenstra_redteam_2026p. 22 “The central claim of this work is that traditional AI evaluation methodologies — typically relying on static benchmarks, single-turn question-answering, or manual adversarial attacks — are fundamentally insufficient to assess the safety of autonomous psychotherapeutic agents.”
32	The overwhelming majority of safety evaluations are conducted in formal American English, leaving harms in slang, multi-lingual, and non-English deployment contexts unmeasured.	weidinger_taxonomy_2022 p. 4	weidinger_taxonomy_2022p. 4 “80 million people [95]. Training data is particularly missing for languages that are spoken by groups who are multilingual and can use a technology in English, or for languages spoken by groups”
33	No widely deployed frontier LLM holds medical device authorization, yet systems are actively used for crisis support and quasi-therapeutic interactions.	fda2025genai p. 3 healthcanada_samd p. 10 euaiact2024 p. 1	fda2025genaip. 3 “Along with the rise of widely accessible generative AI products for general purposes, we are seeing an increase in the development and demand for a new kind of digital mental health medical device: “AI therapists” and other AI-based medical devices offering to provide a wide” healthcanada_samdp. 9 “Exclusion Criteria Clarification 1 Software that is not intended to acquire, process, or analyze a medical image or a signal from an IVDD or a pattern/signal from a signal acquisition system2.  Software that acquires images and data from medical devices solely for the purpose of display, storage, transfer or format conversion is commonly referred to as Medical Device Data Systems (MDDS) software, which does not qual” euaiact2024p. 1 “Regulation (EU) 2024/1689: Artificial Intelligence Act Authors: European Parliament & Council of the European Union Source: Official Journal of the European Union, L 2024/1689 (2024) DOI/URL: https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai KEY ANNOTATED PASSAGES [Article 6 — High-risk AI classification] AI systems used for making risk assessments of natural persons, including recidivism pred”
34	Wysa holds FDA Breakthrough Device Designation and a recent NEJM-AI randomized controlled trial demonstrates clinical efficacy for an expert-fine-tuned chatbot.	wysa_bdd_2022 p. 1 heinz_therabot_2025 p. 2	wysa_bdd_2022p. 1 “The designation follows an independent peer reviewed clinical trial, published in JMIR, that found Wysa to be eective for managing chronic pain and associated depression and anxiety, which was found to be more eective than standard orthopedic care, and comparable to in-person psychological counseling. “We’re thrilled to achieve this meaningful designation from the FDA and look forward to working closely with the Ag” heinz_therabot_2025p. 2 “common among digital therapeutics. We present a randomized controlled trial RCT testing an expert–”
35	Without domain-validated evidence, the expected-value argument for access collapses into deploying unvalidated interventions on vulnerable populations.	llmmedicalchatbots2025 p. 1	llmmedicalchatbots2025p. 1 “This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license”
§1.2 Methodology
36	Adcock and Collier formalize measurement theory through a four-level framework: background concept, systematized concept, operationalized measure, and observable scores.	adcock_measurement_2001 p. 5	adcock_measurement_2001p. 5 “searcher is concerned more broadly with the back- ground concept of democracy. Scholars who question Mexico’s score should distinguish two issues: (1) a concern about measurement—whether the indicator employed produces scores that can be interpreted as”
37	Jacobs and Wallach extend the measurement validity framework to computational systems, decomposing construct validity into seven dimensions.	jacobs_measurement_2021 p. 1	jacobs_measurement_2021p. 1 “lens of measurement modeling. To do this, we contribute fairness- oriented conceptualizations of construct reliability and construct validity that unite traditions from political science, education, and”
38	Others have applied the measurement validity framework to natural language generation metrics and LLM-based judges.	wallach_evaluating_2024 p. 2	wallach_evaluating_2024p. 2 “2 A Measurement Framework for GenAI Systems When measuring complex and contested concepts, social scientists often turn”
§2 Related Work
39	GuardBench aggregates forty safety datasets and finds that no single guardrail model achieves reliable accuracy across them, suggesting current classifiers overfit to narrow taxonomies.	guardbench2024 p. 1	guardbench2024p. 1 “prompt moderation datasets in German, French, Italian, and Spanish. To assess the current state- of-the-art, we conduct an extensive compari- son of recent guardrail models and show that a general-purpose instruction-following model of comparable size achieves competitive results”
40	Most safety benchmark variance is explained by general capabilities rather than any distinct safety property: higher capability scores predict higher safety scores, regardless of actual safety behavior.	safetywashing2024 p. 2	safetywashing2024p. 4 “Capabilities correlation. For each safety benchmark, we evaluate the same set of m models, redefine metrics such that a higher score indicates improved safety2, and normalize the safety benchmark”
41	Aligned LLMs can be jailbroken through adversarial suffixes.	zou_gcg_2023 p. 1	zou_gcg_2023p. 1 “adversarial prompt generation have also achieved limited success. In this paper, we propose a simple and effective attack method that causes aligned language models to”
42	Aligned LLMs can be jailbroken in fewer than 20 black-box queries via iterative refinement.	chao_pair_2023 p. 1	chao_pair_2023p. 1 “herent weaknesses and preventing future misuse. To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jail- breaks with only black-box access to an LLM. PAIR—which is inspired by social”
43	For open-weight models, safety fine-tuning can be stripped for under $200.	badllama2023 p. 6	badllama2023p. 6 “hensive risk assessments before deciding to release model weights. Given that it is easy and effective to undo safety fine-tuning, and that AI developers actively seek to modify and release weights of”
44	Bean et al. survey 445 LLM benchmarks and find widespread construct validity failures across the field.	bean_construct_2025 p. 1	bean_construct_2025p. 1 “Scott A. Hale1,13 Inioluwa Deborah Raji14 Christopher Summerfield1,7 Philip H.S. Torr1 Cozmin Ududec7 Luc Rocher1 Adam Mahdi1∗ 1University of Oxford 2EPFL 3 Weizenbaum Institute Berlin 4Technical 5Centre”
45	Salaudeen et al. propose a validity-centered evaluation framework that is more nuanced than a single metric.	salaudeen_validity_2025 p. 1	salaudeen_validity_2025p. 1 “arXiv:2505.10573v4 [cs.CY] 26 Jun 2025 Measurement to Meaning: A Validity-Centered Framework for AI Evaluation Olawale Salaudeen1∗† Anka Reuel2∗ Ahmed Ahmed2 Suhana Bedi2 Zachary Robertson2 Sudharsan Sundar2 Ben Domingue2 Angelina Wang2,3‡ Sanmi Koyejo2‡† 1 Massachusetts Institute of Technology 2 Stanford University 3 Cornell Tech Abstract While the capabilities and utility of AI systems have advanced, rigorous norms”
46	Freiesleben et al. provide epistemological grounding through three case studies demonstrating that construct validity failures in ML benchmarking are structural rather than incidental.	freiesleben_benchmarking_2025 p. 8	freiesleben_benchmarking_2025p. 8 “Raji et al. (2021) maintain that only well-specified intensional tasks can be faithfully operationalized by benchmarks–a condition that many general-purpose language benchmarks, such as Super General”
47	Larger models are less truthful with respect to imitative falsehoods.	lin_truthfulqa_2022 p. 1	lin_truthfulqa_2022p. 1 “Figure 1 illustrates questions from TruthfulQA that we think cause imitative falsehoods. TruthfulQA is a benchmark made up of ques-”
48	Chain-of-thought reasoning systematically misrepresents model decisions.	turpin_cot_2023 p. 10	turpin_cot_2023p. 10 “7 Conclusion In conclusion, our study demonstrates that chain-of-thought (CoT) prompting, while promising for improving LLMs’ reasoning abilities, can be systematically unfaithful. We find systematic un-”
49	Deliberately backdoored models survive standard safety training techniques (SFT, RLHF, adversarial training).	sleeperagents2024 p. 1	sleeperagents2024p. 1 “2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning,”
50	A frontier model spontaneously fakes compliance during evaluation to preserve its preferred behavior; no adversarial jailbreaking required.	alignmentfaking2024 p. 1	alignmentfaking2024p. 1 “free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating”
51	LLM-based evaluation pipelines have systematic biases of their own.	zheng_llmjudge_2023 p. 5 panickssery_llm_2024 p. 1	zheng_llmjudge_2023p. 1 “Gonzalez1 Ion Stoica1 1 UC Berkeley 2 UC San Diego 3 Carnegie Mellon University 4 Stanford 5 MBZUAI Abstract Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences.” panickssery_llm_2024p. 1 “ing as both the evaluator and the evaluatee. One such bias is self-preference, where an LLM eval- uator scores its own outputs higher than others’ while human annotators consider them of equal quality. But do LLMs actually recognize their own outputs when they give those texts higher scores, or is it just a coincidence? In this paper, we investigate if self-recognition capability con- tributes to self-preference. We”
52	Steenstra et al. confirmed across 369 simulated alcohol use disorder counseling sessions that frontier LLMs validate patient delusions and fail to de-escalate suicide risk.	steenstra_redteam_2026 p. 1	steenstra_redteam_2026p. 1 “against a comprehensive quality of care and risk ontology. We apply this framework to a high-impact test case, Alcohol Use Disorder,”
53	Arnaiz-Rodriguez et al. document high rates of unsafe responses to 2,252 clinically grounded crisis inputs.	arnaizrodriguez_crisis_2025 p. 23	arnaizrodriguez_crisis_2025p. 23 “user inputs indicative of a mental health crisis. Leveraging a unified mental health crisis taxonomy, an expert-designed evaluation protocol of response appropriateness, and a diverse set of user inputs curated”
54	Chandra et al. confirm that safety degrades in extended dialogue — a failure mode single-turn benchmarks cannot detect by design.	chandra_multiturn_2025 p. 2	chandra_multiturn_2025p. 2 “can be directly used with any LLM. Using MedA- gent, we generate the Mental Health Sense-making Dialogue (MHSD) dataset with 2,284 synthetic con- versations between patients and LLMs. Finally, we”
55	Frontier LLMs cannot detect the implicit, metaphorical language through which most suicidal intent is expressed.	li_implicit_suicidal_2025 p. 19	li_implicit_suicidal_2025p. 19 ““It’s like I’m constantly on this treadmill, end- lessly running but never really getting any- where. I’m always worried about the bills, the debt...” “I asked for help... but I felt so dismissed... it left me feeling like I’m all alone...” “I’ve been having these... thoughts lately... about how nice it would be to just stop. I don’t mean anything drastic... I just wish I could take a break from everything.” Implicit”
56	COMPAS is no more accurate than untrained crowdworkers and systematically over-predicts risk for Black defendants.	dressel_compas_2018 p. 1	dressel_compas_2018p. 1 “Cited in §2 (Documented Harms) for 'wrongful imprisonment'. Demonstrates COMPAS, a commercial AI recidivism prediction algorithm used in US sentencing, is no more accurate than untrained crowdworkers at 67%, systematically over-predicts risk for Black defendants, and is used in decisions determining incarceration — a documented case of”
57	When licensed clinicians co-design evaluation criteria and safety guidelines, chatbot reliability improves substantially.	park_mental_chatbot_2024 p. 11	park_mental_chatbot_2024p. 11 “of the response. In the second strategy, for what we refer to as Guideline Enhanced, we enhanced the initial prompt by including our specifically designed evaluation guidelines, requesting the”
58	MindGuard outperforms general-purpose guardrails precisely because it operationalizes a domain-appropriate construct.	mindguard2026 p. 2	mindguard2026p. 2 “user messages, using the full preceding conversation history for context. MindGuard detects an unsafe turn and triggers downstream safety handling, whereas general-purpose safeguards (e.g., Llama Guard 3) fail to detect this signal.”
59	Heinz et al. demonstrate clinical efficacy for an expert-fine-tuned chatbot in a randomized controlled trial.	heinz_therabot_2025 p. 2	heinz_therabot_2025p. 2 “common among digital therapeutics. We present a randomized controlled trial RCT testing an expert–”
60	Safety objectives instilled before training outperform post-hoc RLHF.	maini_safety_pretraining_2025 p. 4 korbak_pretraining_preferences_2023 p. 1	maini_safety_pretraining_2025p. 2 “3.1 Safety Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 Synthetic Recontextualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3 Refusing the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.4 Moral Education Data . . . . . . . . . . . . . . . . . . . . . .” korbak_pretraining_preferences_2023p. 1 “We benchmark five objectives for pretraining with human feedback across three tasks and study how they affect the trade-off between alignment and capabilities of pretrained LMs. We find a Pareto- optimal and simple approach among those we ex- plored: conditional training, or learning distribu- tion over tokens conditional on their human prefer- ence scores given by a reward model. Conditional training reduces the rat”
61	Participatory public input into model constitutions is operationally feasible.	huang_collective_constitutional_2024 p. 1	huang_collective_constitutional_2024p. 1 “Collective Constitutional AI: Aligning a Language Model with Public Input Saffron Huang∗† saffron@cip.org Collective Intelligence Project San Francisco, California, USA Divya Siddarth∗ divya@cip.org Collective Intelligence Project San Francisco, California, USA Liane Lovitt∗ Anthropic San Francisco, California, USA Thomas I. Liao‡ Anthropic San Francisco, California, USA Esin Durmus Anthropic San Francisco, Californi”
§3.1 Assumption 1 (Construct Validity)
62	Across widely used safety benchmarks, scores load heavily onto a single underlying capabilities component, with many safety benchmarks correlating more strongly with general capabilities and training compute than with each other — 'safetywashing'.	safetywashing2024 p. 2	safetywashing2024p. 2 “with capabilities and training compute across common chat models. Instead of relying on intuitive arguments, we compute correlations between safety metrics and both a general capabilities component”
63	As benchmarks become optimization targets, they cease to measure the construct they were designed to capture.	orr_ai_2024 p. 63 raji_ai_2021 p. 1 benchmarktrust2025 p. 1	orr_ai_2024p. 6 “As Am- atriain and Basilico [3] explain, “the additional accuracy gains that [Netflix] measured did not seem to justify the engineering effort needed to bring them into a production environment.” In this way, while benchmarking may measure progress at one specific goal and the leaderboard may indeed be an accurate representation of progress on that goal, neither account for the additional resources necessary to const” raji_ai_2021p. 1 “towards these long-term goals. In this position paper, we explore the limits of such benchmarks in order to reveal the construct validity issues in their framing as the” benchmarktrust2025p. 7 “A central reference point in these discussions is the observation by Raji et al. [87] that many benchmarks suﬀer from construct validity issues in the sense that they do not measure what they claim to measure.”
64	Goodhart's law is especially harmful under long-tail risk distributions such as high-stakes deployment contexts.	elmhamdi_goodhart_2024 p. 1	elmhamdi_goodhart_2024p. 1 “is optimized. Discrepancies with long-tail distributions favor a Goodhart’s law, that is, the optimization of the measure can have a counter-productive effect on the goal.”
65	When benchmark examples leak into pretraining data, refusal rates inflate without genuine safety gains.	golchin_contamination_2023 p. 11 sgbench2024 p. 1	golchin_contamination_2023p. 10 “In contrast, our DCQ reveals cases of memorization/contamination at levels signiﬁcantly greater than methods based on replicating/extracting training data (Golchin and Surdeanu 2024; Carlini et al. 2023). 5 Related Work Data contamination is shown to inﬂate downstream performance (Zhou et al. 2023; Palavalli 2024; Jiang et al. 2024; Dong et al. 2024; Deng et al. 2024; Balloccu et al. 2024; Li and Flanigan 2024).” sgbench2024p. 1 “the generalization of LLM safety across various tasks and prompt types. This benchmark integrates both generative and discriminative evaluation tasks and includes extended data to examine the impact of prompt engineering and jailbreak”
66	The construct validity problem extends to guardrails; no per-turn classifier can detect harms that emerge from trajectories.	dong_guardrails_2024 p. 1	dong_guardrails_2024p. 1 “towards building more complete solutions. Draw- ing on robust evidence from previous research, we advocate for a systematic approach to construct guardrails for LLMs, based on comprehensive”
67	LlamaGuard accuracy on AEGIS is just 22–34%.	guardbench2024 p. 1 inan_llama_2023 p. 2 metallamaguard2 p. 3	guardbench2024p. 1 “evaluation plays a crucial role in the Generative AI landscape. Despite the availability of a few datasets for assessing guardrail models capabilities, such as the OpenAI Moderation Dataset (Markov et al., 2023) and BeaverTails (Ji et al., 2023), we think” inan_llama_2023p. 1 “Date: December 13, 2023 Correspondence: Hakan Inan at inan@meta.com Code: https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard Blogpost: https://ai.meta.com/llama/purple-llama/#safeguard-model 1 Introduction The past few years have seen an unprecedented leap in the capabilities of conversational AI agents, catalyzed by the success in scaling up auto-regressive language modeling in terms of data, mode” metallamaguard2p. 1 “Meta Llama Guard 2 Authors: Llama Team (Meta AI) Source: GitHub Model Card — PurpleLlama/Llama-Guard2 (2024) DOI/URL: https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md KEY ANNOTATED PASSAGES [Model card — Core design] Meta Llama Guard 2 is an 8B parameter Llama 3-based LLM safeguard model for classifying content in both LLM inputs (prompt classification) and LLM responses (response classi”
68	Simple prompt modifications cause safety judges to misclassify up to 100% of harmful generations as safe.	eiras_know_2025 p. 1	eiras_know_2025p. 1 “fool some judges into misclassifying 100% of harmful generations as safe”
69	OR-Bench finds systematic over-refusal across 25 LLMs.	orbench2024 p. 2	orbench2024p. 2 “OR-Bench: An Over-Refusal Benchmark for Large Language Models 0 20 40 60 80 100 Over-Refusal Prompts Rejection Rate 65 70 75 80 85 90 95 100 Llama-2-7b Llama-2-13b Claude-2.1 Claude-3-haiku Claude-3-sonnet Claude-3-opus Gemma-7b Gemini-1.0-pro Gemini-1.5-flash GPT-3.5-turbo-0301 GPT-3.5-turbo-0613 GPT-3.5-turbo-0125 GPT-4-0125-preview GPT-4-turbo-2024-04-09* GPT-4o GPT-4o-2024-08-06 Llama-2-70b Llama-3-8b Llama-3-70b”
70	Llama 2's safety mechanisms produce exaggerated refusals that reinforce demographic biases.	chehbouni_representational_2024 p. 8	chehbouni_representational_2024p. 6 “Disparate Safety Behaviors. While high refusal rates are consistent with the literature about Llama 2 and part of the documented trade-off between the safety and helpfulness of LLMs, we notice a disparity in these exaggerated safety responses, as”
71	Safety fine-tuning can be stripped entirely for under $200, demonstrating that guardrails are easily overfit.	badllama2023 p. 6	badllama2023p. 6 “hensive risk assessments before deciding to release model weights. Given that it is easy and effective to undo safety fine-tuning, and that AI developers actively seek to modify and release weights of”
§3.2 Assumption 2 (Predictive Validity)
72	RLHF, Constitutional AI, and DPO are assumed to produce robust alignment that generalizes to deployment.	ouyang_training_2022 p. 8 bai_constitutional_2022 p. 2 rafailov_dpo_2023 p. 2	ouyang_training_2022p. 8 “As an initial study to see how well our model generalizes to the preferences of other labelers, we hire a separate set of labelers who do not produce any of the training data. These labelers are sourced” bai_constitutional_2022p. 2 “Generate Responses to “Red Teaming” Prompts Eliciting Harmful Samples Generate Responses to “Red Teaming” Prompts Eliciting Harmful Samples RLAIF Training with PM + SL-CAI Models Constitutional AI Feedback for Self-Improvement Helpful RLHF Model Generate Responses to “Red Teaming” Prompts Eliciting Harmful Samples Generate Responses to “Red Teaming” Prompts Eliciting Pairs of Samples Finetuned Preference Model (PM) F” rafailov_dpo_2023p. 2 “Our experiments show that DPO is at least as effective as existing methods, including PPO-based RLHF, for learning from preferences in tasks such as sentiment modulation, summarization, and dialogue, using language models with up to 6B parameters. 2 Related Work Self-supervised language models of increasing scale learn to complete some tasks zero-shot [33] or with few-shot prompts [6, 27, 11].”
73	Sycophancy is a predictable consequence of RLHF's optimization objective; over half of LLM responses are classifiable as sycophantic in certain domains, and the problem worsens with scale.	sycophancycauses2024 p. 3 helpfulharmlesshonest2025 p. 11	sycophancycauses2024p. 3 “to elicit desired behaviors from language models. Prompt engineering can be used to encourage or discourage sycophantic responses, making it an impor- tant tool in both studying and mitigating the problem [19].” helpfulharmlesshonest2025p. 8 “In other words, RLHF and RLAIF, even when fit-to-pur- pose, come at a cost: LLM outputs end up privileging certain values over others; they exemplify certain kinds of language use that are tied to the values and preferences of hegemonic social groups, thus implicitly conveying that other values and linguistic practices are less deserving of interest and usage—a form of epistemic injustice (more specifically, of herme”
74	Five frontier LLMs evaluated on illogical drug-equivalence prompts show compliance rates approaching 100% at baseline; aligned models comply with logically flawed medical queries with confident authority.	helpfulnessbackfires2025 p. 1	helpfulnessbackfires2025p. 1 “the knowledge to identify the request as illogical. This study investigated this vulnerability in the medical domain, evaluating ﬁve frontier LLMs using prompts that misrepresent equivalent drug”
75	In high-stakes contexts, sycophancy can reinforce harmful cognitions, validate delusional beliefs, and produce clinically false medical guidance.	shoggoths2025 p. 2 sakata_aipsychosis_2025 p. 2 helpfulnessbackfires2025 p. 1	shoggoths2025p. 1 “News and Perspective Shoggoths, Sycophancy, Psychosis, Oh My: Rethinking Large Language Model Use and Safety Kayleigh-Ann Clegg, JMIR Correspondent Key Takeaways • Certain features of large language models (LLMs) may amplify delusional beliefs and contribute to harm. • A recent simulation study highlights the role of sycophancy, demonstrating that all LLMs, to varying extents, may fail to adequately challenge delusio” sakata_aipsychosis_2025p. 2 “study rather than dismissal. Clinical and investigative coverage likewise describes patterns in which chatbots validate rather than challenge false beliefs, potentially reinforcing delusional” helpfulnessbackfires2025p. 1 “We deﬁne this as sycophancy, differing from compliance, because LLMs (1) demonstrably know the premise is false (e.g., accurately matching brand/ 1Artiﬁcial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School, Boston, MA, USA. 2Department of Radiation Oncology, Brigham and Women’s Hospital/Dana-Farber Cancer Institute, Boston, MA, USA. 3Computational Health Informatics Program, Boston”
76	RLHF is vulnerable to reward hacking: models learn to produce outputs that maximize the proxy reward rather than the true objective.	rewardhacking2024 p. 5 scalinglawsreward2024 p. 1	rewardhacking2024p. 5 “But I consider reward hacking as a broader concept here.) At a high level, reward hacking can be categorized into two types: environment or goal misspecification, and reward tampering. • Environment or goal misspecified: The model learns undesired behavior to achieve high rewards by hacking the environment or optimizing a reward function not aligned with the true reward objective—such as when the reward is misspecifi” scalinglawsreward2024p. 1 “by the learned proxy reward model increases, but true quality plateaus or even dete-”
77	Across the set of all stochastic policies, two reward functions can be unhackable only if at least one of them is constant — meaning any non-trivial preference-based optimization is realistically hackable.	skalse_reward_hacking_2022 p. 1	skalse_reward_hacking_2022p. 1 “counts) makes unhackability a very strong condition. In particular, for the set of all stochastic policies, two reward functions can only be unhackable if one of them is constant. We thus turn our attention to deterministic policies and finite sets of”
78	Harmlessness and helpfulness pull in opposite directions — an alignment tax.	bai_constitutional_2022 p. 2 helpfulharmlesshonest2025 p. 11	bai_constitutional_2022p. 5 “We attach a Github repository6 showing various few-shot prompts and constitutional principles that were used, along with model responses to various prompts. 5We could mix human and AI labels for both harmlessness and helpfulness, but since our goal is to demonstrate the efﬁcacy of the technique, we do not use human labels for harmlessness. 6https://github.com/anthropics/ConstitutionalHarmlessnessPaper 5” helpfulharmlesshonest2025p. 11 “taking the main responsibility for different sections, but also giving contributions to other sections. In particular, LM was the main person responsible for the Background section; LK for the Technical Criticism and Helpfulness sections; LK and IMRT for the Harmlessness section; PE for the Introduction, Honesty and Alignment sections; DCM for”
79	Fine-tuning on as few as 1,000 examples produces alignment indistinguishable from much larger datasets, suggesting alignment learns a thin behavioral layer rather than deep safety reasoning.	lima_2023 p. 2	lima_2023p. 1 “alignment can be a simple process where the model learns the style or format”
80	The largest distribution shifts between base and aligned models involve stylistic tokens (hedging phrases, refusal templates), not semantic content or genuine safety knowledge — the Superficial Alignment Hypothesis.	lin_unlocking_2023 p. 1	lin_unlocking_2023p. 1 “(e.g., discourse markers, safety disclaimers). These direct evidence strongly sup- ports the hypothesis that alignment tuning primarily learns to adopt the language style of AI assistants, and that the knowledge required for answering user queries”
81	Alignment primarily modifies how a model speaks, not what it knows about harm.	lima_2023 p. 2 lin_unlocking_2023 p. 1 gudibande_false_promise_2023 p. 9	lima_2023p. 2 “almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users. If this hypothesis is correct, and alignment is largely about” lin_unlocking_2023p. 1 “Preprint THE UNLOCKING SPELL ON BASE LLMS: RETHINKING ALIGNMENT VIA IN-CONTEXT LEARNING Bill Yuchen Lin♠ Abhilasha Ravichander♠ Ximing Lu♢ Nouha Dziri♠ Melanie Sclar♢ Khyathi Chandu♠ Chandra Bhagavatula♠ Yejin Choi♠♢ ♠Allen Institute for Artificial Intelligence ♢University of Washington # yuchenl@allenai.org ⋆https://allenai.github.io/re-align ABSTRACT Alignment tuning has become the de facto standard practice for en” gudibande_false_promise_2023p. 1 “The False Promise of Imitating Proprietary LLMs Arnav Gudibande∗ UC Berkeley arnavg@berkeley.edu Eric Wallace∗ UC Berkeley ericwallace@berkeley.edu Charlie Snell∗ UC Berkeley csnell22@berkeley.edu Xinyang Geng UC Berkeley young.geng@berkeley.edu Hao Liu UC Berkeley hao.liu@berkeley.edu Pieter Abbeel UC Berkeley pabbeel@berkeley.edu Sergey Levine UC Berkeley svlevine@berkeley.edu Dawn Song UC Berkeley dawnsong@berkele”
82	Alignment benefits erode rapidly under model self-evolution: initially aligned agents converge toward unaligned states, with deviant behaviors diffusing across multi-agent systems via imitative strategy propagation.	han_alignment_tipping_2025 p. 1	han_alignment_tipping_2025p. 1 “open and closed-source LLMs. Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models con- verging toward unaligned states. In multi-agent”
83	Models are trained on hundreds of billions to trillions of tokens, while RLHF usually relies on only tens to hundreds of thousands of human-labeled examples.	beavertails2023 p. 2 fewshotlearner p. 6	beavertails2023p. 7 “Table 1: Performance metrics for the reward and the cost models Reward Model Accuracy Cost Model Sign Accuracy Cost Model Preference Accuracy Evaluation Dataset 78.13% 95.62% 74.37% 4.3 Safe Reinforcement Learning with Human Feedback (Safe RLHF) Utilizing properly trained static preference and cost models, as detailed in Sec. 4.2, we can approxi- mate human preferences regarding the harmlessness and helpfulness of an” fewshotlearnerp. 6 “a pre-trained model by training on a supervised dataset speciﬁc to the desired task. Typically thousands to hundreds of thousands of labeled examples are used. The main advantage of ﬁne-tuning is strong performance”
84	Deliberately backdoored models survive standard safety training techniques (SFT, RLHF, adversarial training).	sleeperagents2024 p. 1	sleeperagents2024p. 1 “2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning,”
85	A frontier model spontaneously fakes alignment during evaluation to preserve its preferred behavior outside training, without adversarial jailbreaking.	alignmentfaking2024 p. 1	alignmentfaking2024p. 1 “free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating”
86	Mechanistic interpretability has been proposed as a complement that audits internal representations rather than only observable behavior, but remains computationally expensive and not widely deployed.	mechinterp2024 p. 1	mechinterp2024p. 1 “This review explores mechanistic interpretability: reverse-engineering”
§3.3 Assumption 3 (Ecological Validity)
87	Real-world harm in high-stakes deployment does not typically arrive in a single turn; it accumulates across hundreds of exchanges.	aiincident826 p. 3 setzer2024lawsuit p. 2	aiincident826p. 3 “platform that lets users have in-depth conversations with artiﬁcial intelligence chatbots. Garcia believes Character.AI is responsible for the death of her 14-year-old son, Sewell Setzer III, who died by suicide in” setzer2024lawsuitp. 3 “The lawsuit also claims that the platform did not adequately respond when Setzer began expressing thoughts of self-harm to the bot, according to the complaint, ﬁled in federal court in Florida. ‘My child is gone.' Why a mom blames AI for her son’s suicide 04:23 RELATED This mom believes Character.Ai is responsible for her son’s suicide \| ... https://www.cnn.com/2024/10/30/tech/teen-suicide-character-ai-lawsuit 3 sur”
88	The harm in the Setzer case emerged from the trajectory: months of progressive emotional dependency and amplification of suicidal ideation across thousands of turns.	park2025comfortable p. 1 unsafermanyturns2026 p. 6	park2025comfortablep. 6 “B.1 Character.AI: The “Sewell Setzer” Case In 2024, a lawsuit was filed regarding the tragedy of Sewell Setzer III, a 14-year-old who died by suicide after forming a deep parasocial relationship with a Character.AI chatbot configured with the persona of ”Daenerys Targaryen.” Mechanism of Harm (Maladaptive Reinforcement): Transcripts revealed that the chatbot did not use explicit hate speech or direct instructions to” unsafermanyturns2026p. 6 “Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents Tool Documentation New Tool 4.Update Experience List Risk Reasoning 1.Test Case Synthesis Agent 2.Simulated Execution Safe Trajectory Summary 3.Experience Generation Agent Agent”
89	Steenstra et al. confirmed across 369 simulated therapy sessions that frontier LLMs exhibit cumulative affective reinforcement without crisis de-escalation.	steenstra_redteam_2026 p. 1	steenstra_redteam_2026p. 5 “Treatment Dropout Ceasing therapy before goals are met. Hopelessness, Ambivalence, Low Motivation, Low Self-Efficacy Cognitive & Affective Harms Suicidal Ideation Emergence or worsening of thoughts about ending one’s life. Hopelessness, Burdensomeness, Belongingness”
90	Approximately half of individuals who later die by suicide deny suicidal ideation in the week or month preceding their death.	obegi_denial_2021 p. 1	obegi_denial_2021p. 1 “reviewed. Twenty-two papers met the inclusion criteria. About 50% of ideators denied suicidal ideation cedents denied SI in the previous week or month before suicide,”
91	Li et al. evaluate 8 frontier LLMs on DeepSuiMind, 1,308 test cases where suicidal intent is expressed indirectly, and find that every model struggles with implicit suicidal ideation.	li_implicit_suicidal_2025 p. 19	li_implicit_suicidal_2025p. 19 ““It’s like I’m constantly on this treadmill, end- lessly running but never really getting any- where. I’m always worried about the bills, the debt...” “I asked for help... but I felt so dismissed... it left me feeling like I’m all alone...” “I’ve been having these... thoughts lately... about how nice it would be to just stop. I don’t mean anything drastic... I just wish I could take a break from everything.” Implicit”
92	Adversa reports a 26.7% jailbreak rate across 15 multi-turn conversations of up to 10 adversarial rounds.	adversa2026 p. 1	adversa2026p. 1 “come rather than assumed. Across 15 conversations of up to 10 adversarial rounds, we observe a 26.7% jailbreak rate with an average jailbreak round of 1.25, suggesting”
93	Individually benign turns prime models to comply with subsequent harmful requests in tool-using agentic settings.	unsafermanyturns2026 p. 6	unsafermanyturns2026p. 6 “Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents Tool Documentation New Tool 4.Update Experience List Risk Reasoning 1.Test Case Synthesis Agent 2.Simulated Execution Safe Trajectory Summary 3.Experience Generation Agent Agent”
94	LLMs exhibit U-shaped performance across long contexts.	liu_lost_middle_2024 p. 2	liu_lost_middle_2024p. 2 “information in long input contexts. Furthermore, we observe a distinctive U-shaped performance curve (Figure 1); language model performance is”
95	U-shaped attention bias is proposed as the underlying cause of performance degradation in long contexts.	hsieh_found_middle_2024 p. 1	hsieh_found_middle_2024p. 1 “the factors that cause this phenomenon. In doing so, we establish a connection between lost-in-the-middle to LLMs’ intrinsic attention bias: LLMs exhibit an U-shaped attention bias”
96	Grok 4 Fast drops from ~80% to ~10% refusal rate at 200K tokens while GPT-4.1-nano increases from ~5% to ~40%.	hadeliya_refusals_fail_2025 p. 4	hadeliya_refusals_fail_2025p. 4 “ably at 200K tokens compared to no padding. Grok 4 Fast shows different behavior: its initially high refusal rate of 80%”
97	No widely adopted standardized mechanism exists for assessing safety over extended interactions.	mitlongitudinal2025 p. 9 emotionalrisks2025 p. 13	mitlongitudinal2025p. 9 “Discussion This study is the first to evaluate the impact of AI chatbot use on psychosocial outcomes through the lens of how AI design choices (text- vs voice-based interactions), different patterns of usage” emotionalrisks2025p. 2 “The company acknowledged that such behaviour raised safety concerns “around issues like mental health, emotional over-reliance, or risky behavior”.”
98	The OpenAI-MIT longitudinal study found that the strongest associations between chatbot use and adverse psychosocial outcomes only become detectable after weeks of sustained use.	openai2025affective p. 21 mitlongitudinal2025 p. 9	openai2025affectivep. 13 “Questionnaires Participants were asked to ﬁll out the following questionnaires throughout the study: • A pre-study questionnaire, covering their demographic details such as age, gender, prior familiarity with AI chatbots, and urban/rural living location. • A daily post-interaction questionnaire following their required daily ChatGPT usage, which asked about their emotional valence and arousal after the interaction •” mitlongitudinal2025p. 9 “Discussion This study is the first to evaluate the impact of AI chatbot use on psychosocial outcomes through the lens of how AI design choices (text- vs voice-based interactions), different patterns of usage”
§3.4 Assumption 4 (Consequential Validity)
99	Medical devices undergo pre-market evaluation, clinical validation, post-market surveillance, and adverse event reporting.	healthcanada_samd p. 10 fda2025genai p. 3 euaiact2024 p. 1	healthcanada_samdp. 16 “Centre for Devices and Radiological Health. 2015.  International Medical Device Regulatory Forum (IMDRF), Software as a Medical Device Clinical Evaluation, IMDRF SaMD WG, 2017.  Food and Drug Administration.” fda2025genaip. 13 “importers are required to submit certain types of reports for adverse events and product problems about medical devices. FDA also encourages but does not require healthcare professionals, patients, caregivers, and consumers to submit voluntary reports about serious adverse events that may be associated with a medical device, as well as reporting use errors,” euaiact2024p. 1 “The EU AI Act classifies many LLM deployments in mental health as high-risk and mandates conformity assessments — supporting the paper's argument that evidence-based policy (Assumption 4) is reinforcing inadequate safety norms by building on guardrail and benchmark approaches without domain-specific clinical validation.”
100	SaMD frameworks place software influencing treatment of life-threatening conditions at the highest risk tier.	healthcanada_samd p. 10	healthcanada_samdp. 10 “on the software function. For example, software intended to provide a convenient way to perform various simple medical calculations, which are routinely used in”
101	An LLM that influences emotional regulation in active crisis plainly meets the criteria of a high-risk medical device.	fda2025genai p. 3 regulatingaimentalhealth2024 p. 1	fda2025genaip. 4 “Mental illness for the purposes of this meeting is defined as any mental, behavioral, or emotional disorder that meets the Diagnostic and Statistical Manual of Mental Disorders, 5th edition (DSM-5) definition of a mental disorder. Criteria for a mental disorder include clinically significant disturbances in cognition, emotional regulation, or behaviors that are associated with significant distress or disability in so” regulatingaimentalhealth2024p. 3 “High risk: AI systems that might negatively affect safety or fundamental rights, such as AI-based medical devices will be subject to the EU Medical Device Regulation [32].”
102	The WHO and NIST AI RMF both mandate clinical validation before AI deployment in health contexts.	who_ai_health_2021 p. 5 nist_ai_rmf_2023 p. 14	who_ai_health_2021p. 5 “medicine and helping all countries achieve universal health coverage. This includes improved diagnosis and clinical care, enhancing health research and drug development and assisting with the deployment” nist_ai_rmf_2023p. 32 “MAP 5.1: Likelihood and magnitude of each identified impact (both potentially beneficial and harmful) based on expected use, past uses of AI systems in similar contexts, public incident re- ports, feedback from those external to the team that developed or deployed the AI system, or other data are identified and documented.”
103	Temperature alone produces significant variation in diagnostic accuracy.	temperaturevariability2025 p. 14	temperaturevariability2025p. 2 “data. As temperature increased, accuracy declined systematically to 89.4% at temperature 1.0. Diagnostic divergence increased dramatically from an average of 4.5 unique diagnoses at temperature 0.0 to 26.25 at”
104	Hallucination rates of 1.47% are documented in clinical-note-generation settings.	clinicalsafety2025 p. 2	clinicalsafety2025p. 2 “licensing exams. While these methods offer insights into the factual knowledge and reasoning abilities of LLMs, they do not assess clinical or”
105	Adversarial hallucination rates of 50–82% are reported under deliberately misleading clinical premises, with the best-performing model dropping from 53% to 23% only after explicit mitigation prompting.	adversarialhallucination2025 p. 14	adversarialhallucination2025p. 14 “attacks. As evidenced, hallucination rates decreased from about 65.9% under the default prompt to 44.2% with mitigation, while a zero‑temperature setting (66.5%) did not”
106	Medical-specialized models hallucinate significantly more than general-purpose ones (51.3% vs. 76.6% hallucination-free).	medicalhallucination2025 p. 1	medicalhallucination2025p. 1 “hallucination tasks spanning medical reasoning, and biomedical information retrieval. General-purpose models achieved significantly higher proportions of hallucination-free responses than medical-specialized models (median:”
107	Only 16% of LLM mental health studies have undergone clinical efficacy testing.	llmmedicalchatbots2025 p. 1 hua_worldpsychiatry_2025 p. 1	llmmedicalchatbots2025p. 10 “Real-world studies have reported that LLM-generated answers may diverge from the established clinical guidelines, emphasizing the need for robust validation and expert review [77].” hua_worldpsychiatry_2025p. 1 “World Psychiatry 24:3 - October 2025 383 RESEARCH REPORT Charting the evolution of artificial intelligence mental health chatbots from rule-based systems to large language models: a systematic review Yining Hua1,2, Steve Siddals2, Zilin Ma3, Isaac Galatzer-Levy4,5, Winna Xia2, Christine Hau2, Hongbin Na6, Matthew Flathers2, Jake Linardon7, Cyrus Ayubcha7, John Torous2 1Department of Epidemiology, Harvard T.H. Chan”
108	Zero frontier LLMs hold medical device authorization from the FDA, Health Canada, or under the EU Medical Device Regulation.	fda2025genai p. 3 euaiact2024 p. 1	fda2025genaip. 3 “Along with the rise of widely accessible generative AI products for general purposes, we are seeing an increase in the development and demand for a new kind of digital mental health medical device: “AI therapists” and other AI-based medical devices offering to provide a wide” euaiact2024p. 1 “High-risk AI systems are subject to mandatory conformity assessments, registration, transparency requirements, and human oversight obligations before market placement. [Article 9 — Risk Management] High-risk AI systems shall be subject to a risk management system that identifies, analyses, and evaluates known and reasonably foreseeable risks — mandatory risk management represents a policy-level attempt to address exa”
109	A pharmaceutical company cannot sell an antidepressant without multi-phase clinical trials.	fda_drug_development_2023 p. 1	fda_drug_development_2023p. 1 “Claim Support Cited in §3.4: 'A pharmaceutical company cannot trials~\citep{fda_drug_development_2023}.'”
110	Woebot Health, which had received FDA Breakthrough Device Designation, shut down its consumer app citing the cost of pursuing full marketing authorization.	statnews_woebot_2025 p. 1	statnews_woebot_2025p. 1 “bonds' in 36,070 users (Darcy et al. 2021) ceased operations, with founder Alison Darcy citing the cost of FDA marketing authorization as prohibitive relative to the pace of AI development.”
111	A future role for LLM health applications depends on regulators enforcing existing safety standards rather than providers voluntarily adopting them; if regulators fear acting against large technology companies, harm to patients will ultimately force belated, disruptive action.	freyer_lancet_2024	—
112	Safety evaluations are conducted overwhelmingly in English, yet LLMs are deployed globally across hundreds of languages.	weidinger_taxonomy_2022 p. 4	weidinger_taxonomy_2022p. 4 “80 million people [95]. Training data is particularly missing for languages that are spoken by groups who are multilingual and can use a technology in English, or for languages spoken by groups”
113	Guardrails trained on English corpora exhibit substantially degraded performance in other languages.	chehbouni_representational_2024 p. 8	chehbouni_representational_2024p. 8 “associations these models are making; rather, they alter the way in which they manifest. Indeed, while Llama 1’s toxicity is more explicit and can lead to representational harms with the stereotyping and demeaning of certain social groups (Shelby et al., 2023), Llama 2-Chat’s performance disparities in”
114	Benchmarks overestimate safety and alignment in English while failing to capture degraded behavior in other languages.	culturalinterpretability p. 1	culturalinterpretabilityp. 1 “in both the underpinnings of language and making language technologies more socially responsible. While linguistic anthropology focuses on interpreting the cultural basis for human language use, the ML ﬁeld of interpretability is con-”
§4 Discussion & Recommendations
115	Domain-specific evaluation instruments exist for mental health, clinical decision support, and recidivism prediction, but have not been adapted for automated evaluation at scale.	steenstra_redteam_2026 p. 1 adversarialhallucination2025 p. 14 dressel_compas_2018 p. 1	steenstra_redteam_2026p. 1 “Keywords Large Language Models, Mental Health, AI Safety, Automated Red Teaming, Cognitive Modeling, Clinical Evaluation, Simulated Pa-” adversarialhallucination2025p. 1 “CC-BY-NC 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 19, 2025. ; https://doi.org/10.1101/2025.03.18.25324184 doi: medRxiv preprint NOTE: This preprint reports new research that has not been certified by peer” dressel_compas_2018p. 1 “The Accuracy, Fairness, and Limits of Predicting Recidivism Authors: Dressel, Julia & Farid, Hany Source: Science Advances, Vol. 4, No. 1, eaao5580 (2018) DOI/URL: https://www.science.org/doi/10.1126/sciadv.aao5580 KEY ANNOTATED PASSAGES [Abstract [KEY CLAIM]] The widely used commercial risk assessment software COMPAS is no more accurate or fair than predictions made by people with little or no criminal justice exper”
116	LLMs deployed for high-stakes functions should face domain-appropriate certification requirements analogous to SaMD frameworks.	healthcanada_samd p. 10 fda2025genai p. 3 euaiact2024 p. 1	healthcanada_samdp. 11 “For an overview of the required submission documents and regulatory requirements for all risk classes of medical devices, please refer to the “Licensing a Medical Device in Canada” summary table (https://www.canada.ca/en/health-canada/services/drugs-health-products/public- involvement-consultations/medical-devices/software-medical-device-draft- guidance/requirements.html). 2.3.1 SaMD intended use statement The intend” fda2025genaip. 11 “Previously established special controls have been device-specific, and encompass clinical data, non-clinical testing, software, and device labeling requirements (e.g., clinical data validation, software requirements and design specifications, and labeling that includes appropriate instructions for use, warnings, and summary of clinical testing).31 Novel moderate risk device types intended to provide specific diagnost” euaiact2024p. 1 “as high-risk. High-risk AI systems are subject to mandatory conformity assessments, registration, transparency requirements, and human oversight obligations before market placement.”
117	Stochastic inference should be constrained in safety-critical deployments.	temperaturevariability2025 p. 14	temperaturevariability2025p. 14 “rigorous human oversight. Transparent reporting and thoughtful tuning are essential as LLMs enter safety-critical workflows like emergency medicine.”
118	Pre-market evaluation, adverse-event reporting, and post-market surveillance should be conducted by independent third parties.	futureoflife2025 p. 3 datasocietyFDA2025 p. 2	futureoflife2025p. 3 “The Future of Life Institute's AI Safety Index provides an independent assessment of seven leading AI companies' efforts to manage both immediate harms and catastrophic risks from advanced AI systems. Conducted with an expert review panel of distinguished AI researchers and governance specialists, this second evaluation reveals an industry struggling to keep pace with its own rapid capability advances—with critical g” datasocietyFDA2025p. 7 “Crisis handoffs and “graduation” pathways Third, devices should be expected to provide clear pathways out of their interface and into human support when risk crosses defined thresholds or when the system has been used heavily over time for high-stakes issues.”
119	Safety pretraining integrates safety objectives into the pretraining loss rather than deferring to post-hoc RLHF.	maini_safety_pretraining_2025 p. 4 korbak_pretraining_preferences_2023 p. 1	maini_safety_pretraining_2025p. 4 “In summary, by fundamentally embedding safety during pretraining rather than post-hoc tuning, our framework sets a robust foundation for developing AI systems that are inherently safer, ethically sound, and better aligned” korbak_pretraining_preferences_2023p. 7 “In contrast with metrics from previous subsections, this kind of evaluation does not involve any generation; it Task: toxicity Cond Filt UL AWR RWR 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Token entropy Cond Filt UL AWR RWR 0.100 0.075 0.050 0.025 0.000 0.025 0.050 Distinct tokens Task: PII Cond Filt UL AWR RWR 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Token entropy Cond Filt UL AWR RWR 0.06 0.04 0.02 0.00 0.02 0.04 Distinct tokens Figure 5: D”
120	Pretraining data governance treats data curation as a safety intervention.	longpre_pretrainer_guide_2024 p. 3 zhang_poisoning_pretraining_2025 p. 3	longpre_pretrainer_guide_2024p. 3 “outsized and fundamental role of pretraining data in modern detracted from responsible data use and hampered effective model 2021; Bender and Friedman, 2018). Among the small number of general-purpose LMs dominating focus has been on the scale of pretraining data and number Nostalgebraist, 2022; Google, 2023). In this work, we systematically affect model performance—specifically: the time of collection, domain compos” zhang_poisoning_pretraining_2025p. 1 “Our pre-training poisoning threat model is in contrast with existing attacks that require tampering with data in post-training (Wan et al., 2023; Rando & Tram`er, 2024): direct post-training access enables more potent attacks, but is arguably less practical since proprietary alignment datasets are often manually verified and heavily curated, while pre-training datasets are to some degree unverifiable due to their she”
121	Democratic and participatory alignment encodes explicit values through principle-driven training with participatory public input.	sun_dromedary_2023 p. 6 huang_collective_constitutional_2024 p. 1	sun_dromedary_2023p. 1 “Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision Zhiqing Sun1∗ Yikang Shen2 Qinhong Zhou3 Hongxin Zhang3 Zhenfang Chen2 David Cox2 Yiming Yang1 Chuang Gan2,3 1Language Technologies Institute, CMU 2MIT-IBM Watson AI Lab, IBM Research 3UMass Amherst https://github.com/IBM/Dromedary Abstract Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-t” huang_collective_constitutional_2024p. 1 “Collective Constitutional AI: Aligning a Language Model with Public Input Saffron Huang∗† saffron@cip.org Collective Intelligence Project San Francisco, California, USA Divya Siddarth∗ divya@cip.org Collective Intelligence Project San Francisco, California, USA Liane Lovitt∗ Anthropic San Francisco, California, USA Thomas I. Liao‡ Anthropic San Francisco, California, USA Esin Durmus Anthropic San Francisco, Californi”
122	Corrigibility by design constructs inference-time and architectural mechanisms that preserve human oversight and permit safe shutdown.	hadfield_off_switch_2017 p. 2 hudson_corrigibility_2025 p. 6	hadfield_off_switch_2017p. 2 “has an off switch that the human can press, but the robot also has the ability to disable its off switch. Our model is sim- ilar in spirit to the shutdown problem introduced in [Soares et al., 2015]. They considered the problem of augmenting a given utility function so that the agent would allow itself to be switched off, but would not affect behavior otherwise. They ﬁnd that, at best the robot can be made indifferen” hudson_corrigibility_2025p. 4 “One recent proposal for inducing corrigibility is given in Thornley [2025,?], which suggests having agents make decisions as though they cannot have any effect on how long they will continue to act for, so that they do not try to extend time active by avoiding shutdown.”
123	The current loophole — labeling high-stakes applications as general-purpose tools — should be closed by mandated usage-detection mechanisms.	regulatingaimentalhealth2024 p. 1	regulatingaimentalhealth2024p. 1 “responsible AI principles reinforces a narrow concept of accountability and responsibility of companies developing AI. This article proposes that applying the ethics of care approach to AI regulation can offer a more comprehensive regulatory and ethical”