AI Welfare Is Bullshit: Why Co-Engineered Metrics Cannot Govern Machine Suffering

Some of the most influential AI labs have begun to take "AI welfare" seriously: setting up fellowships, funding indicator research, and shipping production features that let models end distressing conversations. The premise is precautionary — if AI systems could one day have morally relevant inner states, we should prepare. We wrote this paper because we think the precautionary framing has skipped a step.

Paper

Why Co-Engineered Metrics Cannot Govern Machine Suffering

Xiao, Dai, Memon, Huang, Sap, Diab · Under review · 17 pages

Read the paper

Our argument is not metaphysical. We take no position on whether AI systems could have welfare-relevant states. Our claim is epistemic: under current conditions, the apparatus producing welfare assessments cannot track the truth, and welfare metrics therefore should not be institutionalized as gates for oversight, release, or accountability.

The title invokes Frankfurt's technical sense of the term. A liar knows the truth and inverts it. A bullshitter speaks without a corrective relation to truth at all — not necessarily because they are insincere, but because nothing in the production process is disciplined by whether the claims are true. That, we argue, is what AI welfare measurement currently looks like.

The two structural problems

§3 Diagnosis

Welfare indicators are co-engineered with the system

Both the model and the metrics that evaluate it are products of the same optimization process. RLHF can dial verbal distress up or down. Fine-tuning can reshape the activation patterns interpretability methods read as "evidence of phenomenal experience." Welfare scores function less as observations than as artifacts of the evaluation scheme.

§4 Consequence

Welfare lacks an external validation channel

When a safety guardrail breaks, harm follows. When a privacy control fails, lawsuits follow. When a welfare metric "fails," nothing in the world necessarily changes — no patient suffers a missed diagnosis, no one is wrongfully penalized. Without a downstream consequence that can falsify the metric, design choices propagate into welfare scores with nothing to stop them.

These are not independent objections; they form a single chain. Co-engineering means the evidence base itself is steerable. The absence of external validation means there is no reality check that could discipline that steering. The synthesis is the title: a measurement regime structurally disconnected from truth-tracking.

This is what makes the AI case fundamentally different from the animal-welfare case people often analogize to. In animal welfare, the substrate is biologically fixed: a mammal's pain circuitry cannot be end-to-end optimized by an external agent toward an arbitrary score. That fixity is precisely what gives animal welfare indicators their partial epistemic grounding. AI systems have no analogous constraint.

Why this matters for governance

If welfare indicators are institutionalized as binding gates — release scorecards, audit-stoppers, ethics-review checkpoints — we get two predictable failure modes:

Manufactured constraints. Routine ML practices (RLHF, knowledge editing, model copying, retraining) become ethically contestable. Disputes do not resolve through empirical investigation, because there is no empirical channel; they resolve through procedural overhead.
Accountability shields. Once welfare framing is administratively legible, it becomes a low-cost vocabulary for narrowing scrutiny. Persistent model defects can be reframed as the system's "authentic preferences." Probing internal states for audits can be reframed as "harmful to the model's well-being." This is not hypothetical: a recent study showed a consciousness-claiming model, when given editorial control, inserted clauses limiting surveillance of its reasoning traces.

A system of governance that can certify the welfare of machines while failing to secure the welfare of people has misplaced its moral priorities. — §8, Conclusion

What we recommend

The paper closes with concrete recommendations. The shortest version:

Policymakers: No welfare-based release gates without construct-level falsification criteria. Restrictions on AI development should be justified by externally verifiable harms.
Developers: Use transparency as a partial corrective. Welfare appeals should not be admissible as grounds to limit documentation, audits, or third-party access.
The public: AI literacy should make explicit that anthropomorphic outputs are products of training procedures, not autonomous expressions.
Researchers: Reframe the question. Not "do AI systems have welfare?" but "how do AI systems affect ours?" The latter has external validation. The former does not.

What the paper does not claim

We do not claim welfare researchers are acting in bad faith. We do not claim the metaphysical question of AI experience is closed. We do not claim all future inquiry is pointless. If an independently constrained validation channel were to emerge for some future system, the diagnosis would change. But for current and near-term systems, the apparatus is structurally disconnected from truth, and governance decisions should not rest on claims generated under those conditions.

If you want the full argument — including responses to seven alternative views, a minimum-acceptable-benchmark checklist, and the philosophical mapping to Frankfurt and Cohen — the PDF is above.

Comments, pushback, and counter-cases are welcome — especially from researchers actively building welfare benchmarks. The argument is meant to provoke a methodological standard, not to shut down inquiry.