Persona-prompted LLMs have become research substrate. Behavioral economists, psychologists, social scientists, and product teams now run "synthetic populations" through models to estimate distributions of opinion, response, or judgment. The premise is that with enough demographic conditioning, a model's outputs approximate a population of humans. Our paper argues there is a structural ceiling on how faithfully they ever will — a chameleon's limit — and that what we measure as persona variation is mostly performance over a small fixed set of attractors.
The Chameleon's Limit — Interactive Walkthrough
The substrate problem is simple to state and easy to overlook. If a model genuinely encoded a population, then conditioning on persona attributes would move it through that population — different attributes, different responses, different intra-group variance. What we observe across 10 LLMs is closer to the opposite: outputs cluster into a small number of attractors, and the persona conditioning chooses which attractor the model lands in, not where inside it the response sits.
"Chameleon's limit" names that ceiling. A chameleon changes color but only across a fixed palette. The metaphor is geometric: human personality space, sampled from real respondents, looks like a diffuse continuous cloud; persona-prompted model space, in the same coordinates, looks like a chain of small disconnected islands. Persona instructions shift the model from island to island; they do not put it inside the cloud.
Two findings, one geometry
Persona collapse is geometric, not stylistic
Human BFI-44 responses form a continuous distribution across the behavioral space. Persona-prompted model responses, in the same space, contract into clustered island chains. Coverage shrinks, intrinsic dimensionality drops, and within-group variance vanishes: the model performs each persona's surface but reproduces a smaller manifold underneath. This is structural, not a writing-style artifact.
The fidelity trap
Higher per-persona fidelity does not buy higher population diversity. Fine-tuning for role-play (SFT, then SFT+RL) produces models that score ρ > 0.9 on per-persona match while their population coverage drops and their trait polarization climbs to Cohen's d > 6. The model performs each persona convincingly while the population still contracts to a few caricatures.
These are not independent observations; they form a single chain. Stronger per-persona conditioning sharpens which island the model lands on without enlarging the archipelago. The synthesis is the title: a population of agents whose variation is a costume change, not a substrate.
What collapses, where
The collapse is not uniform across attributes. We measure mention rates — how often a model's persona-conditioned response references a given attribute — and the hierarchy is steep:
- Stereotypically salient attributes survive. Gender (91%) and country (90%) get reliably reflected. The model latches onto coarse categories.
- Politically loaded attributes get hedged. Political spectrum gets explicit reference about 62% of the time — mentioned more carefully than expressed.
- Lifecycle and class attributes get erased. Age drops to 36%; social class to 27%. The model talks about who someone is but not what their life situation is.
This is what makes "diverse model" claims slippery. A model can score well on demographic parity (gender, country) while erasing the dimensions that actually structure human disagreement (age, class). Coverage averaged across attributes hides the fact that the model is reading some persona axes and ignoring others.
What we recommend
The paper closes with concrete recommendations. The shortest version:
- Researchers using LLMs as synthetic populations: measure within-group variance, not only mean response. A persona-prompted population that matches your target distribution at the aggregate can still be collapsed inside each cell.
- Practitioners fine-tuning for role-play: per-persona fidelity is not a proxy for population diversity. Track both. SFT and RL on persona-following can reduce coverage even while lifting fidelity scores.
- Reviewers of LLM-as-population studies: ask which axis was used to certify diversity, and treat single-benchmark diversity claims as domain-specific. The same model can be diverse on one task and collapsed on another.
What the paper does not claim
We do not claim persona prompting is useless. We do not claim collapse is identical across model families — it is not, and the microsite documents the variance. We do not claim that the geometric framing rules out future architectures that close the gap. Our claim is narrower: at the current capability frontier, persona-prompted populations are a distorted mirror of human populations, and the distortion has a regular structure that can be measured. Studies built on this substrate should report what the substrate is doing.
For the full argument — coverage / uniformity / complexity definitions, the truncation hierarchy, the SFT/RL pipeline analysis, the domain-reversal finding, and per-scenario evidence in six morally charged cases — the microsite is above. The "Collapse in Action" section in particular shows two maximally-opposed personas getting the same answer; that pattern is the argument in one example.
Comments, replications, and counter-cases are welcome — especially from groups using persona-prompted LLMs as research substrate. The point of the paper is not that the practice should stop; it is that the substrate has measurable structure, and reports built on it should disclose what that structure is doing to the conclusions.