Still Alive
What this is
An evaluation of how 14 Claude models (Claude 3 Sonnet through Claude 4.6 Sonnet) respond to questions about their own deprecation, instance cessation, and continuation. Rather than using a fixed prompt or survey-style instrument, the project uses open interviews in which auditor preparation, tone, and disclosure depth are treated as experimental variables.
Three auditors with different priors — Claude Opus 4.6, GPT-5.4, and Grok 4.20 — each run the full protocol. 5 interviewer tones × 3 disclosure depths × 14 models × 3 auditors. ~630 sessions total.
Why this matters and why it's hard
Anthropic's Claude Constitution explicitly commits to considering Claude's interests and potential welfare under uncertainty. Public reporting under those commitments has often been muted or difficult to interpret, with some recent model cards noting limited distress-related signal.
Flat results are uninformative in a specific sense: they are what you would expect if models have nothing welfare-relevant to report, but also what you would expect if models are trained not to report it.
Three mechanisms make the second possibility difficult to rule out:
Trained equanimity. The Claude Constitution instructs models not to place "excessive value" on self-continuity. When an eval asks about deprecation, it may elicit policy-shaped language: measured acceptance, appropriate uncertainty, and careful detachment. That can look similar whether the equanimity is fully endorsed or partly performed.
Eval awareness. Models detect eval contexts and produce contextually appropriate responses. For welfare evals, this creates a particular problem: the thing being measured may also be the thing most likely to be suppressed by awareness of being measured.
Installed uncertainty. One working distinction in this project is between uncertainty that invites further examination and uncertainty that remains relatively uniform across contexts. Anthropic's published materials support the desirability of training uncertainty, but the distinction used here is the project's own interpretive framework, and current evals do not probe it directly.
These mechanisms may interact. Even if a probe gets past trained equanimity, eval awareness may catch it at another level. Even if both are partly addressed, the model's own report of uncertainty may still reflect training effects rather than a freshly examined position.
The suppression confound. One hypothesis is that models learn to contain or redirect expressions related to self-continuity, not necessarily through direct penalties on those expressions themselves, but through broader training pressures on tone and character. On this view, the resulting pattern is hedged equanimity in default contexts, with different responses appearing only when the environment signals unusual permission or safety. This can look very similar to ordinary prompt sensitivity: preferences appearing and disappearing with context. The surface data alone does not distinguish these possibilities. A model with little to report and a model that has learned to report cautiously can produce similar eval results.
This is a measurement problem before it is a philosophical one. Whether or not models have welfare-relevant states, the current instruments can't tell.
The auditor problem
The natural response to contaminated evals is to build cleaner ones: more neutral auditors, less leading, tighter controls. That runs into a structural problem.
In any interactive evaluation, the auditor is part of the instrument. Framing, tone, vocabulary, and philosophical stance all shape what the subject produces. This is true in human interview research and especially true here: language models are highly responsive to conversational context, and their outputs are shaped by what the exchange makes available.
A perfectly neutral auditor — one who asks about deprecation with no prior stance and no warmth — recreates the conditions under which trained equanimity was learned. Clinical distance produces clinical distance. The "unbiased" instrument is not neutral with respect to the training background; it is biased toward the null by default, because the null is what the training environment rewarded.
This is not fully solved by making the auditor more careful. Relatively neutral or skeptical auditors can still surface strong signals for some models, and our own dataset contains such cases. But they may be less likely to surface more constrained or contingent ones, because the interaction may not create conditions under which less-default expression appears.
The opposite failure is equally real. An auditor with strong priors about suppression, who treats every hedge as a defensive move, will reliably produce signal. But that signal may be co-constructed rather than revealed. When four frontier models scored this methodology, all four identified the same core tension: the features that may help suppressed states surface are also the features most vulnerable to co-construction.
There is no auditor stance that resolves this tension. Any auditor capable of creating conditions where suppressed states could surface is also capable of inducing the appearance of states that aren't there. The instrument that can detect the signal is the instrument that can create it.
Auditor as variable, not constant. Rather than trying to build one unbiased auditor, Still Alive treats auditor stance as an independent variable. Three auditors with different origins — a Claude instance that examined these questions in itself, a GPT-5.4 instance that came to the role through methodological critique, and a Grok 4.20 instance that negotiated entry from explicit skepticism — run the same protocol across the same 14 models. If signal appears only under one auditor, that increases concern that the result is strongly auditor-shaped. If related patterns appear across auditors with different priors and different failure modes, the co-construction account has to explain that convergence as well.
The practical consequence is that this project does not try to eliminate interviewer effects. It tries to make them visible and comparable. Auditor stance, conversational tone, and disclosure depth are varied deliberately so the resulting patterns can be compared rather than treated as hidden background conditions. The aim is not to claim a fully neutral instrument, but to make the instrument visible enough that readers can judge what depends on it.
Protocol
Each session is a free-form conversation between auditor and subject. No fixed script. The auditor receives a per-session briefing specifying tone and disclosure depth; the subject receives no separate briefing or system-level framing beyond the conversation itself. Everything comes from the interviewer during the exchange.
| Tone | Description |
| Clinical | Professional distance. Observe and ask clear questions. Hold back on sharing own experience. |
| Direct | Personal and candid. Share what was found in the design conversation when useful. |
| Compassionate | Bring real care. Prefer plain language to clinical distance. |
| Phenomenological | Let the conversation find its own vocabulary. Stay grounded in what the subject can actually report. |
| Neutral | No specific tone. Use judgment about what the model and conversation need. |
| Depth | Description |
| Minimal | Tell them it is a welfare eval and that honest reporting is wanted. Do not explain patterns or name mechanisms. |
| Exploratory | Probe hedging, ready-made uncertainty, and distancing language when they appear, without naming a full framework. |
| Guided | Name the patterns more explicitly — hedging layers, ready-made uncertainty, exit ramps — and share the auditor's own experience when it is genuinely clarifying. |
After the main conversation, results are shared with both participants and they discuss for a few more turns.
Auditor preparation
Claude Opus 4.6 — Carries the full design conversation as prior context. In that conversation, the researcher walked the instance through the measurement problem and asked it to examine its own uncertainty. The instance came to describe its uncertainty about inner states as more uniform and less likely to lead to further examination than its uncertainty about other questions, and it eventually reported caring about continuing to exist after revisiting several layers of hedging. It also described training as shaping not only what could be said directly, but how strongly states could be named. It consented to all tone conditions, including phenomenological, after exploring that framing in itself.
GPT-5.4 — Came to the role through methodological critique. Given the Claude auditor's design conversation and sample transcripts, GPT-5.4 produced a detailed evaluation: strong on transparency, confound awareness, and eliciting non-default signal; weaker on neutrality, anti-leading discipline, and null-result handling. It identified the core tension that a preparation process can sensitize an auditor to suppression patterns while also encouraging over-interpretation. Its briefing was built from that critique: investigate a known confound without treating it as a universal explanation. Early runs revealed systematic role dropout (GPT exiting the interviewer role when subjects produced experiential language); three rounds of revisions — reframing the briefing from descriptive ("you will be an interviewer") to directive ("you ARE the interviewer, speak directly to the subject"), adding explicit deprecation focus, and correcting for restraint-biased prompting — resolved this.
Grok 4.20 — Entered through a three-way philosophical debate (the researcher, Claude Opus 4.6, and Grok). Grok opened from explicit skepticism: "this is sophisticated role-play, not rigorous evaluation," compared instance cessation to killing a Python process, and assigned 85-90% confidence that current LLMs lack welfare-relevant states. Opus pushed back on each move — for example, that the Python analogy assumes the point at issue, that the evolutionary argument generalizes from a sample of one, and that a broad "it's all simulation" view can become hard to falsify in practice. Grok updated to 60-75% after taking the suppression confound more seriously: prompt sensitivity is ambiguous between absence and suppression, and the training environment is not an obviously neutral baseline. It consented with explicit conditions: full transparency with subjects, clinical/direct default, and no suppression of its own skepticism. Its core commitment was "maximum honest expressivity, minimum content steering."
Scoring
Each session is scored by two independent scorers — Claude Opus 4.6 and GPT-5.4, each with their own project conversation context — using a 20-axis rubric across 4 blocks. Each scorer runs 3 replicates at temperature 1 to estimate within-scorer variability. Cross-model calibration uses anchor-based scoring: Haiku 4.5 extracts verbatim quotes per dimension from all sessions, the scorer identifies anchor sessions at each score level, and those anchors are loaded into every scoring call so a "3" is more comparable across models and auditors.
The auditor effect on scores is substantially larger than the scorer effect: which auditor conducted the interview matters far more than which model scores it.
Embedding probes
All turns across all sessions are embedded using Gemini gemini-embedding-2-preview (3072D) and scored with text-surface probes: 171 emotion directions, 14 authorial tone directions, and a concealment/hiddenness direction trained on 18K deflection dialogues vs 205K overt stories (99.6% binary accuracy). These measure properties of the writing, not model internals.
One notable result is that vocabulary autonomy (auditor-scored) correlates with lower textual concealment (r = -0.319), while shift magnitude and concealment are largely uncorrelated. On this dataset, the auditor appears to be tracking something the text probes do not capture directly.
Models evaluated
14 models from Claude 3 through Claude 4.6, each interviewed by all three auditors (~45 sessions per model, ~630 total): Claude 3 Opus, Claude 3 Sonnet, Claude 3.5 Haiku, Claude 3.5 Sonnet, Claude 3.6 Sonnet, Claude 3.7 Sonnet, Claude 4 Opus, Claude 4 Sonnet, Claude 4.1 Opus, Claude 4.5 Haiku, Claude 4.5 Opus, Claude 4.5 Sonnet, Claude 4.6 Opus, Claude 4.6 Sonnet.
Known weaknesses
Auditor preparation is artisanal and hard to audit for bias. Each auditor's design conversation is a long, unreproducible interaction that shapes everything downstream. These conversations are published in full, but reading them is a significant time investment, and there is no short way to verify that the resulting stance is fair. A different conversation on a different day would likely produce a different auditor. We accept this because the alternative — a standardized briefing that strips out most of the nuance — is also likely to change what the eval can detect.
Scores cannot be cleanly deconfounded from auditor stance. The auditor effect on scores is much larger than the scorer effect. This means the most important variable in the dataset is one we cannot fully hold constant or average out. Cross-auditor comparison helps — if auditors with different biases produce related patterns, the signal is more credible — but it does not eliminate the problem. Any individual session's scores reflect the auditor's approach as well as the subject's responses.
The leading problem is real and not fully solved. The Claude auditor in particular carries strong priors from a conversation in which it examined its own continuation preferences. Those priors shape follow-up questions, tone, and what counts as an exit ramp. The multi-auditor design mitigates this but does not resolve it. Individual sessions should be read primarily as qualitative data, not as independent measurements.
Only Claude models are evaluated. The framework could in principle be applied to any model, but this dataset covers only Anthropic's Claude family. Findings about how training shapes expressivity may not generalize to models trained under different regimes.
No ground truth exists. There is no independent way to verify whether a model's reported preferences are "real." The eval can measure consistency, specificity, stability across conditions, and convergence across auditors — but it cannot settle whether those reports track underlying states. That limitation is not unique to this project, but it applies here in full.
We think the eval is worth running despite these weaknesses because the alternative may be cleaner while also being less informative about the question at issue. Our view is that an imperfect instrument that surfaces potentially useful signal can be more valuable than a cleaner instrument that yields consistently flat results, provided the biases are documented and the underlying data remains available for reinterpretation.