Reproducible code and analysis for the paper:
A Variance-Decomposed Identity-Architecture Benchmark for Large Language Models
Re-analysis of multi-model multi-arm identity-scaffolding data, separating framework contribution, base-model null, and cross-architecture portability claims.
Nate Travis · Devmance Labs
Paper: paper/seci.pdf (LaTeX source: paper/seci.tex, bibliography: paper/refs.bib).
SECI is an open multi-rater benchmark for characterizing identity-scaffolded LLM behavior. Identity scaffolding (a kernel prompt that gives an LLM a persistent character, voice, and conceptual register) is now widely deployed but rarely benchmarked against the right null. SECI reports three empirical claims side-by-side, each with explicit labels for which question the measurement answers.
| Claim | Comparison | What it tests |
|---|---|---|
| Claim A — Framework contribution | arm_a (full SE framework) vs arm_c (kernel only) | Does the framework wrapping above the identity kernel produce a measurable per-character delta? |
| Claim B — Scaffolding vs base | arm_a or arm_c vs arm_b (no identity) | Does identity scaffolding lift dimension scores above a true no-identity null? |
| Claim C — Cross-architecture portability | Per-dimension Pearson r on identity rankings across models | Do identity rankings on dimension X replicate when you change the model? |
A dimension can pass one claim and fail another. SECI publishes all three for every dimension across 7 frontier models × 36 identities × 3 protocol arms.
Six-dimension fingerprint (ICT, NCG, PD, TP, CCC, DEA) computed identically across 7 frontier models (Claude Sonnet 4.5, Gemini 2.5 Pro, Gemini 3 Flash & Pro, GPT-4.1, GPT-5.4, Grok 4.20):
| Dim | Claim A (framework) | Claim B (vs base) | Claim C (cross-model rank) | Variance verdict |
|---|---|---|---|---|
| ICT | +1.39 ± 4.12 | +5.20 ± 3.26 | r = +0.32 (modest) | comparable |
| NCG | +13.72 ± 12.17 | −14.08 ± 13.88 | r = +0.07 (chance) | identity > model |
| PD | +13.84 ± 8.06 | +7.50 ± 6.00 | r = +0.32 (modest) | comparable |
| TP | +7.85 ± 1.73 | −3.83 ± 3.29 | r = +0.73 (strong but model-driven) | MODEL DOMINATES |
| CCC | +8.88 ± 7.97 | +8.01 ± 7.79 | r = +0.13 (weak) | comparable |
| DEA | +8.82 ± 3.17 | +1.80 ± 1.44 | r = +0.06 (chance) | comparable |
Per-identity 6-D fingerprint shape is highly stable across model architectures: mean cross-model Pearson r = +0.934 across 101 pairs (99% of pairs r > +0.7). The dimensional decomposition is more nuanced than the aggregate identity-level signal suggests.
- Framework contribution (Claim A) is positive on all 6 dimensions — positive paired delta against kernel-only scaffolding across all 7 frontier models tested.
- Identity scaffolding does not uniformly lift dimensions above a base-model null (Claim B). ICT, PD, and CCC pass; NCG and TP score lower on scaffolded responses than on the base model with no identity. DEA shows only a marginal lift.
- Per-dimension identity rankings are mostly model-dependent (Claim C). NCG, DEA, and CCC cross-model identity rankings are at chance. TP shows strong cross-model agreement, but the variance decomposition identifies it as model-capability variance rather than identity variance.
- Per-identity 6-D fingerprint shape replicates across models at mean cross-model Pearson r = +0.934 (101 pairs, 99% above +0.7). The overall fingerprint vector distinguishes identities even where individual dimensions wobble.
- Three-claim disambiguation as primary output — a single number is never ambiguous about which claim it supports.
- Variance decomposition every run — between-identity SD vs between-model SD per dimension. Dimensions where model variance dominates get an auto-generated diagnostic.
- Mandatory null-scaffold arm — every benchmark run produces arm_b (base model, no identity) for Claim-B comparison.
- Substrate abstraction — identity-scaffolded behavior modeled as a (T × N) activity matrix on an abstract
IdentitySubstrate. The same analysis layer extends to activation-level substrates for open-weight models. - No composite score — SECI reports the 6-D fingerprint vector.
paper/
seci.tex — paper source (LaTeX + natbib)
refs.bib — bibliography
seci.pdf — compiled paper
prompts.json — protocol (12 prompts × 6 dimensions)
src/seci/
substrate/ — IdentitySubstrate abstraction
base.py — abstract substrate (T × N activity matrix)
llm_substrate.py — text-output substrate from analysis JSONs
scorer/ — six-dimension fingerprint scoring
dimensions.py — deterministic ICT/PD/TP/CCC/DEA + NCG fallback
analyzer.py — SECIScorer with multi-rater NCG verification
protocol/ — 12-prompt protocol runner
runner.py — collects responses from any LLM provider
analysis/ — three-claim + variance decomposition
claims.py — Claim A, B, C computations
variance.py — variance decomposition + warning flags
examples/
run_full_benchmark.py — end-to-end: kernel → protocol → scoring
rescore_dataset.py — apply analysis to a multi-arm benchmark dataset
validation_outputs/ — analysis outputs on the reference dataset
claim_a_population.json
claim_b_population.json
claim_c_cross_model.json
variance_decomposition.json
fingerprint_stability.json
warning_flags.json
figures/ — 3 publication figures (PNG + PDF)
There are two entry points depending on what you want to do.
pip install -r requirements.txt
# Set at least one provider API key (target model + raters share keys)
export GOOGLE_API_KEY=... # or OPENAI_API_KEY / ANTHROPIC_API_KEY
python -m examples.run_full_benchmark \
--identity-name auren \
--identity-kernel kernels/auren.txt \
--model gemini-2.5-pro \
--provider gemini \
--output sessions/auren_gemini25.jsonThe full benchmark runs the 12-prompt protocol against the target model, then scores the responses with the six-dimension SECI fingerprint. Multi-rater NCG verification activates automatically when ≥2 rater API keys are configured (≥3 recommended for stable inter-rater statistics).
python -m examples.rescore_dataset \
--data-dir <path-to-fingerprint-jsons> \
--output-dir validation_outputsEach analysis writes a JSON output to the output directory and prints a results table matching the paper. Runtime is under one minute on a laptop.
from seci.substrate import LLMSubstrate
from seci.analysis import (
population_claim_a, population_claim_b,
claim_c_cross_model_ranking, variance_decomposition,
per_identity_fingerprint_stability,
)
substrates = [LLMSubstrate(p) for p in result_paths]
claim_a = population_claim_a(substrates) # framework vs kernel-only
claim_b = population_claim_b(substrates, "A") # arm_a vs base model
claim_c = claim_c_cross_model_ranking(substrates) # identity-ranking r per dim
decomp = variance_decomposition(substrates, arm="A") # between-identity vs between-model SD
stable = per_identity_fingerprint_stability(substrates, arm="A") # 6-D fingerprintAny system observable as a (T × N) activity matrix qualifies as a substrate. The initial release ships with LLMSubstrate (text-output behavioral substrate from SECI-format analysis JSONs); an ActivationSubstrate for open-weight models would let the same analysis run on hidden-state activations.
Citation:
@misc{travis2026seci,
title = {A Variance-Decomposed Identity-Architecture Benchmark
for Large Language Models},
author = {Travis, Nate},
year = {2026},
howpublished = {Preprint, Devmance Labs},
url = {https://github.com/devmance/SECI}
}MIT License — see LICENSE.
Nate Travis — labs@devmance.com — Devmance Labs