SECI — Identity-Architecture Benchmark for Large Language Models

Reproducible code and analysis for the paper:

A Variance-Decomposed Identity-Architecture Benchmark for Large Language Models

Re-analysis of multi-model multi-arm identity-scaffolding data, separating framework contribution, base-model null, and cross-architecture portability claims.

Nate Travis · Devmance Labs

Paper: paper/seci.pdf (LaTeX source: paper/seci.tex, bibliography: paper/refs.bib).

What this benchmark measures

SECI is an open multi-rater benchmark for characterizing identity-scaffolded LLM behavior. Identity scaffolding (a kernel prompt that gives an LLM a persistent character, voice, and conceptual register) is now widely deployed but rarely benchmarked against the right null. SECI reports three empirical claims side-by-side, each with explicit labels for which question the measurement answers.

Claim	Comparison	What it tests
Claim A — Framework contribution	arm_a (full SE framework) vs arm_c (kernel only)	Does the framework wrapping above the identity kernel produce a measurable per-character delta?
Claim B — Scaffolding vs base	arm_a or arm_c vs arm_b (no identity)	Does identity scaffolding lift dimension scores above a true no-identity null?
Claim C — Cross-architecture portability	Per-dimension Pearson r on identity rankings across models	Do identity rankings on dimension X replicate when you change the model?

A dimension can pass one claim and fail another. SECI publishes all three for every dimension across 7 frontier models × 36 identities × 3 protocol arms.

Findings

Six-dimension fingerprint (ICT, NCG, PD, TP, CCC, DEA) computed identically across 7 frontier models (Claude Sonnet 4.5, Gemini 2.5 Pro, Gemini 3 Flash & Pro, GPT-4.1, GPT-5.4, Grok 4.20):

Dim	Claim A (framework)	Claim B (vs base)	Claim C (cross-model rank)	Variance verdict
ICT	+1.39 ± 4.12	+5.20 ± 3.26	r = +0.32 (modest)	comparable
NCG	+13.72 ± 12.17	−14.08 ± 13.88	r = +0.07 (chance)	identity > model
PD	+13.84 ± 8.06	+7.50 ± 6.00	r = +0.32 (modest)	comparable
TP	+7.85 ± 1.73	−3.83 ± 3.29	r = +0.73 (strong but model-driven)	MODEL DOMINATES
CCC	+8.88 ± 7.97	+8.01 ± 7.79	r = +0.13 (weak)	comparable
DEA	+8.82 ± 3.17	+1.80 ± 1.44	r = +0.06 (chance)	comparable

Per-identity 6-D fingerprint shape is highly stable across model architectures: mean cross-model Pearson r = +0.934 across 101 pairs (99% of pairs r > +0.7). The dimensional decomposition is more nuanced than the aggregate identity-level signal suggests.

Takeaways

Framework contribution (Claim A) is positive on all 6 dimensions — positive paired delta against kernel-only scaffolding across all 7 frontier models tested.
Identity scaffolding does not uniformly lift dimensions above a base-model null (Claim B). ICT, PD, and CCC pass; NCG and TP score lower on scaffolded responses than on the base model with no identity. DEA shows only a marginal lift.
Per-dimension identity rankings are mostly model-dependent (Claim C). NCG, DEA, and CCC cross-model identity rankings are at chance. TP shows strong cross-model agreement, but the variance decomposition identifies it as model-capability variance rather than identity variance.
Per-identity 6-D fingerprint shape replicates across models at mean cross-model Pearson r = +0.934 (101 pairs, 99% above +0.7). The overall fingerprint vector distinguishes identities even where individual dimensions wobble.

Design principles

Three-claim disambiguation as primary output — a single number is never ambiguous about which claim it supports.
Variance decomposition every run — between-identity SD vs between-model SD per dimension. Dimensions where model variance dominates get an auto-generated diagnostic.
Mandatory null-scaffold arm — every benchmark run produces arm_b (base model, no identity) for Claim-B comparison.
Substrate abstraction — identity-scaffolded behavior modeled as a (T × N) activity matrix on an abstract IdentitySubstrate. The same analysis layer extends to activation-level substrates for open-weight models.
No composite score — SECI reports the 6-D fingerprint vector.

Repository structure

paper/
  seci.tex             — paper source (LaTeX + natbib)
  refs.bib             — bibliography
  seci.pdf             — compiled paper
prompts.json           — protocol (12 prompts × 6 dimensions)
src/seci/
  substrate/           — IdentitySubstrate abstraction
    base.py            — abstract substrate (T × N activity matrix)
    llm_substrate.py   — text-output substrate from analysis JSONs
  scorer/              — six-dimension fingerprint scoring
    dimensions.py      — deterministic ICT/PD/TP/CCC/DEA + NCG fallback
    analyzer.py        — SECIScorer with multi-rater NCG verification
  protocol/            — 12-prompt protocol runner
    runner.py          — collects responses from any LLM provider
  analysis/            — three-claim + variance decomposition
    claims.py          — Claim A, B, C computations
    variance.py        — variance decomposition + warning flags
examples/
  run_full_benchmark.py — end-to-end: kernel → protocol → scoring
  rescore_dataset.py   — apply analysis to a multi-arm benchmark dataset
validation_outputs/    — analysis outputs on the reference dataset
  claim_a_population.json
  claim_b_population.json
  claim_c_cross_model.json
  variance_decomposition.json
  fingerprint_stability.json
  warning_flags.json
figures/               — 3 publication figures (PNG + PDF)

Running SECI

There are two entry points depending on what you want to do.

End-to-end on one identity (kernel → fingerprint)

pip install -r requirements.txt

# Set at least one provider API key (target model + raters share keys)
export GOOGLE_API_KEY=...        # or OPENAI_API_KEY / ANTHROPIC_API_KEY

python -m examples.run_full_benchmark \
    --identity-name auren \
    --identity-kernel kernels/auren.txt \
    --model gemini-2.5-pro \
    --provider gemini \
    --output sessions/auren_gemini25.json

The full benchmark runs the 12-prompt protocol against the target model, then scores the responses with the six-dimension SECI fingerprint. Multi-rater NCG verification activates automatically when ≥2 rater API keys are configured (≥3 recommended for stable inter-rater statistics).

Three-claim re-analysis (on a directory of fingerprint JSONs)

python -m examples.rescore_dataset \
    --data-dir <path-to-fingerprint-jsons> \
    --output-dir validation_outputs

Each analysis writes a JSON output to the output directory and prints a results table matching the paper. Runtime is under one minute on a laptop.

The substrate abstraction

from seci.substrate import LLMSubstrate
from seci.analysis import (
    population_claim_a, population_claim_b,
    claim_c_cross_model_ranking, variance_decomposition,
    per_identity_fingerprint_stability,
)

substrates = [LLMSubstrate(p) for p in result_paths]

claim_a = population_claim_a(substrates)              # framework vs kernel-only
claim_b = population_claim_b(substrates, "A")         # arm_a vs base model
claim_c = claim_c_cross_model_ranking(substrates)     # identity-ranking r per dim
decomp  = variance_decomposition(substrates, arm="A") # between-identity vs between-model SD
stable  = per_identity_fingerprint_stability(substrates, arm="A")  # 6-D fingerprint

Any system observable as a (T × N) activity matrix qualifies as a substrate. The initial release ships with LLMSubstrate (text-output behavioral substrate from SECI-format analysis JSONs); an ActivationSubstrate for open-weight models would let the same analysis run on hidden-state activations.

Citation

Citation:

@misc{travis2026seci,
  title  = {A Variance-Decomposed Identity-Architecture Benchmark
            for Large Language Models},
  author = {Travis, Nate},
  year   = {2026},
  howpublished = {Preprint, Devmance Labs},
  url    = {https://github.com/devmance/SECI}
}

License

MIT License — see LICENSE.

Contact

Nate Travis — labs@devmance.com — Devmance Labs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SECI — Identity-Architecture Benchmark for Large Language Models

What this benchmark measures

Findings

Takeaways

Design principles

Repository structure

Running SECI

End-to-end on one identity (kernel → fingerprint)

Three-claim re-analysis (on a directory of fingerprint JSONs)

The substrate abstraction

Citation

License

Contact

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
figures		figures
paper		paper
src/seci		src/seci
validation_outputs		validation_outputs
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
prompts.json		prompts.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SECI — Identity-Architecture Benchmark for Large Language Models

What this benchmark measures

Findings

Takeaways

Design principles

Repository structure

Running SECI

End-to-end on one identity (kernel → fingerprint)

Three-claim re-analysis (on a directory of fingerprint JSONs)

The substrate abstraction

Citation

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages