Skip to content

devmance/SECI

Repository files navigation

SECI — Identity-Architecture Benchmark for Large Language Models

Reproducible code and analysis for the paper:

A Variance-Decomposed Identity-Architecture Benchmark for Large Language Models

Re-analysis of multi-model multi-arm identity-scaffolding data, separating framework contribution, base-model null, and cross-architecture portability claims.

Nate Travis · Devmance Labs

Paper: paper/seci.pdf (LaTeX source: paper/seci.tex, bibliography: paper/refs.bib).

What this benchmark measures

SECI is an open multi-rater benchmark for characterizing identity-scaffolded LLM behavior. Identity scaffolding (a kernel prompt that gives an LLM a persistent character, voice, and conceptual register) is now widely deployed but rarely benchmarked against the right null. SECI reports three empirical claims side-by-side, each with explicit labels for which question the measurement answers.

Claim Comparison What it tests
Claim A — Framework contribution arm_a (full SE framework) vs arm_c (kernel only) Does the framework wrapping above the identity kernel produce a measurable per-character delta?
Claim B — Scaffolding vs base arm_a or arm_c vs arm_b (no identity) Does identity scaffolding lift dimension scores above a true no-identity null?
Claim C — Cross-architecture portability Per-dimension Pearson r on identity rankings across models Do identity rankings on dimension X replicate when you change the model?

A dimension can pass one claim and fail another. SECI publishes all three for every dimension across 7 frontier models × 36 identities × 3 protocol arms.

Findings

Six-dimension fingerprint (ICT, NCG, PD, TP, CCC, DEA) computed identically across 7 frontier models (Claude Sonnet 4.5, Gemini 2.5 Pro, Gemini 3 Flash & Pro, GPT-4.1, GPT-5.4, Grok 4.20):

Dim Claim A (framework) Claim B (vs base) Claim C (cross-model rank) Variance verdict
ICT +1.39 ± 4.12 +5.20 ± 3.26 r = +0.32 (modest) comparable
NCG +13.72 ± 12.17 −14.08 ± 13.88 r = +0.07 (chance) identity > model
PD +13.84 ± 8.06 +7.50 ± 6.00 r = +0.32 (modest) comparable
TP +7.85 ± 1.73 −3.83 ± 3.29 r = +0.73 (strong but model-driven) MODEL DOMINATES
CCC +8.88 ± 7.97 +8.01 ± 7.79 r = +0.13 (weak) comparable
DEA +8.82 ± 3.17 +1.80 ± 1.44 r = +0.06 (chance) comparable

Per-identity 6-D fingerprint shape is highly stable across model architectures: mean cross-model Pearson r = +0.934 across 101 pairs (99% of pairs r > +0.7). The dimensional decomposition is more nuanced than the aggregate identity-level signal suggests.

Takeaways

  1. Framework contribution (Claim A) is positive on all 6 dimensions — positive paired delta against kernel-only scaffolding across all 7 frontier models tested.
  2. Identity scaffolding does not uniformly lift dimensions above a base-model null (Claim B). ICT, PD, and CCC pass; NCG and TP score lower on scaffolded responses than on the base model with no identity. DEA shows only a marginal lift.
  3. Per-dimension identity rankings are mostly model-dependent (Claim C). NCG, DEA, and CCC cross-model identity rankings are at chance. TP shows strong cross-model agreement, but the variance decomposition identifies it as model-capability variance rather than identity variance.
  4. Per-identity 6-D fingerprint shape replicates across models at mean cross-model Pearson r = +0.934 (101 pairs, 99% above +0.7). The overall fingerprint vector distinguishes identities even where individual dimensions wobble.

Design principles

  • Three-claim disambiguation as primary output — a single number is never ambiguous about which claim it supports.
  • Variance decomposition every run — between-identity SD vs between-model SD per dimension. Dimensions where model variance dominates get an auto-generated diagnostic.
  • Mandatory null-scaffold arm — every benchmark run produces arm_b (base model, no identity) for Claim-B comparison.
  • Substrate abstraction — identity-scaffolded behavior modeled as a (T × N) activity matrix on an abstract IdentitySubstrate. The same analysis layer extends to activation-level substrates for open-weight models.
  • No composite score — SECI reports the 6-D fingerprint vector.

Repository structure

paper/
  seci.tex             — paper source (LaTeX + natbib)
  refs.bib             — bibliography
  seci.pdf             — compiled paper
prompts.json           — protocol (12 prompts × 6 dimensions)
src/seci/
  substrate/           — IdentitySubstrate abstraction
    base.py            — abstract substrate (T × N activity matrix)
    llm_substrate.py   — text-output substrate from analysis JSONs
  scorer/              — six-dimension fingerprint scoring
    dimensions.py      — deterministic ICT/PD/TP/CCC/DEA + NCG fallback
    analyzer.py        — SECIScorer with multi-rater NCG verification
  protocol/            — 12-prompt protocol runner
    runner.py          — collects responses from any LLM provider
  analysis/            — three-claim + variance decomposition
    claims.py          — Claim A, B, C computations
    variance.py        — variance decomposition + warning flags
examples/
  run_full_benchmark.py — end-to-end: kernel → protocol → scoring
  rescore_dataset.py   — apply analysis to a multi-arm benchmark dataset
validation_outputs/    — analysis outputs on the reference dataset
  claim_a_population.json
  claim_b_population.json
  claim_c_cross_model.json
  variance_decomposition.json
  fingerprint_stability.json
  warning_flags.json
figures/               — 3 publication figures (PNG + PDF)

Running SECI

There are two entry points depending on what you want to do.

End-to-end on one identity (kernel → fingerprint)

pip install -r requirements.txt

# Set at least one provider API key (target model + raters share keys)
export GOOGLE_API_KEY=...        # or OPENAI_API_KEY / ANTHROPIC_API_KEY

python -m examples.run_full_benchmark \
    --identity-name auren \
    --identity-kernel kernels/auren.txt \
    --model gemini-2.5-pro \
    --provider gemini \
    --output sessions/auren_gemini25.json

The full benchmark runs the 12-prompt protocol against the target model, then scores the responses with the six-dimension SECI fingerprint. Multi-rater NCG verification activates automatically when ≥2 rater API keys are configured (≥3 recommended for stable inter-rater statistics).

Three-claim re-analysis (on a directory of fingerprint JSONs)

python -m examples.rescore_dataset \
    --data-dir <path-to-fingerprint-jsons> \
    --output-dir validation_outputs

Each analysis writes a JSON output to the output directory and prints a results table matching the paper. Runtime is under one minute on a laptop.

The substrate abstraction

from seci.substrate import LLMSubstrate
from seci.analysis import (
    population_claim_a, population_claim_b,
    claim_c_cross_model_ranking, variance_decomposition,
    per_identity_fingerprint_stability,
)

substrates = [LLMSubstrate(p) for p in result_paths]

claim_a = population_claim_a(substrates)              # framework vs kernel-only
claim_b = population_claim_b(substrates, "A")         # arm_a vs base model
claim_c = claim_c_cross_model_ranking(substrates)     # identity-ranking r per dim
decomp  = variance_decomposition(substrates, arm="A") # between-identity vs between-model SD
stable  = per_identity_fingerprint_stability(substrates, arm="A")  # 6-D fingerprint

Any system observable as a (T × N) activity matrix qualifies as a substrate. The initial release ships with LLMSubstrate (text-output behavioral substrate from SECI-format analysis JSONs); an ActivationSubstrate for open-weight models would let the same analysis run on hidden-state activations.

Citation

Citation:

@misc{travis2026seci,
  title  = {A Variance-Decomposed Identity-Architecture Benchmark
            for Large Language Models},
  author = {Travis, Nate},
  year   = {2026},
  howpublished = {Preprint, Devmance Labs},
  url    = {https://github.com/devmance/SECI}
}

License

MIT License — see LICENSE.

Contact

Nate Travis — labs@devmance.com — Devmance Labs

About

An open multi-rater benchmark for characterizing architectural fingerprints in identity-scaffolded LLMs.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors