From bde5ee9fa6718caaa7ac09db4d7e136ee6bc175b Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Sat, 13 Jun 2026 02:47:49 +0000
Subject: [PATCH] docs: add diagram-heavy developer guide for the research
 codebase

Add a docs/ folder that documents the system at the architecture level
rather than the docstring level: how the pieces fit together, the
project-specific vocabulary, the streaming data-generation pipeline, the
model zoo and offline evaluation, and candid notes on what modern
practice would recommend.

 - docs/README.md       orientation + repo map + reading order
 - docs/glossary.md     USP / initiation / eligible-existing-active /
                        triple / coverage, with a state diagram
 - docs/architecture.md whole-system overview, entry points, experiment
                        loop, package map (Mermaid)
 - docs/data-pipeline.md the timestamp-ordered replay that produces
                        training triples and test contexts, plus the
                        async writer and feature dedup
 - docs/modeling.md     1563-d feature vector, model zoo, training loop,
                        baselines, cached offline evaluation, sweeps
 - docs/modernization.md consolidated "what would you do today" notes

All 14 Mermaid diagrams validated with the mermaid parser. Link the new
guide from the top-level README.

https://claude.ai/code/session_018fRrzqPsGMHL3roZ2E3gVq
---
 README.md             |  15 +++
 docs/README.md        |  85 +++++++++++++++
 docs/architecture.md  | 166 +++++++++++++++++++++++++++++
 docs/data-pipeline.md | 230 ++++++++++++++++++++++++++++++++++++++++
 docs/glossary.md      |  72 +++++++++++++
 docs/modeling.md      | 238 ++++++++++++++++++++++++++++++++++++++++++
 docs/modernization.md |  91 ++++++++++++++++
 7 files changed, 897 insertions(+)
 create mode 100644 docs/README.md
 create mode 100644 docs/architecture.md
 create mode 100644 docs/data-pipeline.md
 create mode 100644 docs/glossary.md
 create mode 100644 docs/modeling.md
 create mode 100644 docs/modernization.md

diff --git a/README.md b/README.md
index d6dfb1f..973e4b6 100644
--- a/README.md
+++ b/README.md
@@ -7,6 +7,21 @@ HealthBlogRec is a recommendation system for peer health blogs.
 
 ![Recommender system overview](/figures/rec_system_overview.png)
 
+## Documentation
+
+In-depth developer documentation lives in [`docs/`](docs/) — a diagram-heavy
+guided tour of how the system fits together, written for someone reading the
+[paper](https://arxiv.org/abs/2209.04973) and the source side by side. Because
+this is research code from 2021, the docs also include candid **🕰️ Modern take**
+notes flagging where current practice would steer you differently.
+
+ - [`docs/README.md`](docs/README.md) — start here: orientation and reading order
+ - [`docs/glossary.md`](docs/glossary.md) — the project's vocabulary (*USP*, *initiation*, *eligible*/*existing*/*active*, *triple*, *coverage*)
+ - [`docs/architecture.md`](docs/architecture.md) — the whole system on one page, plus entry points and the experiment loop
+ - [`docs/data-pipeline.md`](docs/data-pipeline.md) — the streaming history-replay that generates training triples and test contexts
+ - [`docs/modeling.md`](docs/modeling.md) — the feature vector, model zoo, training loop, baselines, and offline evaluation
+ - [`docs/modernization.md`](docs/modernization.md) — consolidated "what would you do today?" notes
+
 ## Citation and project history
 
 If any portion of this project is useful to you, please cite the following paper: 
diff --git a/docs/README.md b/docs/README.md
new file mode 100644
index 0000000..8a751e4
--- /dev/null
+++ b/docs/README.md
@@ -0,0 +1,85 @@
+# HealthBlogRec — Developer Documentation
+
+This folder is a guided tour of the HealthBlogRec codebase, written for the
+reader who has the [arXiv paper](https://arxiv.org/abs/2209.04973) open in one
+tab and the source in another, and is trying to understand *how the choices in
+the paper map onto the code*.
+
+It is intentionally heavy on diagrams and light on prose. The code itself has
+reasonable docstrings in the places that matter; what was missing — and what
+these documents try to provide — is the **bird's-eye view**: how the pieces fit
+together, why they're shaped the way they are, and where the sharp edges are.
+
+> ⚠️ **This is research code from 2021.** It was written to run one study on one
+> cluster (the University of Minnesota's MSI) and to be re-run a handful of times
+> for paper revisions. It is *not* a library, it is *not* deployable as-is, and
+> several decisions that look strange today were reasonable given the deadline,
+> the data, and the tooling of the time. Throughout these docs, call-outs marked
+> **🕰️ Modern take** flag places where current practice would steer you
+> somewhere else. They are not bug reports — they are orientation for anyone
+> trying to learn from or revive this work.
+
+## Where to start
+
+Read in this order:
+
+1. **[glossary.md](glossary.md)** — the vocabulary. The code is dense with
+   domain-specific terms (*USP*, *initiation*, *eligible* vs *existing* vs
+   *active*, *triple*, *coverage*). Nothing else makes sense until these click.
+2. **[architecture.md](architecture.md)** — the whole system on one page, plus
+   the two entry points (`gen.py`, `predict.py`) and how an offline experiment
+   flows end to end.
+3. **[data-pipeline.md](data-pipeline.md)** — the heart of the project: the
+   streaming "replay history one interaction at a time" simulation in
+   `cbrec.triple_generation` that turns raw CaringBridge activity into training
+   triples and test/prediction contexts.
+4. **[modeling.md](modeling.md)** — the feature vector, the model zoo
+   (`LinearNet`, `SimNet`, `ConcatNet`, `LearnedSimNet`, `InteractionNet`), the
+   training loop, the baselines, and how offline evaluation is run at scale.
+5. **[modernization.md](modernization.md)** — a consolidated, candid list of
+   "what would you do differently today," collected so it doesn't clutter the
+   reference docs.
+
+## The one-paragraph summary
+
+HealthBlogRec recommends *peer health blogs* (CaringBridge sites) to people who
+write their own. The core modeling task is framed as: given a **source** author
+who is about to reach out to a new site (an **initiation**), can we rank the pool
+of **candidate** authors so the one they actually reached out to (the **target**)
+floats to the top? The system learns from history by replaying years of
+CaringBridge interactions in timestamp order, snapshotting the network/activity
+state at each real initiation, and emitting a `(source, target, sampled
+non-target)` **triple**. Each user/site pair is encoded as network features +
+recent-activity features + a RoBERTa embedding of their recent journal updates.
+A small MLP scores `(source, candidate)` pairs; the trained model was used to
+send weekly recommendation emails during a 2021 field study.
+
+## Map of the repository
+
+```
+HealthBlogRec/
+├── src/
+│   ├── gen.py / predict.py        # thin entry points → cbrec.triple_generation / cbrec.prediction
+│   ├── *.sh                       # Slurm batch scripts for the MSI cluster
+│   ├── cbrec/                     # the recommendation package (see data-pipeline.md, modeling.md)
+│   │   ├── triple_generation.py   # the streaming history-replay simulation (the core)
+│   │   ├── reccontext.py          # one recommendation event (source → candidates [→ target])
+│   │   ├── feature_extraction.py  # USP → feature vector, + async DB writer
+│   │   ├── eligibility.py         # who counts as existing / eligible
+│   │   ├── network.py             # the interaction graph (indegree/outdegree/component)
+│   │   ├── recentActivityCounter.py / timeAwareDict.py  # time-windowed activity counts
+│   │   ├── featuredb.py           # SQLite storage of feature arrays / triples / contexts
+│   │   ├── evaluation.py / coverage.py  # baselines, ranking metrics, coverage
+│   │   ├── text/                  # RoBERTa journal embeddings in SQLite
+│   │   ├── modeling/              # the PyTorch models, training, and offline eval
+│   │   └── experiment/            # generates hyperparameter-sweep configs + Slurm scripts
+│   ├── cbsend/                    # composes and sends the weekly recommendation emails
+│   └── extract/                   # MongoDB BSON dumps → flattened ndJSON
+├── notebook/                      # analysis & experimentation (outputs/IDs scrubbed)
+├── data/                          # small sample data artifacts
+├── design/                        # email mockups
+└── figures/                       # figures used in the paper
+```
+
+See the top-level [README](../README.md) for citation, study history, the conda
+environment, and the list of bugs caught during deployment.
diff --git a/docs/architecture.md b/docs/architecture.md
new file mode 100644
index 0000000..b813f80
--- /dev/null
+++ b/docs/architecture.md
@@ -0,0 +1,166 @@
+# Architecture Overview
+
+This is the whole system on one page. For the vocabulary
+(*USP*, *initiation*, *eligible*…) see [glossary.md](glossary.md) first.
+
+## The big picture
+
+HealthBlogRec turns years of raw CaringBridge activity into a trained model that
+ranks peer authors, and used that model to send weekly recommendation emails
+during a 2021 field study. There are five loosely-coupled stages, each writing
+artifacts to disk that the next stage reads. This file-handoff design is
+deliberate: each stage took hours-to-days on the cluster and had to be
+independently restartable.
+
+```mermaid
+flowchart TD
+    subgraph raw["① Raw data (MongoDB → flat files)"]
+        BSON["MongoDB BSON dumps"] -->|"src/extract/"| FLAT["ints_df.feather<br/>journal_metadata.feather<br/>user_site_df.csv"]
+        TXT["journalText.sqlite<br/>(raw post text)"]
+    end
+
+    subgraph gen["② Data generation — cbrec.triple_generation (gen.py)"]
+        FLAT --> REPLAY["Replay interactions in<br/>timestamp order, maintaining<br/>graph + activity + eligibility state"]
+        REPLAY -->|"each eligible initiation"| FDB[("feature.sqlite<br/>triples · test_contexts · feature arrays")]
+        REPLAY --> MDN["metadata.ndjson"]
+    end
+
+    subgraph text["③ Text features — cbrec.text"]
+        TXT -->|"RoBERTa, 32 procs"| TFDB[("text_feature.sqlite<br/>768-d embeddings per journal")]
+    end
+
+    subgraph model["④ Modeling — cbrec.modeling"]
+        FDB --> TRAIN["train.py / model_gen.py<br/>MLP on (source, candidate) pairs"]
+        TFDB --> TRAIN
+        TRAIN --> MODEL["trained model<br/>(.pt + .json manifest)"]
+        MODEL --> EVAL["offline evaluation<br/>(NDCG, reciprocal rank, coverage)"]
+        FDB --> EVAL
+        TFDB --> EVAL
+    end
+
+    subgraph deploy["⑤ Deployment (the 2021 field study)"]
+        MODEL --> PRED["cbrec.prediction (predict.py)<br/>score live candidates"]
+        FLAT --> PRED
+        PRED --> EMAIL["cbsend/<br/>compose + send weekly SSE emails"]
+    end
+```
+
+This diagram is the code version of the paper's
+[system overview figure](../figures/rec_system_overview.png).
+
+## The two entry points
+
+The `src/` top level is deliberately thin — the real logic is in the `cbrec`
+package. Everything is launched through Slurm batch scripts (`src/*.sh`) on the
+MSI cluster.
+
+| Script | Delegates to | Purpose |
+|--------|--------------|---------|
+| `gen.py` | `cbrec.triple_generation.main()` | Stage ②: generate all training triples and test/predict contexts. ~50 hours. |
+| `predict.py` | `cbrec.prediction.main()` | Stage ⑤: load a checkpoint + model and produce live recommendations for a set of users. |
+
+The supporting `make_text_features_*.sh` scripts drive Stage ③, and the
+`cbrec.experiment` + `cbrec.modeling` machinery drives Stage ④ (see
+[modeling.md](modeling.md)).
+
+## How an offline experiment flows end-to-end
+
+This is the loop a researcher actually ran to compare models for the paper.
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant U as Researcher
+    participant CG as experiment/config_gen.py
+    participant SL as Slurm (MSI)
+    participant MG as modeling/model_gen.py
+    participant FDB as feature.sqlite + cached X_train
+    participant EV as modeling/evaluateModelFromCache*.py
+    participant NB as Jupyter notebook
+
+    U->>CG: define a hyperparameter sweep
+    CG->>SL: emit one config.json + sbatch script per cell of the grid
+    loop each config in the sweep
+        SL->>MG: train one model
+        MG->>FDB: load cached training matrix
+        MG-->>SL: write model .pt + .json manifest
+    end
+    U->>EV: submitEvalFromDirectory.py over the model dir
+    loop each trained model
+        EV->>EV: score every cached RecContext, write *_test_metadata.ndjson
+    end
+    EV-->>NB: NDJSON metrics + coverage pickles
+    NB->>U: compare models, pick the field-study model
+```
+
+The key performance trick: **RecContexts are pre-built once and cached as
+pickles** (`generateRecMd.py`) so that evaluating 500+ models in a sweep doesn't
+re-pay the cost of assembling feature matrices and looking up embeddings each
+time. See [modeling.md](modeling.md#offline-evaluation-at-scale).
+
+## Package map
+
+```mermaid
+flowchart LR
+    subgraph cbrec["cbrec (recommendation core)"]
+        TG["triple_generation<br/>(orchestrator)"]
+        RC["reccontext"]
+        FE["feature_extraction"]
+        EL["eligibility"]
+        NW["network"]
+        RA["recentActivityCounter<br/>timeAwareDict"]
+        FD["featuredb"]
+        DA["data"]
+        EVl["evaluation<br/>coverage"]
+        TG --> RC --> FE
+        TG --> EL & NW & RA & DA
+        FE --> FD
+        RC --> EVl
+    end
+    subgraph txt["cbrec.text"]
+        TDB["textdb"] --> CTF["createTextFeatureSqlite<br/>(RoBERTa)"] --> EDB["embeddingdb"]
+        JID["journalid"]
+    end
+    subgraph mdl["cbrec.modeling"]
+        MC["modelconfig"] --> MM["manager"]
+        MM --> PP["preprocess"] & MT["train"]
+        MT --> MODELS["models/*<br/>LinearNet · SimNet · ConcatNet<br/>LearnedSimNet · InteractionNet"]
+        MM --> SC["scorer"]
+        RCB["reccontext_builder"] --> TL["text_loader"]
+    end
+    subgraph exp["cbrec.experiment"]
+        CGN["config_gen"] --> MGN["model_gen"]
+    end
+    cbrec --> mdl
+    txt --> mdl
+    exp --> mdl
+    mdl --> SEND["cbsend (emails)"]
+```
+
+## Design choices worth understanding up front
+
+- **Everything is a streaming replay.** The system never "queries the current
+  state of the database." It replays history one interaction at a time and
+  computes features from the *online* state at that instant. This is what makes
+  the features causally honest (no future leakage) — and what makes the pipeline
+  a stateful, checkpoint-heavy, ~50-hour job. See [data-pipeline.md](data-pipeline.md).
+
+- **Temporal train/test split, not random.** Train = initiations 2014–2020,
+  test = initiations in 2021. The model is always asked to predict the future.
+
+- **Feature arrays are content-addressed and deduplicated** in SQLite
+  (`featuredb.py`), so the hundreds of millions of `(source, candidate)` pairs
+  collapse to far fewer unique stored vectors.
+
+- **Text is a separate pipeline on purpose.** RoBERTa embedding of ~1M journals
+  is expensive and embarrassingly parallel, so it lives in its own stage
+  (`cbrec.text`, 32 processes) and is joined in only at model-training time.
+
+- **🕰️ Modern take:** Today most of stages ②–④ would be a feature store + a
+  scheduled batch job (or a streaming feature pipeline like Feast/Flink), the
+  bespoke SQLite + ndjson + pickle handoffs would be Parquet/Arrow, and the
+  Slurm sbatch-script-generation in `cbrec.experiment` would be an experiment
+  tracker (Weights & Biases / MLflow) plus a workflow engine
+  (Airflow/Prefect/Ray). The architecture's *shape* — replay → features →
+  train → eval → serve — is still exactly what you'd build. See
+  [modernization.md](modernization.md).
diff --git a/docs/data-pipeline.md b/docs/data-pipeline.md
new file mode 100644
index 0000000..b44c042
--- /dev/null
+++ b/docs/data-pipeline.md
@@ -0,0 +1,230 @@
+# The Data Generation Pipeline
+
+> Code: `cbrec/triple_generation.py` and everything it orchestrates.
+> Entry point: `src/gen.py`. Vocabulary: [glossary.md](glossary.md).
+
+This is the most important — and least obvious — part of the project. Everything
+the model learns from is produced here. If you understand this file, you
+understand HealthBlogRec.
+
+## The core idea: replay history, one interaction at a time
+
+The pipeline does **not** query a database for "the current network." Instead it
+**replays the entire interaction stream in timestamp order**, and maintains the
+graph, activity counters, and eligibility sets *incrementally* as each event
+arrives. When it reaches a real initiation, it snapshots the state *as it existed
+at that instant* and emits training data.
+
+Why bother? Because it guarantees **no temporal leakage**: a USP's features at
+time *t* can only reflect what had happened before *t*. A model trained on these
+features is being asked exactly the question it will face in production — "given
+only the past, rank these candidates" — which is why the offline numbers are
+trustworthy.
+
+```mermaid
+flowchart TD
+    START["RecDataGenerator.generate_rec_data()"] --> LOAD["DataManager loads + trims<br/>ints_df (sorted by created_at)"]
+    LOAD --> LOOP{"for each interaction<br/>in timestamp order"}
+    LOOP --> JUP["apply any journal updates<br/>that occurred before this ts<br/>(timeAwareDict)"]
+    JUP --> UPDST["update activity counters<br/>(recentActivityManager)"]
+    UPDST --> ISINIT{"first time this user<br/>touched this site?<br/>(is_initiation)"}
+    ISINIT -->|no| UPDACT["record interaction in<br/>activity counter"]
+    ISINIT -->|yes| PERIOD{"which period?"}
+    PERIOD -->|"before 2014<br/>(warm-up)"| GRAPH
+    PERIOD -->|"2014–2020<br/>(generation)"| TRAIN["build RecContext →<br/>generate TRAIN triples"]
+    PERIOD -->|"2021–2022<br/>(test)"| TEST["build RecContext →<br/>generate TEST context"]
+    TRAIN --> WRITE["DatabaseWriter (async) →<br/>feature.sqlite + metadata.ndjson"]
+    TEST --> WRITE
+    GRAPH["update initiation graph<br/>(add edge source→existing authors)"] --> UPDACT
+    WRITE --> GRAPH
+    UPDACT --> LOOP
+    LOOP -->|"hit generation_stop_timestamp"| DONE["final checkpoint"]
+```
+
+The loop lives in `generate_rec_data()` (line ~126) and `process_interaction()`
+(line ~254). The three time boundaries are config values in `genconfig.py`:
+`generation_start_timestamp` (2014), `test_generation_start_timestamp` (2021),
+`generation_stop_timestamp` (2022).
+
+## The stateful machinery
+
+`RecDataGenerator.__init__` wires together a small set of collaborating state
+objects. Each one answers a different question about "the world so far":
+
+```mermaid
+flowchart LR
+    subgraph state["Online state maintained during replay"]
+        EL["UserSitePairEligibilityManager<br/>eligibility.py"]
+        NW["UserGraph + WccGraph<br/>network.py"]
+        RA["RecentActivityManager<br/>recentActivityCounter.py"]
+        TAD["TimeAwareDict (journals)<br/>timeAwareDict.py"]
+    end
+    EL -.->|"who is existing / eligible?"| Q1
+    NW -.->|"indegree, outdegree,<br/>component, reciprocal, FoF?"| Q1
+    RA -.->|"who is active? recent<br/>activity counts + recency?"| Q1
+    TAD -.->|"which journals happened<br/>before now?"| Q1
+    Q1["FeatureGenerator<br/>builds a USP's feature vector"]
+```
+
+| Object | File | Maintains | Feeds |
+|--------|------|-----------|-------|
+| `UserSitePairEligibilityManager` | `eligibility.py` | counts of journal updates per USP; the existing/eligible sets; site↔author maps | eligibility checks, candidate pool |
+| `UserGraph` (+ `WccGraph`) | `network.py` | the directed initiation graph; weakly-connected components | network features (indegree/outdegree/component/reciprocal/FoF) |
+| `RecentActivityManager` (a set of `RecentActivityCounter`s) | `recentActivityCounter.py` | sliding-window counts of journals/amps/comments/guestbooks; first-journal timestamps | activity features, the "active" set |
+| `TimeAwareDict` | `timeAwareDict.py` | journal updates indexed by timestamp, consumed once | drives the eligibility/activity updates |
+
+The activity counters use a neat trick: an `OrderedDict` keyed by timestamp lets
+expiry be an O(1) pop from the left as the window slides forward
+(`RecentActivityCounter.update_counts`). The `WccGraph` keeps component
+membership up to date by merging components whenever an edge bridges two of them.
+
+## From an initiation to a training triple
+
+When an eligible initiation occurs during the generation period,
+`RecContext.generate_train_features()` produces one or more **triples**:
+
+```mermaid
+flowchart TD
+    INIT["initiation: source_user → target_site"] --> SRC["source USPs =<br/>all sites where source is eligible<br/>(usually 1)"]
+    INIT --> TGT["target USPs =<br/>eligible authors on target_site"]
+    INIT --> CAND["candidate pool =<br/>active ∩ eligible USPs<br/>(minus self, coauthors, existing ties)"]
+    SRC --> PAIR
+    TGT --> PAIR
+    CAND -->|"random negative sample"| ALT["alt USP"]
+    ALT --> PAIR["TRIPLE = (source, target, alt)"]
+    PAIR --> POS["(source, target) → label 1"]
+    PAIR --> NEG["(source, alt) → label 0"]
+    POS --> FEAT["FeatureGenerator:<br/>12 USP features each<br/>+ 3 pairwise features"]
+    NEG --> FEAT
+    FEAT --> SAVE["DatabaseWriter.save_triples()<br/>→ feature.sqlite"]
+```
+
+Negative sampling here is **uniform random** from the eligible-active candidate
+pool. Note what's stored is *features*, not raw IDs at scoring time: each triple
+row references five `feature_id`s (source, target, alt, source-target pair,
+source-alt pair), and identical feature vectors are deduplicated by content hash
+(see below).
+
+## Test contexts are different: rank the whole pool
+
+In the test period there's no negative sampling. Instead
+`generate_test_features()` materializes the **entire candidate set** as dense
+matrices so the evaluator can score every candidate and find where the true
+target ranked:
+
+| Array | Shape | Meaning |
+|-------|-------|---------|
+| `source_usp_arr` | `(X, 2)` | source USPs (usually X=1) |
+| `candidate_usp_arr` | `(Y, 2)` | every candidate USP, sorted by `site_id` |
+| `target_inds` | `(Z,)` | indices into `candidate_usp_arr` that are the true target |
+| `source_usp_mat` | `(X, 12)` | non-text features for sources |
+| `candidate_usp_mat` | `(Y, 12)` | non-text features for candidates |
+| `user_pair_mat` | `(X·Y, 3)` | pairwise features for every source×candidate combo |
+
+The text embeddings are **not** stored here — they're joined in later at scoring
+time by `reccontext_builder.build_reccontext_X` (see [modeling.md](modeling.md)).
+This keeps `feature.sqlite` from ballooning with duplicated 768-d vectors.
+
+A **predict** context is a test context with no target — used to generate the
+recs that were actually emailed, and the ~1000 target-less "coverage" contexts
+created at test-period start to measure reach.
+
+## What a USP feature vector contains
+
+`FeatureGenerator.generate_user_features` produces a **12-dimensional** non-text
+vector per USP, and `generate_user_pair_features` produces **3** pairwise
+features:
+
+```
+USP features (12):
+  network (3):   indegree · outdegree · weakly-connected-component size
+  activity (8):  for each of {journal, amp, comment, guestbook}:
+                   count-in-window · hours-since-most-recent
+  temporal (1):  hours since the author's first journal update
+
+pair features (3):  are_weakly_connected · is_friend_of_friend · is_reciprocal
+```
+
+At model-training time these are concatenated with two 768-d RoBERTa journal
+embeddings (source + candidate) to form the final input vector. The full layout
+is documented in [modeling.md](modeling.md#the-feature-vector).
+
+## Persistence: the async writer and feature dedup
+
+Feature computation runs in the main process; disk I/O is offloaded to a
+separate `WriterProcess` over a queue (`feature_extraction.py`). This lets the
+CPU-bound replay keep moving while SQLite writes batch up (commits every
+`feature_db_max_commit_size` = 500k inserts).
+
+```mermaid
+flowchart LR
+    MAIN["main loop<br/>(compute features)"] -->|"queue items:<br/>feature / triple /<br/>test_context / metadata"| Q(["mp.Queue"])
+    Q --> WP["WriterProcess"]
+    WP -->|"batched commits"| SQL[("feature.sqlite")]
+    WP -->|"append JSON line"| ND["metadata.ndjson"]
+    MAIN -.->|"content-hash dedup (LruCache of feature_ids)"| MAIN
+```
+
+`feature.sqlite` holds three logical tables: `feature` (deduplicated numpy
+arrays, serialized with a 1-byte dtype prefix), `triple` (training rows
+referencing feature_ids), and `test_context` (test/predict rows referencing the
+six matrices above). `metadata.ndjson` carries one rich metadata dict per
+RecContext — eligibility flags, candidate counts, adjustment reasons — used for
+slicing results during analysis.
+
+## Checkpointing
+
+Because a full run is ~50 hours, the entire `RecDataGenerator` is pickled at
+milestones (`create_checkpoint`): before the generation period, before the test
+period, and at the end. `gen.py --from-recent-checkpoint` resumes by replaying
+to `current_interaction_ind` and continuing. The `__getstate__`/`__setstate__`
+hooks drop and rebuild the (unpicklable) `DataManager`.
+
+## Sharp edges and candid notes
+
+These are real comments and behaviors in the code — worth knowing before you
+trust or modify the pipeline.
+
+- **The target-selection fallback is admittedly arbitrary.** At test time, if no
+  author on the target site is currently *active*, the code forces one in for
+  evaluation. With multiple eligible authors it picks the one who *joined the
+  site most recently* — and the comment says so plainly: *"a very arbitrary
+  selection criterion; random might literally be better"* (`reccontext.py:165`).
+  The `test_target_usp_adjustment` field in metadata records when this happened
+  so you can filter those cases.
+
+- **Debug mode fabricates eligibility.** In `TestConfig`/debug runs,
+  `DataManager` duplicates a journal row per user so everyone clears the
+  3-update eligibility bar (`data.py:50`). Great for exercising code paths on
+  tiny test data; obviously not a real-data behavior.
+
+- **Amp timestamps are reconstructed.** "Amps" (reactions) weren't reliably
+  timestamped in the source data, so `trim_data()` adds offsets drawn from a
+  precomputed delay distribution (`reaction_ms_since_journal.npy`). The features
+  built on amp recency are therefore approximate by construction.
+
+- **`replace_config()` is explicitly hacky.** Its own docstring warns it "will
+  NOT do everything you expect, and most config settings will be unaffected"
+  (`triple_generation.py:115`). It exists to enable a few specific mutations for
+  the predict path.
+
+- **A cache is resized mid-run on a hunch.** At test-period start the feature
+  array cache is shrunk from 5000 to 10 because test arrays are much larger,
+  with the comment `FIXME is this reasonable?` (`triple_generation.py:202`).
+
+- **The reuse cache mostly doesn't fire.** `FeatureGenerator`'s LRU cache keys on
+  `(user, site, timestamp)`; since timestamps are essentially unique, repeat hits
+  "basically don't happen" — the author says as much
+  (`feature_extraction.py:64`). The exception is coverage generation, where the
+  timestamp is held fixed and the cache is temporarily enlarged.
+
+- **🕰️ Modern take.** The hand-rolled streaming state (eligibility sets, sliding
+  windows, an incremental WCC union-find by hand) is exactly the kind of thing a
+  modern **point-in-time feature store** (Feast, Tecton) or a stream processor
+  (Flink/Kafka Streams) is built to do — with leakage-safety and backfills as
+  first-class features rather than something you enforce by careful ordering and
+  hope. The SQLite-blob feature table would be Parquet/Arrow today, and the
+  ad-hoc content-hash dedup is what a feature store's entity keys give you for
+  free. None of this means the original is wrong — it's a faithful, leakage-safe
+  pipeline — just that you'd reach for off-the-shelf infrastructure now. More in
+  [modernization.md](modernization.md).
diff --git a/docs/glossary.md b/docs/glossary.md
new file mode 100644
index 0000000..2013b4b
--- /dev/null
+++ b/docs/glossary.md
@@ -0,0 +1,72 @@
+# Glossary
+
+The code uses a compact, project-specific vocabulary that the variable names
+assume you already know. This is the decoder ring. Read it once and the rest of
+the codebase becomes legible.
+
+## Entities
+
+| Term | Code | Meaning |
+|------|------|---------|
+| **User** | `user_id` | A CaringBridge account. |
+| **Site** | `site_id` | A CaringBridge blog/site. One site can have multiple authors. |
+| **USP** (user/site pair) | `(user_id, site_id)`, often an `np.int64` 2-tuple | The atomic unit the model reasons about. *Not* a user and *not* a site, but a specific person **in their role as an author of a specific site.** The same person authoring two sites is two USPs. Almost every array in the codebase is indexed by USP. |
+| **Journal update** | `journal_oid`, `journal_id` | A blog post. The text of the 3 most recent updates is what gets embedded with RoBERTa. |
+| **Interaction** | row of `ints_df` | Any activity event: a journal update, an *amp* (a "like"), a *comment*, or a *guestbook* entry. The whole pipeline is a replay of the interaction stream in timestamp order. |
+
+## The recommendation framing
+
+| Term | Code | Meaning |
+|------|------|---------|
+| **Initiation** | `is_initiation` | The **first** time a given `user_id` interacts with a given `site_id`. This is the event the system tries to predict/encourage: one author reaching out to another's site for the first time. Initiations define both the training signal and the edges of the interaction graph. |
+| **Source** | `source_user_id`, `source_usp_arr` | The author *making* a recommendation request / initiation — the person we'd email recommendations to. |
+| **Target** | `target_site_id`, `target_usps`, `target_inds` | The site (and its eligible authors) the source *actually* initiated with. This is the ground-truth positive: the thing a good ranking puts near the top. |
+| **Candidate** | `candidate_usp_arr`, `candidate_usp_mat` | The pool of USPs that *could* be recommended at this moment — every active & eligible author. At test time we score the whole pool and check where the target landed. |
+| **Alt** (alternative) | `alt_usp`, `alt_feature_id` | A candidate sampled as a **negative** during training: a USP the source did *not* initiate with. (See [data-pipeline.md](data-pipeline.md) for the sampling.) |
+| **Triple** | `triple` table, `RecContext.triples` | The training unit: `(source, target, alt)`. Stored as two labeled pairs — `(source, target) → 1` and `(source, alt) → 0`. The word "triple" is everywhere; it just means "one positive + one sampled negative, sharing a source." |
+
+## User state (the three words that trip everyone up)
+
+These describe a USP's status *at a given instant in the replay*. They are a
+strict-ish hierarchy and are tracked in `eligibility.py` and
+`recentActivityCounter.py`.
+
+```mermaid
+flowchart LR
+    A["Author writes<br/>journal updates"] --> E["EXISTING<br/>(≥1 update on a site)"]
+    E --> EL["ELIGIBLE<br/>(≥3 updates on a site)<br/>→ can be a rec TARGET"]
+    A2["Any recent interaction<br/>(journal/amp/comment/guestbook)<br/>in the last 7 days"] --> AC["ACTIVE<br/>→ worth scoring right now"]
+    EL -.->|"eligible AND active = a real candidate"| CAND["CANDIDATE pool"]
+    AC -.-> CAND
+```
+
+| Term | Threshold | Why it exists |
+|------|-----------|---------------|
+| **Existing** | Has authored **≥1** journal update on the site. | Needed for the interaction graph: only existing authors can be the recipient of an initiation edge. |
+| **Eligible** | Has authored **≥3** journal updates on the site (`config.journal_update_memory`). | The model needs 3 journal embeddings to build a USP's text features, and recommending a barely-started blog is undesirable. Only eligible USPs can be recommendation **targets**. |
+| **Active** | Had *any* interaction within the activity window (default **1 week**, `config.activity_count_duration_ms`). | We only want to surface authors who are currently writing — a recommendation to a dormant blog is a bad experience. Candidates must be active. |
+
+So the **candidate pool = eligible ∩ active** authors at the recommendation
+timestamp.
+
+## Process / evaluation terms
+
+| Term | Code | Meaning |
+|------|------|---------|
+| **Generation period** | `generation_start_timestamp` (2014) → `test_generation_start_timestamp` (2021) | The window whose initiations become **training triples**. |
+| **Test period** | `test_generation_start_timestamp` (2021) → `generation_stop_timestamp` (2022) | The window whose initiations become **test contexts** (full candidate sets to rank, no negative sampling). This is a strict **temporal** train/test split — the model is always evaluated on the future. |
+| **RecContext** | `reccontext.RecContext` | One recommendation event, in any of three modes: *train* (emit triples), *test* (emit a full candidate matrix with known target), or *predict* (a full candidate matrix with **no** target — used to actually generate recs to email). |
+| **Coverage** | `coverage.py`, `n_coverage_users` | A diagnostic: across all the recs generated, *which sites ever get surfaced, and how often?* Answers "can the system even reach most eligible authors, or does it pile recs onto a popular few?" At test-period start the pipeline picks ~1000 eligible-active users and generates target-less recs for them purely to measure this. |
+| **Feature crosses (FC)** | comments in `feature_extraction.py` | Pairwise features computed *between* source and candidate (e.g. "are they reciprocal?", "friend-of-friend?"), as opposed to features of a single USP. |
+| **Pointwise** | `get_pointwise_training_triples` | The model is trained pointwise (each `(source, candidate)` pair gets an independent 0/1 label + BCE loss) rather than with a pairwise ranking loss — even though the triple structure is right there. See [modernization.md](modernization.md). |
+
+## Acronyms in the wild
+
+- **SSE** — "Site Suggestion Email," the weekly recommendation email (`cbsend/`).
+- **BPR** — Bayesian Personalized Ranking; referenced aspirationally in
+  `evaluation.py` comments but the deployed model is a plain MLP, not BPR.
+- **RAC** — `RecentActivityCounter`, the sliding-window activity counter.
+- **WCC** — Weakly Connected Component (of the initiation graph), used for the
+  `component_size` network feature.
+- **MSI** — Minnesota Supercomputing Institute, the Slurm cluster everything ran on.
+- **OID** — MongoDB ObjectID; the original data lived in MongoDB.
diff --git a/docs/modeling.md b/docs/modeling.md
new file mode 100644
index 0000000..2e2a848
--- /dev/null
+++ b/docs/modeling.md
@@ -0,0 +1,238 @@
+# Modeling: Features, Models, Training & Evaluation
+
+> Code: `cbrec/modeling/` and `cbrec/modeling/models/`.
+> Hyperparameter sweeps: `cbrec/experiment/`.
+> Upstream data: [data-pipeline.md](data-pipeline.md).
+
+The modeling task is a **pointwise binary classification** framing of ranking:
+score each `(source, candidate)` pair with a number in `[0, 1]`, then rank
+candidates by that score and check where the true target landed.
+
+## The feature vector
+
+Every model consumes the same fixed-length row, assembled at scoring time by
+`reccontext_builder.build_reccontext_X`. With the defaults in `genconfig.py`
+(`user_feature_count=12`, `user_pair_feature_count=3`, `text_feature_count=768`)
+the row is **1563** columns:
+
+```
+n_cols = 2·(user_feature_count + text_feature_count) + user_pair_feature_count
+       = 2·(12 + 768) + 3 = 1563
+```
+
+```mermaid
+flowchart LR
+    subgraph row["one input row (1563 dims)"]
+        S["[0:12]<br/>source<br/>non-text"]
+        C["[12:24]<br/>candidate<br/>non-text"]
+        P["[24:27]<br/>pair<br/>features"]
+        ST["[27:795]<br/>source<br/>RoBERTa (768)"]
+        CT["[795:1563]<br/>candidate<br/>RoBERTa (768)"]
+    end
+```
+
+> ⚠️ **Don't be fooled by the number 27.** It's the offset where the non-text
+> block ends (`12 + 12 + 3`), *not* the count of source features. Source non-text
+> is 12 columns. The `FeatureManager` (`preprocess.py`) lets you address slices
+> symbolically — `get_feature_indices('source', 'text')`, `('*', '~text')`, etc.
+> — which is what makes feature-ablation experiments one config line instead of a
+> code change.
+
+The non-text 12 = 3 network + 8 activity + 1 temporal; the 3 pair features are
+`are_weakly_connected · is_friend_of_friend · is_reciprocal`. See
+[data-pipeline.md](data-pipeline.md#what-a-usp-feature-vector-contains).
+
+The two 768-d RoBERTa vectors are mean-pooled over the author's 3 most recent
+journal updates by default (`pool_text_feature=0`; alternatives are max-pool and
+concatenate). They're looked up and joined in by `text_loader.TextLoader`, which
+caches embeddings in an LRU and can pre-load all ~1M journal embeddings into RAM.
+
+## The model zoo
+
+All models are tiny `nn.Module`s selected by `model_config.model_name` in
+`ModelTrainer.create_net()`. They output **raw logits** (no final sigmoid) —
+the sigmoid lives in `BCEWithLogitsLoss` and in scoring.
+
+```mermaid
+flowchart TD
+    IN["input row (1563)"] --> SEL{model_name}
+    SEL --> LN["LinearNet"]
+    SEL --> SN["SimNet"]
+    SEL --> CN["ConcatNet"]
+    SEL --> LSN["LearnedSimNet"]
+    SEL --> IN2["InteractionNet"]
+```
+
+| Model | File | Learns? | Idea | Status |
+|-------|------|:---:|------|--------|
+| **LinearNet** | `linearnet.py` | ✔ | Concatenate the whole row, push through a 2-hidden-layer MLP (ReLU + dropout, no-bias output). The workhorse — this is the **deployed field-study model**. | Primary |
+| **SimNet** | `simnet.py` | ✘ | No parameters: take cosine (or L2) similarity between the source slice and the candidate slice. A pure-similarity baseline; "train" with 0 epochs. | Baseline |
+| **LearnedSimNet** | `learnedsimnet.py` | ✔ | A bi-encoder (DSSM/Sentence-BERT-style): project source and candidate independently, then take similarity of the two embeddings. Optional shared encoder. | Experimental |
+| **ConcatNet** | `concatnet.py` | ✔ | Encode source and candidate separately, concatenate the two encodings, then an MLP head. | Partial / experimental |
+| **InteractionNet** | `interactionnet.py` | ✔ | Form the outer product of source × candidate features (explicit pairwise interactions), flatten, MLP. Expressive but `n_input²`-wide. | Experimental |
+
+`cf.py` is a stub (a citation, no implementation). The shared-feature branches in
+`LearnedSimNet`/`InteractionNet` are present but untested per their comments.
+
+**LinearNet at a glance** (`linearnet.py`):
+
+```mermaid
+flowchart LR
+    X["x (1563)"] --> F1["fc1 → n_hidden"] --> R1["ReLU + dropout"]
+    R1 --> F2["fc2 → n_hidden"] --> R2["ReLU + dropout"]
+    R2 --> F3["fc3 → 1 (no bias)"] --> OUT["logit"]
+```
+
+## The training loop
+
+`ModelManager` (`manager.py`) is the façade: it owns a `FeatureManager`
+(preprocessing) and a `ModelTrainer` (`train.py`), and handles save/load via a
+JSON manifest that points at the `.pt` weights, the pickled scaler, and the
+metrics arrays.
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant MG as model_gen.py
+    participant MM as ModelManager
+    participant FM as FeatureManager
+    participant MT as ModelTrainer
+    MG->>MM: ModelConfig + cached X_train_raw, y
+    MM->>MM: 99% / 1% train-validation split
+    MM->>FM: fit_transform(X_train)  (drop/encode cols, fit StandardScaler)
+    MM->>FM: transform(X_valid)
+    MM->>MT: train_model(X_train, y, X_valid, y_valid)
+    loop each epoch
+        MT->>MT: shuffle, minibatch, forward
+        MT->>MT: BCEWithLogitsLoss, backward, optimizer.step()
+        opt every 1/validation_rate epochs
+            MT->>MT: validation loss/acc, checkpoint if best
+        end
+        opt train loss below 0.001
+            MT->>MT: early stop
+        end
+    end
+    MM->>MM: save scaler + best .pt + metrics + manifest.json
+```
+
+Notable specifics:
+
+- **Loss / optimizer:** `BCEWithLogitsLoss`, Adam (betas/eps configurable),
+  optional `OneCycleLR` schedule (`train_scheduler_name`).
+- **Labels come in pairs** from the triples: `(source, target)=1`,
+  `(source, alt)=0`.
+- **Best-checkpoint selection** is by *validation loss*, saved as
+  `basename_e{epoch}.pt`; the chosen epoch is recorded as
+  `best_model_description` in the manifest.
+- **Early stopping** is a hard threshold on *training* loss (`< 0.001`) — crude,
+  and unrelated to generalization.
+
+## The baselines
+
+By the project's "a baseline is anything that needs no training" definition
+(`evaluation.py`), the non-neural comparators are scored directly from the
+replay state:
+
+- **Non-personalized:** `MostRecentJournal`, `MostJournals`,
+  `MostRecentInitiation`, `MostInitiations` — rank candidates purely by their own
+  recency/volume.
+- **Personalized:** `ClosestToStart` (temporal proximity of first journal),
+  `CosineSimilarity` (raw cosine of feature vectors — i.e. what `SimNet`
+  computes).
+
+These were computed via a separate path (`notebook/eval/BaselineCompute.ipynb`)
+and are the "can a dumb heuristic do this?" reference for the learned models.
+
+## Offline evaluation at scale
+
+Evaluation ranks the full candidate set per test context and records ranking
+metrics into NDJSON. The metrics (`evaluation.py`): `rank_of_target`,
+`reciprocal_rank`, and `ndcg_{1,5,10,50}`, plus per-context bookkeeping
+(`n_usps_scored`, `n_sites_scored`, raw target score). USP scores are reduced to
+**site** scores (an author site can have several eligible authors) before
+ranking. `coverage.py` aggregates, across all contexts, how often each site
+surfaces in the top-N (Welford streaming stats) — the system's *reach*.
+
+The performance problem: a hyperparameter sweep trains hundreds of models, and
+naively each evaluation would re-assemble every feature matrix and re-look-up
+every embedding. The fix is a **two-phase split**:
+
+```mermaid
+flowchart TD
+    subgraph once["build once"]
+        TC["test_contexts in feature.sqlite"] --> GR["generateRecMd.py"]
+        TL["TextLoader.cache_all_journals()"] --> GR
+        GR --> PKL["reccontextN.pkl<br/>(X_test fully assembled)"]
+    end
+    subgraph many["per model in the sweep"]
+        PKL --> EV["evaluateModelFromCacheMp.py<br/>(multiprocessing pool)"]
+        MODEL["trained model"] --> EV
+        EV --> NDJSON["*_test_metadata.ndjson<br/>per-context metrics"]
+        EV --> COV["coverage pickles"]
+    end
+    NDJSON --> NB["Jupyter: compare, pick model"]
+```
+
+Three evaluator variants exist with increasing speed: `evaluateModels.py`
+(streams contexts from the DB), `evaluateModelFromCache.py` (reads the pre-built
+pickles sequentially), and `evaluateModelFromCacheMp.py` (the multiprocessing
+version that overlaps pickle-loading with scoring). `submitEvalFromDirectory.py`
+fans a whole model directory out to Slurm; `--test-only` runs final test
+evaluation on a single chosen model.
+
+A subtlety worth knowing: `build_reccontext_X` **drops candidates with
+insufficient journal text** (fewer than 3 embeddings), and re-indexes
+`target_inds` to compensate. If the *target itself* lacks text it raises — that
+context is unscorable.
+
+## Hyperparameter sweeps
+
+`experiment/config_gen.py` is a config-and-script generator, not a runner. An
+`Experiment` writes one `config_{i}.json` per cell of a grid plus the matching
+`sbatch`/shell scripts into `configs/ outputs/ scripts/ sbatch/`. The committed
+examples are the actual paper sweeps — e.g. the field-study grid over
+`train_max_lr × LinearNet_n_hidden × train_weight_decay × LinearNet_dropout_p`
+with 3 replicates (540 configs), feature-ablation grids, and an Adam
+random-search. Each config is trained by `experiment/model_gen.py`, which loads
+the cached `X_train_raw.pkl`/`y_train_raw.pkl`, trains, and saves.
+
+```mermaid
+flowchart LR
+    EXP["Experiment grid<br/>(config_gen.py)"] --> CFGS["N × config.json<br/>+ sbatch scripts"]
+    CFGS --> SLURM["Slurm array"]
+    SLURM --> MGN["model_gen.py ×N"]
+    MGN --> MODELS["N trained models"]
+    MODELS --> EVALDIR["submitEvalFromDirectory.py"]
+```
+
+## Candid notes for the modeler
+
+- **Pointwise BCE on data that screams "pairwise."** The triples are literally
+  `(positive, negative)` per source, yet training optimizes independent 0/1
+  cross-entropy with a 0.5 threshold rather than a ranking loss (BPR/pairwise
+  hinge/softmax-CE over the candidate set). The `evaluation.py` comments even
+  muse about BPR. A pairwise or listwise objective is the most natural upgrade
+  and would align the training objective with the NDCG/MRR you actually report.
+
+- **"Learning the dot product is notoriously hard."** A code comment
+  (`evaluation.py:72`) wonders whether cosine similarity should just be a *shared
+  feature* fed to the model rather than something the net must rediscover. That
+  instinct is correct and underlies `LearnedSimNet`.
+
+- **Two-tower is half-built.** `LearnedSimNet`/`ConcatNet` are the bi-encoder
+  architectures that modern retrieval would actually deploy (precompute candidate
+  embeddings, ANN search), but they're experimental here while the dense
+  cross-feature `LinearNet` is what shipped. With ~1563 features and a small MLP
+  that was the pragmatic choice for a one-off study.
+
+- **Validation is 1% by split, early-stop is a magic number.** Fine for a sweep
+  where you pick the best config post-hoc; not what you'd standardize on.
+
+- **🕰️ Modern take.** RoBERTa-base mean-pooling (2019-era) would today be a
+  stronger sentence-embedding model (e.g. a current `sentence-transformers`
+  bi-encoder or an instruction-tuned embedding model), the bespoke train/eval
+  loop would be Lightning/Hugging Face `Trainer`, the sbatch-script generator
+  would be an experiment tracker + sweep agent (W&B/Optuna), and serving would
+  precompute candidate tower embeddings into an ANN index (FAISS/ScaNN) instead
+  of scoring every pair with a cross-encoder MLP. See
+  [modernization.md](modernization.md).
diff --git a/docs/modernization.md b/docs/modernization.md
new file mode 100644
index 0000000..aa73e64
--- /dev/null
+++ b/docs/modernization.md
@@ -0,0 +1,91 @@
+# Modernization Notes: "What would you do today?"
+
+This file collects, in one place, the candid "if you were starting this in
+2026, you'd probably…" observations that are sprinkled through the other docs.
+
+**Read this in the right spirit.** HealthBlogRec is a 2021 research artifact that
+did its job: it supported a published feasibility study and was re-run for paper
+revisions through 2024. The notes below are *not* a defect list. Many choices
+that look dated were the right call under the actual constraints — a fixed
+dataset, a single Slurm cluster, a paper deadline, and the ML tooling that
+existed at the time. The goal here is to help a future reader (or a future
+revival of this work) calibrate which parts are essential ideas versus
+incidental plumbing.
+
+## At a glance
+
+| Area | What the code does | What you'd reach for today | Why it matters |
+|------|--------------------|-----------------------------|----------------|
+| Feature pipeline | Hand-rolled streaming replay maintaining eligibility/activity/graph state, leakage-safe by careful ordering | A **point-in-time feature store** (Feast, Tecton) or stream processor (Flink, Kafka Streams) | Leakage-safety, backfills, and online/offline parity become first-class instead of invariants you enforce by hand |
+| Storage / handoffs | SQLite blobs, `.ndjson`, pickles passed file-to-file | **Parquet/Arrow** + a real artifact/data registry | Columnar, typed, portable, queryable; pickles are version-fragile and unsafe |
+| Text embeddings | RoBERTa-base, mean-pooled over 3 posts, 32-process batch script | A current `sentence-transformers` / instruction-tuned embedding model; batch inference on GPU with a serving runtime | Better representations; far less bespoke multiprocessing |
+| Model | Dense cross-feature MLP (`LinearNet`), pointwise BCE | Two-tower bi-encoder with a **pairwise/listwise ranking loss**; cross-encoder only for re-ranking a shortlist | Objective matches the NDCG/MRR you report; tower embeddings enable ANN retrieval |
+| Serving / retrieval | Score *every* eligible×active pair with the cross-encoder | Precompute candidate embeddings → **ANN index** (FAISS/ScaNN), retrieve then re-rank | Scales past brute-force pair scoring |
+| Experiments | `config_gen.py` emits hundreds of JSON configs + sbatch scripts | **Experiment tracker + sweep agent** (W&B/MLflow + Optuna/Ray Tune) | Provenance, dashboards, smarter search than grid |
+| Training loop | Hand-written epoch/validation/checkpoint loop | Lightning / HF `Trainer` / Ray Train | Less boilerplate, standard checkpointing, multi-GPU for free |
+| Orchestration | Slurm `.sh` scripts, manual stage ordering, pickled checkpoints | A workflow engine (Airflow/Prefect/Dagster/Ray) with typed stage I/O | Reproducible DAGs, retries, lineage |
+| Config | A Python `Config` object with hard-coded `/home/lana/...` paths | `pydantic`/`hydra` config + env-driven paths | Portability off the original cluster |
+| Reproducibility | A single global RNG seed (`872`), conda env listed in prose | Pinned lockfile (uv/poetry), seeded per-component, containerized | Bit-reproducible re-runs |
+
+## The deeper modeling point
+
+The single highest-leverage change is the **training objective**. The data is
+generated as `(source, target, alt)` triples — a positive and a sampled negative
+that share a source — which is *exactly* the input a pairwise ranking loss (BPR,
+pairwise hinge, or sampled-softmax cross-entropy) wants. Yet the model trains
+with pointwise binary cross-entropy and a 0.5 decision threshold, then is
+evaluated with ranking metrics (NDCG, reciprocal rank). Closing that gap — train
+the thing you measure — is the change most likely to move the numbers, and the
+plumbing to do it (the negatives) is already there. The code's own comments
+gesture at this (`evaluation.py` muses about BPR; a comment notes "learning the
+dot product is notoriously hard").
+
+The second point is **architecture for serving**. The deployed model is a
+cross-encoder: it needs both the source and candidate features in the same
+forward pass, so producing recs means scoring every eligible×active pair. That's
+fine for a study with a bounded candidate pool, but it doesn't scale. The
+half-built `LearnedSimNet`/`ConcatNet` two-tower models point the right way:
+encode candidates independently and offline, build an ANN index, retrieve a
+shortlist, and (optionally) re-rank the shortlist with a heavier cross-encoder.
+This is the standard modern retrieval pattern and the codebase was already
+reaching toward it.
+
+## Known sharp edges (documented in code)
+
+These are called out at their source in [data-pipeline.md](data-pipeline.md) and
+[modeling.md](modeling.md); summarized here so they're not a surprise:
+
+- The test-time target fallback picks "most recently joined eligible author"
+  when none is active, and the comment admits *"random might literally be
+  better"* (`reccontext.py:165`). Metadata records when this fired
+  (`test_target_usp_adjustment`).
+- Amp (reaction) timestamps are **reconstructed** from a delay distribution, so
+  amp-recency features are approximate (`data.py`).
+- A feature cache is resized mid-run with a `FIXME is this reasonable?`
+  (`triple_generation.py:202`), and the reuse cache "basically doesn't do
+  anything" outside coverage generation (`feature_extraction.py:64`).
+- `replace_config()` warns in its own docstring that it won't do what you expect.
+- Early stopping is a hard-coded train-loss threshold (`< 0.001`), unrelated to
+  generalization.
+
+See also the **"Fixed bugs during deployment"** section of the top-level
+[README](../README.md): a few real bugs (an activity time-elapsed bimodality, a
+pre-2018 comment timezone bug, and a feature-preprocessing bug dropping source
+features on ~12% of post-July-2020 training data) were caught and corrected for
+the offline results in the paper. That section is a good model of the kind of
+honesty research code should carry — these docs try to extend it.
+
+## What aged well
+
+It's worth saying plainly, because it's the reason the repo is worth reading:
+
+- **Leakage discipline.** The streaming replay makes temporal correctness
+  structural, not aspirational. A lot of "modern" pipelines get this *wrong*.
+- **A clean temporal train/test split.** Always predicting the future.
+- **Feature deduplication by content hash** — a genuinely nice space optimization
+  for a combinatorial pair space.
+- **Coverage as a first-class metric.** Asking "can we even reach most eligible
+  authors, or do recs pile onto a popular few?" is exactly the right question for
+  a social-support intervention, and many recommender evaluations omit it.
+- **Honest, self-deprecating comments.** The code tells you where the bodies are
+  buried. That's rarer than it should be.