diff --git a/README.md b/README.md
index d6dfb1f..973e4b6 100644
--- a/README.md
+++ b/README.md
@@ -7,6 +7,21 @@ HealthBlogRec is a recommendation system for peer health blogs.

+## Documentation
+
+In-depth developer documentation lives in [`docs/`](docs/) β a diagram-heavy
+guided tour of how the system fits together, written for someone reading the
+[paper](https://arxiv.org/abs/2209.04973) and the source side by side. Because
+this is research code from 2021, the docs also include candid **π°οΈ Modern take**
+notes flagging where current practice would steer you differently.
+
+ - [`docs/README.md`](docs/README.md) β start here: orientation and reading order
+ - [`docs/glossary.md`](docs/glossary.md) β the project's vocabulary (*USP*, *initiation*, *eligible*/*existing*/*active*, *triple*, *coverage*)
+ - [`docs/architecture.md`](docs/architecture.md) β the whole system on one page, plus entry points and the experiment loop
+ - [`docs/data-pipeline.md`](docs/data-pipeline.md) β the streaming history-replay that generates training triples and test contexts
+ - [`docs/modeling.md`](docs/modeling.md) β the feature vector, model zoo, training loop, baselines, and offline evaluation
+ - [`docs/modernization.md`](docs/modernization.md) β consolidated "what would you do today?" notes
+
## Citation and project history
If any portion of this project is useful to you, please cite the following paper:
diff --git a/docs/README.md b/docs/README.md
new file mode 100644
index 0000000..8a751e4
--- /dev/null
+++ b/docs/README.md
@@ -0,0 +1,85 @@
+# HealthBlogRec β Developer Documentation
+
+This folder is a guided tour of the HealthBlogRec codebase, written for the
+reader who has the [arXiv paper](https://arxiv.org/abs/2209.04973) open in one
+tab and the source in another, and is trying to understand *how the choices in
+the paper map onto the code*.
+
+It is intentionally heavy on diagrams and light on prose. The code itself has
+reasonable docstrings in the places that matter; what was missing β and what
+these documents try to provide β is the **bird's-eye view**: how the pieces fit
+together, why they're shaped the way they are, and where the sharp edges are.
+
+> β οΈ **This is research code from 2021.** It was written to run one study on one
+> cluster (the University of Minnesota's MSI) and to be re-run a handful of times
+> for paper revisions. It is *not* a library, it is *not* deployable as-is, and
+> several decisions that look strange today were reasonable given the deadline,
+> the data, and the tooling of the time. Throughout these docs, call-outs marked
+> **π°οΈ Modern take** flag places where current practice would steer you
+> somewhere else. They are not bug reports β they are orientation for anyone
+> trying to learn from or revive this work.
+
+## Where to start
+
+Read in this order:
+
+1. **[glossary.md](glossary.md)** β the vocabulary. The code is dense with
+ domain-specific terms (*USP*, *initiation*, *eligible* vs *existing* vs
+ *active*, *triple*, *coverage*). Nothing else makes sense until these click.
+2. **[architecture.md](architecture.md)** β the whole system on one page, plus
+ the two entry points (`gen.py`, `predict.py`) and how an offline experiment
+ flows end to end.
+3. **[data-pipeline.md](data-pipeline.md)** β the heart of the project: the
+ streaming "replay history one interaction at a time" simulation in
+ `cbrec.triple_generation` that turns raw CaringBridge activity into training
+ triples and test/prediction contexts.
+4. **[modeling.md](modeling.md)** β the feature vector, the model zoo
+ (`LinearNet`, `SimNet`, `ConcatNet`, `LearnedSimNet`, `InteractionNet`), the
+ training loop, the baselines, and how offline evaluation is run at scale.
+5. **[modernization.md](modernization.md)** β a consolidated, candid list of
+ "what would you do differently today," collected so it doesn't clutter the
+ reference docs.
+
+## The one-paragraph summary
+
+HealthBlogRec recommends *peer health blogs* (CaringBridge sites) to people who
+write their own. The core modeling task is framed as: given a **source** author
+who is about to reach out to a new site (an **initiation**), can we rank the pool
+of **candidate** authors so the one they actually reached out to (the **target**)
+floats to the top? The system learns from history by replaying years of
+CaringBridge interactions in timestamp order, snapshotting the network/activity
+state at each real initiation, and emitting a `(source, target, sampled
+non-target)` **triple**. Each user/site pair is encoded as network features +
+recent-activity features + a RoBERTa embedding of their recent journal updates.
+A small MLP scores `(source, candidate)` pairs; the trained model was used to
+send weekly recommendation emails during a 2021 field study.
+
+## Map of the repository
+
+```
+HealthBlogRec/
+βββ src/
+β βββ gen.py / predict.py # thin entry points β cbrec.triple_generation / cbrec.prediction
+β βββ *.sh # Slurm batch scripts for the MSI cluster
+β βββ cbrec/ # the recommendation package (see data-pipeline.md, modeling.md)
+β β βββ triple_generation.py # the streaming history-replay simulation (the core)
+β β βββ reccontext.py # one recommendation event (source β candidates [β target])
+β β βββ feature_extraction.py # USP β feature vector, + async DB writer
+β β βββ eligibility.py # who counts as existing / eligible
+β β βββ network.py # the interaction graph (indegree/outdegree/component)
+β β βββ recentActivityCounter.py / timeAwareDict.py # time-windowed activity counts
+β β βββ featuredb.py # SQLite storage of feature arrays / triples / contexts
+β β βββ evaluation.py / coverage.py # baselines, ranking metrics, coverage
+β β βββ text/ # RoBERTa journal embeddings in SQLite
+β β βββ modeling/ # the PyTorch models, training, and offline eval
+β β βββ experiment/ # generates hyperparameter-sweep configs + Slurm scripts
+β βββ cbsend/ # composes and sends the weekly recommendation emails
+β βββ extract/ # MongoDB BSON dumps β flattened ndJSON
+βββ notebook/ # analysis & experimentation (outputs/IDs scrubbed)
+βββ data/ # small sample data artifacts
+βββ design/ # email mockups
+βββ figures/ # figures used in the paper
+```
+
+See the top-level [README](../README.md) for citation, study history, the conda
+environment, and the list of bugs caught during deployment.
diff --git a/docs/architecture.md b/docs/architecture.md
new file mode 100644
index 0000000..b813f80
--- /dev/null
+++ b/docs/architecture.md
@@ -0,0 +1,166 @@
+# Architecture Overview
+
+This is the whole system on one page. For the vocabulary
+(*USP*, *initiation*, *eligible*β¦) see [glossary.md](glossary.md) first.
+
+## The big picture
+
+HealthBlogRec turns years of raw CaringBridge activity into a trained model that
+ranks peer authors, and used that model to send weekly recommendation emails
+during a 2021 field study. There are five loosely-coupled stages, each writing
+artifacts to disk that the next stage reads. This file-handoff design is
+deliberate: each stage took hours-to-days on the cluster and had to be
+independently restartable.
+
+```mermaid
+flowchart TD
+ subgraph raw["β Raw data (MongoDB β flat files)"]
+ BSON["MongoDB BSON dumps"] -->|"src/extract/"| FLAT["ints_df.feather
journal_metadata.feather
user_site_df.csv"]
+ TXT["journalText.sqlite
(raw post text)"]
+ end
+
+ subgraph gen["β‘ Data generation β cbrec.triple_generation (gen.py)"]
+ FLAT --> REPLAY["Replay interactions in
timestamp order, maintaining
graph + activity + eligibility state"]
+ REPLAY -->|"each eligible initiation"| FDB[("feature.sqlite
triples Β· test_contexts Β· feature arrays")]
+ REPLAY --> MDN["metadata.ndjson"]
+ end
+
+ subgraph text["β’ Text features β cbrec.text"]
+ TXT -->|"RoBERTa, 32 procs"| TFDB[("text_feature.sqlite
768-d embeddings per journal")]
+ end
+
+ subgraph model["β£ Modeling β cbrec.modeling"]
+ FDB --> TRAIN["train.py / model_gen.py
MLP on (source, candidate) pairs"]
+ TFDB --> TRAIN
+ TRAIN --> MODEL["trained model
(.pt + .json manifest)"]
+ MODEL --> EVAL["offline evaluation
(NDCG, reciprocal rank, coverage)"]
+ FDB --> EVAL
+ TFDB --> EVAL
+ end
+
+ subgraph deploy["β€ Deployment (the 2021 field study)"]
+ MODEL --> PRED["cbrec.prediction (predict.py)
score live candidates"]
+ FLAT --> PRED
+ PRED --> EMAIL["cbsend/
compose + send weekly SSE emails"]
+ end
+```
+
+This diagram is the code version of the paper's
+[system overview figure](../figures/rec_system_overview.png).
+
+## The two entry points
+
+The `src/` top level is deliberately thin β the real logic is in the `cbrec`
+package. Everything is launched through Slurm batch scripts (`src/*.sh`) on the
+MSI cluster.
+
+| Script | Delegates to | Purpose |
+|--------|--------------|---------|
+| `gen.py` | `cbrec.triple_generation.main()` | Stage β‘: generate all training triples and test/predict contexts. ~50 hours. |
+| `predict.py` | `cbrec.prediction.main()` | Stage β€: load a checkpoint + model and produce live recommendations for a set of users. |
+
+The supporting `make_text_features_*.sh` scripts drive Stage β’, and the
+`cbrec.experiment` + `cbrec.modeling` machinery drives Stage β£ (see
+[modeling.md](modeling.md)).
+
+## How an offline experiment flows end-to-end
+
+This is the loop a researcher actually ran to compare models for the paper.
+
+```mermaid
+sequenceDiagram
+ autonumber
+ participant U as Researcher
+ participant CG as experiment/config_gen.py
+ participant SL as Slurm (MSI)
+ participant MG as modeling/model_gen.py
+ participant FDB as feature.sqlite + cached X_train
+ participant EV as modeling/evaluateModelFromCache*.py
+ participant NB as Jupyter notebook
+
+ U->>CG: define a hyperparameter sweep
+ CG->>SL: emit one config.json + sbatch script per cell of the grid
+ loop each config in the sweep
+ SL->>MG: train one model
+ MG->>FDB: load cached training matrix
+ MG-->>SL: write model .pt + .json manifest
+ end
+ U->>EV: submitEvalFromDirectory.py over the model dir
+ loop each trained model
+ EV->>EV: score every cached RecContext, write *_test_metadata.ndjson
+ end
+ EV-->>NB: NDJSON metrics + coverage pickles
+ NB->>U: compare models, pick the field-study model
+```
+
+The key performance trick: **RecContexts are pre-built once and cached as
+pickles** (`generateRecMd.py`) so that evaluating 500+ models in a sweep doesn't
+re-pay the cost of assembling feature matrices and looking up embeddings each
+time. See [modeling.md](modeling.md#offline-evaluation-at-scale).
+
+## Package map
+
+```mermaid
+flowchart LR
+ subgraph cbrec["cbrec (recommendation core)"]
+ TG["triple_generation
(orchestrator)"]
+ RC["reccontext"]
+ FE["feature_extraction"]
+ EL["eligibility"]
+ NW["network"]
+ RA["recentActivityCounter
timeAwareDict"]
+ FD["featuredb"]
+ DA["data"]
+ EVl["evaluation
coverage"]
+ TG --> RC --> FE
+ TG --> EL & NW & RA & DA
+ FE --> FD
+ RC --> EVl
+ end
+ subgraph txt["cbrec.text"]
+ TDB["textdb"] --> CTF["createTextFeatureSqlite
(RoBERTa)"] --> EDB["embeddingdb"]
+ JID["journalid"]
+ end
+ subgraph mdl["cbrec.modeling"]
+ MC["modelconfig"] --> MM["manager"]
+ MM --> PP["preprocess"] & MT["train"]
+ MT --> MODELS["models/*
LinearNet Β· SimNet Β· ConcatNet
LearnedSimNet Β· InteractionNet"]
+ MM --> SC["scorer"]
+ RCB["reccontext_builder"] --> TL["text_loader"]
+ end
+ subgraph exp["cbrec.experiment"]
+ CGN["config_gen"] --> MGN["model_gen"]
+ end
+ cbrec --> mdl
+ txt --> mdl
+ exp --> mdl
+ mdl --> SEND["cbsend (emails)"]
+```
+
+## Design choices worth understanding up front
+
+- **Everything is a streaming replay.** The system never "queries the current
+ state of the database." It replays history one interaction at a time and
+ computes features from the *online* state at that instant. This is what makes
+ the features causally honest (no future leakage) β and what makes the pipeline
+ a stateful, checkpoint-heavy, ~50-hour job. See [data-pipeline.md](data-pipeline.md).
+
+- **Temporal train/test split, not random.** Train = initiations 2014β2020,
+ test = initiations in 2021. The model is always asked to predict the future.
+
+- **Feature arrays are content-addressed and deduplicated** in SQLite
+ (`featuredb.py`), so the hundreds of millions of `(source, candidate)` pairs
+ collapse to far fewer unique stored vectors.
+
+- **Text is a separate pipeline on purpose.** RoBERTa embedding of ~1M journals
+ is expensive and embarrassingly parallel, so it lives in its own stage
+ (`cbrec.text`, 32 processes) and is joined in only at model-training time.
+
+- **π°οΈ Modern take:** Today most of stages β‘ββ£ would be a feature store + a
+ scheduled batch job (or a streaming feature pipeline like Feast/Flink), the
+ bespoke SQLite + ndjson + pickle handoffs would be Parquet/Arrow, and the
+ Slurm sbatch-script-generation in `cbrec.experiment` would be an experiment
+ tracker (Weights & Biases / MLflow) plus a workflow engine
+ (Airflow/Prefect/Ray). The architecture's *shape* β replay β features β
+ train β eval β serve β is still exactly what you'd build. See
+ [modernization.md](modernization.md).
diff --git a/docs/data-pipeline.md b/docs/data-pipeline.md
new file mode 100644
index 0000000..b44c042
--- /dev/null
+++ b/docs/data-pipeline.md
@@ -0,0 +1,230 @@
+# The Data Generation Pipeline
+
+> Code: `cbrec/triple_generation.py` and everything it orchestrates.
+> Entry point: `src/gen.py`. Vocabulary: [glossary.md](glossary.md).
+
+This is the most important β and least obvious β part of the project. Everything
+the model learns from is produced here. If you understand this file, you
+understand HealthBlogRec.
+
+## The core idea: replay history, one interaction at a time
+
+The pipeline does **not** query a database for "the current network." Instead it
+**replays the entire interaction stream in timestamp order**, and maintains the
+graph, activity counters, and eligibility sets *incrementally* as each event
+arrives. When it reaches a real initiation, it snapshots the state *as it existed
+at that instant* and emits training data.
+
+Why bother? Because it guarantees **no temporal leakage**: a USP's features at
+time *t* can only reflect what had happened before *t*. A model trained on these
+features is being asked exactly the question it will face in production β "given
+only the past, rank these candidates" β which is why the offline numbers are
+trustworthy.
+
+```mermaid
+flowchart TD
+ START["RecDataGenerator.generate_rec_data()"] --> LOAD["DataManager loads + trims
ints_df (sorted by created_at)"]
+ LOAD --> LOOP{"for each interaction
in timestamp order"}
+ LOOP --> JUP["apply any journal updates
that occurred before this ts
(timeAwareDict)"]
+ JUP --> UPDST["update activity counters
(recentActivityManager)"]
+ UPDST --> ISINIT{"first time this user
touched this site?
(is_initiation)"}
+ ISINIT -->|no| UPDACT["record interaction in
activity counter"]
+ ISINIT -->|yes| PERIOD{"which period?"}
+ PERIOD -->|"before 2014
(warm-up)"| GRAPH
+ PERIOD -->|"2014β2020
(generation)"| TRAIN["build RecContext β
generate TRAIN triples"]
+ PERIOD -->|"2021β2022
(test)"| TEST["build RecContext β
generate TEST context"]
+ TRAIN --> WRITE["DatabaseWriter (async) β
feature.sqlite + metadata.ndjson"]
+ TEST --> WRITE
+ GRAPH["update initiation graph
(add edge sourceβexisting authors)"] --> UPDACT
+ WRITE --> GRAPH
+ UPDACT --> LOOP
+ LOOP -->|"hit generation_stop_timestamp"| DONE["final checkpoint"]
+```
+
+The loop lives in `generate_rec_data()` (line ~126) and `process_interaction()`
+(line ~254). The three time boundaries are config values in `genconfig.py`:
+`generation_start_timestamp` (2014), `test_generation_start_timestamp` (2021),
+`generation_stop_timestamp` (2022).
+
+## The stateful machinery
+
+`RecDataGenerator.__init__` wires together a small set of collaborating state
+objects. Each one answers a different question about "the world so far":
+
+```mermaid
+flowchart LR
+ subgraph state["Online state maintained during replay"]
+ EL["UserSitePairEligibilityManager
eligibility.py"]
+ NW["UserGraph + WccGraph
network.py"]
+ RA["RecentActivityManager
recentActivityCounter.py"]
+ TAD["TimeAwareDict (journals)
timeAwareDict.py"]
+ end
+ EL -.->|"who is existing / eligible?"| Q1
+ NW -.->|"indegree, outdegree,
component, reciprocal, FoF?"| Q1
+ RA -.->|"who is active? recent
activity counts + recency?"| Q1
+ TAD -.->|"which journals happened
before now?"| Q1
+ Q1["FeatureGenerator
builds a USP's feature vector"]
+```
+
+| Object | File | Maintains | Feeds |
+|--------|------|-----------|-------|
+| `UserSitePairEligibilityManager` | `eligibility.py` | counts of journal updates per USP; the existing/eligible sets; siteβauthor maps | eligibility checks, candidate pool |
+| `UserGraph` (+ `WccGraph`) | `network.py` | the directed initiation graph; weakly-connected components | network features (indegree/outdegree/component/reciprocal/FoF) |
+| `RecentActivityManager` (a set of `RecentActivityCounter`s) | `recentActivityCounter.py` | sliding-window counts of journals/amps/comments/guestbooks; first-journal timestamps | activity features, the "active" set |
+| `TimeAwareDict` | `timeAwareDict.py` | journal updates indexed by timestamp, consumed once | drives the eligibility/activity updates |
+
+The activity counters use a neat trick: an `OrderedDict` keyed by timestamp lets
+expiry be an O(1) pop from the left as the window slides forward
+(`RecentActivityCounter.update_counts`). The `WccGraph` keeps component
+membership up to date by merging components whenever an edge bridges two of them.
+
+## From an initiation to a training triple
+
+When an eligible initiation occurs during the generation period,
+`RecContext.generate_train_features()` produces one or more **triples**:
+
+```mermaid
+flowchart TD
+ INIT["initiation: source_user β target_site"] --> SRC["source USPs =
all sites where source is eligible
(usually 1)"]
+ INIT --> TGT["target USPs =
eligible authors on target_site"]
+ INIT --> CAND["candidate pool =
active β© eligible USPs
(minus self, coauthors, existing ties)"]
+ SRC --> PAIR
+ TGT --> PAIR
+ CAND -->|"random negative sample"| ALT["alt USP"]
+ ALT --> PAIR["TRIPLE = (source, target, alt)"]
+ PAIR --> POS["(source, target) β label 1"]
+ PAIR --> NEG["(source, alt) β label 0"]
+ POS --> FEAT["FeatureGenerator:
12 USP features each
+ 3 pairwise features"]
+ NEG --> FEAT
+ FEAT --> SAVE["DatabaseWriter.save_triples()
β feature.sqlite"]
+```
+
+Negative sampling here is **uniform random** from the eligible-active candidate
+pool. Note what's stored is *features*, not raw IDs at scoring time: each triple
+row references five `feature_id`s (source, target, alt, source-target pair,
+source-alt pair), and identical feature vectors are deduplicated by content hash
+(see below).
+
+## Test contexts are different: rank the whole pool
+
+In the test period there's no negative sampling. Instead
+`generate_test_features()` materializes the **entire candidate set** as dense
+matrices so the evaluator can score every candidate and find where the true
+target ranked:
+
+| Array | Shape | Meaning |
+|-------|-------|---------|
+| `source_usp_arr` | `(X, 2)` | source USPs (usually X=1) |
+| `candidate_usp_arr` | `(Y, 2)` | every candidate USP, sorted by `site_id` |
+| `target_inds` | `(Z,)` | indices into `candidate_usp_arr` that are the true target |
+| `source_usp_mat` | `(X, 12)` | non-text features for sources |
+| `candidate_usp_mat` | `(Y, 12)` | non-text features for candidates |
+| `user_pair_mat` | `(XΒ·Y, 3)` | pairwise features for every sourceΓcandidate combo |
+
+The text embeddings are **not** stored here β they're joined in later at scoring
+time by `reccontext_builder.build_reccontext_X` (see [modeling.md](modeling.md)).
+This keeps `feature.sqlite` from ballooning with duplicated 768-d vectors.
+
+A **predict** context is a test context with no target β used to generate the
+recs that were actually emailed, and the ~1000 target-less "coverage" contexts
+created at test-period start to measure reach.
+
+## What a USP feature vector contains
+
+`FeatureGenerator.generate_user_features` produces a **12-dimensional** non-text
+vector per USP, and `generate_user_pair_features` produces **3** pairwise
+features:
+
+```
+USP features (12):
+ network (3): indegree Β· outdegree Β· weakly-connected-component size
+ activity (8): for each of {journal, amp, comment, guestbook}:
+ count-in-window Β· hours-since-most-recent
+ temporal (1): hours since the author's first journal update
+
+pair features (3): are_weakly_connected Β· is_friend_of_friend Β· is_reciprocal
+```
+
+At model-training time these are concatenated with two 768-d RoBERTa journal
+embeddings (source + candidate) to form the final input vector. The full layout
+is documented in [modeling.md](modeling.md#the-feature-vector).
+
+## Persistence: the async writer and feature dedup
+
+Feature computation runs in the main process; disk I/O is offloaded to a
+separate `WriterProcess` over a queue (`feature_extraction.py`). This lets the
+CPU-bound replay keep moving while SQLite writes batch up (commits every
+`feature_db_max_commit_size` = 500k inserts).
+
+```mermaid
+flowchart LR
+ MAIN["main loop
(compute features)"] -->|"queue items:
feature / triple /
test_context / metadata"| Q(["mp.Queue"])
+ Q --> WP["WriterProcess"]
+ WP -->|"batched commits"| SQL[("feature.sqlite")]
+ WP -->|"append JSON line"| ND["metadata.ndjson"]
+ MAIN -.->|"content-hash dedup (LruCache of feature_ids)"| MAIN
+```
+
+`feature.sqlite` holds three logical tables: `feature` (deduplicated numpy
+arrays, serialized with a 1-byte dtype prefix), `triple` (training rows
+referencing feature_ids), and `test_context` (test/predict rows referencing the
+six matrices above). `metadata.ndjson` carries one rich metadata dict per
+RecContext β eligibility flags, candidate counts, adjustment reasons β used for
+slicing results during analysis.
+
+## Checkpointing
+
+Because a full run is ~50 hours, the entire `RecDataGenerator` is pickled at
+milestones (`create_checkpoint`): before the generation period, before the test
+period, and at the end. `gen.py --from-recent-checkpoint` resumes by replaying
+to `current_interaction_ind` and continuing. The `__getstate__`/`__setstate__`
+hooks drop and rebuild the (unpicklable) `DataManager`.
+
+## Sharp edges and candid notes
+
+These are real comments and behaviors in the code β worth knowing before you
+trust or modify the pipeline.
+
+- **The target-selection fallback is admittedly arbitrary.** At test time, if no
+ author on the target site is currently *active*, the code forces one in for
+ evaluation. With multiple eligible authors it picks the one who *joined the
+ site most recently* β and the comment says so plainly: *"a very arbitrary
+ selection criterion; random might literally be better"* (`reccontext.py:165`).
+ The `test_target_usp_adjustment` field in metadata records when this happened
+ so you can filter those cases.
+
+- **Debug mode fabricates eligibility.** In `TestConfig`/debug runs,
+ `DataManager` duplicates a journal row per user so everyone clears the
+ 3-update eligibility bar (`data.py:50`). Great for exercising code paths on
+ tiny test data; obviously not a real-data behavior.
+
+- **Amp timestamps are reconstructed.** "Amps" (reactions) weren't reliably
+ timestamped in the source data, so `trim_data()` adds offsets drawn from a
+ precomputed delay distribution (`reaction_ms_since_journal.npy`). The features
+ built on amp recency are therefore approximate by construction.
+
+- **`replace_config()` is explicitly hacky.** Its own docstring warns it "will
+ NOT do everything you expect, and most config settings will be unaffected"
+ (`triple_generation.py:115`). It exists to enable a few specific mutations for
+ the predict path.
+
+- **A cache is resized mid-run on a hunch.** At test-period start the feature
+ array cache is shrunk from 5000 to 10 because test arrays are much larger,
+ with the comment `FIXME is this reasonable?` (`triple_generation.py:202`).
+
+- **The reuse cache mostly doesn't fire.** `FeatureGenerator`'s LRU cache keys on
+ `(user, site, timestamp)`; since timestamps are essentially unique, repeat hits
+ "basically don't happen" β the author says as much
+ (`feature_extraction.py:64`). The exception is coverage generation, where the
+ timestamp is held fixed and the cache is temporarily enlarged.
+
+- **π°οΈ Modern take.** The hand-rolled streaming state (eligibility sets, sliding
+ windows, an incremental WCC union-find by hand) is exactly the kind of thing a
+ modern **point-in-time feature store** (Feast, Tecton) or a stream processor
+ (Flink/Kafka Streams) is built to do β with leakage-safety and backfills as
+ first-class features rather than something you enforce by careful ordering and
+ hope. The SQLite-blob feature table would be Parquet/Arrow today, and the
+ ad-hoc content-hash dedup is what a feature store's entity keys give you for
+ free. None of this means the original is wrong β it's a faithful, leakage-safe
+ pipeline β just that you'd reach for off-the-shelf infrastructure now. More in
+ [modernization.md](modernization.md).
diff --git a/docs/glossary.md b/docs/glossary.md
new file mode 100644
index 0000000..2013b4b
--- /dev/null
+++ b/docs/glossary.md
@@ -0,0 +1,72 @@
+# Glossary
+
+The code uses a compact, project-specific vocabulary that the variable names
+assume you already know. This is the decoder ring. Read it once and the rest of
+the codebase becomes legible.
+
+## Entities
+
+| Term | Code | Meaning |
+|------|------|---------|
+| **User** | `user_id` | A CaringBridge account. |
+| **Site** | `site_id` | A CaringBridge blog/site. One site can have multiple authors. |
+| **USP** (user/site pair) | `(user_id, site_id)`, often an `np.int64` 2-tuple | The atomic unit the model reasons about. *Not* a user and *not* a site, but a specific person **in their role as an author of a specific site.** The same person authoring two sites is two USPs. Almost every array in the codebase is indexed by USP. |
+| **Journal update** | `journal_oid`, `journal_id` | A blog post. The text of the 3 most recent updates is what gets embedded with RoBERTa. |
+| **Interaction** | row of `ints_df` | Any activity event: a journal update, an *amp* (a "like"), a *comment*, or a *guestbook* entry. The whole pipeline is a replay of the interaction stream in timestamp order. |
+
+## The recommendation framing
+
+| Term | Code | Meaning |
+|------|------|---------|
+| **Initiation** | `is_initiation` | The **first** time a given `user_id` interacts with a given `site_id`. This is the event the system tries to predict/encourage: one author reaching out to another's site for the first time. Initiations define both the training signal and the edges of the interaction graph. |
+| **Source** | `source_user_id`, `source_usp_arr` | The author *making* a recommendation request / initiation β the person we'd email recommendations to. |
+| **Target** | `target_site_id`, `target_usps`, `target_inds` | The site (and its eligible authors) the source *actually* initiated with. This is the ground-truth positive: the thing a good ranking puts near the top. |
+| **Candidate** | `candidate_usp_arr`, `candidate_usp_mat` | The pool of USPs that *could* be recommended at this moment β every active & eligible author. At test time we score the whole pool and check where the target landed. |
+| **Alt** (alternative) | `alt_usp`, `alt_feature_id` | A candidate sampled as a **negative** during training: a USP the source did *not* initiate with. (See [data-pipeline.md](data-pipeline.md) for the sampling.) |
+| **Triple** | `triple` table, `RecContext.triples` | The training unit: `(source, target, alt)`. Stored as two labeled pairs β `(source, target) β 1` and `(source, alt) β 0`. The word "triple" is everywhere; it just means "one positive + one sampled negative, sharing a source." |
+
+## User state (the three words that trip everyone up)
+
+These describe a USP's status *at a given instant in the replay*. They are a
+strict-ish hierarchy and are tracked in `eligibility.py` and
+`recentActivityCounter.py`.
+
+```mermaid
+flowchart LR
+ A["Author writes
journal updates"] --> E["EXISTING
(β₯1 update on a site)"]
+ E --> EL["ELIGIBLE
(β₯3 updates on a site)
β can be a rec TARGET"]
+ A2["Any recent interaction
(journal/amp/comment/guestbook)
in the last 7 days"] --> AC["ACTIVE
β worth scoring right now"]
+ EL -.->|"eligible AND active = a real candidate"| CAND["CANDIDATE pool"]
+ AC -.-> CAND
+```
+
+| Term | Threshold | Why it exists |
+|------|-----------|---------------|
+| **Existing** | Has authored **β₯1** journal update on the site. | Needed for the interaction graph: only existing authors can be the recipient of an initiation edge. |
+| **Eligible** | Has authored **β₯3** journal updates on the site (`config.journal_update_memory`). | The model needs 3 journal embeddings to build a USP's text features, and recommending a barely-started blog is undesirable. Only eligible USPs can be recommendation **targets**. |
+| **Active** | Had *any* interaction within the activity window (default **1 week**, `config.activity_count_duration_ms`). | We only want to surface authors who are currently writing β a recommendation to a dormant blog is a bad experience. Candidates must be active. |
+
+So the **candidate pool = eligible β© active** authors at the recommendation
+timestamp.
+
+## Process / evaluation terms
+
+| Term | Code | Meaning |
+|------|------|---------|
+| **Generation period** | `generation_start_timestamp` (2014) β `test_generation_start_timestamp` (2021) | The window whose initiations become **training triples**. |
+| **Test period** | `test_generation_start_timestamp` (2021) β `generation_stop_timestamp` (2022) | The window whose initiations become **test contexts** (full candidate sets to rank, no negative sampling). This is a strict **temporal** train/test split β the model is always evaluated on the future. |
+| **RecContext** | `reccontext.RecContext` | One recommendation event, in any of three modes: *train* (emit triples), *test* (emit a full candidate matrix with known target), or *predict* (a full candidate matrix with **no** target β used to actually generate recs to email). |
+| **Coverage** | `coverage.py`, `n_coverage_users` | A diagnostic: across all the recs generated, *which sites ever get surfaced, and how often?* Answers "can the system even reach most eligible authors, or does it pile recs onto a popular few?" At test-period start the pipeline picks ~1000 eligible-active users and generates target-less recs for them purely to measure this. |
+| **Feature crosses (FC)** | comments in `feature_extraction.py` | Pairwise features computed *between* source and candidate (e.g. "are they reciprocal?", "friend-of-friend?"), as opposed to features of a single USP. |
+| **Pointwise** | `get_pointwise_training_triples` | The model is trained pointwise (each `(source, candidate)` pair gets an independent 0/1 label + BCE loss) rather than with a pairwise ranking loss β even though the triple structure is right there. See [modernization.md](modernization.md). |
+
+## Acronyms in the wild
+
+- **SSE** β "Site Suggestion Email," the weekly recommendation email (`cbsend/`).
+- **BPR** β Bayesian Personalized Ranking; referenced aspirationally in
+ `evaluation.py` comments but the deployed model is a plain MLP, not BPR.
+- **RAC** β `RecentActivityCounter`, the sliding-window activity counter.
+- **WCC** β Weakly Connected Component (of the initiation graph), used for the
+ `component_size` network feature.
+- **MSI** β Minnesota Supercomputing Institute, the Slurm cluster everything ran on.
+- **OID** β MongoDB ObjectID; the original data lived in MongoDB.
diff --git a/docs/modeling.md b/docs/modeling.md
new file mode 100644
index 0000000..2e2a848
--- /dev/null
+++ b/docs/modeling.md
@@ -0,0 +1,238 @@
+# Modeling: Features, Models, Training & Evaluation
+
+> Code: `cbrec/modeling/` and `cbrec/modeling/models/`.
+> Hyperparameter sweeps: `cbrec/experiment/`.
+> Upstream data: [data-pipeline.md](data-pipeline.md).
+
+The modeling task is a **pointwise binary classification** framing of ranking:
+score each `(source, candidate)` pair with a number in `[0, 1]`, then rank
+candidates by that score and check where the true target landed.
+
+## The feature vector
+
+Every model consumes the same fixed-length row, assembled at scoring time by
+`reccontext_builder.build_reccontext_X`. With the defaults in `genconfig.py`
+(`user_feature_count=12`, `user_pair_feature_count=3`, `text_feature_count=768`)
+the row is **1563** columns:
+
+```
+n_cols = 2Β·(user_feature_count + text_feature_count) + user_pair_feature_count
+ = 2Β·(12 + 768) + 3 = 1563
+```
+
+```mermaid
+flowchart LR
+ subgraph row["one input row (1563 dims)"]
+ S["[0:12]
source
non-text"]
+ C["[12:24]
candidate
non-text"]
+ P["[24:27]
pair
features"]
+ ST["[27:795]
source
RoBERTa (768)"]
+ CT["[795:1563]
candidate
RoBERTa (768)"]
+ end
+```
+
+> β οΈ **Don't be fooled by the number 27.** It's the offset where the non-text
+> block ends (`12 + 12 + 3`), *not* the count of source features. Source non-text
+> is 12 columns. The `FeatureManager` (`preprocess.py`) lets you address slices
+> symbolically β `get_feature_indices('source', 'text')`, `('*', '~text')`, etc.
+> β which is what makes feature-ablation experiments one config line instead of a
+> code change.
+
+The non-text 12 = 3 network + 8 activity + 1 temporal; the 3 pair features are
+`are_weakly_connected Β· is_friend_of_friend Β· is_reciprocal`. See
+[data-pipeline.md](data-pipeline.md#what-a-usp-feature-vector-contains).
+
+The two 768-d RoBERTa vectors are mean-pooled over the author's 3 most recent
+journal updates by default (`pool_text_feature=0`; alternatives are max-pool and
+concatenate). They're looked up and joined in by `text_loader.TextLoader`, which
+caches embeddings in an LRU and can pre-load all ~1M journal embeddings into RAM.
+
+## The model zoo
+
+All models are tiny `nn.Module`s selected by `model_config.model_name` in
+`ModelTrainer.create_net()`. They output **raw logits** (no final sigmoid) β
+the sigmoid lives in `BCEWithLogitsLoss` and in scoring.
+
+```mermaid
+flowchart TD
+ IN["input row (1563)"] --> SEL{model_name}
+ SEL --> LN["LinearNet"]
+ SEL --> SN["SimNet"]
+ SEL --> CN["ConcatNet"]
+ SEL --> LSN["LearnedSimNet"]
+ SEL --> IN2["InteractionNet"]
+```
+
+| Model | File | Learns? | Idea | Status |
+|-------|------|:---:|------|--------|
+| **LinearNet** | `linearnet.py` | β | Concatenate the whole row, push through a 2-hidden-layer MLP (ReLU + dropout, no-bias output). The workhorse β this is the **deployed field-study model**. | Primary |
+| **SimNet** | `simnet.py` | β | No parameters: take cosine (or L2) similarity between the source slice and the candidate slice. A pure-similarity baseline; "train" with 0 epochs. | Baseline |
+| **LearnedSimNet** | `learnedsimnet.py` | β | A bi-encoder (DSSM/Sentence-BERT-style): project source and candidate independently, then take similarity of the two embeddings. Optional shared encoder. | Experimental |
+| **ConcatNet** | `concatnet.py` | β | Encode source and candidate separately, concatenate the two encodings, then an MLP head. | Partial / experimental |
+| **InteractionNet** | `interactionnet.py` | β | Form the outer product of source Γ candidate features (explicit pairwise interactions), flatten, MLP. Expressive but `n_inputΒ²`-wide. | Experimental |
+
+`cf.py` is a stub (a citation, no implementation). The shared-feature branches in
+`LearnedSimNet`/`InteractionNet` are present but untested per their comments.
+
+**LinearNet at a glance** (`linearnet.py`):
+
+```mermaid
+flowchart LR
+ X["x (1563)"] --> F1["fc1 β n_hidden"] --> R1["ReLU + dropout"]
+ R1 --> F2["fc2 β n_hidden"] --> R2["ReLU + dropout"]
+ R2 --> F3["fc3 β 1 (no bias)"] --> OUT["logit"]
+```
+
+## The training loop
+
+`ModelManager` (`manager.py`) is the faΓ§ade: it owns a `FeatureManager`
+(preprocessing) and a `ModelTrainer` (`train.py`), and handles save/load via a
+JSON manifest that points at the `.pt` weights, the pickled scaler, and the
+metrics arrays.
+
+```mermaid
+sequenceDiagram
+ autonumber
+ participant MG as model_gen.py
+ participant MM as ModelManager
+ participant FM as FeatureManager
+ participant MT as ModelTrainer
+ MG->>MM: ModelConfig + cached X_train_raw, y
+ MM->>MM: 99% / 1% train-validation split
+ MM->>FM: fit_transform(X_train) (drop/encode cols, fit StandardScaler)
+ MM->>FM: transform(X_valid)
+ MM->>MT: train_model(X_train, y, X_valid, y_valid)
+ loop each epoch
+ MT->>MT: shuffle, minibatch, forward
+ MT->>MT: BCEWithLogitsLoss, backward, optimizer.step()
+ opt every 1/validation_rate epochs
+ MT->>MT: validation loss/acc, checkpoint if best
+ end
+ opt train loss below 0.001
+ MT->>MT: early stop
+ end
+ end
+ MM->>MM: save scaler + best .pt + metrics + manifest.json
+```
+
+Notable specifics:
+
+- **Loss / optimizer:** `BCEWithLogitsLoss`, Adam (betas/eps configurable),
+ optional `OneCycleLR` schedule (`train_scheduler_name`).
+- **Labels come in pairs** from the triples: `(source, target)=1`,
+ `(source, alt)=0`.
+- **Best-checkpoint selection** is by *validation loss*, saved as
+ `basename_e{epoch}.pt`; the chosen epoch is recorded as
+ `best_model_description` in the manifest.
+- **Early stopping** is a hard threshold on *training* loss (`< 0.001`) β crude,
+ and unrelated to generalization.
+
+## The baselines
+
+By the project's "a baseline is anything that needs no training" definition
+(`evaluation.py`), the non-neural comparators are scored directly from the
+replay state:
+
+- **Non-personalized:** `MostRecentJournal`, `MostJournals`,
+ `MostRecentInitiation`, `MostInitiations` β rank candidates purely by their own
+ recency/volume.
+- **Personalized:** `ClosestToStart` (temporal proximity of first journal),
+ `CosineSimilarity` (raw cosine of feature vectors β i.e. what `SimNet`
+ computes).
+
+These were computed via a separate path (`notebook/eval/BaselineCompute.ipynb`)
+and are the "can a dumb heuristic do this?" reference for the learned models.
+
+## Offline evaluation at scale
+
+Evaluation ranks the full candidate set per test context and records ranking
+metrics into NDJSON. The metrics (`evaluation.py`): `rank_of_target`,
+`reciprocal_rank`, and `ndcg_{1,5,10,50}`, plus per-context bookkeeping
+(`n_usps_scored`, `n_sites_scored`, raw target score). USP scores are reduced to
+**site** scores (an author site can have several eligible authors) before
+ranking. `coverage.py` aggregates, across all contexts, how often each site
+surfaces in the top-N (Welford streaming stats) β the system's *reach*.
+
+The performance problem: a hyperparameter sweep trains hundreds of models, and
+naively each evaluation would re-assemble every feature matrix and re-look-up
+every embedding. The fix is a **two-phase split**:
+
+```mermaid
+flowchart TD
+ subgraph once["build once"]
+ TC["test_contexts in feature.sqlite"] --> GR["generateRecMd.py"]
+ TL["TextLoader.cache_all_journals()"] --> GR
+ GR --> PKL["reccontextN.pkl
(X_test fully assembled)"]
+ end
+ subgraph many["per model in the sweep"]
+ PKL --> EV["evaluateModelFromCacheMp.py
(multiprocessing pool)"]
+ MODEL["trained model"] --> EV
+ EV --> NDJSON["*_test_metadata.ndjson
per-context metrics"]
+ EV --> COV["coverage pickles"]
+ end
+ NDJSON --> NB["Jupyter: compare, pick model"]
+```
+
+Three evaluator variants exist with increasing speed: `evaluateModels.py`
+(streams contexts from the DB), `evaluateModelFromCache.py` (reads the pre-built
+pickles sequentially), and `evaluateModelFromCacheMp.py` (the multiprocessing
+version that overlaps pickle-loading with scoring). `submitEvalFromDirectory.py`
+fans a whole model directory out to Slurm; `--test-only` runs final test
+evaluation on a single chosen model.
+
+A subtlety worth knowing: `build_reccontext_X` **drops candidates with
+insufficient journal text** (fewer than 3 embeddings), and re-indexes
+`target_inds` to compensate. If the *target itself* lacks text it raises β that
+context is unscorable.
+
+## Hyperparameter sweeps
+
+`experiment/config_gen.py` is a config-and-script generator, not a runner. An
+`Experiment` writes one `config_{i}.json` per cell of a grid plus the matching
+`sbatch`/shell scripts into `configs/ outputs/ scripts/ sbatch/`. The committed
+examples are the actual paper sweeps β e.g. the field-study grid over
+`train_max_lr Γ LinearNet_n_hidden Γ train_weight_decay Γ LinearNet_dropout_p`
+with 3 replicates (540 configs), feature-ablation grids, and an Adam
+random-search. Each config is trained by `experiment/model_gen.py`, which loads
+the cached `X_train_raw.pkl`/`y_train_raw.pkl`, trains, and saves.
+
+```mermaid
+flowchart LR
+ EXP["Experiment grid
(config_gen.py)"] --> CFGS["N Γ config.json
+ sbatch scripts"]
+ CFGS --> SLURM["Slurm array"]
+ SLURM --> MGN["model_gen.py ΓN"]
+ MGN --> MODELS["N trained models"]
+ MODELS --> EVALDIR["submitEvalFromDirectory.py"]
+```
+
+## Candid notes for the modeler
+
+- **Pointwise BCE on data that screams "pairwise."** The triples are literally
+ `(positive, negative)` per source, yet training optimizes independent 0/1
+ cross-entropy with a 0.5 threshold rather than a ranking loss (BPR/pairwise
+ hinge/softmax-CE over the candidate set). The `evaluation.py` comments even
+ muse about BPR. A pairwise or listwise objective is the most natural upgrade
+ and would align the training objective with the NDCG/MRR you actually report.
+
+- **"Learning the dot product is notoriously hard."** A code comment
+ (`evaluation.py:72`) wonders whether cosine similarity should just be a *shared
+ feature* fed to the model rather than something the net must rediscover. That
+ instinct is correct and underlies `LearnedSimNet`.
+
+- **Two-tower is half-built.** `LearnedSimNet`/`ConcatNet` are the bi-encoder
+ architectures that modern retrieval would actually deploy (precompute candidate
+ embeddings, ANN search), but they're experimental here while the dense
+ cross-feature `LinearNet` is what shipped. With ~1563 features and a small MLP
+ that was the pragmatic choice for a one-off study.
+
+- **Validation is 1% by split, early-stop is a magic number.** Fine for a sweep
+ where you pick the best config post-hoc; not what you'd standardize on.
+
+- **π°οΈ Modern take.** RoBERTa-base mean-pooling (2019-era) would today be a
+ stronger sentence-embedding model (e.g. a current `sentence-transformers`
+ bi-encoder or an instruction-tuned embedding model), the bespoke train/eval
+ loop would be Lightning/Hugging Face `Trainer`, the sbatch-script generator
+ would be an experiment tracker + sweep agent (W&B/Optuna), and serving would
+ precompute candidate tower embeddings into an ANN index (FAISS/ScaNN) instead
+ of scoring every pair with a cross-encoder MLP. See
+ [modernization.md](modernization.md).
diff --git a/docs/modernization.md b/docs/modernization.md
new file mode 100644
index 0000000..aa73e64
--- /dev/null
+++ b/docs/modernization.md
@@ -0,0 +1,91 @@
+# Modernization Notes: "What would you do today?"
+
+This file collects, in one place, the candid "if you were starting this in
+2026, you'd probablyβ¦" observations that are sprinkled through the other docs.
+
+**Read this in the right spirit.** HealthBlogRec is a 2021 research artifact that
+did its job: it supported a published feasibility study and was re-run for paper
+revisions through 2024. The notes below are *not* a defect list. Many choices
+that look dated were the right call under the actual constraints β a fixed
+dataset, a single Slurm cluster, a paper deadline, and the ML tooling that
+existed at the time. The goal here is to help a future reader (or a future
+revival of this work) calibrate which parts are essential ideas versus
+incidental plumbing.
+
+## At a glance
+
+| Area | What the code does | What you'd reach for today | Why it matters |
+|------|--------------------|-----------------------------|----------------|
+| Feature pipeline | Hand-rolled streaming replay maintaining eligibility/activity/graph state, leakage-safe by careful ordering | A **point-in-time feature store** (Feast, Tecton) or stream processor (Flink, Kafka Streams) | Leakage-safety, backfills, and online/offline parity become first-class instead of invariants you enforce by hand |
+| Storage / handoffs | SQLite blobs, `.ndjson`, pickles passed file-to-file | **Parquet/Arrow** + a real artifact/data registry | Columnar, typed, portable, queryable; pickles are version-fragile and unsafe |
+| Text embeddings | RoBERTa-base, mean-pooled over 3 posts, 32-process batch script | A current `sentence-transformers` / instruction-tuned embedding model; batch inference on GPU with a serving runtime | Better representations; far less bespoke multiprocessing |
+| Model | Dense cross-feature MLP (`LinearNet`), pointwise BCE | Two-tower bi-encoder with a **pairwise/listwise ranking loss**; cross-encoder only for re-ranking a shortlist | Objective matches the NDCG/MRR you report; tower embeddings enable ANN retrieval |
+| Serving / retrieval | Score *every* eligibleΓactive pair with the cross-encoder | Precompute candidate embeddings β **ANN index** (FAISS/ScaNN), retrieve then re-rank | Scales past brute-force pair scoring |
+| Experiments | `config_gen.py` emits hundreds of JSON configs + sbatch scripts | **Experiment tracker + sweep agent** (W&B/MLflow + Optuna/Ray Tune) | Provenance, dashboards, smarter search than grid |
+| Training loop | Hand-written epoch/validation/checkpoint loop | Lightning / HF `Trainer` / Ray Train | Less boilerplate, standard checkpointing, multi-GPU for free |
+| Orchestration | Slurm `.sh` scripts, manual stage ordering, pickled checkpoints | A workflow engine (Airflow/Prefect/Dagster/Ray) with typed stage I/O | Reproducible DAGs, retries, lineage |
+| Config | A Python `Config` object with hard-coded `/home/lana/...` paths | `pydantic`/`hydra` config + env-driven paths | Portability off the original cluster |
+| Reproducibility | A single global RNG seed (`872`), conda env listed in prose | Pinned lockfile (uv/poetry), seeded per-component, containerized | Bit-reproducible re-runs |
+
+## The deeper modeling point
+
+The single highest-leverage change is the **training objective**. The data is
+generated as `(source, target, alt)` triples β a positive and a sampled negative
+that share a source β which is *exactly* the input a pairwise ranking loss (BPR,
+pairwise hinge, or sampled-softmax cross-entropy) wants. Yet the model trains
+with pointwise binary cross-entropy and a 0.5 decision threshold, then is
+evaluated with ranking metrics (NDCG, reciprocal rank). Closing that gap β train
+the thing you measure β is the change most likely to move the numbers, and the
+plumbing to do it (the negatives) is already there. The code's own comments
+gesture at this (`evaluation.py` muses about BPR; a comment notes "learning the
+dot product is notoriously hard").
+
+The second point is **architecture for serving**. The deployed model is a
+cross-encoder: it needs both the source and candidate features in the same
+forward pass, so producing recs means scoring every eligibleΓactive pair. That's
+fine for a study with a bounded candidate pool, but it doesn't scale. The
+half-built `LearnedSimNet`/`ConcatNet` two-tower models point the right way:
+encode candidates independently and offline, build an ANN index, retrieve a
+shortlist, and (optionally) re-rank the shortlist with a heavier cross-encoder.
+This is the standard modern retrieval pattern and the codebase was already
+reaching toward it.
+
+## Known sharp edges (documented in code)
+
+These are called out at their source in [data-pipeline.md](data-pipeline.md) and
+[modeling.md](modeling.md); summarized here so they're not a surprise:
+
+- The test-time target fallback picks "most recently joined eligible author"
+ when none is active, and the comment admits *"random might literally be
+ better"* (`reccontext.py:165`). Metadata records when this fired
+ (`test_target_usp_adjustment`).
+- Amp (reaction) timestamps are **reconstructed** from a delay distribution, so
+ amp-recency features are approximate (`data.py`).
+- A feature cache is resized mid-run with a `FIXME is this reasonable?`
+ (`triple_generation.py:202`), and the reuse cache "basically doesn't do
+ anything" outside coverage generation (`feature_extraction.py:64`).
+- `replace_config()` warns in its own docstring that it won't do what you expect.
+- Early stopping is a hard-coded train-loss threshold (`< 0.001`), unrelated to
+ generalization.
+
+See also the **"Fixed bugs during deployment"** section of the top-level
+[README](../README.md): a few real bugs (an activity time-elapsed bimodality, a
+pre-2018 comment timezone bug, and a feature-preprocessing bug dropping source
+features on ~12% of post-July-2020 training data) were caught and corrected for
+the offline results in the paper. That section is a good model of the kind of
+honesty research code should carry β these docs try to extend it.
+
+## What aged well
+
+It's worth saying plainly, because it's the reason the repo is worth reading:
+
+- **Leakage discipline.** The streaming replay makes temporal correctness
+ structural, not aspirational. A lot of "modern" pipelines get this *wrong*.
+- **A clean temporal train/test split.** Always predicting the future.
+- **Feature deduplication by content hash** β a genuinely nice space optimization
+ for a combinatorial pair space.
+- **Coverage as a first-class metric.** Asking "can we even reach most eligible
+ authors, or do recs pile onto a popular few?" is exactly the right question for
+ a social-support intervention, and many recommender evaluations omit it.
+- **Honest, self-deprecating comments.** The code tells you where the bodies are
+ buried. That's rarer than it should be.