levon003 · levon003 · Jun 13, 2026
diff --git a/README.md b/README.md
@@ -7,6 +7,21 @@ HealthBlogRec is a recommendation system for peer health blogs.
 
 ![Recommender system overview](/figures/rec_system_overview.png)
 
+## Documentation
+
+In-depth developer documentation lives in [`docs/`](docs/) — a diagram-heavy
+guided tour of how the system fits together, written for someone reading the
+[paper](https://arxiv.org/abs/2209.04973) and the source side by side. Because
+this is research code from 2021, the docs also include candid **🕰️ Modern take**
+notes flagging where current practice would steer you differently.
+
+ - [`docs/README.md`](docs/README.md) — start here: orientation and reading order
+ - [`docs/glossary.md`](docs/glossary.md) — the project's vocabulary (*USP*, *initiation*, *eligible*/*existing*/*active*, *triple*, *coverage*)
+ - [`docs/architecture.md`](docs/architecture.md) — the whole system on one page, plus entry points and the experiment loop
+ - [`docs/data-pipeline.md`](docs/data-pipeline.md) — the streaming history-replay that generates training triples and test contexts
+ - [`docs/modeling.md`](docs/modeling.md) — the feature vector, model zoo, training loop, baselines, and offline evaluation
+ - [`docs/modernization.md`](docs/modernization.md) — consolidated "what would you do today?" notes
+
 ## Citation and project history
 
 If any portion of this project is useful to you, please cite the following paper: 

diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,85 @@
+# HealthBlogRec — Developer Documentation
+
+This folder is a guided tour of the HealthBlogRec codebase, written for the
+reader who has the [arXiv paper](https://arxiv.org/abs/2209.04973) open in one
+tab and the source in another, and is trying to understand *how the choices in
+the paper map onto the code*.
+
+It is intentionally heavy on diagrams and light on prose. The code itself has
+reasonable docstrings in the places that matter; what was missing — and what
+these documents try to provide — is the **bird's-eye view**: how the pieces fit
+together, why they're shaped the way they are, and where the sharp edges are.
+
+> ⚠️ **This is research code from 2021.** It was written to run one study on one
+> cluster (the University of Minnesota's MSI) and to be re-run a handful of times
+> for paper revisions. It is *not* a library, it is *not* deployable as-is, and
+> several decisions that look strange today were reasonable given the deadline,
+> the data, and the tooling of the time. Throughout these docs, call-outs marked
+> **🕰️ Modern take** flag places where current practice would steer you
+> somewhere else. They are not bug reports — they are orientation for anyone
+> trying to learn from or revive this work.
+
+## Where to start
+
+Read in this order:
+
+1. **[glossary.md](glossary.md)** — the vocabulary. The code is dense with
+   domain-specific terms (*USP*, *initiation*, *eligible* vs *existing* vs
+   *active*, *triple*, *coverage*). Nothing else makes sense until these click.
+2. **[architecture.md](architecture.md)** — the whole system on one page, plus
+   the two entry points (`gen.py`, `predict.py`) and how an offline experiment
+   flows end to end.
+3. **[data-pipeline.md](data-pipeline.md)** — the heart of the project: the
+   streaming "replay history one interaction at a time" simulation in
+   `cbrec.triple_generation` that turns raw CaringBridge activity into training
+   triples and test/prediction contexts.
+4. **[modeling.md](modeling.md)** — the feature vector, the model zoo
+   (`LinearNet`, `SimNet`, `ConcatNet`, `LearnedSimNet`, `InteractionNet`), the
+   training loop, the baselines, and how offline evaluation is run at scale.
+5. **[modernization.md](modernization.md)** — a consolidated, candid list of
+   "what would you do differently today," collected so it doesn't clutter the
+   reference docs.
+
+## The one-paragraph summary
+
+HealthBlogRec recommends *peer health blogs* (CaringBridge sites) to people who
+write their own. The core modeling task is framed as: given a **source** author
+who is about to reach out to a new site (an **initiation**), can we rank the pool
+of **candidate** authors so the one they actually reached out to (the **target**)
+floats to the top? The system learns from history by replaying years of
+CaringBridge interactions in timestamp order, snapshotting the network/activity
+state at each real initiation, and emitting a `(source, target, sampled
+non-target)` **triple**. Each user/site pair is encoded as network features +
+recent-activity features + a RoBERTa embedding of their recent journal updates.
+A small MLP scores `(source, candidate)` pairs; the trained model was used to
+send weekly recommendation emails during a 2021 field study.
+
+## Map of the repository
+
+```
+HealthBlogRec/
+├── src/
+│   ├── gen.py / predict.py        # thin entry points → cbrec.triple_generation / cbrec.prediction
+│   ├── *.sh                       # Slurm batch scripts for the MSI cluster
+│   ├── cbrec/                     # the recommendation package (see data-pipeline.md, modeling.md)
+│   │   ├── triple_generation.py   # the streaming history-replay simulation (the core)
+│   │   ├── reccontext.py          # one recommendation event (source → candidates [→ target])
+│   │   ├── feature_extraction.py  # USP → feature vector, + async DB writer
+│   │   ├── eligibility.py         # who counts as existing / eligible
+│   │   ├── network.py             # the interaction graph (indegree/outdegree/component)
+│   │   ├── recentActivityCounter.py / timeAwareDict.py  # time-windowed activity counts
+│   │   ├── featuredb.py           # SQLite storage of feature arrays / triples / contexts
+│   │   ├── evaluation.py / coverage.py  # baselines, ranking metrics, coverage
+│   │   ├── text/                  # RoBERTa journal embeddings in SQLite
+│   │   ├── modeling/              # the PyTorch models, training, and offline eval
+│   │   └── experiment/            # generates hyperparameter-sweep configs + Slurm scripts
+│   ├── cbsend/                    # composes and sends the weekly recommendation emails
+│   └── extract/                   # MongoDB BSON dumps → flattened ndJSON
+├── notebook/                      # analysis & experimentation (outputs/IDs scrubbed)
+├── data/                          # small sample data artifacts
+├── design/                        # email mockups
+└── figures/                       # figures used in the paper
+```
+
+See the top-level [README](../README.md) for citation, study history, the conda
+environment, and the list of bugs caught during deployment.
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -0,0 +1,166 @@
+# Architecture Overview
+
+This is the whole system on one page. For the vocabulary
+(*USP*, *initiation*, *eligible*…) see [glossary.md](glossary.md) first.
+
+## The big picture
+
+HealthBlogRec turns years of raw CaringBridge activity into a trained model that
+ranks peer authors, and used that model to send weekly recommendation emails
+during a 2021 field study. There are five loosely-coupled stages, each writing
+artifacts to disk that the next stage reads. This file-handoff design is
+deliberate: each stage took hours-to-days on the cluster and had to be
+independently restartable.
+
+```mermaid
+flowchart TD
+    subgraph raw["① Raw data (MongoDB → flat files)"]
+        BSON["MongoDB BSON dumps"] -->|"src/extract/"| FLAT["ints_df.feather<br/>journal_metadata.feather<br/>user_site_df.csv"]
+        TXT["journalText.sqlite<br/>(raw post text)"]
+    end
+
+    subgraph gen["② Data generation — cbrec.triple_generation (gen.py)"]
+        FLAT --> REPLAY["Replay interactions in<br/>timestamp order, maintaining<br/>graph + activity + eligibility state"]
+        REPLAY -->|"each eligible initiation"| FDB[("feature.sqlite<br/>triples · test_contexts · feature arrays")]
+        REPLAY --> MDN["metadata.ndjson"]
+    end
+
+    subgraph text["③ Text features — cbrec.text"]
+        TXT -->|"RoBERTa, 32 procs"| TFDB[("text_feature.sqlite<br/>768-d embeddings per journal")]
+    end
+
+    subgraph model["④ Modeling — cbrec.modeling"]
+        FDB --> TRAIN["train.py / model_gen.py<br/>MLP on (source, candidate) pairs"]
+        TFDB --> TRAIN
+        TRAIN --> MODEL["trained model<br/>(.pt + .json manifest)"]
+        MODEL --> EVAL["offline evaluation<br/>(NDCG, reciprocal rank, coverage)"]
+        FDB --> EVAL
+        TFDB --> EVAL
+    end
+
+    subgraph deploy["⑤ Deployment (the 2021 field study)"]
+        MODEL --> PRED["cbrec.prediction (predict.py)<br/>score live candidates"]
+        FLAT --> PRED
+        PRED --> EMAIL["cbsend/<br/>compose + send weekly SSE emails"]
+    end
+```
+
+This diagram is the code version of the paper's
+[system overview figure](../figures/rec_system_overview.png).
+
+## The two entry points
+
+The `src/` top level is deliberately thin — the real logic is in the `cbrec`
+package. Everything is launched through Slurm batch scripts (`src/*.sh`) on the
+MSI cluster.
+
+| Script | Delegates to | Purpose |
+|--------|--------------|---------|
+| `gen.py` | `cbrec.triple_generation.main()` | Stage ②: generate all training triples and test/predict contexts. ~50 hours. |
+| `predict.py` | `cbrec.prediction.main()` | Stage ⑤: load a checkpoint + model and produce live recommendations for a set of users. |
+
+The supporting `make_text_features_*.sh` scripts drive Stage ③, and the
+`cbrec.experiment` + `cbrec.modeling` machinery drives Stage ④ (see
+[modeling.md](modeling.md)).
+
+## How an offline experiment flows end-to-end
+
+This is the loop a researcher actually ran to compare models for the paper.
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant U as Researcher
+    participant CG as experiment/config_gen.py
+    participant SL as Slurm (MSI)
+    participant MG as modeling/model_gen.py
+    participant FDB as feature.sqlite + cached X_train
+    participant EV as modeling/evaluateModelFromCache*.py
+    participant NB as Jupyter notebook
+
+    U->>CG: define a hyperparameter sweep
+    CG->>SL: emit one config.json + sbatch script per cell of the grid
+    loop each config in the sweep
+        SL->>MG: train one model
+        MG->>FDB: load cached training matrix
+        MG-->>SL: write model .pt + .json manifest
+    end
+    U->>EV: submitEvalFromDirectory.py over the model dir
+    loop each trained model
+        EV->>EV: score every cached RecContext, write *_test_metadata.ndjson
+    end
+    EV-->>NB: NDJSON metrics + coverage pickles
+    NB->>U: compare models, pick the field-study model
+```
+
+The key performance trick: **RecContexts are pre-built once and cached as
+pickles** (`generateRecMd.py`) so that evaluating 500+ models in a sweep doesn't
+re-pay the cost of assembling feature matrices and looking up embeddings each
+time. See [modeling.md](modeling.md#offline-evaluation-at-scale).
+
+## Package map
+
+```mermaid
+flowchart LR
+    subgraph cbrec["cbrec (recommendation core)"]
+        TG["triple_generation<br/>(orchestrator)"]
+        RC["reccontext"]
+        FE["feature_extraction"]
+        EL["eligibility"]
+        NW["network"]
+        RA["recentActivityCounter<br/>timeAwareDict"]
+        FD["featuredb"]
+        DA["data"]
+        EVl["evaluation<br/>coverage"]
+        TG --> RC --> FE
+        TG --> EL & NW & RA & DA
+        FE --> FD
+        RC --> EVl
+    end
+    subgraph txt["cbrec.text"]
+        TDB["textdb"] --> CTF["createTextFeatureSqlite<br/>(RoBERTa)"] --> EDB["embeddingdb"]
+        JID["journalid"]
+    end
+    subgraph mdl["cbrec.modeling"]
+        MC["modelconfig"] --> MM["manager"]
+        MM --> PP["preprocess"] & MT["train"]
+        MT --> MODELS["models/*<br/>LinearNet · SimNet · ConcatNet<br/>LearnedSimNet · InteractionNet"]
+        MM --> SC["scorer"]
+        RCB["reccontext_builder"] --> TL["text_loader"]
+    end
+    subgraph exp["cbrec.experiment"]
+        CGN["config_gen"] --> MGN["model_gen"]
+    end
+    cbrec --> mdl
+    txt --> mdl
+    exp --> mdl
+    mdl --> SEND["cbsend (emails)"]
+```
+
+## Design choices worth understanding up front
+
+- **Everything is a streaming replay.** The system never "queries the current
+  state of the database." It replays history one interaction at a time and
+  computes features from the *online* state at that instant. This is what makes
+  the features causally honest (no future leakage) — and what makes the pipeline
+  a stateful, checkpoint-heavy, ~50-hour job. See [data-pipeline.md](data-pipeline.md).
+
+- **Temporal train/test split, not random.** Train = initiations 2014–2020,
+  test = initiations in 2021. The model is always asked to predict the future.
+
+- **Feature arrays are content-addressed and deduplicated** in SQLite
+  (`featuredb.py`), so the hundreds of millions of `(source, candidate)` pairs
+  collapse to far fewer unique stored vectors.
+
+- **Text is a separate pipeline on purpose.** RoBERTa embedding of ~1M journals
+  is expensive and embarrassingly parallel, so it lives in its own stage
+  (`cbrec.text`, 32 processes) and is joined in only at model-training time.
+
+- **🕰️ Modern take:** Today most of stages ②–④ would be a feature store + a
+  scheduled batch job (or a streaming feature pipeline like Feast/Flink), the
+  bespoke SQLite + ndjson + pickle handoffs would be Parquet/Arrow, and the
+  Slurm sbatch-script-generation in `cbrec.experiment` would be an experiment
+  tracker (Weights & Biases / MLflow) plus a workflow engine
+  (Airflow/Prefect/Ray). The architecture's *shape* — replay → features →
+  train → eval → serve — is still exactly what you'd build. See
+  [modernization.md](modernization.md).