From bde5ee9fa6718caaa7ac09db4d7e136ee6bc175b Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 13 Jun 2026 02:47:49 +0000 Subject: [PATCH] docs: add diagram-heavy developer guide for the research codebase Add a docs/ folder that documents the system at the architecture level rather than the docstring level: how the pieces fit together, the project-specific vocabulary, the streaming data-generation pipeline, the model zoo and offline evaluation, and candid notes on what modern practice would recommend. - docs/README.md orientation + repo map + reading order - docs/glossary.md USP / initiation / eligible-existing-active / triple / coverage, with a state diagram - docs/architecture.md whole-system overview, entry points, experiment loop, package map (Mermaid) - docs/data-pipeline.md the timestamp-ordered replay that produces training triples and test contexts, plus the async writer and feature dedup - docs/modeling.md 1563-d feature vector, model zoo, training loop, baselines, cached offline evaluation, sweeps - docs/modernization.md consolidated "what would you do today" notes All 14 Mermaid diagrams validated with the mermaid parser. Link the new guide from the top-level README. https://claude.ai/code/session_018fRrzqPsGMHL3roZ2E3gVq --- README.md | 15 +++ docs/README.md | 85 +++++++++++++++ docs/architecture.md | 166 +++++++++++++++++++++++++++++ docs/data-pipeline.md | 230 ++++++++++++++++++++++++++++++++++++++++ docs/glossary.md | 72 +++++++++++++ docs/modeling.md | 238 ++++++++++++++++++++++++++++++++++++++++++ docs/modernization.md | 91 ++++++++++++++++ 7 files changed, 897 insertions(+) create mode 100644 docs/README.md create mode 100644 docs/architecture.md create mode 100644 docs/data-pipeline.md create mode 100644 docs/glossary.md create mode 100644 docs/modeling.md create mode 100644 docs/modernization.md diff --git a/README.md b/README.md index d6dfb1f..973e4b6 100644 --- a/README.md +++ b/README.md @@ -7,6 +7,21 @@ HealthBlogRec is a recommendation system for peer health blogs. ![Recommender system overview](/figures/rec_system_overview.png) +## Documentation + +In-depth developer documentation lives in [`docs/`](docs/) β€” a diagram-heavy +guided tour of how the system fits together, written for someone reading the +[paper](https://arxiv.org/abs/2209.04973) and the source side by side. Because +this is research code from 2021, the docs also include candid **πŸ•°οΈ Modern take** +notes flagging where current practice would steer you differently. + + - [`docs/README.md`](docs/README.md) β€” start here: orientation and reading order + - [`docs/glossary.md`](docs/glossary.md) β€” the project's vocabulary (*USP*, *initiation*, *eligible*/*existing*/*active*, *triple*, *coverage*) + - [`docs/architecture.md`](docs/architecture.md) β€” the whole system on one page, plus entry points and the experiment loop + - [`docs/data-pipeline.md`](docs/data-pipeline.md) β€” the streaming history-replay that generates training triples and test contexts + - [`docs/modeling.md`](docs/modeling.md) β€” the feature vector, model zoo, training loop, baselines, and offline evaluation + - [`docs/modernization.md`](docs/modernization.md) β€” consolidated "what would you do today?" notes + ## Citation and project history If any portion of this project is useful to you, please cite the following paper: diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..8a751e4 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,85 @@ +# HealthBlogRec β€” Developer Documentation + +This folder is a guided tour of the HealthBlogRec codebase, written for the +reader who has the [arXiv paper](https://arxiv.org/abs/2209.04973) open in one +tab and the source in another, and is trying to understand *how the choices in +the paper map onto the code*. + +It is intentionally heavy on diagrams and light on prose. The code itself has +reasonable docstrings in the places that matter; what was missing β€” and what +these documents try to provide β€” is the **bird's-eye view**: how the pieces fit +together, why they're shaped the way they are, and where the sharp edges are. + +> ⚠️ **This is research code from 2021.** It was written to run one study on one +> cluster (the University of Minnesota's MSI) and to be re-run a handful of times +> for paper revisions. It is *not* a library, it is *not* deployable as-is, and +> several decisions that look strange today were reasonable given the deadline, +> the data, and the tooling of the time. Throughout these docs, call-outs marked +> **πŸ•°οΈ Modern take** flag places where current practice would steer you +> somewhere else. They are not bug reports β€” they are orientation for anyone +> trying to learn from or revive this work. + +## Where to start + +Read in this order: + +1. **[glossary.md](glossary.md)** β€” the vocabulary. The code is dense with + domain-specific terms (*USP*, *initiation*, *eligible* vs *existing* vs + *active*, *triple*, *coverage*). Nothing else makes sense until these click. +2. **[architecture.md](architecture.md)** β€” the whole system on one page, plus + the two entry points (`gen.py`, `predict.py`) and how an offline experiment + flows end to end. +3. **[data-pipeline.md](data-pipeline.md)** β€” the heart of the project: the + streaming "replay history one interaction at a time" simulation in + `cbrec.triple_generation` that turns raw CaringBridge activity into training + triples and test/prediction contexts. +4. **[modeling.md](modeling.md)** β€” the feature vector, the model zoo + (`LinearNet`, `SimNet`, `ConcatNet`, `LearnedSimNet`, `InteractionNet`), the + training loop, the baselines, and how offline evaluation is run at scale. +5. **[modernization.md](modernization.md)** β€” a consolidated, candid list of + "what would you do differently today," collected so it doesn't clutter the + reference docs. + +## The one-paragraph summary + +HealthBlogRec recommends *peer health blogs* (CaringBridge sites) to people who +write their own. The core modeling task is framed as: given a **source** author +who is about to reach out to a new site (an **initiation**), can we rank the pool +of **candidate** authors so the one they actually reached out to (the **target**) +floats to the top? The system learns from history by replaying years of +CaringBridge interactions in timestamp order, snapshotting the network/activity +state at each real initiation, and emitting a `(source, target, sampled +non-target)` **triple**. Each user/site pair is encoded as network features + +recent-activity features + a RoBERTa embedding of their recent journal updates. +A small MLP scores `(source, candidate)` pairs; the trained model was used to +send weekly recommendation emails during a 2021 field study. + +## Map of the repository + +``` +HealthBlogRec/ +β”œβ”€β”€ src/ +β”‚ β”œβ”€β”€ gen.py / predict.py # thin entry points β†’ cbrec.triple_generation / cbrec.prediction +β”‚ β”œβ”€β”€ *.sh # Slurm batch scripts for the MSI cluster +β”‚ β”œβ”€β”€ cbrec/ # the recommendation package (see data-pipeline.md, modeling.md) +β”‚ β”‚ β”œβ”€β”€ triple_generation.py # the streaming history-replay simulation (the core) +β”‚ β”‚ β”œβ”€β”€ reccontext.py # one recommendation event (source β†’ candidates [β†’ target]) +β”‚ β”‚ β”œβ”€β”€ feature_extraction.py # USP β†’ feature vector, + async DB writer +β”‚ β”‚ β”œβ”€β”€ eligibility.py # who counts as existing / eligible +β”‚ β”‚ β”œβ”€β”€ network.py # the interaction graph (indegree/outdegree/component) +β”‚ β”‚ β”œβ”€β”€ recentActivityCounter.py / timeAwareDict.py # time-windowed activity counts +β”‚ β”‚ β”œβ”€β”€ featuredb.py # SQLite storage of feature arrays / triples / contexts +β”‚ β”‚ β”œβ”€β”€ evaluation.py / coverage.py # baselines, ranking metrics, coverage +β”‚ β”‚ β”œβ”€β”€ text/ # RoBERTa journal embeddings in SQLite +β”‚ β”‚ β”œβ”€β”€ modeling/ # the PyTorch models, training, and offline eval +β”‚ β”‚ └── experiment/ # generates hyperparameter-sweep configs + Slurm scripts +β”‚ β”œβ”€β”€ cbsend/ # composes and sends the weekly recommendation emails +β”‚ └── extract/ # MongoDB BSON dumps β†’ flattened ndJSON +β”œβ”€β”€ notebook/ # analysis & experimentation (outputs/IDs scrubbed) +β”œβ”€β”€ data/ # small sample data artifacts +β”œβ”€β”€ design/ # email mockups +└── figures/ # figures used in the paper +``` + +See the top-level [README](../README.md) for citation, study history, the conda +environment, and the list of bugs caught during deployment. diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 0000000..b813f80 --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,166 @@ +# Architecture Overview + +This is the whole system on one page. For the vocabulary +(*USP*, *initiation*, *eligible*…) see [glossary.md](glossary.md) first. + +## The big picture + +HealthBlogRec turns years of raw CaringBridge activity into a trained model that +ranks peer authors, and used that model to send weekly recommendation emails +during a 2021 field study. There are five loosely-coupled stages, each writing +artifacts to disk that the next stage reads. This file-handoff design is +deliberate: each stage took hours-to-days on the cluster and had to be +independently restartable. + +```mermaid +flowchart TD + subgraph raw["β‘  Raw data (MongoDB β†’ flat files)"] + BSON["MongoDB BSON dumps"] -->|"src/extract/"| FLAT["ints_df.feather
journal_metadata.feather
user_site_df.csv"] + TXT["journalText.sqlite
(raw post text)"] + end + + subgraph gen["β‘‘ Data generation β€” cbrec.triple_generation (gen.py)"] + FLAT --> REPLAY["Replay interactions in
timestamp order, maintaining
graph + activity + eligibility state"] + REPLAY -->|"each eligible initiation"| FDB[("feature.sqlite
triples Β· test_contexts Β· feature arrays")] + REPLAY --> MDN["metadata.ndjson"] + end + + subgraph text["β‘’ Text features β€” cbrec.text"] + TXT -->|"RoBERTa, 32 procs"| TFDB[("text_feature.sqlite
768-d embeddings per journal")] + end + + subgraph model["β‘£ Modeling β€” cbrec.modeling"] + FDB --> TRAIN["train.py / model_gen.py
MLP on (source, candidate) pairs"] + TFDB --> TRAIN + TRAIN --> MODEL["trained model
(.pt + .json manifest)"] + MODEL --> EVAL["offline evaluation
(NDCG, reciprocal rank, coverage)"] + FDB --> EVAL + TFDB --> EVAL + end + + subgraph deploy["β‘€ Deployment (the 2021 field study)"] + MODEL --> PRED["cbrec.prediction (predict.py)
score live candidates"] + FLAT --> PRED + PRED --> EMAIL["cbsend/
compose + send weekly SSE emails"] + end +``` + +This diagram is the code version of the paper's +[system overview figure](../figures/rec_system_overview.png). + +## The two entry points + +The `src/` top level is deliberately thin β€” the real logic is in the `cbrec` +package. Everything is launched through Slurm batch scripts (`src/*.sh`) on the +MSI cluster. + +| Script | Delegates to | Purpose | +|--------|--------------|---------| +| `gen.py` | `cbrec.triple_generation.main()` | Stage β‘‘: generate all training triples and test/predict contexts. ~50 hours. | +| `predict.py` | `cbrec.prediction.main()` | Stage β‘€: load a checkpoint + model and produce live recommendations for a set of users. | + +The supporting `make_text_features_*.sh` scripts drive Stage β‘’, and the +`cbrec.experiment` + `cbrec.modeling` machinery drives Stage β‘£ (see +[modeling.md](modeling.md)). + +## How an offline experiment flows end-to-end + +This is the loop a researcher actually ran to compare models for the paper. + +```mermaid +sequenceDiagram + autonumber + participant U as Researcher + participant CG as experiment/config_gen.py + participant SL as Slurm (MSI) + participant MG as modeling/model_gen.py + participant FDB as feature.sqlite + cached X_train + participant EV as modeling/evaluateModelFromCache*.py + participant NB as Jupyter notebook + + U->>CG: define a hyperparameter sweep + CG->>SL: emit one config.json + sbatch script per cell of the grid + loop each config in the sweep + SL->>MG: train one model + MG->>FDB: load cached training matrix + MG-->>SL: write model .pt + .json manifest + end + U->>EV: submitEvalFromDirectory.py over the model dir + loop each trained model + EV->>EV: score every cached RecContext, write *_test_metadata.ndjson + end + EV-->>NB: NDJSON metrics + coverage pickles + NB->>U: compare models, pick the field-study model +``` + +The key performance trick: **RecContexts are pre-built once and cached as +pickles** (`generateRecMd.py`) so that evaluating 500+ models in a sweep doesn't +re-pay the cost of assembling feature matrices and looking up embeddings each +time. See [modeling.md](modeling.md#offline-evaluation-at-scale). + +## Package map + +```mermaid +flowchart LR + subgraph cbrec["cbrec (recommendation core)"] + TG["triple_generation
(orchestrator)"] + RC["reccontext"] + FE["feature_extraction"] + EL["eligibility"] + NW["network"] + RA["recentActivityCounter
timeAwareDict"] + FD["featuredb"] + DA["data"] + EVl["evaluation
coverage"] + TG --> RC --> FE + TG --> EL & NW & RA & DA + FE --> FD + RC --> EVl + end + subgraph txt["cbrec.text"] + TDB["textdb"] --> CTF["createTextFeatureSqlite
(RoBERTa)"] --> EDB["embeddingdb"] + JID["journalid"] + end + subgraph mdl["cbrec.modeling"] + MC["modelconfig"] --> MM["manager"] + MM --> PP["preprocess"] & MT["train"] + MT --> MODELS["models/*
LinearNet Β· SimNet Β· ConcatNet
LearnedSimNet Β· InteractionNet"] + MM --> SC["scorer"] + RCB["reccontext_builder"] --> TL["text_loader"] + end + subgraph exp["cbrec.experiment"] + CGN["config_gen"] --> MGN["model_gen"] + end + cbrec --> mdl + txt --> mdl + exp --> mdl + mdl --> SEND["cbsend (emails)"] +``` + +## Design choices worth understanding up front + +- **Everything is a streaming replay.** The system never "queries the current + state of the database." It replays history one interaction at a time and + computes features from the *online* state at that instant. This is what makes + the features causally honest (no future leakage) β€” and what makes the pipeline + a stateful, checkpoint-heavy, ~50-hour job. See [data-pipeline.md](data-pipeline.md). + +- **Temporal train/test split, not random.** Train = initiations 2014–2020, + test = initiations in 2021. The model is always asked to predict the future. + +- **Feature arrays are content-addressed and deduplicated** in SQLite + (`featuredb.py`), so the hundreds of millions of `(source, candidate)` pairs + collapse to far fewer unique stored vectors. + +- **Text is a separate pipeline on purpose.** RoBERTa embedding of ~1M journals + is expensive and embarrassingly parallel, so it lives in its own stage + (`cbrec.text`, 32 processes) and is joined in only at model-training time. + +- **πŸ•°οΈ Modern take:** Today most of stages ⑑–④ would be a feature store + a + scheduled batch job (or a streaming feature pipeline like Feast/Flink), the + bespoke SQLite + ndjson + pickle handoffs would be Parquet/Arrow, and the + Slurm sbatch-script-generation in `cbrec.experiment` would be an experiment + tracker (Weights & Biases / MLflow) plus a workflow engine + (Airflow/Prefect/Ray). The architecture's *shape* β€” replay β†’ features β†’ + train β†’ eval β†’ serve β€” is still exactly what you'd build. See + [modernization.md](modernization.md). diff --git a/docs/data-pipeline.md b/docs/data-pipeline.md new file mode 100644 index 0000000..b44c042 --- /dev/null +++ b/docs/data-pipeline.md @@ -0,0 +1,230 @@ +# The Data Generation Pipeline + +> Code: `cbrec/triple_generation.py` and everything it orchestrates. +> Entry point: `src/gen.py`. Vocabulary: [glossary.md](glossary.md). + +This is the most important β€” and least obvious β€” part of the project. Everything +the model learns from is produced here. If you understand this file, you +understand HealthBlogRec. + +## The core idea: replay history, one interaction at a time + +The pipeline does **not** query a database for "the current network." Instead it +**replays the entire interaction stream in timestamp order**, and maintains the +graph, activity counters, and eligibility sets *incrementally* as each event +arrives. When it reaches a real initiation, it snapshots the state *as it existed +at that instant* and emits training data. + +Why bother? Because it guarantees **no temporal leakage**: a USP's features at +time *t* can only reflect what had happened before *t*. A model trained on these +features is being asked exactly the question it will face in production β€” "given +only the past, rank these candidates" β€” which is why the offline numbers are +trustworthy. + +```mermaid +flowchart TD + START["RecDataGenerator.generate_rec_data()"] --> LOAD["DataManager loads + trims
ints_df (sorted by created_at)"] + LOAD --> LOOP{"for each interaction
in timestamp order"} + LOOP --> JUP["apply any journal updates
that occurred before this ts
(timeAwareDict)"] + JUP --> UPDST["update activity counters
(recentActivityManager)"] + UPDST --> ISINIT{"first time this user
touched this site?
(is_initiation)"} + ISINIT -->|no| UPDACT["record interaction in
activity counter"] + ISINIT -->|yes| PERIOD{"which period?"} + PERIOD -->|"before 2014
(warm-up)"| GRAPH + PERIOD -->|"2014–2020
(generation)"| TRAIN["build RecContext β†’
generate TRAIN triples"] + PERIOD -->|"2021–2022
(test)"| TEST["build RecContext β†’
generate TEST context"] + TRAIN --> WRITE["DatabaseWriter (async) β†’
feature.sqlite + metadata.ndjson"] + TEST --> WRITE + GRAPH["update initiation graph
(add edge source→existing authors)"] --> UPDACT + WRITE --> GRAPH + UPDACT --> LOOP + LOOP -->|"hit generation_stop_timestamp"| DONE["final checkpoint"] +``` + +The loop lives in `generate_rec_data()` (line ~126) and `process_interaction()` +(line ~254). The three time boundaries are config values in `genconfig.py`: +`generation_start_timestamp` (2014), `test_generation_start_timestamp` (2021), +`generation_stop_timestamp` (2022). + +## The stateful machinery + +`RecDataGenerator.__init__` wires together a small set of collaborating state +objects. Each one answers a different question about "the world so far": + +```mermaid +flowchart LR + subgraph state["Online state maintained during replay"] + EL["UserSitePairEligibilityManager
eligibility.py"] + NW["UserGraph + WccGraph
network.py"] + RA["RecentActivityManager
recentActivityCounter.py"] + TAD["TimeAwareDict (journals)
timeAwareDict.py"] + end + EL -.->|"who is existing / eligible?"| Q1 + NW -.->|"indegree, outdegree,
component, reciprocal, FoF?"| Q1 + RA -.->|"who is active? recent
activity counts + recency?"| Q1 + TAD -.->|"which journals happened
before now?"| Q1 + Q1["FeatureGenerator
builds a USP's feature vector"] +``` + +| Object | File | Maintains | Feeds | +|--------|------|-----------|-------| +| `UserSitePairEligibilityManager` | `eligibility.py` | counts of journal updates per USP; the existing/eligible sets; site↔author maps | eligibility checks, candidate pool | +| `UserGraph` (+ `WccGraph`) | `network.py` | the directed initiation graph; weakly-connected components | network features (indegree/outdegree/component/reciprocal/FoF) | +| `RecentActivityManager` (a set of `RecentActivityCounter`s) | `recentActivityCounter.py` | sliding-window counts of journals/amps/comments/guestbooks; first-journal timestamps | activity features, the "active" set | +| `TimeAwareDict` | `timeAwareDict.py` | journal updates indexed by timestamp, consumed once | drives the eligibility/activity updates | + +The activity counters use a neat trick: an `OrderedDict` keyed by timestamp lets +expiry be an O(1) pop from the left as the window slides forward +(`RecentActivityCounter.update_counts`). The `WccGraph` keeps component +membership up to date by merging components whenever an edge bridges two of them. + +## From an initiation to a training triple + +When an eligible initiation occurs during the generation period, +`RecContext.generate_train_features()` produces one or more **triples**: + +```mermaid +flowchart TD + INIT["initiation: source_user β†’ target_site"] --> SRC["source USPs =
all sites where source is eligible
(usually 1)"] + INIT --> TGT["target USPs =
eligible authors on target_site"] + INIT --> CAND["candidate pool =
active ∩ eligible USPs
(minus self, coauthors, existing ties)"] + SRC --> PAIR + TGT --> PAIR + CAND -->|"random negative sample"| ALT["alt USP"] + ALT --> PAIR["TRIPLE = (source, target, alt)"] + PAIR --> POS["(source, target) β†’ label 1"] + PAIR --> NEG["(source, alt) β†’ label 0"] + POS --> FEAT["FeatureGenerator:
12 USP features each
+ 3 pairwise features"] + NEG --> FEAT + FEAT --> SAVE["DatabaseWriter.save_triples()
β†’ feature.sqlite"] +``` + +Negative sampling here is **uniform random** from the eligible-active candidate +pool. Note what's stored is *features*, not raw IDs at scoring time: each triple +row references five `feature_id`s (source, target, alt, source-target pair, +source-alt pair), and identical feature vectors are deduplicated by content hash +(see below). + +## Test contexts are different: rank the whole pool + +In the test period there's no negative sampling. Instead +`generate_test_features()` materializes the **entire candidate set** as dense +matrices so the evaluator can score every candidate and find where the true +target ranked: + +| Array | Shape | Meaning | +|-------|-------|---------| +| `source_usp_arr` | `(X, 2)` | source USPs (usually X=1) | +| `candidate_usp_arr` | `(Y, 2)` | every candidate USP, sorted by `site_id` | +| `target_inds` | `(Z,)` | indices into `candidate_usp_arr` that are the true target | +| `source_usp_mat` | `(X, 12)` | non-text features for sources | +| `candidate_usp_mat` | `(Y, 12)` | non-text features for candidates | +| `user_pair_mat` | `(XΒ·Y, 3)` | pairwise features for every sourceΓ—candidate combo | + +The text embeddings are **not** stored here β€” they're joined in later at scoring +time by `reccontext_builder.build_reccontext_X` (see [modeling.md](modeling.md)). +This keeps `feature.sqlite` from ballooning with duplicated 768-d vectors. + +A **predict** context is a test context with no target β€” used to generate the +recs that were actually emailed, and the ~1000 target-less "coverage" contexts +created at test-period start to measure reach. + +## What a USP feature vector contains + +`FeatureGenerator.generate_user_features` produces a **12-dimensional** non-text +vector per USP, and `generate_user_pair_features` produces **3** pairwise +features: + +``` +USP features (12): + network (3): indegree Β· outdegree Β· weakly-connected-component size + activity (8): for each of {journal, amp, comment, guestbook}: + count-in-window Β· hours-since-most-recent + temporal (1): hours since the author's first journal update + +pair features (3): are_weakly_connected Β· is_friend_of_friend Β· is_reciprocal +``` + +At model-training time these are concatenated with two 768-d RoBERTa journal +embeddings (source + candidate) to form the final input vector. The full layout +is documented in [modeling.md](modeling.md#the-feature-vector). + +## Persistence: the async writer and feature dedup + +Feature computation runs in the main process; disk I/O is offloaded to a +separate `WriterProcess` over a queue (`feature_extraction.py`). This lets the +CPU-bound replay keep moving while SQLite writes batch up (commits every +`feature_db_max_commit_size` = 500k inserts). + +```mermaid +flowchart LR + MAIN["main loop
(compute features)"] -->|"queue items:
feature / triple /
test_context / metadata"| Q(["mp.Queue"]) + Q --> WP["WriterProcess"] + WP -->|"batched commits"| SQL[("feature.sqlite")] + WP -->|"append JSON line"| ND["metadata.ndjson"] + MAIN -.->|"content-hash dedup (LruCache of feature_ids)"| MAIN +``` + +`feature.sqlite` holds three logical tables: `feature` (deduplicated numpy +arrays, serialized with a 1-byte dtype prefix), `triple` (training rows +referencing feature_ids), and `test_context` (test/predict rows referencing the +six matrices above). `metadata.ndjson` carries one rich metadata dict per +RecContext β€” eligibility flags, candidate counts, adjustment reasons β€” used for +slicing results during analysis. + +## Checkpointing + +Because a full run is ~50 hours, the entire `RecDataGenerator` is pickled at +milestones (`create_checkpoint`): before the generation period, before the test +period, and at the end. `gen.py --from-recent-checkpoint` resumes by replaying +to `current_interaction_ind` and continuing. The `__getstate__`/`__setstate__` +hooks drop and rebuild the (unpicklable) `DataManager`. + +## Sharp edges and candid notes + +These are real comments and behaviors in the code β€” worth knowing before you +trust or modify the pipeline. + +- **The target-selection fallback is admittedly arbitrary.** At test time, if no + author on the target site is currently *active*, the code forces one in for + evaluation. With multiple eligible authors it picks the one who *joined the + site most recently* β€” and the comment says so plainly: *"a very arbitrary + selection criterion; random might literally be better"* (`reccontext.py:165`). + The `test_target_usp_adjustment` field in metadata records when this happened + so you can filter those cases. + +- **Debug mode fabricates eligibility.** In `TestConfig`/debug runs, + `DataManager` duplicates a journal row per user so everyone clears the + 3-update eligibility bar (`data.py:50`). Great for exercising code paths on + tiny test data; obviously not a real-data behavior. + +- **Amp timestamps are reconstructed.** "Amps" (reactions) weren't reliably + timestamped in the source data, so `trim_data()` adds offsets drawn from a + precomputed delay distribution (`reaction_ms_since_journal.npy`). The features + built on amp recency are therefore approximate by construction. + +- **`replace_config()` is explicitly hacky.** Its own docstring warns it "will + NOT do everything you expect, and most config settings will be unaffected" + (`triple_generation.py:115`). It exists to enable a few specific mutations for + the predict path. + +- **A cache is resized mid-run on a hunch.** At test-period start the feature + array cache is shrunk from 5000 to 10 because test arrays are much larger, + with the comment `FIXME is this reasonable?` (`triple_generation.py:202`). + +- **The reuse cache mostly doesn't fire.** `FeatureGenerator`'s LRU cache keys on + `(user, site, timestamp)`; since timestamps are essentially unique, repeat hits + "basically don't happen" β€” the author says as much + (`feature_extraction.py:64`). The exception is coverage generation, where the + timestamp is held fixed and the cache is temporarily enlarged. + +- **πŸ•°οΈ Modern take.** The hand-rolled streaming state (eligibility sets, sliding + windows, an incremental WCC union-find by hand) is exactly the kind of thing a + modern **point-in-time feature store** (Feast, Tecton) or a stream processor + (Flink/Kafka Streams) is built to do β€” with leakage-safety and backfills as + first-class features rather than something you enforce by careful ordering and + hope. The SQLite-blob feature table would be Parquet/Arrow today, and the + ad-hoc content-hash dedup is what a feature store's entity keys give you for + free. None of this means the original is wrong β€” it's a faithful, leakage-safe + pipeline β€” just that you'd reach for off-the-shelf infrastructure now. More in + [modernization.md](modernization.md). diff --git a/docs/glossary.md b/docs/glossary.md new file mode 100644 index 0000000..2013b4b --- /dev/null +++ b/docs/glossary.md @@ -0,0 +1,72 @@ +# Glossary + +The code uses a compact, project-specific vocabulary that the variable names +assume you already know. This is the decoder ring. Read it once and the rest of +the codebase becomes legible. + +## Entities + +| Term | Code | Meaning | +|------|------|---------| +| **User** | `user_id` | A CaringBridge account. | +| **Site** | `site_id` | A CaringBridge blog/site. One site can have multiple authors. | +| **USP** (user/site pair) | `(user_id, site_id)`, often an `np.int64` 2-tuple | The atomic unit the model reasons about. *Not* a user and *not* a site, but a specific person **in their role as an author of a specific site.** The same person authoring two sites is two USPs. Almost every array in the codebase is indexed by USP. | +| **Journal update** | `journal_oid`, `journal_id` | A blog post. The text of the 3 most recent updates is what gets embedded with RoBERTa. | +| **Interaction** | row of `ints_df` | Any activity event: a journal update, an *amp* (a "like"), a *comment*, or a *guestbook* entry. The whole pipeline is a replay of the interaction stream in timestamp order. | + +## The recommendation framing + +| Term | Code | Meaning | +|------|------|---------| +| **Initiation** | `is_initiation` | The **first** time a given `user_id` interacts with a given `site_id`. This is the event the system tries to predict/encourage: one author reaching out to another's site for the first time. Initiations define both the training signal and the edges of the interaction graph. | +| **Source** | `source_user_id`, `source_usp_arr` | The author *making* a recommendation request / initiation β€” the person we'd email recommendations to. | +| **Target** | `target_site_id`, `target_usps`, `target_inds` | The site (and its eligible authors) the source *actually* initiated with. This is the ground-truth positive: the thing a good ranking puts near the top. | +| **Candidate** | `candidate_usp_arr`, `candidate_usp_mat` | The pool of USPs that *could* be recommended at this moment β€” every active & eligible author. At test time we score the whole pool and check where the target landed. | +| **Alt** (alternative) | `alt_usp`, `alt_feature_id` | A candidate sampled as a **negative** during training: a USP the source did *not* initiate with. (See [data-pipeline.md](data-pipeline.md) for the sampling.) | +| **Triple** | `triple` table, `RecContext.triples` | The training unit: `(source, target, alt)`. Stored as two labeled pairs β€” `(source, target) β†’ 1` and `(source, alt) β†’ 0`. The word "triple" is everywhere; it just means "one positive + one sampled negative, sharing a source." | + +## User state (the three words that trip everyone up) + +These describe a USP's status *at a given instant in the replay*. They are a +strict-ish hierarchy and are tracked in `eligibility.py` and +`recentActivityCounter.py`. + +```mermaid +flowchart LR + A["Author writes
journal updates"] --> E["EXISTING
(β‰₯1 update on a site)"] + E --> EL["ELIGIBLE
(β‰₯3 updates on a site)
β†’ can be a rec TARGET"] + A2["Any recent interaction
(journal/amp/comment/guestbook)
in the last 7 days"] --> AC["ACTIVE
β†’ worth scoring right now"] + EL -.->|"eligible AND active = a real candidate"| CAND["CANDIDATE pool"] + AC -.-> CAND +``` + +| Term | Threshold | Why it exists | +|------|-----------|---------------| +| **Existing** | Has authored **β‰₯1** journal update on the site. | Needed for the interaction graph: only existing authors can be the recipient of an initiation edge. | +| **Eligible** | Has authored **β‰₯3** journal updates on the site (`config.journal_update_memory`). | The model needs 3 journal embeddings to build a USP's text features, and recommending a barely-started blog is undesirable. Only eligible USPs can be recommendation **targets**. | +| **Active** | Had *any* interaction within the activity window (default **1 week**, `config.activity_count_duration_ms`). | We only want to surface authors who are currently writing β€” a recommendation to a dormant blog is a bad experience. Candidates must be active. | + +So the **candidate pool = eligible ∩ active** authors at the recommendation +timestamp. + +## Process / evaluation terms + +| Term | Code | Meaning | +|------|------|---------| +| **Generation period** | `generation_start_timestamp` (2014) β†’ `test_generation_start_timestamp` (2021) | The window whose initiations become **training triples**. | +| **Test period** | `test_generation_start_timestamp` (2021) β†’ `generation_stop_timestamp` (2022) | The window whose initiations become **test contexts** (full candidate sets to rank, no negative sampling). This is a strict **temporal** train/test split β€” the model is always evaluated on the future. | +| **RecContext** | `reccontext.RecContext` | One recommendation event, in any of three modes: *train* (emit triples), *test* (emit a full candidate matrix with known target), or *predict* (a full candidate matrix with **no** target β€” used to actually generate recs to email). | +| **Coverage** | `coverage.py`, `n_coverage_users` | A diagnostic: across all the recs generated, *which sites ever get surfaced, and how often?* Answers "can the system even reach most eligible authors, or does it pile recs onto a popular few?" At test-period start the pipeline picks ~1000 eligible-active users and generates target-less recs for them purely to measure this. | +| **Feature crosses (FC)** | comments in `feature_extraction.py` | Pairwise features computed *between* source and candidate (e.g. "are they reciprocal?", "friend-of-friend?"), as opposed to features of a single USP. | +| **Pointwise** | `get_pointwise_training_triples` | The model is trained pointwise (each `(source, candidate)` pair gets an independent 0/1 label + BCE loss) rather than with a pairwise ranking loss β€” even though the triple structure is right there. See [modernization.md](modernization.md). | + +## Acronyms in the wild + +- **SSE** β€” "Site Suggestion Email," the weekly recommendation email (`cbsend/`). +- **BPR** β€” Bayesian Personalized Ranking; referenced aspirationally in + `evaluation.py` comments but the deployed model is a plain MLP, not BPR. +- **RAC** β€” `RecentActivityCounter`, the sliding-window activity counter. +- **WCC** β€” Weakly Connected Component (of the initiation graph), used for the + `component_size` network feature. +- **MSI** β€” Minnesota Supercomputing Institute, the Slurm cluster everything ran on. +- **OID** β€” MongoDB ObjectID; the original data lived in MongoDB. diff --git a/docs/modeling.md b/docs/modeling.md new file mode 100644 index 0000000..2e2a848 --- /dev/null +++ b/docs/modeling.md @@ -0,0 +1,238 @@ +# Modeling: Features, Models, Training & Evaluation + +> Code: `cbrec/modeling/` and `cbrec/modeling/models/`. +> Hyperparameter sweeps: `cbrec/experiment/`. +> Upstream data: [data-pipeline.md](data-pipeline.md). + +The modeling task is a **pointwise binary classification** framing of ranking: +score each `(source, candidate)` pair with a number in `[0, 1]`, then rank +candidates by that score and check where the true target landed. + +## The feature vector + +Every model consumes the same fixed-length row, assembled at scoring time by +`reccontext_builder.build_reccontext_X`. With the defaults in `genconfig.py` +(`user_feature_count=12`, `user_pair_feature_count=3`, `text_feature_count=768`) +the row is **1563** columns: + +``` +n_cols = 2Β·(user_feature_count + text_feature_count) + user_pair_feature_count + = 2Β·(12 + 768) + 3 = 1563 +``` + +```mermaid +flowchart LR + subgraph row["one input row (1563 dims)"] + S["[0:12]
source
non-text"] + C["[12:24]
candidate
non-text"] + P["[24:27]
pair
features"] + ST["[27:795]
source
RoBERTa (768)"] + CT["[795:1563]
candidate
RoBERTa (768)"] + end +``` + +> ⚠️ **Don't be fooled by the number 27.** It's the offset where the non-text +> block ends (`12 + 12 + 3`), *not* the count of source features. Source non-text +> is 12 columns. The `FeatureManager` (`preprocess.py`) lets you address slices +> symbolically β€” `get_feature_indices('source', 'text')`, `('*', '~text')`, etc. +> β€” which is what makes feature-ablation experiments one config line instead of a +> code change. + +The non-text 12 = 3 network + 8 activity + 1 temporal; the 3 pair features are +`are_weakly_connected Β· is_friend_of_friend Β· is_reciprocal`. See +[data-pipeline.md](data-pipeline.md#what-a-usp-feature-vector-contains). + +The two 768-d RoBERTa vectors are mean-pooled over the author's 3 most recent +journal updates by default (`pool_text_feature=0`; alternatives are max-pool and +concatenate). They're looked up and joined in by `text_loader.TextLoader`, which +caches embeddings in an LRU and can pre-load all ~1M journal embeddings into RAM. + +## The model zoo + +All models are tiny `nn.Module`s selected by `model_config.model_name` in +`ModelTrainer.create_net()`. They output **raw logits** (no final sigmoid) β€” +the sigmoid lives in `BCEWithLogitsLoss` and in scoring. + +```mermaid +flowchart TD + IN["input row (1563)"] --> SEL{model_name} + SEL --> LN["LinearNet"] + SEL --> SN["SimNet"] + SEL --> CN["ConcatNet"] + SEL --> LSN["LearnedSimNet"] + SEL --> IN2["InteractionNet"] +``` + +| Model | File | Learns? | Idea | Status | +|-------|------|:---:|------|--------| +| **LinearNet** | `linearnet.py` | βœ” | Concatenate the whole row, push through a 2-hidden-layer MLP (ReLU + dropout, no-bias output). The workhorse β€” this is the **deployed field-study model**. | Primary | +| **SimNet** | `simnet.py` | ✘ | No parameters: take cosine (or L2) similarity between the source slice and the candidate slice. A pure-similarity baseline; "train" with 0 epochs. | Baseline | +| **LearnedSimNet** | `learnedsimnet.py` | βœ” | A bi-encoder (DSSM/Sentence-BERT-style): project source and candidate independently, then take similarity of the two embeddings. Optional shared encoder. | Experimental | +| **ConcatNet** | `concatnet.py` | βœ” | Encode source and candidate separately, concatenate the two encodings, then an MLP head. | Partial / experimental | +| **InteractionNet** | `interactionnet.py` | βœ” | Form the outer product of source Γ— candidate features (explicit pairwise interactions), flatten, MLP. Expressive but `n_inputΒ²`-wide. | Experimental | + +`cf.py` is a stub (a citation, no implementation). The shared-feature branches in +`LearnedSimNet`/`InteractionNet` are present but untested per their comments. + +**LinearNet at a glance** (`linearnet.py`): + +```mermaid +flowchart LR + X["x (1563)"] --> F1["fc1 β†’ n_hidden"] --> R1["ReLU + dropout"] + R1 --> F2["fc2 β†’ n_hidden"] --> R2["ReLU + dropout"] + R2 --> F3["fc3 β†’ 1 (no bias)"] --> OUT["logit"] +``` + +## The training loop + +`ModelManager` (`manager.py`) is the faΓ§ade: it owns a `FeatureManager` +(preprocessing) and a `ModelTrainer` (`train.py`), and handles save/load via a +JSON manifest that points at the `.pt` weights, the pickled scaler, and the +metrics arrays. + +```mermaid +sequenceDiagram + autonumber + participant MG as model_gen.py + participant MM as ModelManager + participant FM as FeatureManager + participant MT as ModelTrainer + MG->>MM: ModelConfig + cached X_train_raw, y + MM->>MM: 99% / 1% train-validation split + MM->>FM: fit_transform(X_train) (drop/encode cols, fit StandardScaler) + MM->>FM: transform(X_valid) + MM->>MT: train_model(X_train, y, X_valid, y_valid) + loop each epoch + MT->>MT: shuffle, minibatch, forward + MT->>MT: BCEWithLogitsLoss, backward, optimizer.step() + opt every 1/validation_rate epochs + MT->>MT: validation loss/acc, checkpoint if best + end + opt train loss below 0.001 + MT->>MT: early stop + end + end + MM->>MM: save scaler + best .pt + metrics + manifest.json +``` + +Notable specifics: + +- **Loss / optimizer:** `BCEWithLogitsLoss`, Adam (betas/eps configurable), + optional `OneCycleLR` schedule (`train_scheduler_name`). +- **Labels come in pairs** from the triples: `(source, target)=1`, + `(source, alt)=0`. +- **Best-checkpoint selection** is by *validation loss*, saved as + `basename_e{epoch}.pt`; the chosen epoch is recorded as + `best_model_description` in the manifest. +- **Early stopping** is a hard threshold on *training* loss (`< 0.001`) β€” crude, + and unrelated to generalization. + +## The baselines + +By the project's "a baseline is anything that needs no training" definition +(`evaluation.py`), the non-neural comparators are scored directly from the +replay state: + +- **Non-personalized:** `MostRecentJournal`, `MostJournals`, + `MostRecentInitiation`, `MostInitiations` β€” rank candidates purely by their own + recency/volume. +- **Personalized:** `ClosestToStart` (temporal proximity of first journal), + `CosineSimilarity` (raw cosine of feature vectors β€” i.e. what `SimNet` + computes). + +These were computed via a separate path (`notebook/eval/BaselineCompute.ipynb`) +and are the "can a dumb heuristic do this?" reference for the learned models. + +## Offline evaluation at scale + +Evaluation ranks the full candidate set per test context and records ranking +metrics into NDJSON. The metrics (`evaluation.py`): `rank_of_target`, +`reciprocal_rank`, and `ndcg_{1,5,10,50}`, plus per-context bookkeeping +(`n_usps_scored`, `n_sites_scored`, raw target score). USP scores are reduced to +**site** scores (an author site can have several eligible authors) before +ranking. `coverage.py` aggregates, across all contexts, how often each site +surfaces in the top-N (Welford streaming stats) β€” the system's *reach*. + +The performance problem: a hyperparameter sweep trains hundreds of models, and +naively each evaluation would re-assemble every feature matrix and re-look-up +every embedding. The fix is a **two-phase split**: + +```mermaid +flowchart TD + subgraph once["build once"] + TC["test_contexts in feature.sqlite"] --> GR["generateRecMd.py"] + TL["TextLoader.cache_all_journals()"] --> GR + GR --> PKL["reccontextN.pkl
(X_test fully assembled)"] + end + subgraph many["per model in the sweep"] + PKL --> EV["evaluateModelFromCacheMp.py
(multiprocessing pool)"] + MODEL["trained model"] --> EV + EV --> NDJSON["*_test_metadata.ndjson
per-context metrics"] + EV --> COV["coverage pickles"] + end + NDJSON --> NB["Jupyter: compare, pick model"] +``` + +Three evaluator variants exist with increasing speed: `evaluateModels.py` +(streams contexts from the DB), `evaluateModelFromCache.py` (reads the pre-built +pickles sequentially), and `evaluateModelFromCacheMp.py` (the multiprocessing +version that overlaps pickle-loading with scoring). `submitEvalFromDirectory.py` +fans a whole model directory out to Slurm; `--test-only` runs final test +evaluation on a single chosen model. + +A subtlety worth knowing: `build_reccontext_X` **drops candidates with +insufficient journal text** (fewer than 3 embeddings), and re-indexes +`target_inds` to compensate. If the *target itself* lacks text it raises β€” that +context is unscorable. + +## Hyperparameter sweeps + +`experiment/config_gen.py` is a config-and-script generator, not a runner. An +`Experiment` writes one `config_{i}.json` per cell of a grid plus the matching +`sbatch`/shell scripts into `configs/ outputs/ scripts/ sbatch/`. The committed +examples are the actual paper sweeps β€” e.g. the field-study grid over +`train_max_lr Γ— LinearNet_n_hidden Γ— train_weight_decay Γ— LinearNet_dropout_p` +with 3 replicates (540 configs), feature-ablation grids, and an Adam +random-search. Each config is trained by `experiment/model_gen.py`, which loads +the cached `X_train_raw.pkl`/`y_train_raw.pkl`, trains, and saves. + +```mermaid +flowchart LR + EXP["Experiment grid
(config_gen.py)"] --> CFGS["N Γ— config.json
+ sbatch scripts"] + CFGS --> SLURM["Slurm array"] + SLURM --> MGN["model_gen.py Γ—N"] + MGN --> MODELS["N trained models"] + MODELS --> EVALDIR["submitEvalFromDirectory.py"] +``` + +## Candid notes for the modeler + +- **Pointwise BCE on data that screams "pairwise."** The triples are literally + `(positive, negative)` per source, yet training optimizes independent 0/1 + cross-entropy with a 0.5 threshold rather than a ranking loss (BPR/pairwise + hinge/softmax-CE over the candidate set). The `evaluation.py` comments even + muse about BPR. A pairwise or listwise objective is the most natural upgrade + and would align the training objective with the NDCG/MRR you actually report. + +- **"Learning the dot product is notoriously hard."** A code comment + (`evaluation.py:72`) wonders whether cosine similarity should just be a *shared + feature* fed to the model rather than something the net must rediscover. That + instinct is correct and underlies `LearnedSimNet`. + +- **Two-tower is half-built.** `LearnedSimNet`/`ConcatNet` are the bi-encoder + architectures that modern retrieval would actually deploy (precompute candidate + embeddings, ANN search), but they're experimental here while the dense + cross-feature `LinearNet` is what shipped. With ~1563 features and a small MLP + that was the pragmatic choice for a one-off study. + +- **Validation is 1% by split, early-stop is a magic number.** Fine for a sweep + where you pick the best config post-hoc; not what you'd standardize on. + +- **πŸ•°οΈ Modern take.** RoBERTa-base mean-pooling (2019-era) would today be a + stronger sentence-embedding model (e.g. a current `sentence-transformers` + bi-encoder or an instruction-tuned embedding model), the bespoke train/eval + loop would be Lightning/Hugging Face `Trainer`, the sbatch-script generator + would be an experiment tracker + sweep agent (W&B/Optuna), and serving would + precompute candidate tower embeddings into an ANN index (FAISS/ScaNN) instead + of scoring every pair with a cross-encoder MLP. See + [modernization.md](modernization.md). diff --git a/docs/modernization.md b/docs/modernization.md new file mode 100644 index 0000000..aa73e64 --- /dev/null +++ b/docs/modernization.md @@ -0,0 +1,91 @@ +# Modernization Notes: "What would you do today?" + +This file collects, in one place, the candid "if you were starting this in +2026, you'd probably…" observations that are sprinkled through the other docs. + +**Read this in the right spirit.** HealthBlogRec is a 2021 research artifact that +did its job: it supported a published feasibility study and was re-run for paper +revisions through 2024. The notes below are *not* a defect list. Many choices +that look dated were the right call under the actual constraints β€” a fixed +dataset, a single Slurm cluster, a paper deadline, and the ML tooling that +existed at the time. The goal here is to help a future reader (or a future +revival of this work) calibrate which parts are essential ideas versus +incidental plumbing. + +## At a glance + +| Area | What the code does | What you'd reach for today | Why it matters | +|------|--------------------|-----------------------------|----------------| +| Feature pipeline | Hand-rolled streaming replay maintaining eligibility/activity/graph state, leakage-safe by careful ordering | A **point-in-time feature store** (Feast, Tecton) or stream processor (Flink, Kafka Streams) | Leakage-safety, backfills, and online/offline parity become first-class instead of invariants you enforce by hand | +| Storage / handoffs | SQLite blobs, `.ndjson`, pickles passed file-to-file | **Parquet/Arrow** + a real artifact/data registry | Columnar, typed, portable, queryable; pickles are version-fragile and unsafe | +| Text embeddings | RoBERTa-base, mean-pooled over 3 posts, 32-process batch script | A current `sentence-transformers` / instruction-tuned embedding model; batch inference on GPU with a serving runtime | Better representations; far less bespoke multiprocessing | +| Model | Dense cross-feature MLP (`LinearNet`), pointwise BCE | Two-tower bi-encoder with a **pairwise/listwise ranking loss**; cross-encoder only for re-ranking a shortlist | Objective matches the NDCG/MRR you report; tower embeddings enable ANN retrieval | +| Serving / retrieval | Score *every* eligibleΓ—active pair with the cross-encoder | Precompute candidate embeddings β†’ **ANN index** (FAISS/ScaNN), retrieve then re-rank | Scales past brute-force pair scoring | +| Experiments | `config_gen.py` emits hundreds of JSON configs + sbatch scripts | **Experiment tracker + sweep agent** (W&B/MLflow + Optuna/Ray Tune) | Provenance, dashboards, smarter search than grid | +| Training loop | Hand-written epoch/validation/checkpoint loop | Lightning / HF `Trainer` / Ray Train | Less boilerplate, standard checkpointing, multi-GPU for free | +| Orchestration | Slurm `.sh` scripts, manual stage ordering, pickled checkpoints | A workflow engine (Airflow/Prefect/Dagster/Ray) with typed stage I/O | Reproducible DAGs, retries, lineage | +| Config | A Python `Config` object with hard-coded `/home/lana/...` paths | `pydantic`/`hydra` config + env-driven paths | Portability off the original cluster | +| Reproducibility | A single global RNG seed (`872`), conda env listed in prose | Pinned lockfile (uv/poetry), seeded per-component, containerized | Bit-reproducible re-runs | + +## The deeper modeling point + +The single highest-leverage change is the **training objective**. The data is +generated as `(source, target, alt)` triples β€” a positive and a sampled negative +that share a source β€” which is *exactly* the input a pairwise ranking loss (BPR, +pairwise hinge, or sampled-softmax cross-entropy) wants. Yet the model trains +with pointwise binary cross-entropy and a 0.5 decision threshold, then is +evaluated with ranking metrics (NDCG, reciprocal rank). Closing that gap β€” train +the thing you measure β€” is the change most likely to move the numbers, and the +plumbing to do it (the negatives) is already there. The code's own comments +gesture at this (`evaluation.py` muses about BPR; a comment notes "learning the +dot product is notoriously hard"). + +The second point is **architecture for serving**. The deployed model is a +cross-encoder: it needs both the source and candidate features in the same +forward pass, so producing recs means scoring every eligibleΓ—active pair. That's +fine for a study with a bounded candidate pool, but it doesn't scale. The +half-built `LearnedSimNet`/`ConcatNet` two-tower models point the right way: +encode candidates independently and offline, build an ANN index, retrieve a +shortlist, and (optionally) re-rank the shortlist with a heavier cross-encoder. +This is the standard modern retrieval pattern and the codebase was already +reaching toward it. + +## Known sharp edges (documented in code) + +These are called out at their source in [data-pipeline.md](data-pipeline.md) and +[modeling.md](modeling.md); summarized here so they're not a surprise: + +- The test-time target fallback picks "most recently joined eligible author" + when none is active, and the comment admits *"random might literally be + better"* (`reccontext.py:165`). Metadata records when this fired + (`test_target_usp_adjustment`). +- Amp (reaction) timestamps are **reconstructed** from a delay distribution, so + amp-recency features are approximate (`data.py`). +- A feature cache is resized mid-run with a `FIXME is this reasonable?` + (`triple_generation.py:202`), and the reuse cache "basically doesn't do + anything" outside coverage generation (`feature_extraction.py:64`). +- `replace_config()` warns in its own docstring that it won't do what you expect. +- Early stopping is a hard-coded train-loss threshold (`< 0.001`), unrelated to + generalization. + +See also the **"Fixed bugs during deployment"** section of the top-level +[README](../README.md): a few real bugs (an activity time-elapsed bimodality, a +pre-2018 comment timezone bug, and a feature-preprocessing bug dropping source +features on ~12% of post-July-2020 training data) were caught and corrected for +the offline results in the paper. That section is a good model of the kind of +honesty research code should carry β€” these docs try to extend it. + +## What aged well + +It's worth saying plainly, because it's the reason the repo is worth reading: + +- **Leakage discipline.** The streaming replay makes temporal correctness + structural, not aspirational. A lot of "modern" pipelines get this *wrong*. +- **A clean temporal train/test split.** Always predicting the future. +- **Feature deduplication by content hash** β€” a genuinely nice space optimization + for a combinatorial pair space. +- **Coverage as a first-class metric.** Asking "can we even reach most eligible + authors, or do recs pile onto a popular few?" is exactly the right question for + a social-support intervention, and many recommender evaluations omit it. +- **Honest, self-deprecating comments.** The code tells you where the bodies are + buried. That's rarer than it should be.