Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,21 @@ HealthBlogRec is a recommendation system for peer health blogs.

![Recommender system overview](/figures/rec_system_overview.png)

## Documentation

In-depth developer documentation lives in [`docs/`](docs/) — a diagram-heavy
guided tour of how the system fits together, written for someone reading the
[paper](https://arxiv.org/abs/2209.04973) and the source side by side. Because
this is research code from 2021, the docs also include candid **🕰️ Modern take**
notes flagging where current practice would steer you differently.

- [`docs/README.md`](docs/README.md) — start here: orientation and reading order
- [`docs/glossary.md`](docs/glossary.md) — the project's vocabulary (*USP*, *initiation*, *eligible*/*existing*/*active*, *triple*, *coverage*)
- [`docs/architecture.md`](docs/architecture.md) — the whole system on one page, plus entry points and the experiment loop
- [`docs/data-pipeline.md`](docs/data-pipeline.md) — the streaming history-replay that generates training triples and test contexts
- [`docs/modeling.md`](docs/modeling.md) — the feature vector, model zoo, training loop, baselines, and offline evaluation
- [`docs/modernization.md`](docs/modernization.md) — consolidated "what would you do today?" notes

## Citation and project history

If any portion of this project is useful to you, please cite the following paper:
Expand Down
85 changes: 85 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# HealthBlogRec — Developer Documentation

This folder is a guided tour of the HealthBlogRec codebase, written for the
reader who has the [arXiv paper](https://arxiv.org/abs/2209.04973) open in one
tab and the source in another, and is trying to understand *how the choices in
the paper map onto the code*.

It is intentionally heavy on diagrams and light on prose. The code itself has
reasonable docstrings in the places that matter; what was missing — and what
these documents try to provide — is the **bird's-eye view**: how the pieces fit
together, why they're shaped the way they are, and where the sharp edges are.

> ⚠️ **This is research code from 2021.** It was written to run one study on one
> cluster (the University of Minnesota's MSI) and to be re-run a handful of times
> for paper revisions. It is *not* a library, it is *not* deployable as-is, and
> several decisions that look strange today were reasonable given the deadline,
> the data, and the tooling of the time. Throughout these docs, call-outs marked
> **🕰️ Modern take** flag places where current practice would steer you
> somewhere else. They are not bug reports — they are orientation for anyone
> trying to learn from or revive this work.

## Where to start

Read in this order:

1. **[glossary.md](glossary.md)** — the vocabulary. The code is dense with
domain-specific terms (*USP*, *initiation*, *eligible* vs *existing* vs
*active*, *triple*, *coverage*). Nothing else makes sense until these click.
2. **[architecture.md](architecture.md)** — the whole system on one page, plus
the two entry points (`gen.py`, `predict.py`) and how an offline experiment
flows end to end.
3. **[data-pipeline.md](data-pipeline.md)** — the heart of the project: the
streaming "replay history one interaction at a time" simulation in
`cbrec.triple_generation` that turns raw CaringBridge activity into training
triples and test/prediction contexts.
4. **[modeling.md](modeling.md)** — the feature vector, the model zoo
(`LinearNet`, `SimNet`, `ConcatNet`, `LearnedSimNet`, `InteractionNet`), the
training loop, the baselines, and how offline evaluation is run at scale.
5. **[modernization.md](modernization.md)** — a consolidated, candid list of
"what would you do differently today," collected so it doesn't clutter the
reference docs.

## The one-paragraph summary

HealthBlogRec recommends *peer health blogs* (CaringBridge sites) to people who
write their own. The core modeling task is framed as: given a **source** author
who is about to reach out to a new site (an **initiation**), can we rank the pool
of **candidate** authors so the one they actually reached out to (the **target**)
floats to the top? The system learns from history by replaying years of
CaringBridge interactions in timestamp order, snapshotting the network/activity
state at each real initiation, and emitting a `(source, target, sampled
non-target)` **triple**. Each user/site pair is encoded as network features +
recent-activity features + a RoBERTa embedding of their recent journal updates.
A small MLP scores `(source, candidate)` pairs; the trained model was used to
send weekly recommendation emails during a 2021 field study.

## Map of the repository

```
HealthBlogRec/
├── src/
│ ├── gen.py / predict.py # thin entry points → cbrec.triple_generation / cbrec.prediction
│ ├── *.sh # Slurm batch scripts for the MSI cluster
│ ├── cbrec/ # the recommendation package (see data-pipeline.md, modeling.md)
│ │ ├── triple_generation.py # the streaming history-replay simulation (the core)
│ │ ├── reccontext.py # one recommendation event (source → candidates [→ target])
│ │ ├── feature_extraction.py # USP → feature vector, + async DB writer
│ │ ├── eligibility.py # who counts as existing / eligible
│ │ ├── network.py # the interaction graph (indegree/outdegree/component)
│ │ ├── recentActivityCounter.py / timeAwareDict.py # time-windowed activity counts
│ │ ├── featuredb.py # SQLite storage of feature arrays / triples / contexts
│ │ ├── evaluation.py / coverage.py # baselines, ranking metrics, coverage
│ │ ├── text/ # RoBERTa journal embeddings in SQLite
│ │ ├── modeling/ # the PyTorch models, training, and offline eval
│ │ └── experiment/ # generates hyperparameter-sweep configs + Slurm scripts
│ ├── cbsend/ # composes and sends the weekly recommendation emails
│ └── extract/ # MongoDB BSON dumps → flattened ndJSON
├── notebook/ # analysis & experimentation (outputs/IDs scrubbed)
├── data/ # small sample data artifacts
├── design/ # email mockups
└── figures/ # figures used in the paper
```

See the top-level [README](../README.md) for citation, study history, the conda
environment, and the list of bugs caught during deployment.
166 changes: 166 additions & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# Architecture Overview

This is the whole system on one page. For the vocabulary
(*USP*, *initiation*, *eligible*…) see [glossary.md](glossary.md) first.

## The big picture

HealthBlogRec turns years of raw CaringBridge activity into a trained model that
ranks peer authors, and used that model to send weekly recommendation emails
during a 2021 field study. There are five loosely-coupled stages, each writing
artifacts to disk that the next stage reads. This file-handoff design is
deliberate: each stage took hours-to-days on the cluster and had to be
independently restartable.

```mermaid
flowchart TD
subgraph raw["① Raw data (MongoDB → flat files)"]
BSON["MongoDB BSON dumps"] -->|"src/extract/"| FLAT["ints_df.feather<br/>journal_metadata.feather<br/>user_site_df.csv"]
TXT["journalText.sqlite<br/>(raw post text)"]
end

subgraph gen["② Data generation — cbrec.triple_generation (gen.py)"]
FLAT --> REPLAY["Replay interactions in<br/>timestamp order, maintaining<br/>graph + activity + eligibility state"]
REPLAY -->|"each eligible initiation"| FDB[("feature.sqlite<br/>triples · test_contexts · feature arrays")]
REPLAY --> MDN["metadata.ndjson"]
end

subgraph text["③ Text features — cbrec.text"]
TXT -->|"RoBERTa, 32 procs"| TFDB[("text_feature.sqlite<br/>768-d embeddings per journal")]
end

subgraph model["④ Modeling — cbrec.modeling"]
FDB --> TRAIN["train.py / model_gen.py<br/>MLP on (source, candidate) pairs"]
TFDB --> TRAIN
TRAIN --> MODEL["trained model<br/>(.pt + .json manifest)"]
MODEL --> EVAL["offline evaluation<br/>(NDCG, reciprocal rank, coverage)"]
FDB --> EVAL
TFDB --> EVAL
end

subgraph deploy["⑤ Deployment (the 2021 field study)"]
MODEL --> PRED["cbrec.prediction (predict.py)<br/>score live candidates"]
FLAT --> PRED
PRED --> EMAIL["cbsend/<br/>compose + send weekly SSE emails"]
end
```

This diagram is the code version of the paper's
[system overview figure](../figures/rec_system_overview.png).

## The two entry points

The `src/` top level is deliberately thin — the real logic is in the `cbrec`
package. Everything is launched through Slurm batch scripts (`src/*.sh`) on the
MSI cluster.

| Script | Delegates to | Purpose |
|--------|--------------|---------|
| `gen.py` | `cbrec.triple_generation.main()` | Stage ②: generate all training triples and test/predict contexts. ~50 hours. |
| `predict.py` | `cbrec.prediction.main()` | Stage ⑤: load a checkpoint + model and produce live recommendations for a set of users. |

The supporting `make_text_features_*.sh` scripts drive Stage ③, and the
`cbrec.experiment` + `cbrec.modeling` machinery drives Stage ④ (see
[modeling.md](modeling.md)).

## How an offline experiment flows end-to-end

This is the loop a researcher actually ran to compare models for the paper.

```mermaid
sequenceDiagram
autonumber
participant U as Researcher
participant CG as experiment/config_gen.py
participant SL as Slurm (MSI)
participant MG as modeling/model_gen.py
participant FDB as feature.sqlite + cached X_train
participant EV as modeling/evaluateModelFromCache*.py
participant NB as Jupyter notebook

U->>CG: define a hyperparameter sweep
CG->>SL: emit one config.json + sbatch script per cell of the grid
loop each config in the sweep
SL->>MG: train one model
MG->>FDB: load cached training matrix
MG-->>SL: write model .pt + .json manifest
end
U->>EV: submitEvalFromDirectory.py over the model dir
loop each trained model
EV->>EV: score every cached RecContext, write *_test_metadata.ndjson
end
EV-->>NB: NDJSON metrics + coverage pickles
NB->>U: compare models, pick the field-study model
```

The key performance trick: **RecContexts are pre-built once and cached as
pickles** (`generateRecMd.py`) so that evaluating 500+ models in a sweep doesn't
re-pay the cost of assembling feature matrices and looking up embeddings each
time. See [modeling.md](modeling.md#offline-evaluation-at-scale).

## Package map

```mermaid
flowchart LR
subgraph cbrec["cbrec (recommendation core)"]
TG["triple_generation<br/>(orchestrator)"]
RC["reccontext"]
FE["feature_extraction"]
EL["eligibility"]
NW["network"]
RA["recentActivityCounter<br/>timeAwareDict"]
FD["featuredb"]
DA["data"]
EVl["evaluation<br/>coverage"]
TG --> RC --> FE
TG --> EL & NW & RA & DA
FE --> FD
RC --> EVl
end
subgraph txt["cbrec.text"]
TDB["textdb"] --> CTF["createTextFeatureSqlite<br/>(RoBERTa)"] --> EDB["embeddingdb"]
JID["journalid"]
end
subgraph mdl["cbrec.modeling"]
MC["modelconfig"] --> MM["manager"]
MM --> PP["preprocess"] & MT["train"]
MT --> MODELS["models/*<br/>LinearNet · SimNet · ConcatNet<br/>LearnedSimNet · InteractionNet"]
MM --> SC["scorer"]
RCB["reccontext_builder"] --> TL["text_loader"]
end
subgraph exp["cbrec.experiment"]
CGN["config_gen"] --> MGN["model_gen"]
end
cbrec --> mdl
txt --> mdl
exp --> mdl
mdl --> SEND["cbsend (emails)"]
```

## Design choices worth understanding up front

- **Everything is a streaming replay.** The system never "queries the current
state of the database." It replays history one interaction at a time and
computes features from the *online* state at that instant. This is what makes
the features causally honest (no future leakage) — and what makes the pipeline
a stateful, checkpoint-heavy, ~50-hour job. See [data-pipeline.md](data-pipeline.md).

- **Temporal train/test split, not random.** Train = initiations 2014–2020,
test = initiations in 2021. The model is always asked to predict the future.

- **Feature arrays are content-addressed and deduplicated** in SQLite
(`featuredb.py`), so the hundreds of millions of `(source, candidate)` pairs
collapse to far fewer unique stored vectors.

- **Text is a separate pipeline on purpose.** RoBERTa embedding of ~1M journals
is expensive and embarrassingly parallel, so it lives in its own stage
(`cbrec.text`, 32 processes) and is joined in only at model-training time.

- **🕰️ Modern take:** Today most of stages ②–④ would be a feature store + a
scheduled batch job (or a streaming feature pipeline like Feast/Flink), the
bespoke SQLite + ndjson + pickle handoffs would be Parquet/Arrow, and the
Slurm sbatch-script-generation in `cbrec.experiment` would be an experiment
tracker (Weights & Biases / MLflow) plus a workflow engine
(Airflow/Prefect/Ray). The architecture's *shape* — replay → features →
train → eval → serve — is still exactly what you'd build. See
[modernization.md](modernization.md).
Loading