A practical fraud detection workflow with time-aware evaluation, probability calibration, and explicit threshold policies for controlling false positives vs false negatives.
Case study: CASE_STUDY.md
- Notebook:
Credit-Card-Fraud-Detection-A-Pipeline-Journey.ipynb - Exported artifacts under
./artifacts/(models + thresholds + run metadata) - Scoring script:
scripts/score_csv.py(optional)
Source: Kaggle “Credit Card Fraud Detection” dataset (creditcard.csv).
Expected columns:
Time,V1…V28,Amount,Class(0 = normal, 1 = fraud)
- Download the dataset CSV.
- Place it at:
data/raw/creditcard.csv
The notebook also supports the common Kaggle input path:
/kaggle/input/creditcardfraud/creditcard.csv
python -m venv .venv
# Windows: .\.venv\Scripts\activate
# macOS/Linux: source .venv/bin/activate
pip install -r requirements.txtOpen and run:
Credit-Card-Fraud-Detection-A-Pipeline-Journey.ipynb
The notebook will:
- train baseline + stronger models
- calibrate probabilities
- select threshold policies
- export artifacts to
./artifacts/
Artifacts are written to ./artifacts/ (models, thresholds, and run metadata).
See artifacts/README.md for the expected files produced by the notebook.
After exporting artifacts, you can score any CSV with the same feature columns:
python scripts/score_csv.py --csv data/raw/creditcard.csv --out artifacts/scored.csv --model xgb --policy min_costThe output adds:
fraud_probafraud_pred(0/1)
- Leak-safe evaluation via time-based train/test windows.
- Calibration produces probabilities suitable for threshold policies.
- Threshold policies define operating points (e.g., min expected cost).
MIT (code). Dataset licensing depends on the dataset source where you download it.