🩺 Pima Diabetes Prediction — Cost-Aware Pipeline

A compact, production-minded workflow to predict diabetes risk from routine clinical measurements with probability calibration and an explicit threshold policy.

Case study: CASE_STUDY.md

What this repository includes

Main notebook: diabetes-prediction-from-eda-to-production.ipynb
Optional reference notebook: pima-indians-diabetes-database.ipynb
Exported artifacts under ./artifacts/
Optional scoring script: scripts/predict.py

Dataset

Source: Kaggle “Pima Indians Diabetes Database” (diabetes.csv).

Expected columns (typical):

Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Outcome

Local (recommended)

Download the dataset CSV.
Place it at: data/raw/diabetes.csv

Kaggle

The notebook also supports: /kaggle/input/pima-indians-diabetes-database/diabetes.csv

Getting started

1) Install

python -m venv .venv
# Windows: .\.venv\Scripts\activate
# macOS/Linux: source .venv/bin/activate

pip install -r requirements.txt

2) Run the notebook

Open and run:

diabetes-prediction-from-eda-to-production.ipynb

The notebook will:

clean and audit the dataset
train models and calibrate probabilities
select an operating threshold (policy)
export artifacts/pima_best_pipeline.joblib

Artifacts

The exported bundle includes:

trained pipeline
operating threshold
run metadata

See artifacts/README.md.

Score a CSV (optional)

After exporting artifacts, you can score a CSV:

python scripts/predict.py --csv data/raw/diabetes.csv --out artifacts/scored.csv

The output adds:

diabetes_proba
diabetes_pred

Methodology notes

Probability calibration makes thresholds usable for decisions.
Threshold policy is selected on validation and exported with the artifact.

License

MIT (code). Dataset licensing depends on the dataset source where you download it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🩺 Pima Diabetes Prediction — Cost-Aware Pipeline

What this repository includes

Dataset

Local (recommended)

Kaggle

Getting started

1) Install

2) Run the notebook

Artifacts

Score a CSV (optional)

Methodology notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
artifacts		artifacts
data/raw		data/raw
scripts		scripts
.gitignore		.gitignore
CASE_STUDY.md		CASE_STUDY.md
LICENSE		LICENSE
README.md		README.md
pima-indians-diabetes-database .ipynb		pima-indians-diabetes-database .ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🩺 Pima Diabetes Prediction — Cost-Aware Pipeline

What this repository includes

Dataset

Local (recommended)

Kaggle

Getting started

1) Install

2) Run the notebook

Artifacts

Score a CSV (optional)

Methodology notes

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages