TACK combines data from multiple sources (TPDdb, PROTAC-DB, and PROTACpedia) to create the largest publicly available dataset for training and evaluating machine learning models that predict PROTAC-induced protein degradation activities.
This repository provides:
- Curated Dataset: High-quality PROTAC degradation data with DC50/Dmax measurements
- Data Curation Pipeline: Scripts to reproduce the dataset from raw sources
- Training Framework: Model training with nested 5Γ5 cross-validation
- Ensemble Selection: Caruana's greedy forward selection with uncertainty quantification
- Benchmark Suite: Standardized evaluation protocols and baselines
- Python API for Ensemble Models: Pre-trained ensembles for predicting Dmax, DC50, and binary degradation activity
Please refer to the tack_dataset/README.md for detailed instructions on dataset curation, to scripts/README.md for model training and ensemble selection, and to notebooks/ensemble_predictor_tutorial.ipynb for interactive tutorials on using the pre-trained ensemble predictor.
- β Multi-source integration with deduplication and quality control
- β Scaffold-based data splitting to prevent information leakage
- β Rigorous statistical evaluation via repeated cross-validation
- β Uncertainty quantification through ensemble disagreement
- β Multiple model architectures: MLP, XGBoost
- β Hyperparameter optimization using Optuna
TACK uses uv for environment and dependency management.
1. Install uv (skip if already available, e.g., via an HPC module):
curl -LsSf https://astral.sh/uv/install.sh | sh
# or on HPC: module load uv2. Clone the repository:
git clone https://github.com/ribesstefano/TACK.git
cd TACK3. Create a virtual environment and install the package:
# Core dependencies only
uv venv --python 3.13
source .venv/bin/activate
uv pip install -e .To also install plotting, notebook, and development tools:
uv pip install -e ".[dev]"4. (GPU clusters) Install PyTorch with the correct CUDA version:
GPU computing is generally discouraged, since the models are very small and the performance bottlneck is in data encoding. Nevertheless, if one wishes to train and/or run inference on GPU, replace cu121 with the CUDA version available on your system (e.g. cu118, cu124) in the following command:
uv pip install torch --extra-index-url https://download.pytorch.org/whl/cu1215. Register the environment as a Jupyter kernel (only needed for notebooks):
python -m ipykernel install --user --name tack --display-name "TACK"6. Set up cache and model files for inference with pre-trained ensembles:
Please refer to the README section on "Pre-trained Models & Cache Files" for detailed instructions on downloading and configuring the necessary files for inference.
For running inference with the pre-trained ensemble, please refer to the ensemble predictor tutorial notebook for step-by-step instructions on how to use the EnsemblePredictor class with the downloaded models and cache files.
The TACK dataset is available on Hugging Face at this link, it can be accessed via:
from datasets import load_dataset
# Load specific configurations
dmax_ds = load_dataset("ailab-bio/TACK", "Dmax", split="train")
dc50_ds = load_dataset("ailab-bio/TACK", "DC50", split="train")
bin_ds = load_dataset("ailab-bio/TACK", "multitask", split="train")Note
For reproducibility, training can also be performed using local CSV files via --custom_dataset_csv.
Running inference with a pre-trained ensemble requires two sets of files that are not included in this repository due to their size:
| Archive | Contents | Purpose |
|---|---|---|
cache.zip |
cell2cell_id.json, cell2description.json, cell2data.json, cell_embeddings_model=sentence-transformer_pooling=sum.npz, morgan_fp_radius16_size512.npz, rdkit_descriptors.npz |
Pre-computed embeddings and molecular descriptors read at inference time |
ensembles.zip |
ensembles/<task>_<type>/ensemble_weights_*.json, *_hparams.yaml, *_state.pt, model checkpoints (.ckpt / XGBoost .json) |
Trained ensemble weights and fitted data-processing state |
Both archives are available on Zenodo: https://doi.org/10.5281/zenodo.15691822
1. Download and unpack the archives:
# choose any writable location; this example uses ~/tack-artifacts
mkdir -p ~/tack-artifacts/cache ~/tack-artifacts/ensembles
unzip cache.zip -d ~/tack-artifacts/cache
unzip ensembles.zip -d ~/tack-artifacts/ensembles2. Point TACKAI_CACHE at the cache directory:
Copy the example file .env.example to .env and edit the TACKAI_CACHE variable to point to the location of the unpacked cache files:
cp .env.example .env
# Then edit .env and set:
TACKAI_CACHE=~/tack-artifacts/cache/tack/tackai reads this variable at startup via get_cache_dir(). If it is not
set the default falls back to ~/.cache/tackai/.
3. Run inference with the pre-trained ensemble:
See this tutorial (TODO).
The separation between cache and models is intentional:
- Cache files are dataset-wide and shared across all tasks (binary, Dmax, DC50). They are expensive to recompute (ESM protein embeddings, sentence embeddings for cell lines) and must exactly match the versions used during training.
- Model files are task-specific. Each ensemble folder contains the
_hparams.yaml/_state.ptpair for theDegradationComplexDataModule(which stores fitted scalers and encoders) and the model checkpoints selected by Caruana's greedy forward search.
If you retrain models yourself the cache files can be reused as-is; only the model archive needs to be regenerated.
For re-running data curation, please refer to the instruction in this README file.
TACK/
βββ configs/ # YAML configuration files
βββ data/ # Processed dataset files
βββ logs/ # Log files from training
βββ misc/ # Images and miscellaneous files
βββ notebooks/ # Jupyter notebooks for exploration
βββ predictions/ # Model predictions on CV splits
βββ protac_stan/ # PROTAC-STAN reproduction scripts
βββ scripts/ # Training and ensemble scripts
βββ ensemble_results/ # Ensemble selection results
βββ pyproject.toml # Package metadata and dependencies (uv)
βββ README.md
python scripts/train_models.py \
--model_type xgboost \
--task dmax \
--group scaffold \
--batch_size 64python scripts/ensemble_comparison.py \
--task dmax \
--prediction_dir ./predictions \
--output_dir ./ensemble_resultsSee the PROTAC-STAN evaluation instructions for reproducing results on TACK with 5Γ5 cross-validation.
The TACK dataset and code are released under the MIT License. See LICENSE for details.
If you use TACK in your research, please cite the following paper:
@misc{ribes2026tackstatisticalevaluationdegradation,
title={{TACK: A statistical evaluation of degradation activity on a novel TArgeting Chimeras Knowledge dataset}},
author={Stefano Ribes and Nils Dunlop and RocΓo Mercado},
year={2026},
eprint={2605.19579},
archivePrefix={arXiv},
primaryClass={q-bio.QM},
url={https://arxiv.org/abs/2605.19579},
}The authors acknowledge funding provided by the Chalmers Gender Initiative for Excellence (Genie), and by the Wallenberg AI, Autonomous Systems, and Software Program (WASP), supported by the Knut and Alice Wallenberg Foundation. The authors thank Yossra Gharbi, Alexander Persson, and Felix ErngΓ₯rd for helpful discussions. The computations and data storage were enabled by resources provided by Chalmers e-Commons and by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.
