Skip to content

ribesstefano/TACK

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

45 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TACK

A statistical evaluation of degradation activity on a novel TArgeting Chimeras Knowledge dataset

Overview of the TACK dataset and training pipeline

TACK combines data from multiple sources (TPDdb, PROTAC-DB, and PROTACpedia) to create the largest publicly available dataset for training and evaluating machine learning models that predict PROTAC-induced protein degradation activities.

Dataset Models Paper License

πŸ“š Overview

This repository provides:

  • Curated Dataset: High-quality PROTAC degradation data with DC50/Dmax measurements
  • Data Curation Pipeline: Scripts to reproduce the dataset from raw sources
  • Training Framework: Model training with nested 5Γ—5 cross-validation
  • Ensemble Selection: Caruana's greedy forward selection with uncertainty quantification
  • Benchmark Suite: Standardized evaluation protocols and baselines
  • Python API for Ensemble Models: Pre-trained ensembles for predicting Dmax, DC50, and binary degradation activity

Please refer to the tack_dataset/README.md for detailed instructions on dataset curation, to scripts/README.md for model training and ensemble selection, and to notebooks/ensemble_predictor_tutorial.ipynb for interactive tutorials on using the pre-trained ensemble predictor.

Key Features

  • βœ… Multi-source integration with deduplication and quality control
  • βœ… Scaffold-based data splitting to prevent information leakage
  • βœ… Rigorous statistical evaluation via repeated cross-validation
  • βœ… Uncertainty quantification through ensemble disagreement
  • βœ… Multiple model architectures: MLP, XGBoost
  • βœ… Hyperparameter optimization using Optuna

πŸš€ Quick Start

Installation

TACK uses uv for environment and dependency management.

1. Install uv (skip if already available, e.g., via an HPC module):

curl -LsSf https://astral.sh/uv/install.sh | sh
# or on HPC: module load uv

2. Clone the repository:

git clone https://github.com/ribesstefano/TACK.git
cd TACK

3. Create a virtual environment and install the package:

# Core dependencies only
uv venv --python 3.13
source .venv/bin/activate
uv pip install -e .

To also install plotting, notebook, and development tools:

uv pip install -e ".[dev]"

4. (GPU clusters) Install PyTorch with the correct CUDA version:

GPU computing is generally discouraged, since the models are very small and the performance bottlneck is in data encoding. Nevertheless, if one wishes to train and/or run inference on GPU, replace cu121 with the CUDA version available on your system (e.g. cu118, cu124) in the following command:

uv pip install torch --extra-index-url https://download.pytorch.org/whl/cu121

5. Register the environment as a Jupyter kernel (only needed for notebooks):

python -m ipykernel install --user --name tack --display-name "TACK"

6. Set up cache and model files for inference with pre-trained ensembles:

Please refer to the README section on "Pre-trained Models & Cache Files" for detailed instructions on downloading and configuring the necessary files for inference.

For running inference with the pre-trained ensemble, please refer to the ensemble predictor tutorial notebook for step-by-step instructions on how to use the EnsemblePredictor class with the downloaded models and cache files.

Download the Dataset

The TACK dataset is available on Hugging Face at this link, it can be accessed via:

from datasets import load_dataset

# Load specific configurations
dmax_ds = load_dataset("ailab-bio/TACK", "Dmax", split="train")
dc50_ds = load_dataset("ailab-bio/TACK", "DC50", split="train")
bin_ds = load_dataset("ailab-bio/TACK", "multitask", split="train")

Note

For reproducibility, training can also be performed using local CSV files via --custom_dataset_csv.

πŸ—„οΈ Pre-trained Models & Cache Files

Running inference with a pre-trained ensemble requires two sets of files that are not included in this repository due to their size:

Archive Contents Purpose
cache.zip cell2cell_id.json, cell2description.json, cell2data.json, cell_embeddings_model=sentence-transformer_pooling=sum.npz, morgan_fp_radius16_size512.npz, rdkit_descriptors.npz Pre-computed embeddings and molecular descriptors read at inference time
ensembles.zip ensembles/<task>_<type>/ensemble_weights_*.json, *_hparams.yaml, *_state.pt, model checkpoints (.ckpt / XGBoost .json) Trained ensemble weights and fitted data-processing state

Both archives are available on Zenodo: https://doi.org/10.5281/zenodo.15691822

Setup

1. Download and unpack the archives:

# choose any writable location; this example uses ~/tack-artifacts
mkdir -p ~/tack-artifacts/cache ~/tack-artifacts/ensembles

unzip cache.zip  -d ~/tack-artifacts/cache
unzip ensembles.zip -d ~/tack-artifacts/ensembles

2. Point TACKAI_CACHE at the cache directory:

Copy the example file .env.example to .env and edit the TACKAI_CACHE variable to point to the location of the unpacked cache files:

cp .env.example .env
# Then edit .env and set:
TACKAI_CACHE=~/tack-artifacts/cache/tack/

tackai reads this variable at startup via get_cache_dir(). If it is not set the default falls back to ~/.cache/tackai/.

3. Run inference with the pre-trained ensemble:

See this tutorial (TODO).

What belongs in each archive

The separation between cache and models is intentional:

  • Cache files are dataset-wide and shared across all tasks (binary, Dmax, DC50). They are expensive to recompute (ESM protein embeddings, sentence embeddings for cell lines) and must exactly match the versions used during training.
  • Model files are task-specific. Each ensemble folder contains the _hparams.yaml / _state.pt pair for the DegradationComplexDataModule (which stores fitted scalers and encoders) and the model checkpoints selected by Caruana's greedy forward search.

If you retrain models yourself the cache files can be reused as-is; only the model archive needs to be regenerated.

πŸ“Š Data Curation

For re-running data curation, please refer to the instruction in this README file.

πŸ“‚ Repository Structure

TACK/
β”œβ”€β”€ configs/               # YAML configuration files
β”œβ”€β”€ data/                  # Processed dataset files
β”œβ”€β”€ logs/                  # Log files from training
β”œβ”€β”€ misc/                  # Images and miscellaneous files
β”œβ”€β”€ notebooks/             # Jupyter notebooks for exploration
β”œβ”€β”€ predictions/           # Model predictions on CV splits
β”œβ”€β”€ protac_stan/           # PROTAC-STAN reproduction scripts
β”œβ”€β”€ scripts/               # Training and ensemble scripts
β”œβ”€β”€ ensemble_results/      # Ensemble selection results
β”œβ”€β”€ pyproject.toml         # Package metadata and dependencies (uv)
└── README.md

πŸ“ˆ Reproducing the Results

Train a Model

python scripts/train_models.py \
    --model_type xgboost \
    --task dmax \
    --group scaffold \
    --batch_size 64

Construct and Evaluate Ensemble

python scripts/ensemble_comparison.py \
    --task dmax \
    --prediction_dir ./predictions \
    --output_dir ./ensemble_results

🧬 PROTAC-STAN Evaluation

See the PROTAC-STAN evaluation instructions for reproducing results on TACK with 5Γ—5 cross-validation.

πŸ“„ License

The TACK dataset and code are released under the MIT License. See LICENSE for details.

πŸ“‘ Citation

If you use TACK in your research, please cite the following paper:

@misc{ribes2026tackstatisticalevaluationdegradation,
      title={{TACK: A statistical evaluation of degradation activity on a novel TArgeting Chimeras Knowledge dataset}}, 
      author={Stefano Ribes and Nils Dunlop and RocΓ­o Mercado},
      year={2026},
      eprint={2605.19579},
      archivePrefix={arXiv},
      primaryClass={q-bio.QM},
      url={https://arxiv.org/abs/2605.19579}, 
}

🀝 Acknowledgements

The authors acknowledge funding provided by the Chalmers Gender Initiative for Excellence (Genie), and by the Wallenberg AI, Autonomous Systems, and Software Program (WASP), supported by the Knut and Alice Wallenberg Foundation. The authors thank Yossra Gharbi, Alexander Persson, and Felix ErngΓ₯rd for helpful discussions. The computations and data storage were enabled by resources provided by Chalmers e-Commons and by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.

About

A statistical evaluation of degradation activity on a novel TArgeting Chimeras Knowledge dataset.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors