FoldTree2: Maximum Likelihood Phylogenetic Tree Inference from Protein Structures

FoldTree2 is a Python package and toolkit for inferring phylogenetic trees from protein structures using maximum likelihood methods. It provides tools for converting protein structure files (PDBs) into graph representations, deriving structural alignments, and building phylogenetic trees based on structural data.

TLDR

Quick setup and run (from the repository root):

conda env create --name foldtree2 --file=foldtree2.yml
conda activate foldtree2
pip install .

Then run a set of PDB structures with the production model using explicit encoder/decoder checkpoints:

foldtree2 \
  --encoder models/production/30char_minimal_decoder/final_30char_contacts_aa_encoder_full_epoch_52.pt \
  --decoder models/production/30char_minimal_decoder/final_30char_contacts_aa_decoder_full_epoch_52.pt \
  --structures "/path/to/structures/*.pdb" \
  --outdir results/

Features

PDB to Graph Conversion: Convert protein structures into graph-based representations suitable for machine learning and phylogenetic analysis.
Custom Substitution Matrices: Generate and use structure-based substitution matrices for alignments.
Maximum Likelihood Tree Inference: Build phylogenetic trees from structural alignments using maximum likelihood approaches.
Flexible Pipeline: Modular scripts for each step: graph creation, encoding, alignment, and tree inference.

Installation

Using pip and conda

First create the environment

conda env create --name foldtree2 --file=foldtree2.yml
conda activate foldtree2

and then install the project with pip

pip install .

This will install all required dependencies as specified in pyproject.toml and setup.py.

Command Line Tools

FoldTree2 provides several command-line tools that are automatically installed and available system-wide:

foldtree2 / ft2treebuilder: Main phylogenetic tree inference pipeline
pdbs-to-graphs: Convert PDB files to graph representations
makesubmat: Generate structure-based substitution matrices
raxml-ng: Maximum likelihood phylogenetic inference (bundled RAxML-NG)
mad: Minimal Ancestor Deviation tree rooting
hex2maffttext / maffttext2hex: MAFFT format conversion utilities

All tools include help documentation accessible with the --help flag.

Quick Start: Using Pretrained Models

For most users, FoldTree2 provides pretrained models that can be used directly to infer phylogenetic trees from protein structures.

Basic Workflow

Build phylogenetic trees from a folder of PDB structures using pretrained models:

foldtree2 --model mergeddecoder_foldtree2_test \
  --structures <YOURSTRUCTUREFOLDER> \
  --outdir <RESULTSFOLDER>

This single command will:

Convert PDB files to graph representations
Use pretrained models to encode structural features
Create structural alignments
Infer a maximum likelihood phylogenetic tree

Available Pretrained Models

mergeddecoder_foldtree2_test: General-purpose model for diverse protein structures
small: Lightweight model for smaller datasets
Additional models may be available in the models/ directory

Output Files

The pipeline generates several output files in your results directory:

Phylogenetic tree: .tre files in Newick format
Alignments: .aln files showing structural alignments
Log files: Detailed information about the inference process

Advanced Usage: Training Custom Models

For advanced users who want to train their own models or work with specialized datasets, FoldTree2 provides a complete training pipeline. Foldtree2 production models are trained on a large, diverse set of protein structures from the AFDB cluster database, but you can train your own models on custom datasets.

Why do this instead of using a pretrained model?

Emphasize domain-specific structure signals: If your proteins are enriched for particular folds, repeats, interfaces, or conformational regimes, a custom encoder can better capture those patterns than a general model.
Control how structures are compressed into discrete characters: FoldTree2 encodes structure graphs into a discrete alphabet, and this bottleneck determines what information is preserved in downstream alignments/tree inference.
Tune phylogenetic granularity with alphabet size: Smaller alphabets (fewer embeddings) tend to merge subtle differences and can be more robust/noise-tolerant; larger alphabets preserve finer structural distinctions and can improve resolution for closely related clades.
Adapt to your data quality and objectives: You can tune model capacity and training settings to prioritize broad family-level separation or fine-grained subfamily/strain-level structure variation.

1. Prepare Training Data

Convert your PDB files to a graph HDF5 dataset suitable for training:

pdbs-to-graphs <input_pdb_dir> <training_graphs.h5>

2. Train Custom Models

FoldTree2 provides several training scripts with different features:

Standard Training

python learn_monodecoder.py \
  --dataset <training_graphs.h5> \
  --modelname <my_custom_model> \
  --epochs 100 \
  --batch-size 20 \
  --hidden-size 256 \
  --embedding-dim 128 \
  --outdir ./models/

See the complete list of options with --help.

Lightning-based Training (Recommended)

For advanced features like distributed training, automatic checkpointing, and logging:

python learn_lightning.py \
  --dataset <training_graphs.h5> \
  --modelname <my_lightning_model> \
  --epochs 100 \
  --batch-size 20 \
  --learning-rate 1e-4 \
  --outdir ./models/ \
  --clip-grad

See the complete list of options with --help.

Key Training Parameters

--dataset: Path to your HDF5 graph dataset
--modelname: Name for your trained model
--epochs: Number of training epochs (default: 100)
--batch-size: Training batch size (default: 20)
--hidden-size: Hidden layer dimensions (default: 256)
--embedding-dim: Embedding dimensions (default: 128)
--num-embeddings: Size of the discrete structural alphabet used by the encoder
--learning-rate: Learning rate (default: 1e-4)
--clip-grad: Enable gradient clipping for stability

3. Generate Custom Substitution Matrices

Create structure-based substitution matrices using your trained model:

makesubmat \
  --modelname <my_custom_model> \
  --modeldir ./models/ \
  --datadir <data_dir> \
  --outdir_base <results_dir> \
  --dataset <input_graphs.h5> \
  --encode_alns

This script has utilities to download structures from the AFDB cluster database, align clusters as reference alignments using Foldseek, encode structures and derive substitution matrices.

See the complete list of options with --help.

4. Use Your Custom Model

Once trained, use explicit encoder and decoder checkpoints in the main pipeline:

foldtree2 --encoder <PATH_TO_ENCODER.pt> \
  --decoder <PATH_TO_DECODER.pt> \
  --structures <YOURSTRUCTUREFOLDER> \
  --outdir <RESULTSFOLDER>

Training Tips

GPU Acceleration: Training is significantly faster with CUDA-enabled GPUs
Dataset Size: Larger, more diverse datasets generally produce better models
Hyperparameter Tuning: Experiment with different learning rates, batch sizes, and architectures
Monitoring: Use TensorBoard logs to monitor training progress
Checkpointing: Save model checkpoints regularly to resume training if interrupted

Requirements

Python 3.7+
See pyproject.toml or setup.py for a full list of dependencies.

License

MIT License (see LICENSE.txt)

Author

Dave Moi (dmoi@unil.ch)

For more details, see the source code and scripts in the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 361 Commits
.github		.github
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
alps		alps
benchmark_configs		benchmark_configs
conda-recipe		conda-recipe
configs		configs
docs		docs
families		families
figures		figures
foldtree2		foldtree2
models		models
scripts		scripts
.conda_build_ignore		.conda_build_ignore
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
build.log		build.log
environment-dev.yml		environment-dev.yml
foldtree2.yml		foldtree2.yml
logo.png		logo.png
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FoldTree2: Maximum Likelihood Phylogenetic Tree Inference from Protein Structures

TLDR

Features

Installation

Using pip and conda

Command Line Tools

Quick Start: Using Pretrained Models

Basic Workflow

Available Pretrained Models

Output Files

Advanced Usage: Training Custom Models

1. Prepare Training Data

2. Train Custom Models

Standard Training

Lightning-based Training (Recommended)

Key Training Parameters

3. Generate Custom Substitution Matrices

4. Use Your Custom Model

Training Tips

Requirements

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FoldTree2: Maximum Likelihood Phylogenetic Tree Inference from Protein Structures

TLDR

Features

Installation

Using pip and conda

Command Line Tools

Quick Start: Using Pretrained Models

Basic Workflow

Available Pretrained Models

Output Files

Advanced Usage: Training Custom Models

1. Prepare Training Data

2. Train Custom Models

Standard Training

Lightning-based Training (Recommended)

Key Training Parameters

3. Generate Custom Substitution Matrices

4. Use Your Custom Model

Training Tips

Requirements

License

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages