IMAGENE-ESCC-Survival

IMAGENE (Subproject) - Python Implementation of "A Multimodal Deep Learning Framework for Postoperative Overall Survival Prediction in Esophageal Squamous Cell Carcinoma"

Project Overview

This project provides a complete workflow for analyzing pathology images, with four main components:

Cell Classification - Identify and classify different cell types in pathology images
Tissue Classification - Classify tissue regions using deep learning models
Feature Calculation - Extract spatial and morphological features from cell distributions
Survival Analysis - Analyze patient survival data with multiple feature types

Project Structure

public_version/
├── cell_classfication/           # Cell classification module
│   ├── train.py                    # Train cell classification model
│   ├── pred_cell_label.py          # Predict cell labels
│   ├── get_data.py                 # Data loading utilities
│   ├── cell_feature_model.py       # Feature extraction models
│   └── main.py                     # Main training script
│
├── tissue_classfication/           # Tissue classification module
│   ├── config/                     # Configuration files
│   │   ├── swin.py                # Swin Transformer config
│   │   └── gigapath.py           # GigaPath config
│   ├── models/                     # Model definitions
│   │   └── giga_path_patch_encoder.py  # GigaPath patch encoder
│   └── pred_tissue_label.py        # Predict tissue labels
│
├── calculate_features/          # Feature extraction module
│   ├── calculate_region_features.py    # Calculate region-based features
│   └── nearest_neighbor_distance.py   # GPU-accelerated spatial analysis
│
└── survival_analysis/           # Survival analysis module
    ├── main.py                     # Main pipeline entry
    ├── train.py                    # Training and evaluation
    ├── model_surv.py               # Survival models (Cox, RSF, GB)
    ├── load_data.py                # Data loading and preprocessing
    ├── select_survival_features.py  # Feature selection methods
    └── visualize.py                # Visualization utilities

Installation

Clone the repository

git clone https://github.com/OpenGene/IMAGENE-ESCC-Survival.git
cd IMAGENE-ESCC-Survival

Create conda environment

conda create -n pathology_analysis python=3.8
conda activate pathology_analysis

Install dependencies

pip install -r requirements.txt

Requirements

See requirements.txt for the complete list of dependencies.

Usage

1. Cell Classification

Preparation of Training Data

The training dataset should be organized in the following structure:

data_root/
├── info.json                          # Dataset metadata (class_map, class_colors_map, id_img_map)
├── fold0/                             # First fold
│   ├── images/                        # Pathology image patches
│   │   ├── 0_14345_7163.png
│   │   ├── 0_14345_7307.png
│   │   └── ...
│   ├── labels/                        # Ground truth labels
│   │   ├── 0_14345_7163.npy          # Contains inst_map and type_map
│   │   ├── 0_14345_7307.npy
│   │   └── ...
│   ├── cell_detection/                # Cell detection results
│   │   ├── cell_detection/           # JSON format cell annotations
│   │   │   ├── 0_14345_7163.json
│   │   │   └── ...
│   │   ├── cell_detection_geojson/    # GeoJSON format for visualization
│   │   │   ├── 0_14345_7163.geojson
│   │   │   └── ...
│   │   └── cell_graph/                # PyTorch graph tensors (optional)
│   │       ├── 0_14345_7163.pt
│   │       └── ...
│   ├── features.npy                   # Cached cell features
│   ├── labels.npy                     # Cached labels
│   ├── image_paths.json               # List of image file names
│   └── cell_count.csv                 # Cell count statistics
├── fold1/                             # Second fold
│   └── ...
└── fold2/                             # Third fold
    └── ...

File Format Details:

Images: PNG format patches extracted from whole slide images
- Naming convention: {slide_id}_{x_coord}_{y_coord}.png
Labels (.npy): Dictionary containing:
- inst_map: Instance segmentation map (H×W numpy array)
- type_map: Cell type classification map (H×W numpy array)
Cell Detection (.json): JSON format with cell annotations including:
- Cell coordinates
- Cell boundaries
- Detection confidence scores
Cell Graph (.pt): PyTorch tensor format for graph-based features
info.json: Dataset metadata:
- class_map: Mapping from cell type IDs to names
- class_colors_map: RGB colors for visualization
- id_img_map: Mapping from fold IDs to image identifiers

Data Preparation Steps:

Extract image patches from whole slide images
Generate ground truth labels with instance and type maps
Run cell detection to generate cell annotations
Generate cell graph representations
Create info.json with class mappings
Organize data into fold directories for cross-validation

Training

cd cell_classfication
python main.py \
    --mode train \
    --data_root path/to/single_cell_dataset \
    --train_folds fold0 fold1 fold2 \
    --val_folds fold3 \
    --model_path path/to/model.pkl \
    --feature_type all \
    --confusion_matrix_path path/to/confusion_matrix.png \
    --roc_curves_path path/to/roc_curves.png

Parameters:

--mode: Training mode (train/test)
--data_root: Path to dataset root directory
--train_folds: List of training fold names (default: all folds starting with 'fold')
--val_folds: List of validation fold names (default: 20% of training data)
--model_path: Path to save/load model
--feature_type: Type of features to use (traditional_feature, deep_feature, or all)
--confusion_matrix_path: Path to save confusion matrix (optional)
--roc_curves_path: Path to save ROC curves (optional)

Example with traditional features only:

python main.py \
    --mode train \
    --data_root path/to/single_cell_dataset \
    --model_path path/to/traditional_model.pkl \
    --feature_type traditional_feature

Example with deep features only:

python main.py \
    --mode train \
    --data_root path/to/single_cell_dataset \
    --model_path path/to/deep_model.pkl \
    --feature_type deep_feature

Testing

python main.py \
    --mode test \
    --data_root path/to/single_cell_dataset \
    --val_folds fold3 \
    --model_path path/to/model.pkl \
    --feature_type all \
    --confusion_matrix_path path/to/test_confusion_matrix.png \
    --roc_curves_path path/to/test_roc_curves.png

Prediction (Single Image)

python pred_cell_label.py \
    --single \
    --cell_detection_dir path/to/image_dir \
    --model_path path/to/model.pkl

Prediction (Batch)

python pred_cell_label.py \
    --batch \
    --data_roots path/to/data_root1 path/to/data_root2 \
    --model_path path/to/model.pkl \
    --n_jobs 4

2. Tissue Classification

Preparation of Training Data

The training dataset should be organized in the following structure:

data_root/
├── esophageal_gland/              # Esophageal gland tissue
│   ├── slide_001.svs_44948_9940.png
│   ├── slide_001.svs_45469_9419.png
│   └── ...
├── interstitial_region/           # Interstitial region
│   ├── slide_001.svs_44948_9940.png
│   └── ...
├── mucosal_epithelium/          # Mucosal epithelium
│   ├── slide_001.svs_44948_9940.png
│   └── ...
├── muscle_tissue/               # Muscle tissue
│   ├── slide_001.svs_44948_9940.png
│   └── ...
├── submucosa/                  # Submucosa
│   ├── slide_001.svs_44948_9940.png
│   ├── slide_002.svs_6463_38916.png
│   └── ...
├── tumor_necrosis_region/       # Tumor necrosis region
│   ├── slide_001.svs_44948_9940.png
│   └── ...
└── tumor_region/                # Tumor region
    ├── slide_001.svs_44948_9940.png
    └── ...

File Format Details:

Images: PNG format patches extracted from whole slide images
- Naming convention: {slide_id}_{x_coord}_{y_coord}.png
- Each directory represents a tissue type (class)
- Images are organized by tissue type for easy dataset management

Tissue Types:

ID	Tissue Type	Description
0	esophageal_gland	Esophageal gland tissue
1	interstitial_region	Interstitial region
2	mucosal_epithelium	Mucosal epithelium
3	muscle_tissue	Muscle tissue
4	submucosa	Submucosa layer
5	tumor_necrosis_region	Tumor necrosis region
6	tumor_region	Tumor region

Data Preparation Steps:

Extract image patches from whole slide images
Organize patches by tissue type into corresponding directories
Ensure consistent naming convention for all image files
Verify image quality and annotations
Split data into training and validation sets if needed

Training with Swin Transformer

Note: This requires mmpretrain library. Install with:

pip install mmpretrain

Then run training:

# Train
python path/to/mmpretrain/tools/train.py \
    tissue_classfication/swin.py \

# Test
python path/to/mmpretrain/tools/test.py \
    tissue_classfication/swin.py \
    path/to/workdirs/epoch_8.pth \
    --out path/to/predictions_epoch_8.pkl \
    --out-item pred

# Confusion matrix
python path/to/mmpretrain/tools/analysis_tools/confusion_matrix.py \
    tissue_classfication/swin.py \
    path/to/workdirs/epoch_8.pth \
    --show \
    --show-path path/to/epoch_8_confusion_matrix.png \
    --include-values

Prediction

cd tissue_classfication
python pred_tissue_label.py \
    --input_dir path/to/h5_files \
    --output_dir path/to/output \
    --model_path path/to/model.pth \
    --device cuda

3. Feature Calculation

Preparation of Training Data

The feature calculation requires the following data structure:

data_root/
├── preprocessing/                          # Preprocessed cell detection results
│   └── {image_name}/                      # Image-specific directory
│       └── cell_detection_clf/
│           └── cells.json                 # Cell annotations (centroid, type)
├── clam_patches/                          # CLAM patch information
│   └── patches/
│       └── {image_name}.h5                # Patch coordinates and metadata
├── clam_gigapath_pred_label/              # Tissue classification predictions
│   └── h5_files/
│       └── {image_name}.h5                # Tissue type labels for patches
└── features/                              # Output directory (auto-created)
    └── {image_name}.csv                   # Calculated features (output)

Input Data Format:

cells.json: JSON file containing cell annotations
CLAM patches HDF5: Contains patch coordinates and metadata
- coords: Patch coordinates
- Attributes: downsample, patch_size, downsampled_level_dim
Tissue prediction HDF5: Contains tissue type predictions for patches
- coords: Patch coordinates
- Tissue type labels

Calculate Region Features

cd calculate_features
python calculate_region_features.py \
    --data_roots /path/to/dataset1 /path/to/dataset2 \
    --output all_features.csv \
    --cells_path_template preprocessing/{}/cell_detection_clf/cells.json \
    --clam_patches_template clam_patches/patches/{}.h5 \
    --clam_gigapath_pred_template clam_gigapath_pred_label/h5_files/{}.h5 \
    --features_save_path features/{}.csv \
    --grid_size 256 \
    --gaussian_kernel 5 5 \
    --gaussian_sigma 0 \
    --tumor_cell_type 1 \
    --mucosal_epithelium_type 2 \
    --density_thresholds 1 5 10 20 \
    --min_tumor_cells 20 \
    --max_workers 4 \
    --batch_size 10000 \
    --distance_threshold 256.0 \
    --min_cells_for_spatial 2 \
    --save_plots

Parameters:

Input/Output

--data_roots: List of data directories to process (required)
--output: Output feature file path (default: all_features.csv)

Path Templates

--cells_path_template: Path template for cell annotation file (default: preprocessing/{}/cell_detection_clf/cells.json)
--clam_patches_template: Path template for CLAM patch file (default: clam_patches/patches/{}.h5)
--clam_gigapath_pred_template: Path template for tissue prediction file (default: clam_gigapath_pred_label/h5_files/{}.h5)
--features_save_path: Path template for feature output (default: features/{}.csv)

KDE Parameters

--grid_size: Grid size for kernel density estimation (default: 256)
--gaussian_kernel: Gaussian kernel size (height, width) (default: 5 5)
--gaussian_sigma: Gaussian kernel standard deviation (default: 0)

Region Parameters

--tumor_cell_type: Tumor cell type ID (default: 1)
--mucosal_epithelium_type: Mucosal epithelium cell type ID (default: 2)
--density_thresholds: Density thresholds for region partitioning (default: 1 5 10 20)
--min_tumor_cells: Minimum number of tumor cells required (default: 20)

Processing Parameters

--max_workers: Maximum number of parallel workers (default: 2)
--batch_size: Batch processing size (default: 10000)

Feature Parameters

--distance_threshold: Distance threshold for spatial analysis (default: 256.0)
--min_cells_for_spatial: Minimum cells for spatial features (default: 2)

Debug Parameters

--save_plots: Save visualization plots (flag)

Output Features:

The script calculates the following types of features:

Global Features (across entire image):
- Total count and ratio of each cell type
- Area ratios between regions
- Density ratios of cell types between regions
Region Features (within each density-based region):
- Total cells and area per region
- Density of each cell type
- Ratio of each cell type
- Cell count ratios between types
- Area ratio with previous region
Spatial Features (nearest neighbor analysis):
- Distance statistics: min, p25, median, p75, max, mean, std, skew, kurt, cv
- Neighbor count statistics: total, mean, max, density
- Calculated for each cell type pair within each region

4. Survival Analysis

Run Complete Pipeline

cd survival_analysis
python main.py \
    --work_dir path/to/work_dir \
    --modal path \
    --feature_selection_method all \
    --model_type cox \
    --do_train \
    --do_test

Available Modalities

clinical - Clinical data only
path - Pathology features only
wes - Whole exome sequencing data
tcr - TCR diversity data
all - All modalities combined

Available Models

cox - Cox proportional hazards model
rsf - Random survival forest
gb - Gradient boosting survival model

Feature Selection Methods

all - Use all features
univariate - Univariate Cox regression
auc - AUC-based selection
lasso_cox - Lasso Cox regression

Configuration

Path Configuration

All paths in the code use placeholder format path/to/.... Before running:

Update paths in each module's main script
Example paths to configure:
- Data directories (images, features, clinical data)
- Model save/load paths
- Output directories

Key Parameters

Cell Classification

feature_type: traditional_feature, deep_feature, or all
use_class_weight: Enable class weighting
use_smote: Enable SMOTE oversampling

Tissue Classification

input_size: Feature dimension (default: 1536)
num_classes: Number of tissue classes (default: 7)
device: Run device (auto/cuda/cpu)

Feature Calculation

grid_size: KDE grid size (default: 256)
distance_threshold: Spatial analysis threshold (default: 256.0)
min_tumor_cells: Minimum tumor cells for analysis (default: 20)

Survival Analysis

See survival analysis module documentation for detailed parameters.

Cell Types

The project recognizes the following cell types:

ID	Type	Color
0	background	[0, 0, 0]
1	tumor_cell	[211, 47, 47]
2	lymphocyte	[25, 118, 210]
3	plasma_cell	[142, 36, 170]
4	neutrophil	[255, 160, 0]
5	eosinophil	[245, 124, 0]
6	interstitial_spindle_cell	[56, 142, 60]

Features

Region-Based Features

Density features: Cell density per region
Ratio features: Proportions of different cell types
Spatial features: Nearest neighbor distances and counts
Area features: Region area calculations

Spatial Analysis

GPU-accelerated nearest neighbor distance calculation
Multi-type cell interaction analysis
Configurable distance thresholds

Output

Cell Classification

model.pkl: Trained classification model
cells.json: Cell annotations with predicted types
cells.geojson: GeoJSON format for visualization
confusion_matrix.png: Confusion matrix visualization
roc_curves.png: ROC curves for each class

Tissue Classification

predictions/: Directory containing predicted tissue labels in H5 format

Feature Calculation

features/*.csv: Region-based features for each image
all_features.csv: Merged features from all images

Survival Analysis

cv_results.csv: Cross-validation results
km_curve_*.png: Kaplan-Meier survival curves
shap_plot_*.png: SHAP value visualizations
confusion_matrix_*.png: Feature confusion matrices
feature_importance_*.csv: Feature importance rankings

Performance Optimization

Parallel processing: Multi-process batch processing
GPU acceleration: CUDA for spatial analysis
Caching: Feature caching to avoid recomputation
Chunk processing: Memory-efficient large-scale processing

Troubleshooting

Common Issues

Out of Memory
- Reduce batch size
- Use chunk processing
- Close unnecessary applications
CUDA Errors
- Check GPU availability
- Reduce model complexity
- Use CPU fallback
Path Not Found
- Verify all paths are configured
- Check file permissions
- Ensure data directories exist

Citation

If you use this project in your research, please cite:

@software{gao2026imagene,
  title = {IMAGENE-ESCC-Survival: A Multimodal Deep Learning Framework for Postoperative Overall Survival Prediction in Esophageal Squamous Cell Carcinoma},
  author = {Gao, Shuaiqiang},
  year = {2026},
  url = {https://github.com/OpenGene/IMAGENE-ESCC-Survival},
  version = {1.0},
  note = {Comprehensive analysis framework integrating cell classification, tissue classification, and survival analysis for esophageal cancer pathology images}
}

License

Please refer to the project license file for usage terms.

Contact

For questions or issues, please contact the project maintainers.

Acknowledgments

This project uses the following open-source libraries and tools:

PyTorch
scikit-learn
lifelines
OpenCV
optuna
SHAP

References

[1] Hörst F, Rempe M, Heine L, et al. CellViT: Vision Transformers for precise cell segmentation and classification[J]. Medical Image Analysis, 2024, 94: 103143. https://doi.org/10.1016/j.media.2024.103143.

[2] MMPreTrain Contributors. OpenMMLab's Pre-training Toolbox and Benchmark[EB/OL]. 2023. https://github.com/open-mmlab/mmpretrain.

[3] Lu MY, Williamson DFK, Chen TY, et al. Data-efficient and weakly supervised computational pathology on whole-slide images[J]. Nature Biomedical Engineering, 2021, 5(6): 555-570. https://doi.org/10.1038/s41551-021-00755-2.

[4] Venkatachalapathy S, Jokhun DS, Shivashankar GV. Multivariate analysis reveals activation-primed fibroblast geometric states in engineered 3D tumor microenvironments[J]. Molecular Biology of the Cell, 2020, 31(8): 803-812. https://doi.org/10.1091/mbc.E19-08-0479.

[5] Xu H, Usuyama N, Bagga J, et al. A whole-slide foundation model for digital pathology from real-world data[J]. Nature, 2024. https://doi.org/10.1038/s41586-024-07241-0.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
calculate_features		calculate_features
cell_classfication		cell_classfication
survival_analysis		survival_analysis
tissue_classfication		tissue_classfication
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

IMAGENE-ESCC-Survival

Project Overview

Project Structure

Installation

Requirements

Usage

1. Cell Classification

Preparation of Training Data

Training

Testing

Prediction (Single Image)

Prediction (Batch)

2. Tissue Classification

Preparation of Training Data

Training with Swin Transformer

Prediction

3. Feature Calculation

Preparation of Training Data

Calculate Region Features

4. Survival Analysis

Run Complete Pipeline

Available Modalities

Available Models

Feature Selection Methods

Configuration

Path Configuration

Key Parameters

Cell Classification

Tissue Classification

Feature Calculation

Survival Analysis

Cell Types

Features

Region-Based Features

Spatial Analysis

Output

Cell Classification

Tissue Classification

Feature Calculation

Survival Analysis

Performance Optimization

Troubleshooting

Common Issues

Citation

License

Contact

Acknowledgments

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages