MotionBench is a real-time pose-based exercise recognition project designed for practical local usage. It classifies exercise motion from short temporal windows, estimates repetition counts with a deterministic finite-state method, and reports similarity against class-level motion centroids.
MotionBench is built to run the full workflow from start to finish. You can prepare sequence data, train models, benchmark inference, and run real-time prediction from a webcam.
The runtime pipeline is simple. It captures frames, extracts pose-based features, builds rolling windows, and predicts one of six exercise classes. It also estimates repetitions with a deterministic finite-state method and reports a centroid similarity score for live feedback.
Core work happens in data/, models/, scripts/, and results/. Older or non-essential files are moved to archive/ to keep the main repository clear and easy to review.
This repo stays lightweight on GitHub. Download the dataset files from Hugging Face and place them in data/.
git clone https://huggingface.co/datasets/johnamit/motionbench-data dataFor local usage, keep split files under data/.
The workflow expects fixed sequence splits (train, val, test_internal) and optionally a separate home/generalization test split.
Expected split files:
data/train_sequences.csvdata/val_sequences.csvdata/test_internal_sequences.csvdata/test_home_sequences.csv(optional for home/generalization evaluation)
To regenerate centralized fixed splits:
python scripts/preprocess/create_fixed_splits.py --input-file data/train_sequences_full.csv --output-dir dataDownload trained model files from Hugging Face and place them in models/.
git clone https://huggingface.co/johnamit/motionbench-models modelsThis project includes six sequence models with different strengths. Some are strong on temporal memory, some are better for latency, and some are better at capturing structured feature relationships.
BiLSTM: The bidirectional LSTM processes each sequence in forward and backward directions within the input window, so the classifier can use context from both ends of the motion segment. This helps when important movement details are spread across the whole sequence, not just a single frame.
LSTM: Unidirectional LSTM reads movement step by step in time. It is a simple and reliable sequence model, so it works well as a strong baseline for exercise classification while keeping runtime reasonable.
GRU: The GRU uses gating similar to LSTM but with fewer internal components, which can reduce parameter count and improve efficiency. In practice, it is a strong candidate when you want robust sequence modeling with lighter recurrent overhead.
TCN: The temporal convolutional network uses dilated 1D convolutions and residual blocks to to learn patterns over short and long time ranges. Because convolutional operations are parallelizable, it is often fast at inference, which makes it a good option when responsiveness matters.
CNN-BiLSTM: This hybrid architecture first applies temporal convolutions to capture short local motion patterns, then a BiLSTM models how those patterns evolve over time. This gives both local detail and sequence context.
ST-GCN-inspired (feature-graph variant): This ST-GCN-style model treats features as connected nodes and learns both their relationships and how they change over time. It can help when interactions between pose features are important for classification.
Train each model from the shared sequence splits in data/.
python models/bilstm/train.py --train-file data/train_sequences.csv --val-file data/val_sequences.csv --test-file data/test_internal_sequences.csv --output-dir models/bilstm/results
python models/lstm/train.py --train-file data/train_sequences.csv --val-file data/val_sequences.csv --test-file data/test_internal_sequences.csv --output-dir models/lstm/results
python models/gru/train.py --train-file data/train_sequences.csv --val-file data/val_sequences.csv --test-file data/test_internal_sequences.csv --output-dir models/gru/results
python models/tcn/train.py --train-file data/train_sequences.csv --val-file data/val_sequences.csv --test-file data/test_internal_sequences.csv --output-dir models/tcn/results
python models/cnn_bilstm/train.py --train-file data/train_sequences.csv --val-file data/val_sequences.csv --test-file data/test_internal_sequences.csv --output-dir models/cnn_bilstm/results
python models/st_gcn/train.py --train-file data/train_sequences.csv --val-file data/val_sequences.csv --test-file data/test_internal_sequences.csv --output-dir models/st_gcn/resultsIf centroid assets are missing or if models were retrained, rebuild similarity assets:
python scripts/preprocess/build_similarity_assets.py --train-file data/train_sequences.csv --models-root modelsRun offline evaluation on home/generalization test data:
python scripts/evaluate/evaluate_home_set.py --test-file data/test_home_sequences.csv --models-root models --output-dir results/eval_offline_homeRun inference benchmarking:
python scripts/benchmark/benchmark_inference.py --input-file data/test_home_sequences.csv --models-root models --output-dir results/benchmark_inferenceRun realtime webcam evaluation (CLI):
python scripts/realtime_eval/evaluate_realtime_webcam.py --model-name bilstm --models-root models --output-dir results/eval_realtimeUse this when you want camera input from /dev/video0 inside Docker. Works for Linux.
docker build -t motionbench-space .
docker run --rm -p 7860:7860 --device=/dev/video0:/dev/video0 motionbench-spaceUse this to run directly on your machine (recommended for Windows/macOS webcam support).
First, clone model weights from Hugging Face and replace the local models/ folder with the cloned models/ folder:
git clone https://huggingface.co/johnamit/motionbench-modelsThen create the environment:
conda create -n motionbench python=3.11 -y
conda activate motionbench
pip install -r requirements.txtThen run the app:
streamlit run scripts/app/motionbench.pyA live app of the streamlit app is hosted on HuggingFace spaces via Docker. However this has bugs so i not yet complete.
The demo videos show the app running locally via Streamlit (Option 2).
- Higher is better: Accuracy, F1 (Macro), F1 (Weighted), Precision, Recall
- Lower is better: Mean Latency (ms), P95 Latency (ms), Peak Memory (MB), Model Size (MB)
This test evaluates how well each model recognises exercises on real-life home exercise videos it did not see during training.
It reports classification performance using Accuracy, Macro F1 (class balance sensitivity), and Weighted F1/Precision/Recall (overall performance weighted by class frequency).
| Model | Accuracy | F1 (Macro) | F1 (Weighted) | Precision (Weighted) | Recall (Weighted) |
|---|---|---|---|---|---|
| gru | 0.9665 | 0.9672 | 0.9663 | 0.9675 | 0.9665 |
| bilstm | 0.9553 | 0.9575 | 0.9552 | 0.9582 | 0.9553 |
| cnn_bilstm | 0.9553 | 0.9585 | 0.9552 | 0.9576 | 0.9553 |
| lstm | 0.9497 | 0.9521 | 0.9500 | 0.9518 | 0.9497 |
| tcn | 0.9497 | 0.9529 | 0.9495 | 0.9528 | 0.9497 |
| st_gcn | 0.8715 | 0.8801 | 0.8679 | 0.8937 | 0.8715 |
GRU performs best on unseen home videos (highest Accuracy and Weighted F1), with BiLSTM and CNN-BiLSTM close behind.
This test measures how fast each model runs on a CPU, which is useful for laptops, edge devices and low-cost deployment.
It reports mean latency and P95 latency (worst-case tail behavior), plus model size in MB.
| Model | Device | Model Size (MB) | Mean Latency (ms) | P95 Latency (ms) |
|---|---|---|---|---|
| lstm | cpu | 0.778 | 0.226 | 0.253 |
| bilstm | cpu | 0.839 | 0.359 | 0.406 |
| cnn_bilstm | cpu | 1.066 | 0.423 | 0.464 |
| gru | cpu | 0.412 | 0.555 | 0.589 |
| tcn | cpu | 1.175 | 0.820 | 0.850 |
| st_gcn | cpu | 0.984 | 2.793 | 2.870 |
LSTM is the fastest on CPU, while GRU has the smallest model size.
This test measures how fast each model runs on a GPU (RTX 3090), where lower latency results in smoother live predictions.
It reports mean latency and P95 latency, model size and peak GPU memory usage, which helps choose a model that is both fast and resource-efficient.
| Model | Device | Model Size (MB) | Mean Latency (ms) | P95 Latency (ms) | Peak Memory (MB) |
|---|---|---|---|---|---|
| gru | cuda | 0.412 | 0.127 | 0.135 | 10.146 |
| bilstm | cuda | 0.839 | 0.189 | 0.208 | 43.994 |
| lstm | cuda | 0.778 | 0.221 | 0.229 | 10.992 |
| cnn_bilstm | cuda | 1.066 | 0.248 | 0.256 | 44.249 |
| tcn | cuda | 1.175 | 0.360 | 0.377 | 10.554 |
| st_gcn | cuda | 0.984 | 0.533 | 0.560 | 18.131 |
GRU is fastest on GPU and also has one of the lowest memory footprints, making it the best real-time deployment candidate in this benchmark.
Bidirectional Long Short-Term Memory (BiLSTM)
@article{riccio2024real,
title={Real-time fitness exercise classification and counting from video frames},
author={Riccio, Riccardo},
journal={arXiv preprint arXiv:2411.11548},
year={2024}
}Gated Recurrent Unit (GRU)
@article{chung2014empirical,
title={Empirical evaluation of gated recurrent neural networks on sequence modeling},
author={Chung, Junyoung and Gulcehre, Caglar and Cho, KyungHyun and Bengio, Yoshua},
journal={arXiv preprint arXiv:1412.3555},
year={2014}
}Temporal Convolutional Network (TCN)
@inproceedings{lea2017temporal,
title={Temporal convolutional networks for action segmentation and detection},
author={Lea, Colin and Flynn, Michael D and Vidal, Rene and Reiter, Austin and Hager, Gregory D},
booktitle={proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
pages={156--165},
year={2017}
}Spatial Temporal Graph Convolutional Network
@inproceedings{yan2018spatial,
title={Spatial temporal graph convolutional networks for skeleton-based action recognition},
author={Yan, Sijie and Xiong, Yuanjun and Lin, Dahua},
booktitle={Proceedings of the AAAI conference on artificial intelligence},
volume={32},
number={1},
year={2018}
}CNN BiLSTM Hybrid
@online{dhomane2024cnnbilstm,
author = {Shreyas Dhomane},
title = {CNN + BiLSTM Architecture: A Practical Guide},
year = {2024},
month = oct,
day = {23},
url = {https://medium.com/@shreyas.dhomane22/cnn-bilstm-architecture-a-practical-guide-c81829022820},
note = {Medium article. Accessed: 2026-04-22}
}This project is released under the MIT License.
