DexAvatar: 3D Sign Language Reconstruction with Hand and Body Pose Priors [WACV'26 π Best Paper Award Finalist π]
The official repository of the paper with supplementary: DexAvatar
conda create -n dexavatar -y python=3.10
conda activate dexavatar
bash scripts/env_install.sh
bash scripts/bug_fix_dexavatar.sh
conda deactivate
Download the signfy frames from this link and place them in the ./data folder. For evaluation, also download the smplxgt files.
The folder structure should be as follows:
data/
βββ images_sgnify/
βββ sign1/
β βββ images/
β βββ Img1.png
β βββ Img2.png
β βββ ...
βββ sign2/
β βββ images/
β βββ Img1.png
β βββ Img2.png
β βββ ...
βββ ...
The sign segmentations and the corresponding classes for each sign are already present in the ./data folder for SGNify dataset. If you want to have your own sign segmentations and classes for each sign, please generate them from the previous work in this link.
For Sapiens
Install sapiens lite from the original sapiens github repo. Please create a new environment called sapiens_lite by following their instructions. Please download the checkpoint of rtmpose from google drive and place them in the following directory structure.
sapiens/
βββ lite/
βββ torchscript/
βββ detector/
β βββ checkpoints/
β βββ rtmpose/
β βββ rtmdet_m_8xb32-100e_coco-obj365-person-235e8209.pth
βββ pose/
βββ checkpoints/
βββ sapiens_1b/
βββ sapiens_1b_coco_wholebody_best_coco_wholebody_AP_727_torchscript.pt2
For SMPLer-X
conda create -n smpler_x python=3.8 -y
conda activate smpler_x
pip install torch==1.12.0+cu116 torchvision==0.13.0+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
pip install mmcv-full==1.7.1 -f https://download.openmmlab.com/mmcv/dist/cu116/torch1.12.0/index.html
pip install -r preprocess/SMPLer-X/requirements.txt
cd preprocess/SMPLer-X/main/transformer_utils
pip install -v -e .
cd ../../../../
pip install setuptools==69.5.1 yapf==0.40.1 numpy==1.23.5
bash scripts/bug_fix_dexavatar.sh
Please download the following checkpoints and smplx files from the google drive and place them in the following directory structure.
DexAvatar/
βββ checkpoints/
β βββ smpler_x_h32.pth.tar
β βββ mmdet/
β βββ faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
β βββ mmdet_faster_rcnn_r50_fpn_coco.py
βββ SMPLer-X/
βββ common/
βββ utils/
βββ human_model_files/
Please download the SignBPoser and SignHPoser from the google drive and place them in the following directory structure.
dexavatar_fitting/
βββ smplifyx/
βββ signbposer/
βββ signhposer/
Run Fitting Code
Run the following command to execute the code:
python run_dexavatar.py --input_img_folder DATA_PATH --output_path OUTPUT_FOLDER --fitting_experiment ./dexavatar_fitting
This project is carried out at the Human-Centered AI Lab in the Faculty of Information Technology, Monash University, Melbourne (Clayton), Australia.
Project Members -
Kaustubh Kundu (Monash University, Melbourne, Australia),
Hrishav Bakul Barua (Monash University and TCS Research, Kolkata, India),
Lucy Robertson-Bell (Monash University, Melbourne, Australia),
Zhixi Cai (Monash University, Melbourne, Australia), and
Kalin Stefanov (Monash University, Melbourne, Australia)
This work is supported by the prestigious Discovery Early Career Researcher Award (DECRA)
fellowship by Australian Research Council (ARC) [Grant no. DE230100049 | Project: Towards automated Australian Sign Language translation]. We also acknowledge Monash University (M3 Cluster) and
National Computational Infrastructure (NCI) for providing High Performance Computing (HPC) to carry out experiments.
The trend in sign language generation is centered around data-driven generative methods. These methods require vast amounts of precise 2D and 3D human pose data to achieve a generation quality acceptable to the Deaf com- munity. However, currently, most sign language datasets are video-based and limited to automatically reconstructed 2D human poses (i.e., keypoints) and lack accurate 3D in- formation. However, manual production of accurate 2D and 3D human pose information from videos is a labor- intensive process. Furthermore, existing state-of-the-art for automatic 3D human pose estimation from sign language videos is prone to self-occlusion, noise, and motion blur ef- fects, resulting in poor reconstruction quality. In response to this, we introduce DexAvatar, a novel framework to re- construct bio-mechanically accurate fine-grained hand ar- ticulations and body movements from in-the-wild monocu- lar sign language videos, guided by learned 3D hand and body priors. DexAvatar achieves strong performance in the SGNify motion capture dataset, the only benchmark avail- able for this task, reaching an improvement of 35.11% in the estimation of body and hand poses compared to the state- of-the-art.
General.mp4
blur.mp4
occlusion.mp4
noise.mp4
If you find our work (i.e., the code, the theory/concept, or the dataset) useful for your research or development activities, please consider citing our work as follows:
@InProceedings{Kundu_2026_WACV,
author = {Kundu, Kaustubh and Barua, Hrishav Bakul and Robertson-Bell, Lucy and Cai, Zhixi and Stefanov, Kalin},
title = {DexAvatar: 3D Sign Language Reconstruction with Hand and Body Pose Priors},
booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
month = {March},
year = {2026},
pages = {5842-5852}
}
----------------------------------------------------------------------------------------
Copyright 2025 | All the authors and contributors of this repository as mentioned above.
----------------------------------------------------------------------------------------
Please check the License Agreement.