Choose and rank perception models for robot learning — on real-world RGB-D data, under embodied deployment conditions, with ESD-stratified difficulty splits and deployment-readiness scoring.
Quickstart · Benchmark Toolkit · Docs · Data Capture · Mask Pipeline · Paper
RPX is a unified real-world RGB-D benchmark for evaluating the perception models actually deployed inside robot learning stacks (not generic perception leaderboards). This repository contains everything behind the benchmark:
- 📷 the data-collection rig scripts (Intel D435 RGB-D + T265 VIO),
- 🎨 the ground-truth mask generator (SAM2 + GroundingDINO),
- 🧰 the benchmark toolkit (
pip install rpx-benchmark), and - 📄 the NeurIPS 2026 Datasets & Benchmarks paper draft.
| 📷 Sensor rig | Intel RealSense D435 (RGB-D) + T265 (6-DoF VIO), pose logged at 200 Hz |
| 🎬 3-phase capture protocol | Clutter → Interaction (human-in-scene) → Clean on identical scenes |
| 🧪 ~75 K frames | 100 indoor scenes, tabletop + room-scale, ~70 object categories |
| 🎯 10 benchmark tasks | depth, segmentation, detection (×2), grounding, pose, keypoints, sparse depth, NVS, tracking |
| 🪜 ESD difficulty splits | Easy / Medium / Hard derived from real annotation effort, per (scene, phase) |
| 🔌 Bring-your-own-model | HF checkpoint · numpy callable · custom adapter — pick one, run in one command |
| 📊 Deployment-readiness scoring | ESD-weighted phase score, state-transition robustness, temporal stability, FLOPs, latency |
| 🧰 Full CI | pytest matrix 3.10 / 3.11 / 3.12 + ruff + auto docs deploy to GitHub Pages |
| 📚 Auto docs | MkDocs + mkdocstrings reads numpydoc; adding a class = zero doc work |
| ⚖️ License | Code MIT · Dataset CC BY 4.0 |
|
Overview Components |
Reference Community |
RPX/
├── benchmark/ Python package + CLI: load, run, metric, report
│ ├── rpx_benchmark/ Library source
│ ├── docs/ MkDocs site (auto-generated from docstrings)
│ ├── tests/ 138-test offline suite
│ └── README.md Toolkit-specific README ←★ start here for users
│
├── dc/ Data-collection rig scripts (D435 + T265)
│ └── README.md Run `save_device_data.py` to capture a scene
│
├── robokit/ Ground-truth mask pipeline (SAM2 + GroundingDINO)
│ └── README.md Interactive GSAM2 refinement UI
│
├── docker/ Dockerised reproducible environment
│ └── README.md Build, start, stop, exec helpers
│
├── paper-submission/ LaTeX source for the NeurIPS 2026 paper
│
├── external/ Third-party submodules (rerun visualiser, ...)
│
├── scripts/ Auxiliary helpers
│
├── .github/workflows/ CI: pytest matrix + ruff + MkDocs Pages deploy
│
└── README.md This file
Each subdirectory has its own README with the full details for that piece of the system.
The fastest way to try RPX is evaluation mode: install the
rpx-benchmark package and run any HuggingFace depth, segmentation,
detection, grounding, pose, keypoint, sparse-depth, or NVS model
against an ESD difficulty split. No data capture or annotation
needed.
pip install 'rpx-benchmark[depth]'
# Run any HuggingFace depth model
rpx bench monocular_depth \
--hf-checkpoint depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf \
--split hard
# Run any HuggingFace instance-segmentation model
rpx bench object_segmentation \
--hf-checkpoint facebook/mask2former-swin-tiny-coco-instance \
--split hardFull toolkit docs: benchmark/README.md and
the hosted site at https://irvlutd.github.io/RPX/.
📷 1. Data collection rig — dc/
Two-sensor capture with Intel RealSense D435 (RGB-D) + T265 (6-DoF VIO). Captures each scene under the three-phase protocol (Clutter → Interaction → Clean) at synchronised FPS with pose logged from T265.
cd dc
python save_device_data.py <task-name> <fps> <sync-threshold>Note
T265 compatibility — T265 support was removed in
librealsense2 > 2.47.0. This project pins to
librealsense2==2.47.0 and pyrealsense2==2.47.0.3313. See
Data capture prerequisites
below for install instructions.
🎨 2. Mask generation — robokit/
Ground-truth instance masks are generated via an interactive pipeline combining GroundingDINO (open-vocabulary detection) and SAM2 (segment anything v2). A human operator curates the bounding box set for one keyframe per phase; SAM2 propagates masks across the rest of the phase.
cd robokit
python -m maskgen_pipeline.interactive_gsam2 --scene_dir /path/to/scene/1🧰 3. Benchmark toolkit — benchmark/
Python library + CLI. The user supplies a model; the toolkit handles:
- Dataset download — task-aware, fetches only the modalities your task needs; reuses the HuggingFace content-addressed cache.
- Splits — Easy / Medium / Hard per
(scene, phase)via Effort-Stratified Difficulty. - Metrics — pluggable per-task calculators (AbsRel, RMSE, δ-acc, mIoU, F1, MOTA, PSNR, SSIM, geodesic pose error, keypoint accuracy, ...).
- Deployment-readiness scoring — ESD-weighted phase score, State-Transition Robustness, Temporal Stability, FLOPs, median latency.
- Reports — JSON + markdown + rich terminal UI with true-colour gradient banner.
- Extensibility — add a new task, metric, or model adapter in one file via the plugin registries.
pip install 'rpx-benchmark[depth]'
rpx models # list registered adapters
rpx bench --help # list task subcommands (9 runnable)138 tests, 0 network deps, runs in under a second on CI.
9 of 10 tasks are runnable end-to-end; only object_tracking is
deferred pending a sequence-per-sample protocol decision.
Capturing new scenes requires the RealSense SDK built from source against v2.47.0 (T265 support was removed afterwards).
sudo apt update
sudo apt install \
libssl-dev libusb-1.0-0-dev libudev-dev pkg-config libgtk-3-dev \
git wget cmake build-essential libglfw3-dev libgl1-mesa-dev \
libglu1-mesa-dev atgit clone -b v2.47.0 https://github.com/IntelRealSense/librealsense.git
cd librealsense && ./scripts/setup_udev_rules.sh
mkdir build && cd build
cmake ../ -DBUILD_EXAMPLES=true -DBUILDTYPE=Release
sudo make uninstall && make clean && make -j12 && sudo make installTip
-j12 uses 12 cores. Leave at least 2 cores free so the system
stays responsive.
Verify the install:
realsense-viewer # launches the GUI; connect the cameraspip install pyrealsense2==2.47.0.3313Warning
Disconnect devices first — unplug all RealSense devices
before running make install; live devices can lock the udev
rules mid-install.
For a fully reproducible environment (useful for CI, GPU setup,
multi-machine replays) — see docker/ for the
full details.
cd docker
./build_docker_image.sh # one-time, a few minutes
./start_docker.sh # detached
./start_docker.sh -i # interactive
./enter_docker.sh # shell into the running container
./stop_docker.sh # stopThe container ships with the RealSense SDK, the benchmark toolkit, and the robokit mask pipeline already installed.
The NeurIPS 2026 Datasets & Benchmarks submission lives under
paper-submission/neurips-2026/.
The full model slate rationale, ESD formulation, three-phase
protocol details, and experiment tables are in the paper.
| Workflow | What it does | When it runs |
|---|---|---|
tests.yml |
138-test pytest suite on Python 3.10 / 3.11 / 3.12 + ruff lint | push / PR touching benchmark/** |
docs.yml |
mkdocs build + deploy to GitHub Pages |
push to main touching benchmark/docs/** or benchmark/rpx_benchmark/** |
- Push the repo to GitHub.
- Settings → Pages → Source = "GitHub Actions".
- Next push to
maintriggers thedocsworkflow and the site goes live at https://irvlutd.github.io/RPX/.
Until step 2 is done, the docs workflow will fail with
HttpError: Not Found at the deploy step — that's the Pages API
telling you Pages isn't enabled yet. Harmless before the one-time
setup; fatal to the docs site after.
Each subproject has its own contribution workflow:
benchmark/—pip install -e '.[dev,docs]',pytest tests/,ruff check. New tasks / metrics / model adapters land through the plugin registries (seebenchmark/docs/guides/). Always use the editable (-e) install when developing — frozen wheel installs will silently show stale behaviour.dc/— changes to the capture rig need a real RealSense device for smoke testing.robokit/— mask generation changes need access to the interactive annotation UI and a CUDA-capable box.paper-submission/— LaTeX edits through whatever your usual Overleaf / local workflow is.
If you use RPX (dataset, toolkit, or any part of this repository) in your work, please cite the accompanying NeurIPS 2026 Datasets & Benchmarks paper. The BibTeX entry will be added here once the camera-ready version is released.
- Code in this repository (benchmark toolkit, data-collection scripts, mask generator, docker setup): MIT.
- RPX dataset (once released): CC BY 4.0.
See LICENSE.