A production-style edge-cloud machine learning platform that bridges server-side model training with real-time inference on edge devices. Built with a microservices architecture using Docker, MQTT, and FastAPI.
- Overview
- Tech Stack
- Architecture
- Features
- Quick Start
- Demo
- API Reference
- MQTT Topics
- Configuration
- Project Structure
- Hardware Tested
- Performance
- Troubleshooting
- Extending the Platform
- Contributing
- License
This platform implements a complete ML lifecycle across distributed infrastructure:
- Server (Cloud) — Model training, versioned model registry, monitoring, and command dispatch
- Edge Devices — Autonomous inference agents that auto-pull models and report telemetry
The system is designed for and tested on NVIDIA Jetson Nano as the edge device and an Intel i7 laptop as the server node, communicating over a local network with optional Tailscale overlay for remote access.
| Layer | Technology | Purpose |
|---|---|---|
| Message Broker | Eclipse Mosquitto 2.x | Asynchronous pub/sub communication between server and edge |
| Model Registry | FastAPI + Uvicorn | REST API for model versioning, storage, and distribution |
| Metrics | Prometheus | Time-series metrics collection and alerting |
| Visualization | Grafana | Real-time monitoring dashboards |
| Orchestration | Docker Compose | Multi-container deployment and management |
| Edge Runtime | Python + NumPy | Lightweight inference agent for resource-constrained devices |
| Networking | Tailscale (optional) | Secure overlay network for remote edge access |
┌──────────────────────────────────────────────────────────────┐
│ SERVER (Laptop / VM) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Mosquitto │ │ Model │ │ Monitoring │ │
│ │ MQTT Broker │◄─┤ Registry │ │ Prometheus + Grafana│ │
│ │ :1883/:9001 │ │ (FastAPI) │ │ :9090 / :3000 │ │
│ └──────┬───────┘ │ :8000 │ └──────────────────────┘ │
│ │ └──────────────┘ │
│ │ │
└─────────┼──────────────────────────────────────────────────────┘
│ MQTT (pub/sub)
│
┌─────────┼──────────────────────────────────────────────────────┐
│ ▼ EDGE DEVICE (Jetson Nano) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Edge Agent │ │
│ │ • Subscribes to model updates via MQTT │ │
│ │ • Auto-downloads new models from registry │ │
│ │ • Runs continuous inference loop │ │
│ │ • Reports telemetry (CPU/GPU temp, memory, disk) │ │
│ │ • Responds to remote commands │ │
│ └──────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
Train Model ──> Upload to Registry ──> MQTT Notification ──> Edge Downloads
│
Grafana Dashboard <── Prometheus <── Registry <── Inference Results
- Versioned model storage — Upload, download, and list models with semantic versioning
- MQTT notifications — Automatically notify edge devices when new models are available
- Prometheus metrics — Track upload/download counts, inference results, latency histograms
- Device tracking — Monitor connected edge devices and their telemetry in real-time
- REST API — Full CRUD operations via FastAPI with auto-generated OpenAPI docs
- Auto model sync — Subscribes to MQTT and downloads new models without manual intervention
- Continuous inference — Runs inference loop with configurable intervals
- Remote commands — Accept commands from server:
run_inference,report_status,pull_model - Device telemetry — Reports CPU/GPU temperature, memory usage, disk space
- Graceful reconnection — Automatically reconnects on MQTT broker disconnection
- Prometheus — Scrapes model registry metrics every 15 seconds
- Grafana — Pre-provisioned dashboard with 10 panels covering inference rates, model activity, and system health
- Docker and Docker Compose (server)
- Python 3.8+ (edge device)
- Network connectivity between server and edge device
cd server
docker compose up -dThis starts 4 services:
| Service | Port | Description |
|---|---|---|
| Mosquitto | 1883, 9001 | MQTT broker |
| Model Registry | 8000 | FastAPI model API |
| Prometheus | 9090 | Metrics collection |
| Grafana | 3000 | Monitoring dashboard |
Verify: curl http://localhost:8000/health
cd edge-agent
pip install -r requirements.txt
# Set server IP and start
export SERVER_IP=<server-ip>
python edge_agent.pyThe agent will connect to MQTT and begin reporting telemetry.
cd scripts
pip install numpy requests paho-mqtt
# Train a model and push to registry
python train_and_deploy.py --name my-model --version 1.0 --server <server-ip>The edge agent will automatically download and start using the new model.
# Trigger inference on all devices
python scripts/send_command.py --command run_inference
# Request status from a specific device
python scripts/send_command.py --device nano --command report_status
# Push a model to a device
python scripts/send_command.py --command pull_model --model my-model --version 1.0- Grafana Dashboard: http://<server-ip>:3000 (admin/admin)
- API Docs: http://<server-ip>:8000/docs
- Prometheus: http://<server-ip>:9090
$ curl http://localhost:8000/health{
status: healthy,
mqtt_connected: true,
models_count: 3,
devices_count: 1
}$ curl http://localhost:8000/devices{
nano: {
last_seen: 1776537994.14,
status: {
device_id: nano,
cpu_temp_c: 42.5,
gpu_temp_c: 33.0,
memory_used_pct: 35.2,
memory_total_mb: 3956,
disk_used_pct: 77.6,
current_model: my-model,
current_model_version: 1.0
}
}
}$ curl http://localhost:8000/results?limit=1[
{
device: nano,
result: {
device_id: nano,
inference_time_ms: 0.34,
output_shape: [1, 5],
output_summary: {
mean: -0.2199,
max: 0.2408,
min: -0.6088
},
model: my-model,
model_version: 1.0
}
}
]$ python train_and_deploy.py --name demo-model --version 1.0 --server localhost
Training model: input_dim=10, output_dim=5
Epoch 20/100, Loss: 7.2560
Epoch 40/100, Loss: 6.6770
Epoch 60/100, Loss: 6.1446
Epoch 80/100, Loss: 5.6550
Epoch 100/100, Loss: 5.2047
Training complete. Final loss: 5.1832
Model saved to /tmp/demo-model_v1.0.npy
Model uploaded successfully.
Done! The model registry will notify edge devices via MQTT.
Edge devices will automatically download and start using the new model.
| Method | Endpoint | Description |
|---|---|---|
GET |
/ |
Service info |
GET |
/health |
Health check with MQTT and model counts |
POST |
/models/upload |
Upload a model file with metadata |
GET |
/models |
List all registered models |
GET |
/models/download/{name} |
Download a model by name and version |
GET |
/devices |
List connected edge devices and telemetry |
GET |
/results |
Get recent inference results |
GET |
/metrics |
Prometheus metrics endpoint |
| Topic | Direction | Description |
|---|---|---|
server/model/new |
Server → Edge | New model available notification |
server/command/{device_id} |
Server → Edge | Targeted command dispatch |
server/command/all |
Server → All | Broadcast command to all devices |
edge/{device_id}/status |
Edge → Server | Device telemetry and health |
edge/{device_id}/inference/result |
Edge → Server | Inference output and timing |
| Variable | Default | Description |
|---|---|---|
SERVER_IP |
192.168.1.12 |
IP address of the server running MQTT and registry |
MQTT_PORT |
1883 |
MQTT broker port |
REGISTRY_PORT |
8000 |
Model registry API port |
DEVICE_ID |
abbhatia-mac |
Unique identifier for this edge device |
MODEL_DIR |
~/edge-cloud/models |
Local directory for downloaded models |
POLL_INTERVAL |
30 |
Telemetry reporting interval in seconds |
| Service | Container Name | Internal Port | External Port |
|---|---|---|---|
| Mosquitto | mqtt-broker | 1883, 9001 | 1883, 9001 |
| Model Registry | model-registry | 8000 | 8000 |
| Prometheus | prometheus | 9090 | 9090 |
| Grafana | grafana | 3000 | 3000 |
Default credentials: admin / admin
The monitoring dashboard is auto-provisioned on startup with Prometheus as the default datasource.
edge-cloud-ml/
├── server/ # Server-side components
│ ├── docker-compose.yml # Orchestration for all services
│ ├── model-registry/ # FastAPI model registry
│ │ ├── Dockerfile
│ │ ├── main.py
│ │ └── requirements.txt
│ ├── mosquitto/ # MQTT broker config
│ │ └── mosquitto.conf
│ ├── prometheus/ # Metrics collection
│ │ └── prometheus.yml
│ └── grafana/ # Monitoring dashboards
│ └── provisioning/
│ ├── datasources/
│ └── dashboards/
├── edge-agent/ # Edge device agent
│ ├── edge_agent.py
│ └── requirements.txt
├── scripts/ # Utility scripts
│ ├── train_and_deploy.py # Model training pipeline
│ └── send_command.py # Remote command dispatcher
└── docs/ # Documentation
├── architecture.md
| Component | Server | Edge Device |
|---|---|---|
| Device | Laptop (i7-4510U) | NVIDIA Jetson Nano |
| CPU | Intel i7-4510U @ 2.0GHz | ARM Cortex-A57, 4 cores |
| RAM | 8 GB DDR3 | 4 GB LPDDR4 |
| GPU | NVIDIA GeForce 840M (2GB) | 128-core Maxwell |
| Storage | 256 GB SSD | 32 GB eMMC |
| OS | Ubuntu 22.04 LTS | Ubuntu 20.04 (L4T R32.6.1) |
| CUDA | 12.4 | 10.2 |
Benchmarks measured on the live system with the Jetson Nano running continuous inference:
| Metric | Value |
|---|---|
| Inference latency | 0.14 — 0.38 ms per inference |
| Average inference time | ~0.34 ms |
| Telemetry interval | Every 30 seconds |
| Auto-inference interval | Every 10 seconds |
| Total inferences recorded | 778+ (in a single session) |
| Model download time | < 1 second (328 bytes, local network) |
| MQTT round-trip | < 5 ms (LAN) |
| Jetson CPU temp (under load) | 42.5°C |
| Jetson GPU temp (under load) | 33.0°C |
| Jetson memory usage | 35.2% (1.4 GB / 4 GB) |
| Server memory (registry) | ~51 MB RSS |
Note: Latency numbers reflect the demo linear model. Real-world models (ResNet, YOLO, etc.) will have higher inference times depending on model complexity and whether TensorRT optimization is applied.
MQTT connect failed: [Errno 111] Connection refused, retrying in 5s...
- Verify the server is running:
docker pson the server - Check the
SERVER_IPenvironment variable is correct - Ensure port 1883 is not blocked by a firewall:
sudo ufw allow 1883
Failed to download model: HTTPConnectionPool - Max retries exceeded
- Verify the model registry is accessible:
curl http://<server-ip>:8000/health - Check that port 8000 is reachable from the edge device
- Ensure the model exists:
curl http://<server-ip>:8000/models
- Verify Prometheus is scraping: visit
http://<server-ip>:9090/targets - Ensure the model registry is running and the
/metricsendpoint responds - Check the Grafana datasource is configured correctly under Settings > Data Sources
# Check logs for the failing container
docker logs <container-name>
# Common fix: ensure ports aren't already in use
sudo lsof -i :1883
sudo lsof -i :8000- Reduce
POLL_INTERVALto report telemetry less frequently - Ensure old model files are cleaned up in
MODEL_DIR - Consider running the agent inside a Docker container with memory limits
- Copy
edge-agent/to the new device - Install dependencies:
pip install -r requirements.txt - Set
SERVER_IPand optionallyDEVICE_ID - Run
python edge_agent.py - The device will auto-register with the server
Replace the numpy-based demo model in train_and_deploy.py with:
- PyTorch models (
.ptfiles viatorch.save()) - ONNX models for cross-platform inference
- TensorRT engines for optimized Jetson inference
- Define the command handler in
edge_agent.pyunderon_message() - Add the command to
send_command.py's CLI choices
Contributions are welcome. Please follow these guidelines:
- Fork the repository and create a feature branch from
main - Follow the existing code style and project structure
- Write descriptive commit messages using Conventional Commits
feat:for new featuresfix:for bug fixesdocs:for documentation changesrefactor:for code restructuring
- Test your changes on both server and edge environments if applicable
- Submit a pull request with a clear description of the changes
git clone git@github.com:Abhinav0002/edge-cloud-ml.git
cd edge-cloud-ml
cd server && docker compose up -d # Start server services
cd ../edge-agent && pip install -r requirements.txt # Set up edge agentMIT License. See LICENSE for details.