Edge-Cloud ML

A production-style edge-cloud machine learning platform that bridges server-side model training with real-time inference on edge devices. Built with a microservices architecture using Docker, MQTT, and FastAPI.

Overview

This platform implements a complete ML lifecycle across distributed infrastructure:

Server (Cloud) — Model training, versioned model registry, monitoring, and command dispatch
Edge Devices — Autonomous inference agents that auto-pull models and report telemetry

The system is designed for and tested on NVIDIA Jetson Nano as the edge device and an Intel i7 laptop as the server node, communicating over a local network with optional Tailscale overlay for remote access.

Tech Stack

Layer	Technology	Purpose
Message Broker	Eclipse Mosquitto 2.x	Asynchronous pub/sub communication between server and edge
Model Registry	FastAPI + Uvicorn	REST API for model versioning, storage, and distribution
Metrics	Prometheus	Time-series metrics collection and alerting
Visualization	Grafana	Real-time monitoring dashboards
Orchestration	Docker Compose	Multi-container deployment and management
Edge Runtime	Python + NumPy	Lightweight inference agent for resource-constrained devices
Networking	Tailscale (optional)	Secure overlay network for remote edge access

Architecture

┌──────────────────────────────────────────────────────────────┐
│                     SERVER (Laptop / VM)                      │
│                                                                │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐ │
│  │  Mosquitto    │  │   Model      │  │   Monitoring          │ │
│  │  MQTT Broker  │◄─┤   Registry   │  │   Prometheus + Grafana│ │
│  │  :1883/:9001  │  │   (FastAPI)  │  │   :9090 / :3000       │ │
│  └──────┬───────┘  │   :8000      │  └──────────────────────┘ │
│         │          └──────────────┘                            │
│         │                                                      │
└─────────┼──────────────────────────────────────────────────────┘
          │  MQTT (pub/sub)
          │
┌─────────┼──────────────────────────────────────────────────────┐
│         ▼              EDGE DEVICE (Jetson Nano)                │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                    Edge Agent                              │  │
│  │  • Subscribes to model updates via MQTT                    │  │
│  │  • Auto-downloads new models from registry                 │  │
│  │  • Runs continuous inference loop                          │  │
│  │  • Reports telemetry (CPU/GPU temp, memory, disk)          │  │
│  │  • Responds to remote commands                             │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────────────┘

Data Flow

 Train Model ──> Upload to Registry ──> MQTT Notification ──> Edge Downloads
                                                                    │
                 Grafana Dashboard <── Prometheus <── Registry <── Inference Results

Features

Model Registry (Server)

Versioned model storage — Upload, download, and list models with semantic versioning
MQTT notifications — Automatically notify edge devices when new models are available
Prometheus metrics — Track upload/download counts, inference results, latency histograms
Device tracking — Monitor connected edge devices and their telemetry in real-time
REST API — Full CRUD operations via FastAPI with auto-generated OpenAPI docs

Edge Agent

Auto model sync — Subscribes to MQTT and downloads new models without manual intervention
Continuous inference — Runs inference loop with configurable intervals
Remote commands — Accept commands from server: run_inference, report_status, pull_model
Device telemetry — Reports CPU/GPU temperature, memory usage, disk space
Graceful reconnection — Automatically reconnects on MQTT broker disconnection

Monitoring Stack

Prometheus — Scrapes model registry metrics every 15 seconds
Grafana — Pre-provisioned dashboard with 10 panels covering inference rates, model activity, and system health

Quick Start

Prerequisites

Docker and Docker Compose (server)
Python 3.8+ (edge device)
Network connectivity between server and edge device

1. Start the Server

cd server
docker compose up -d

This starts 4 services:

Service	Port	Description
Mosquitto	1883, 9001	MQTT broker
Model Registry	8000	FastAPI model API
Prometheus	9090	Metrics collection
Grafana	3000	Monitoring dashboard

Verify: curl http://localhost:8000/health

2. Start the Edge Agent

cd edge-agent
pip install -r requirements.txt

# Set server IP and start
export SERVER_IP=<server-ip>
python edge_agent.py

The agent will connect to MQTT and begin reporting telemetry.

3. Train and Deploy a Model

cd scripts
pip install numpy requests paho-mqtt

# Train a model and push to registry
python train_and_deploy.py --name my-model --version 1.0 --server <server-ip>

The edge agent will automatically download and start using the new model.

4. Send Commands to Edge Devices

# Trigger inference on all devices
python scripts/send_command.py --command run_inference

# Request status from a specific device
python scripts/send_command.py --device nano --command report_status

# Push a model to a device
python scripts/send_command.py --command pull_model --model my-model --version 1.0

5. Monitor

Grafana Dashboard: http://<server-ip>:3000 (admin/admin)
API Docs: http://<server-ip>:8000/docs
Prometheus: http://<server-ip>:9090

Demo

Health Check

$ curl http://localhost:8000/health

{
  status: healthy,
  mqtt_connected: true,
  models_count: 3,
  devices_count: 1
}

Device Telemetry

$ curl http://localhost:8000/devices

{
  nano: {
    last_seen: 1776537994.14,
    status: {
      device_id: nano,
      cpu_temp_c: 42.5,
      gpu_temp_c: 33.0,
      memory_used_pct: 35.2,
      memory_total_mb: 3956,
      disk_used_pct: 77.6,
      current_model: my-model,
      current_model_version: 1.0
    }
  }
}

Inference Results

$ curl http://localhost:8000/results?limit=1

[
  {
    device: nano,
    result: {
      device_id: nano,
      inference_time_ms: 0.34,
      output_shape: [1, 5],
      output_summary: {
        mean: -0.2199,
        max: 0.2408,
        min: -0.6088
      },
      model: my-model,
      model_version: 1.0
    }
  }
]

Training Output

$ python train_and_deploy.py --name demo-model --version 1.0 --server localhost
Training model: input_dim=10, output_dim=5
  Epoch 20/100, Loss: 7.2560
  Epoch 40/100, Loss: 6.6770
  Epoch 60/100, Loss: 6.1446
  Epoch 80/100, Loss: 5.6550
  Epoch 100/100, Loss: 5.2047
Training complete. Final loss: 5.1832
Model saved to /tmp/demo-model_v1.0.npy
Model uploaded successfully.

Done! The model registry will notify edge devices via MQTT.
Edge devices will automatically download and start using the new model.

API Reference

Model Registry Endpoints

Method	Endpoint	Description
`GET`	`/`	Service info
`GET`	`/health`	Health check with MQTT and model counts
`POST`	`/models/upload`	Upload a model file with metadata
`GET`	`/models`	List all registered models
`GET`	`/models/download/{name}`	Download a model by name and version
`GET`	`/devices`	List connected edge devices and telemetry
`GET`	`/results`	Get recent inference results
`GET`	`/metrics`	Prometheus metrics endpoint

MQTT Topics

Topic	Direction	Description
`server/model/new`	Server → Edge	New model available notification
`server/command/{device_id}`	Server → Edge	Targeted command dispatch
`server/command/all`	Server → All	Broadcast command to all devices
`edge/{device_id}/status`	Edge → Server	Device telemetry and health
`edge/{device_id}/inference/result`	Edge → Server	Inference output and timing

Configuration

Edge Agent Environment Variables

Variable	Default	Description
`SERVER_IP`	`192.168.1.12`	IP address of the server running MQTT and registry
`MQTT_PORT`	`1883`	MQTT broker port
`REGISTRY_PORT`	`8000`	Model registry API port
`DEVICE_ID`	`abbhatia-mac`	Unique identifier for this edge device
`MODEL_DIR`	`~/edge-cloud/models`	Local directory for downloaded models
`POLL_INTERVAL`	`30`	Telemetry reporting interval in seconds

Server Services (docker-compose.yml)

Service	Container Name	Internal Port	External Port
Mosquitto	mqtt-broker	1883, 9001	1883, 9001
Model Registry	model-registry	8000	8000
Prometheus	prometheus	9090	9090
Grafana	grafana	3000	3000

Grafana

Default credentials: admin / admin

The monitoring dashboard is auto-provisioned on startup with Prometheus as the default datasource.

Project Structure

edge-cloud-ml/
├── server/                          # Server-side components
│   ├── docker-compose.yml           # Orchestration for all services
│   ├── model-registry/              # FastAPI model registry
│   │   ├── Dockerfile
│   │   ├── main.py
│   │   └── requirements.txt
│   ├── mosquitto/                   # MQTT broker config
│   │   └── mosquitto.conf
│   ├── prometheus/                  # Metrics collection
│   │   └── prometheus.yml
│   └── grafana/                     # Monitoring dashboards
│       └── provisioning/
│           ├── datasources/
│           └── dashboards/
├── edge-agent/                      # Edge device agent
│   ├── edge_agent.py
│   └── requirements.txt
├── scripts/                         # Utility scripts
│   ├── train_and_deploy.py          # Model training pipeline
│   └── send_command.py              # Remote command dispatcher
└── docs/                            # Documentation
    ├── architecture.md

Hardware Tested

Component	Server	Edge Device
Device	Laptop (i7-4510U)	NVIDIA Jetson Nano
CPU	Intel i7-4510U @ 2.0GHz	ARM Cortex-A57, 4 cores
RAM	8 GB DDR3	4 GB LPDDR4
GPU	NVIDIA GeForce 840M (2GB)	128-core Maxwell
Storage	256 GB SSD	32 GB eMMC
OS	Ubuntu 22.04 LTS	Ubuntu 20.04 (L4T R32.6.1)
CUDA	12.4	10.2

Performance

Benchmarks measured on the live system with the Jetson Nano running continuous inference:

Metric	Value
Inference latency	0.14 — 0.38 ms per inference
Average inference time	~0.34 ms
Telemetry interval	Every 30 seconds
Auto-inference interval	Every 10 seconds
Total inferences recorded	778+ (in a single session)
Model download time	< 1 second (328 bytes, local network)
MQTT round-trip	< 5 ms (LAN)
Jetson CPU temp (under load)	42.5°C
Jetson GPU temp (under load)	33.0°C
Jetson memory usage	35.2% (1.4 GB / 4 GB)
Server memory (registry)	~51 MB RSS

Note: Latency numbers reflect the demo linear model. Real-world models (ResNet, YOLO, etc.) will have higher inference times depending on model complexity and whether TensorRT optimization is applied.

Troubleshooting

Edge agent can't connect to MQTT

MQTT connect failed: [Errno 111] Connection refused, retrying in 5s...

Verify the server is running: docker ps on the server
Check the SERVER_IP environment variable is correct
Ensure port 1883 is not blocked by a firewall: sudo ufw allow 1883

Model download fails on edge device

Failed to download model: HTTPConnectionPool - Max retries exceeded

Verify the model registry is accessible: curl http://<server-ip>:8000/health
Check that port 8000 is reachable from the edge device
Ensure the model exists: curl http://<server-ip>:8000/models

Grafana shows no data

Verify Prometheus is scraping: visit http://<server-ip>:9090/targets
Ensure the model registry is running and the /metrics endpoint responds
Check the Grafana datasource is configured correctly under Settings > Data Sources

Docker containers keep restarting

# Check logs for the failing container
docker logs <container-name>

# Common fix: ensure ports aren't already in use
sudo lsof -i :1883
sudo lsof -i :8000

Edge agent high memory usage

Reduce POLL_INTERVAL to report telemetry less frequently
Ensure old model files are cleaned up in MODEL_DIR
Consider running the agent inside a Docker container with memory limits

Extending the Platform

Adding a New Edge Device

Copy edge-agent/ to the new device
Install dependencies: pip install -r requirements.txt
Set SERVER_IP and optionally DEVICE_ID
Run python edge_agent.py
The device will auto-register with the server

Using Real Models

Replace the numpy-based demo model in train_and_deploy.py with:

PyTorch models (.pt files via torch.save())
ONNX models for cross-platform inference
TensorRT engines for optimized Jetson inference

Adding New MQTT Commands

Define the command handler in edge_agent.py under on_message()
Add the command to send_command.py's CLI choices

Contributing

Contributions are welcome. Please follow these guidelines:

Fork the repository and create a feature branch from main
Follow the existing code style and project structure
Write descriptive commit messages using Conventional Commits
- feat: for new features
- fix: for bug fixes
- docs: for documentation changes
- refactor: for code restructuring
Test your changes on both server and edge environments if applicable
Submit a pull request with a clear description of the changes

Development Setup

git clone git@github.com:Abhinav0002/edge-cloud-ml.git
cd edge-cloud-ml
cd server && docker compose up -d    # Start server services
cd ../edge-agent && pip install -r requirements.txt  # Set up edge agent

License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
docs		docs
edge-agent		edge-agent
scripts		scripts
server		server
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Edge-Cloud ML

Table of Contents

Overview

Tech Stack

Architecture

Data Flow

Features

Model Registry (Server)

Edge Agent

Monitoring Stack

Quick Start

Prerequisites

1. Start the Server

2. Start the Edge Agent

3. Train and Deploy a Model

4. Send Commands to Edge Devices

5. Monitor

Demo

Health Check

Device Telemetry

Inference Results

Training Output

API Reference

Model Registry Endpoints

MQTT Topics

Configuration

Edge Agent Environment Variables

Server Services (docker-compose.yml)

Grafana

Project Structure

Hardware Tested

Performance

Troubleshooting

Edge agent can't connect to MQTT

Model download fails on edge device

Grafana shows no data

Docker containers keep restarting

Edge agent high memory usage

Extending the Platform

Adding a New Edge Device

Using Real Models

Adding New MQTT Commands

Contributing

Development Setup

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages