Skip to content

Abhinav0002/edge-cloud-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Edge-Cloud ML

A production-style edge-cloud machine learning platform that bridges server-side model training with real-time inference on edge devices. Built with a microservices architecture using Docker, MQTT, and FastAPI.

Python Docker MQTT FastAPI Prometheus Grafana License


Table of Contents


Overview

This platform implements a complete ML lifecycle across distributed infrastructure:

  • Server (Cloud) — Model training, versioned model registry, monitoring, and command dispatch
  • Edge Devices — Autonomous inference agents that auto-pull models and report telemetry

The system is designed for and tested on NVIDIA Jetson Nano as the edge device and an Intel i7 laptop as the server node, communicating over a local network with optional Tailscale overlay for remote access.

Tech Stack

Layer Technology Purpose
Message Broker Eclipse Mosquitto 2.x Asynchronous pub/sub communication between server and edge
Model Registry FastAPI + Uvicorn REST API for model versioning, storage, and distribution
Metrics Prometheus Time-series metrics collection and alerting
Visualization Grafana Real-time monitoring dashboards
Orchestration Docker Compose Multi-container deployment and management
Edge Runtime Python + NumPy Lightweight inference agent for resource-constrained devices
Networking Tailscale (optional) Secure overlay network for remote edge access

Architecture

┌──────────────────────────────────────────────────────────────┐
│                     SERVER (Laptop / VM)                      │
│                                                                │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐ │
│  │  Mosquitto    │  │   Model      │  │   Monitoring          │ │
│  │  MQTT Broker  │◄─┤   Registry   │  │   Prometheus + Grafana│ │
│  │  :1883/:9001  │  │   (FastAPI)  │  │   :9090 / :3000       │ │
│  └──────┬───────┘  │   :8000      │  └──────────────────────┘ │
│         │          └──────────────┘                            │
│         │                                                      │
└─────────┼──────────────────────────────────────────────────────┘
          │  MQTT (pub/sub)
          │
┌─────────┼──────────────────────────────────────────────────────┐
│         ▼              EDGE DEVICE (Jetson Nano)                │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                    Edge Agent                              │  │
│  │  • Subscribes to model updates via MQTT                    │  │
│  │  • Auto-downloads new models from registry                 │  │
│  │  • Runs continuous inference loop                          │  │
│  │  • Reports telemetry (CPU/GPU temp, memory, disk)          │  │
│  │  • Responds to remote commands                             │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────────────┘

Data Flow

 Train Model ──> Upload to Registry ──> MQTT Notification ──> Edge Downloads
                                                                    │
                 Grafana Dashboard <── Prometheus <── Registry <── Inference Results

Features

Model Registry (Server)

  • Versioned model storage — Upload, download, and list models with semantic versioning
  • MQTT notifications — Automatically notify edge devices when new models are available
  • Prometheus metrics — Track upload/download counts, inference results, latency histograms
  • Device tracking — Monitor connected edge devices and their telemetry in real-time
  • REST API — Full CRUD operations via FastAPI with auto-generated OpenAPI docs

Edge Agent

  • Auto model sync — Subscribes to MQTT and downloads new models without manual intervention
  • Continuous inference — Runs inference loop with configurable intervals
  • Remote commands — Accept commands from server: run_inference, report_status, pull_model
  • Device telemetry — Reports CPU/GPU temperature, memory usage, disk space
  • Graceful reconnection — Automatically reconnects on MQTT broker disconnection

Monitoring Stack

  • Prometheus — Scrapes model registry metrics every 15 seconds
  • Grafana — Pre-provisioned dashboard with 10 panels covering inference rates, model activity, and system health

Quick Start

Prerequisites

  • Docker and Docker Compose (server)
  • Python 3.8+ (edge device)
  • Network connectivity between server and edge device

1. Start the Server

cd server
docker compose up -d

This starts 4 services:

Service Port Description
Mosquitto 1883, 9001 MQTT broker
Model Registry 8000 FastAPI model API
Prometheus 9090 Metrics collection
Grafana 3000 Monitoring dashboard

Verify: curl http://localhost:8000/health

2. Start the Edge Agent

cd edge-agent
pip install -r requirements.txt

# Set server IP and start
export SERVER_IP=<server-ip>
python edge_agent.py

The agent will connect to MQTT and begin reporting telemetry.

3. Train and Deploy a Model

cd scripts
pip install numpy requests paho-mqtt

# Train a model and push to registry
python train_and_deploy.py --name my-model --version 1.0 --server <server-ip>

The edge agent will automatically download and start using the new model.

4. Send Commands to Edge Devices

# Trigger inference on all devices
python scripts/send_command.py --command run_inference

# Request status from a specific device
python scripts/send_command.py --device nano --command report_status

# Push a model to a device
python scripts/send_command.py --command pull_model --model my-model --version 1.0

5. Monitor

  • Grafana Dashboard: http://<server-ip>:3000 (admin/admin)
  • API Docs: http://<server-ip>:8000/docs
  • Prometheus: http://<server-ip>:9090

Demo

Health Check

$ curl http://localhost:8000/health
{
  status: healthy,
  mqtt_connected: true,
  models_count: 3,
  devices_count: 1
}

Device Telemetry

$ curl http://localhost:8000/devices
{
  nano: {
    last_seen: 1776537994.14,
    status: {
      device_id: nano,
      cpu_temp_c: 42.5,
      gpu_temp_c: 33.0,
      memory_used_pct: 35.2,
      memory_total_mb: 3956,
      disk_used_pct: 77.6,
      current_model: my-model,
      current_model_version: 1.0
    }
  }
}

Inference Results

$ curl http://localhost:8000/results?limit=1
[
  {
    device: nano,
    result: {
      device_id: nano,
      inference_time_ms: 0.34,
      output_shape: [1, 5],
      output_summary: {
        mean: -0.2199,
        max: 0.2408,
        min: -0.6088
      },
      model: my-model,
      model_version: 1.0
    }
  }
]

Training Output

$ python train_and_deploy.py --name demo-model --version 1.0 --server localhost
Training model: input_dim=10, output_dim=5
  Epoch 20/100, Loss: 7.2560
  Epoch 40/100, Loss: 6.6770
  Epoch 60/100, Loss: 6.1446
  Epoch 80/100, Loss: 5.6550
  Epoch 100/100, Loss: 5.2047
Training complete. Final loss: 5.1832
Model saved to /tmp/demo-model_v1.0.npy
Model uploaded successfully.

Done! The model registry will notify edge devices via MQTT.
Edge devices will automatically download and start using the new model.

API Reference

Model Registry Endpoints

Method Endpoint Description
GET / Service info
GET /health Health check with MQTT and model counts
POST /models/upload Upload a model file with metadata
GET /models List all registered models
GET /models/download/{name} Download a model by name and version
GET /devices List connected edge devices and telemetry
GET /results Get recent inference results
GET /metrics Prometheus metrics endpoint

MQTT Topics

Topic Direction Description
server/model/new Server → Edge New model available notification
server/command/{device_id} Server → Edge Targeted command dispatch
server/command/all Server → All Broadcast command to all devices
edge/{device_id}/status Edge → Server Device telemetry and health
edge/{device_id}/inference/result Edge → Server Inference output and timing

Configuration

Edge Agent Environment Variables

Variable Default Description
SERVER_IP 192.168.1.12 IP address of the server running MQTT and registry
MQTT_PORT 1883 MQTT broker port
REGISTRY_PORT 8000 Model registry API port
DEVICE_ID abbhatia-mac Unique identifier for this edge device
MODEL_DIR ~/edge-cloud/models Local directory for downloaded models
POLL_INTERVAL 30 Telemetry reporting interval in seconds

Server Services (docker-compose.yml)

Service Container Name Internal Port External Port
Mosquitto mqtt-broker 1883, 9001 1883, 9001
Model Registry model-registry 8000 8000
Prometheus prometheus 9090 9090
Grafana grafana 3000 3000

Grafana

Default credentials: admin / admin

The monitoring dashboard is auto-provisioned on startup with Prometheus as the default datasource.

Project Structure

edge-cloud-ml/
├── server/                          # Server-side components
│   ├── docker-compose.yml           # Orchestration for all services
│   ├── model-registry/              # FastAPI model registry
│   │   ├── Dockerfile
│   │   ├── main.py
│   │   └── requirements.txt
│   ├── mosquitto/                   # MQTT broker config
│   │   └── mosquitto.conf
│   ├── prometheus/                  # Metrics collection
│   │   └── prometheus.yml
│   └── grafana/                     # Monitoring dashboards
│       └── provisioning/
│           ├── datasources/
│           └── dashboards/
├── edge-agent/                      # Edge device agent
│   ├── edge_agent.py
│   └── requirements.txt
├── scripts/                         # Utility scripts
│   ├── train_and_deploy.py          # Model training pipeline
│   └── send_command.py              # Remote command dispatcher
└── docs/                            # Documentation
    ├── architecture.md

Hardware Tested

Component Server Edge Device
Device Laptop (i7-4510U) NVIDIA Jetson Nano
CPU Intel i7-4510U @ 2.0GHz ARM Cortex-A57, 4 cores
RAM 8 GB DDR3 4 GB LPDDR4
GPU NVIDIA GeForce 840M (2GB) 128-core Maxwell
Storage 256 GB SSD 32 GB eMMC
OS Ubuntu 22.04 LTS Ubuntu 20.04 (L4T R32.6.1)
CUDA 12.4 10.2

Performance

Benchmarks measured on the live system with the Jetson Nano running continuous inference:

Metric Value
Inference latency 0.14 — 0.38 ms per inference
Average inference time ~0.34 ms
Telemetry interval Every 30 seconds
Auto-inference interval Every 10 seconds
Total inferences recorded 778+ (in a single session)
Model download time < 1 second (328 bytes, local network)
MQTT round-trip < 5 ms (LAN)
Jetson CPU temp (under load) 42.5°C
Jetson GPU temp (under load) 33.0°C
Jetson memory usage 35.2% (1.4 GB / 4 GB)
Server memory (registry) ~51 MB RSS

Note: Latency numbers reflect the demo linear model. Real-world models (ResNet, YOLO, etc.) will have higher inference times depending on model complexity and whether TensorRT optimization is applied.

Troubleshooting

Edge agent can't connect to MQTT

MQTT connect failed: [Errno 111] Connection refused, retrying in 5s...
  • Verify the server is running: docker ps on the server
  • Check the SERVER_IP environment variable is correct
  • Ensure port 1883 is not blocked by a firewall: sudo ufw allow 1883

Model download fails on edge device

Failed to download model: HTTPConnectionPool - Max retries exceeded
  • Verify the model registry is accessible: curl http://<server-ip>:8000/health
  • Check that port 8000 is reachable from the edge device
  • Ensure the model exists: curl http://<server-ip>:8000/models

Grafana shows no data

  • Verify Prometheus is scraping: visit http://<server-ip>:9090/targets
  • Ensure the model registry is running and the /metrics endpoint responds
  • Check the Grafana datasource is configured correctly under Settings > Data Sources

Docker containers keep restarting

# Check logs for the failing container
docker logs <container-name>

# Common fix: ensure ports aren't already in use
sudo lsof -i :1883
sudo lsof -i :8000

Edge agent high memory usage

  • Reduce POLL_INTERVAL to report telemetry less frequently
  • Ensure old model files are cleaned up in MODEL_DIR
  • Consider running the agent inside a Docker container with memory limits

Extending the Platform

Adding a New Edge Device

  1. Copy edge-agent/ to the new device
  2. Install dependencies: pip install -r requirements.txt
  3. Set SERVER_IP and optionally DEVICE_ID
  4. Run python edge_agent.py
  5. The device will auto-register with the server

Using Real Models

Replace the numpy-based demo model in train_and_deploy.py with:

  • PyTorch models (.pt files via torch.save())
  • ONNX models for cross-platform inference
  • TensorRT engines for optimized Jetson inference

Adding New MQTT Commands

  1. Define the command handler in edge_agent.py under on_message()
  2. Add the command to send_command.py's CLI choices

Contributing

Contributions are welcome. Please follow these guidelines:

  1. Fork the repository and create a feature branch from main
  2. Follow the existing code style and project structure
  3. Write descriptive commit messages using Conventional Commits
    • feat: for new features
    • fix: for bug fixes
    • docs: for documentation changes
    • refactor: for code restructuring
  4. Test your changes on both server and edge environments if applicable
  5. Submit a pull request with a clear description of the changes

Development Setup

git clone git@github.com:Abhinav0002/edge-cloud-ml.git
cd edge-cloud-ml
cd server && docker compose up -d    # Start server services
cd ../edge-agent && pip install -r requirements.txt  # Set up edge agent

License

MIT License. See LICENSE for details.

About

Edge-cloud ML platform for distributed model training and real-time inference on NVIDIA Jetson devices. Built with FastAPI, MQTT, Docker, Prometheus, and Grafana

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors