Skip to content

Commit 5f2e2fa

Browse files
author
Mark Saroufim
authored
Add health checks for nvidia and AMD (#315)
* nvidia runner health check * add amd * Update amd-health.yml * Update nvidia-arc-health.yml
1 parent 30237c7 commit 5f2e2fa

3 files changed

Lines changed: 62 additions & 0 deletions

File tree

.github/workflows/amd-health.yml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
name: amd
2+
3+
on:
4+
schedule:
5+
# Run nightly at 2 AM UTC
6+
- cron: '0 2 * * *'
7+
workflow_dispatch:
8+
push:
9+
branches: [main]
10+
11+
jobs:
12+
health-check:
13+
runs-on: [amdgpu-mi300-x86-64]
14+
timeout-minutes: 5
15+
16+
steps:
17+
- name: Setup Python
18+
uses: actions/setup-python@v5
19+
with:
20+
python-version: '3.10'
21+
22+
- name: Install PyTorch
23+
run: |
24+
pip install torch
25+
26+
- name: GPU Health Check
27+
run: python -c "import torch; torch.randn(5, device='cuda')"
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
name: nvidia-arc
2+
3+
on:
4+
schedule:
5+
# Run nightly at 2 AM UTC
6+
- cron: '0 2 * * *'
7+
workflow_dispatch:
8+
push:
9+
branches: [main]
10+
11+
jobs:
12+
health-check:
13+
runs-on: [gpumode-nvidia-arc]
14+
timeout-minutes: 5
15+
container:
16+
image: nvidia/cuda:12.4.0-devel-ubuntu22.04
17+
18+
steps:
19+
- name: Setup Python
20+
uses: actions/setup-python@v5
21+
with:
22+
python-version: '3.10'
23+
24+
- name: Install PyTorch
25+
run: |
26+
pip install torch
27+
28+
- name: GPU Health Check
29+
run: python -c "import torch; torch.randn(5, device='cuda')"
30+
31+
env:
32+
CUDA_VISIBLE_DEVICES: 0

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
# discord-cluster-manager
22

3+
[![nvidia-arc](https://github.com/pytorch-labs/discord-cluster-manager/actions/workflows/nvidia-arc-health.yml/badge.svg)](https://github.com/pytorch-labs/discord-cluster-manager/actions/workflows/nvidia-arc-health.yml)
4+
[![amd](https://github.com/pytorch-labs/discord-cluster-manager/actions/workflows/amd-health.yml/badge.svg)](https://github.com/pytorch-labs/discord-cluster-manager/actions/workflows/amd-health.yml)
5+
36
This is the code for the Discord bot we'll be using to queue jobs to a cluster of GPUs that our generous sponsors have provided. Our goal is to be able to queue kernels that can run end to end in seconds that way things feel interactive and social.
47

58
The key idea is that we're using Github Actions as a job scheduling engine and primarily making the Discord bot interact with the cluster via issuing Github Actions and and monitoring their status and while we're focused on having a nice user experience on discord.gg/gpumode, [we're happy to accept PRs](#local-development) that make it easier for other Discord communities to hook GPUs.

0 commit comments

Comments
 (0)