Skip to content

how to run Slurm job across multiple node #164

Description

@GKarbon

Hi,

I'm tring to run job on 2 nodes with Slurm and Pyxis, pmix is available.
The container is working well when running on single node.

Submit script

#!/usr/bin/bash
#SBATCH -N 2
#SBATCH --gpus-per-node=8
#SBATCH -J job
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=8
#SBATCH --gpu-bind=none

export OMP_NUM_THREADS=1
export OMP_PLACES=threads
export OMP_PROC_BIND=spread

srun  --container-mounts=./input \
        --container-workdir=./input \
        --container-image=runtime.sqsh \
        mpirun -np 8 bin_std

Error output

Host key verification failed.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

After searching, I tried adding export OMPI_MCA_plm=^slurm, now error is

--------------------------------------------------------------------------
The SLURM process starter for OpenMPI was unable to locate a
usable "srun" command in its path. Please check your path
and try again.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[7357,0],0] FORCE-TERMINATE AT (null):1 - error ../../../../../orte/mca/plm/slurm/plm_slurm_module.c(475)

This is something that should be reported to the developers.
--------------------------------------------------------------------------

What is the best practice to do it?
Do I need to install srun inside the container?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions