Slurm Usage Guide

This guide explains how to use Slurm for job scheduling, including details about the run_training.sh script and common Slurm commands.

Introduction to Slurm
Understanding run_training.sh
Common Slurm Commands
Examples

Introduction to Slurm

Slurm is an open-source workload manager designed for Linux clusters of all sizes. It facilitates job scheduling and resource management, allowing users to submit and manage computing tasks efficiently.

Understanding `run_training.sh`

The run_training.sh script is a Slurm batch script used to submit a training job to the cluster. Below is the content of the script with explanations for each line:

#!/bin/bash
#SBATCH --job-name=training_job      # Sets the job name to 'training_job'
#SBATCH --nodes=3                    # Requests 3 nodes
#SBATCH --ntasks-per-node=1          # Runs 1 task per node
#SBATCH --gres=gpu:8                 # Requests 8 GPUs per node
#SBATCH --time=00:15:00              # Sets a time limit of 15 minutes
#SBATCH --output=training_output.log # Redirects output to 'training_output.log'

export NCCL_SOCKET_IFNAME=^ib        # Configures NCCL to use specific network interfaces
export NCCL_DEBUG=INFO               # Enables NCCL debugging information
export NCCL_IB_DISABLE=0             # Enables InfiniBand for NCCL

source myenv/bin/activate            # Activates the Python virtual environment 'myenv'
accelerate launch --config_file=ddp3.yaml training.py # Launches the training script using 'accelerate' with the specified config

Explanation of Slurm Directives

#SBATCH --job-name=training_job: Assigns a name to the job for easier identification.
#SBATCH --nodes=3: Requests 3 compute nodes for the job.
#SBATCH --ntasks-per-node=1: Specifies that 1 task will run on each node.
#SBATCH --gres=gpu:8: Requests 8 GPUs per node.
#SBATCH --time=00:15:00: Sets a maximum wall time of 15 minutes for the job.
#SBATCH --output=training_output.log: Redirects the job's standard output to a file.

Environment Variables

export NCCL_SOCKET_IFNAME=^ib: Chooses correct InfiniBand interfaces from NCCL communication.
export NCCL_DEBUG=INFO: Enables detailed logging for NCCL operations.
export NCCL_IB_DISABLE=0: Enables InfiniBand support in NCCL.

Launching the Training Script

The script activates a Python virtual environment and runs the training script using accelerate:

source myenv/bin/activate
accelerate launch --config_file=wand/code_for_testing_nebius/ddp3.yaml wand/code_for_testing_nebius/training.py

The accelerate launch command runs the training script with the specified configuration file for distributed data parallel (DDP) training.

Common Slurm Commands

Below are some common Slurm commands used for job management:

Submit a Job: sbatch run_training.sh
Check Job Queue: squeue
Cancel a Job: scancel <job_id>
Show Node Information: sinfo
View Job Details: scontrol show job <job_id>
Check Job Accounting: sacct

Examples

Submitting the Job

To submit the training job, run:

sbatch run_training.sh

Checking Job Status

To check the status of your jobs in the queue:

squeue

Sample output:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               126       gpu training skyporta  R       4:56      3 skyportalh100-[1-3]

Cancelling a Job

To cancel a job with a specific Job ID:

scancel 126

Viewing Node Information

To view the status of nodes in the cluster:

sinfo

Slurm Usage Guide

Contents

Introduction to Slurm

Understanding `run_training.sh`

Explanation of Slurm Directives

Environment Variables

Launching the Training Script

Common Slurm Commands

Examples

Submitting the Job

Checking Job Status

Cancelling a Job

Viewing Node Information

Additional Resources

Slurm Usage Guide

Contents

Introduction to Slurm

Understanding run_training.sh

Explanation of Slurm Directives

Environment Variables

Launching the Training Script

Common Slurm Commands

Examples

Submitting the Job

Checking Job Status

Cancelling a Job

Viewing Node Information

Additional Resources

Understanding `run_training.sh`