Slurm Usage Guide
This guide explains how to use Slurm for job scheduling, including details about the run_training.sh
script and common Slurm commands.
Contents
Introduction to Slurm
Slurm is an open-source workload manager designed for Linux clusters of all sizes. It facilitates job scheduling and resource management, allowing users to submit and manage computing tasks efficiently.
Understanding run_training.sh
The run_training.sh
script is a Slurm batch script used to submit a training job to the cluster. Below is the content of the script with explanations for each line:
#!/bin/bash
#SBATCH --job-name=training_job # Sets the job name to 'training_job'
#SBATCH --nodes=3 # Requests 3 nodes
#SBATCH --ntasks-per-node=1 # Runs 1 task per node
#SBATCH --gres=gpu:8 # Requests 8 GPUs per node
#SBATCH --time=00:15:00 # Sets a time limit of 15 minutes
#SBATCH --output=training_output.log # Redirects output to 'training_output.log'
export NCCL_SOCKET_IFNAME=^ib # Configures NCCL to use specific network interfaces
export NCCL_DEBUG=INFO # Enables NCCL debugging information
export NCCL_IB_DISABLE=0 # Enables InfiniBand for NCCL
source myenv/bin/activate # Activates the Python virtual environment 'myenv'
accelerate launch --config_file=ddp3.yaml training.py # Launches the training script using 'accelerate' with the specified config
Explanation of Slurm Directives
-
#SBATCH --job-name=training_job
: Assigns a name to the job for easier identification. -
#SBATCH --nodes=3
: Requests 3 compute nodes for the job. -
#SBATCH --ntasks-per-node=1
: Specifies that 1 task will run on each node. -
#SBATCH --gres=gpu:8
: Requests 8 GPUs per node. -
#SBATCH --time=00:15:00
: Sets a maximum wall time of 15 minutes for the job. -
#SBATCH --output=training_output.log
: Redirects the job's standard output to a file.
Environment Variables
-
export NCCL_SOCKET_IFNAME=^ib
: Chooses correct InfiniBand interfaces from NCCL communication. -
export NCCL_DEBUG=INFO
: Enables detailed logging for NCCL operations. -
export NCCL_IB_DISABLE=0
: Enables InfiniBand support in NCCL.
Launching the Training Script
The script activates a Python virtual environment and runs the training script using accelerate
:
source myenv/bin/activate
accelerate launch --config_file=wand/code_for_testing_nebius/ddp3.yaml wand/code_for_testing_nebius/training.py
The accelerate launch
command runs the training script with the specified configuration file for distributed data parallel (DDP) training.
Common Slurm Commands
Below are some common Slurm commands used for job management:
-
Submit a Job:
sbatch run_training.sh
-
Check Job Queue:
squeue
-
Cancel a Job:
scancel <job_id>
-
Show Node Information:
sinfo
-
View Job Details:
scontrol show job <job_id>
-
Check Job Accounting:
sacct
Examples
Submitting the Job
To submit the training job, run:
sbatch run_training.sh
Checking Job Status
To check the status of your jobs in the queue:
squeue
Sample output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
126 gpu training skyporta R 4:56 3 skyportalh100-[1-3]
Cancelling a Job
To cancel a job with a specific Job ID:
scancel 126
Viewing Node Information
To view the status of nodes in the cluster:
sinfo