How to Use SLURM Array Jobs for Parameter Sweeps and Batch Processing

Last updated: 2025-12-04
Solution under review

Environment

  • HPC4 cluster

  • Superpod cluster

  • SLURM workload manager

  • Tasks that need to run with different parameters or input files

  • Parameter sweeps, sensitivity analysis, batch data processing

Issue

  • Need to run the same program with different parameters or input files

  • Want to process multiple datasets with the same analysis script

  • Conducting parameter sweeps for optimization or sensitivity analysis

  • Have many independent tasks that can run in parallel

  • Submitting individual jobs for each parameter is time-consuming and error-prone

  • Need efficient way to manage hundreds or thousands of similar jobs

Resolution

Use SLURM array jobs with the --array option to submit multiple similar tasks with a single command. Each array task gets a unique $SLURM_ARRAY_TASK_ID that can be used to select parameters or input files.

Basic Array Job Syntax

#!/bin/bash
#SBATCH --job-name=array_example
#SBATCH --account=exampleproj
#SBATCH --partition=amd
#SBATCH --array=1-10
#SBATCH --output=array_%A_%a.out
#SBATCH --error=array_%A_%a.err

# $SLURM_ARRAY_TASK_ID contains the array index (1, 2, 3, ..., 10)
echo "This is array task $SLURM_ARRAY_TASK_ID"

# Use the task ID in your computation
python process.py --task-id $SLURM_ARRAY_TASK_ID

This submits 10 jobs (array tasks) with indices 1 through 10.

Array Job Environment Variables

Variable

Description

$SLURM_ARRAY_TASK_ID

Current array task index (e.g., 1, 2, 3, …)

$SLURM_ARRAY_JOB_ID

Job ID of the entire array (same for all tasks)

$SLURM_ARRAY_TASK_MIN

Minimum array index

$SLURM_ARRAY_TASK_MAX

Maximum array index

$SLURM_ARRAY_TASK_COUNT

Total number of array tasks

Use Case 1: Processing Multiple Input Files

Process different data files using the array task ID to select the file.

Method 1: Numbered Files

#!/bin/bash
#SBATCH --job-name=process_data
#SBATCH --account=exampleproj
#SBATCH --partition=amd
#SBATCH --array=1-100
#SBATCH --output=logs/job_%A_%a.out

# Process file based on array task ID
# i.e. data/input_1.txt, data/input_2.txt, ...
INPUT_FILE="data/input_${SLURM_ARRAY_TASK_ID}.txt"
OUTPUT_FILE="results/output_${SLURM_ARRAY_TASK_ID}.txt"

python analyze.py --input $INPUT_FILE --output $OUTPUT_FILE

Method 2: File List

Read filenames from a list and select based on array task ID.

#!/bin/bash
#SBATCH --job-name=file_processing
#SBATCH --account=exampleproj
#SBATCH --partition=amd
#SBATCH --array=1-50
#SBATCH --output=logs/job_%A_%a.out

# Get the filename from a list
FILE_LIST="file_list.txt"
INPUT_FILE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" $FILE_LIST)

# Process the file
echo "Processing: $INPUT_FILE"
python process.py $INPUT_FILE

Example file_list.txt:

/path/to/dataset1.dat
/path/to/dataset2.dat
/path/to/dataset3.dat
...

Use Case 2: Parameter Sweeps

Run simulations or analyses with different parameter values.

Simple Parameter Mapping

#!/bin/bash
#SBATCH --job-name=param_sweep
#SBATCH --account=exampleproj
#SBATCH --partition=amd
#SBATCH --array=0-99
#SBATCH --output=logs/param_%A_%a.out

# Map array task ID to parameter values
# Example: sweep learning rate from 0.001 to 0.1
LEARNING_RATE=$(awk "BEGIN {print 0.001 + $SLURM_ARRAY_TASK_ID * 0.001}")

echo "Running with learning rate: $LEARNING_RATE"
python train_model.py --lr $LEARNING_RATE

Multi-Dimensional Parameter Grid

Sweep multiple parameters simultaneously.

#!/bin/bash
#SBATCH --job-name=grid_search
#SBATCH --account=exampleproj
#SBATCH --partition=gpu
#SBATCH --gpus-per-task=1
#SBATCH --array=0-99
#SBATCH --output=logs/grid_%A_%a.out

# Define parameter grid
# 10 learning rates × 10 batch sizes = 100 combinations
LR_VALUES=(0.001 0.002 0.005 0.01 0.02 0.05 0.1 0.2 0.5 1.0)
BATCH_VALUES=(16 32 64 128 256 512 1024 2048 4096 8192)

# Calculate indices
LR_IDX=$((SLURM_ARRAY_TASK_ID / 10))
BATCH_IDX=$((SLURM_ARRAY_TASK_ID % 10))

# Get parameter values
LR=${LR_VALUES[$LR_IDX]}
BATCH=${BATCH_VALUES[$BATCH_IDX]}

echo "Learning Rate: $LR, Batch Size: $BATCH"
python train.py --lr $LR --batch-size $BATCH

Using Parameter File

Read parameter combinations from a file.

#!/bin/bash
#SBATCH --job-name=param_file
#SBATCH --account=exampleproj
#SBATCH --partition=amd
#SBATCH --array=1-100
#SBATCH --output=logs/param_%A_%a.out

# Read parameters from file (one combination per line)
PARAMS=$(sed -n "${SLURM_ARRAY_TASK_ID}p" parameters.txt)

# Parse parameters (assuming space-separated)
read -r ALPHA BETA GAMMA <<< "$PARAMS"

echo "Running with α=$ALPHA, β=$BETA, γ=$GAMMA"
./simulation --alpha $ALPHA --beta $BETA --gamma $GAMMA

Example parameters.txt:

0.1 0.5 1.0
0.1 0.5 2.0
0.1 1.0 1.0
0.2 0.5 1.0
...

Use Case 3: Processing Folders

Process data in different directories.

#!/bin/bash
#SBATCH --job-name=folder_processing
#SBATCH --account=exampleproj
#SBATCH --partition=amd
#SBATCH --array=1-20
#SBATCH --output=logs/folder_%A_%a.out

# Define folder pattern
FOLDER_PREFIX="/data/experiment"
FOLDER="${FOLDER_PREFIX}_${SLURM_ARRAY_TASK_ID}"

# Check if folder exists
if [ -d "$FOLDER" ]; then
    echo "Processing folder: $FOLDER"
    cd $FOLDER
    python ../analysis.py
else
    echo "Warning: Folder $FOLDER does not exist"
    exit 1
fi

Array Job Array Specifications

Different ways to specify array indices:

# Range: tasks 1, 2, 3, ..., 100
#SBATCH --array=1-100

# Range with step: tasks 0, 10, 20, ..., 100
#SBATCH --array=0-100:10

# Specific values: tasks 1, 5, 10, 15
#SBATCH --array=1,5,10,15

# Mixed: tasks 1, 2, 3, 4, 5, 10, 20, 30
#SBATCH --array=1-5,10,20,30

# Limit concurrent tasks: max 10 running at once
#SBATCH --array=1-1000%10

Tip

Use % to limit concurrent array tasks. This prevents overwhelming the system with too many simultaneous jobs while still allowing all tasks to queue.

Managing Array Jobs

Monitoring Array Jobs

# View all array tasks
squeue -u $USER

# View specific array job
squeue -j 12345

# Count running/pending tasks
squeue -u $USER --array -t RUNNING | wc -l
squeue -u $USER --array -t PENDING | wc -l

Canceling Array Tasks

# Cancel entire array job
scancel 12345

# Cancel specific array task
scancel 12345_5

# Cancel range of array tasks
scancel 12345_[10-20]

# Cancel all array tasks with specific job name
scancel --name=array_job

Output File Naming

Use special placeholders in output filenames:

# %A = array job ID (same for all tasks)
# %a = array task ID (unique for each task)
#SBATCH --output=results_%A_%a.out
#SBATCH --error=errors_%A_%a.err

# Organize outputs in subdirectories
#SBATCH --output=logs/task_%a/output.log
#SBATCH --error=logs/task_%a/error.log

Best Practices

Array Job Design

  • Make each array task independent - no dependencies between tasks

  • Ensure all tasks have similar resource requirements

  • Use --array=1-N%M to limit concurrent tasks and avoid overwhelming the scheduler

  • Test with a small array (e.g., --array=1-3) before scaling up

Resource Management

  • Request resources per task, not for the entire array

  • Consider task runtime - all tasks should finish in similar time

  • Use appropriate concurrency limits based on cluster policy

Data Management

  • Use unique output filenames with %A_%a to avoid conflicts

  • Create output directories before submitting if needed

  • Consider using task-specific working directories

  • Clean up intermediate files from completed tasks

Error Handling

  • Include error checking in your script

  • Log which parameter combination or file each task processes

  • Failed tasks can be identified and resubmitted individually

  • Use set -e to exit on errors

Debugging

  • Test with --array=1 or --array=1-3 first

  • Check one output file to verify correctness

  • Use explicit echo statements to log task ID and parameters

  • Verify file/parameter selection logic works correctly

Advanced Techniques

Dynamic Array Size from File Count

# Count files and create array job
shopt -s nullglob
files=(data/*.txt)
NUM_FILES=${#files[@]}
sbatch --array=1-$NUM_FILES process_files.sh

Resubmitting Failed Tasks

# Find failed tasks from sacct
# Find failed tasks from sacct
JOB_ID=12345
sacct -j $JOB_ID --format=JobID,State | grep FAILED | awk '{print $1}' > failed_tasks.txt

# Create array specification from failed tasks
FAILED_ARRAY=$(sed "s/${JOB_ID}_//" failed_tasks.txt | tr '\n' ',' | sed 's/,$//')

# Resubmit only failed tasks
sbatch --array=$FAILED_ARRAY rerun_job.sh

Root Cause

Array jobs solve the problem of submitting and managing large numbers of similar tasks:

Without Array Jobs: - Need to write loops to submit hundreds of individual jobs - Job IDs are unrelated, making management difficult - Output files need manual naming conventions - Monitoring and canceling groups of related jobs is tedious

With Array Jobs: - Single submission for all related tasks - Automatic task indexing with $SLURM_ARRAY_TASK_ID - Unified job ID for the entire array - Easy monitoring and cancellation of task groups - Built-in output file naming with %A_%a - Scheduler can optimize resource allocation for task groups

References

Related Articles

SLURM Documentation