How to Use SLURM Array Jobs for Parameter Sweeps and Batch Processing

Environment

HPC4 cluster

Superpod cluster

SLURM workload manager

Tasks that need to run with different parameters or input files

Parameter sweeps, sensitivity analysis, batch data processing

Issue

Need to run the same program with different parameters or input files

Want to process multiple datasets with the same analysis script

Conducting parameter sweeps for optimization or sensitivity analysis

Have many independent tasks that can run in parallel

Submitting individual jobs for each parameter is time-consuming and error-prone

Need efficient way to manage hundreds or thousands of similar jobs

Resolution

Use SLURM array jobs with the --array option to submit multiple similar tasks with a single command. Each array task gets a unique $SLURM_ARRAY_TASK_ID that can be used to select parameters or input files.

Basic Array Job Syntax

#!/bin/bash
#SBATCH --job-name=array_example
#SBATCH --account=exampleproj
#SBATCH --partition=amd
#SBATCH --array=1-10
#SBATCH --output=array_%A_%a.out
#SBATCH --error=array_%A_%a.err

# $SLURM_ARRAY_TASK_ID contains the array index (1, 2, 3, ..., 10)
echo "This is array task $SLURM_ARRAY_TASK_ID"

# Use the task ID in your computation
python process.py --task-id $SLURM_ARRAY_TASK_ID

This submits 10 jobs (array tasks) with indices 1 through 10.

Array Job Environment Variables

Variable	Description
`$SLURM_ARRAY_TASK_ID`	Current array task index (e.g., 1, 2, 3, …)
`$SLURM_ARRAY_JOB_ID`	Job ID of the entire array (same for all tasks)
`$SLURM_ARRAY_TASK_MIN`	Minimum array index
`$SLURM_ARRAY_TASK_MAX`	Maximum array index
`$SLURM_ARRAY_TASK_COUNT`	Total number of array tasks

Use Case 1: Processing Multiple Input Files

Process different data files using the array task ID to select the file.

Method 1: Numbered Files

#!/bin/bash
#SBATCH --job-name=process_data
#SBATCH --account=exampleproj
#SBATCH --partition=amd
#SBATCH --array=1-100
#SBATCH --output=logs/job_%A_%a.out

# Process file based on array task ID
# i.e. data/input_1.txt, data/input_2.txt, ...
INPUT_FILE="data/input_${SLURM_ARRAY_TASK_ID}.txt"
OUTPUT_FILE="results/output_${SLURM_ARRAY_TASK_ID}.txt"

python analyze.py --input $INPUT_FILE --output $OUTPUT_FILE

Method 2: File List

Read filenames from a list and select based on array task ID.

#!/bin/bash
#SBATCH --job-name=file_processing
#SBATCH --account=exampleproj
#SBATCH --partition=amd
#SBATCH --array=1-50
#SBATCH --output=logs/job_%A_%a.out

# Get the filename from a list
FILE_LIST="file_list.txt"
INPUT_FILE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" $FILE_LIST)

# Process the file
echo "Processing: $INPUT_FILE"
python process.py $INPUT_FILE

Example file_list.txt:

/path/to/dataset1.dat
/path/to/dataset2.dat
/path/to/dataset3.dat
...

Use Case 2: Parameter Sweeps

Run simulations or analyses with different parameter values.

Simple Parameter Mapping

#!/bin/bash
#SBATCH --job-name=param_sweep
#SBATCH --account=exampleproj
#SBATCH --partition=amd
#SBATCH --array=0-99
#SBATCH --output=logs/param_%A_%a.out

# Map array task ID to parameter values
# Example: sweep learning rate from 0.001 to 0.1
LEARNING_RATE=$(awk "BEGIN {print 0.001 + $SLURM_ARRAY_TASK_ID * 0.001}")

echo "Running with learning rate: $LEARNING_RATE"
python train_model.py --lr $LEARNING_RATE

Multi-Dimensional Parameter Grid

Sweep multiple parameters simultaneously.

#!/bin/bash
#SBATCH --job-name=grid_search
#SBATCH --account=exampleproj
#SBATCH --partition=gpu
#SBATCH --gpus-per-task=1
#SBATCH --array=0-99
#SBATCH --output=logs/grid_%A_%a.out

# Define parameter grid
# 10 learning rates × 10 batch sizes = 100 combinations
LR_VALUES=(0.001 0.002 0.005 0.01 0.02 0.05 0.1 0.2 0.5 1.0)
BATCH_VALUES=(16 32 64 128 256 512 1024 2048 4096 8192)

# Calculate indices
LR_IDX=$((SLURM_ARRAY_TASK_ID / 10))
BATCH_IDX=$((SLURM_ARRAY_TASK_ID % 10))

# Get parameter values
LR=${LR_VALUES[$LR_IDX]}
BATCH=${BATCH_VALUES[$BATCH_IDX]}

echo "Learning Rate: $LR, Batch Size: $BATCH"
python train.py --lr $LR --batch-size $BATCH

Using Parameter File

Read parameter combinations from a file.

#!/bin/bash
#SBATCH --job-name=param_file
#SBATCH --account=exampleproj
#SBATCH --partition=amd
#SBATCH --array=1-100
#SBATCH --output=logs/param_%A_%a.out

# Read parameters from file (one combination per line)
PARAMS=$(sed -n "${SLURM_ARRAY_TASK_ID}p" parameters.txt)

# Parse parameters (assuming space-separated)
read -r ALPHA BETA GAMMA <<< "$PARAMS"

echo "Running with α=$ALPHA, β=$BETA, γ=$GAMMA"
./simulation --alpha $ALPHA --beta $BETA --gamma $GAMMA

Example parameters.txt:

1 0.5 1.0
1 0.5 2.0
1 1.0 1.0
2 0.5 1.0
...

Use Case 3: Processing Folders

Process data in different directories.

#!/bin/bash
#SBATCH --job-name=folder_processing
#SBATCH --account=exampleproj
#SBATCH --partition=amd
#SBATCH --array=1-20
#SBATCH --output=logs/folder_%A_%a.out

# Define folder pattern
FOLDER_PREFIX="/data/experiment"
FOLDER="${FOLDER_PREFIX}_${SLURM_ARRAY_TASK_ID}"

# Check if folder exists
if [ -d "$FOLDER" ]; then
    echo "Processing folder: $FOLDER"
    cd $FOLDER
    python ../analysis.py
else
    echo "Warning: Folder $FOLDER does not exist"
    exit 1
fi

Array Job Array Specifications

Different ways to specify array indices:

# Range: tasks 1, 2, 3, ..., 100
#SBATCH --array=1-100

# Range with step: tasks 0, 10, 20, ..., 100
#SBATCH --array=0-100:10

# Specific values: tasks 1, 5, 10, 15
#SBATCH --array=1,5,10,15

# Mixed: tasks 1, 2, 3, 4, 5, 10, 20, 30
#SBATCH --array=1-5,10,20,30

# Limit concurrent tasks: max 10 running at once
#SBATCH --array=1-1000%10

Tip

Use % to limit concurrent array tasks. This prevents overwhelming the system with too many simultaneous jobs while still allowing all tasks to queue.

Managing Array Jobs

Monitoring Array Jobs

# View all array tasks
squeue -u $USER

# View specific array job
squeue -j 12345

# Count running/pending tasks
squeue -u $USER --array -t RUNNING | wc -l
squeue -u $USER --array -t PENDING | wc -l

Canceling Array Tasks

# Cancel entire array job
scancel 12345

# Cancel specific array task
scancel 12345_5

# Cancel range of array tasks
scancel 12345_[10-20]

# Cancel all array tasks with specific job name
scancel --name=array_job

Output File Naming

Use special placeholders in output filenames:

# %A = array job ID (same for all tasks)
# %a = array task ID (unique for each task)
#SBATCH --output=results_%A_%a.out
#SBATCH --error=errors_%A_%a.err

# Organize outputs in subdirectories
#SBATCH --output=logs/task_%a/output.log
#SBATCH --error=logs/task_%a/error.log

Best Practices

Array Job Design

Make each array task independent - no dependencies between tasks
Ensure all tasks have similar resource requirements
Use --array=1-N%M to limit concurrent tasks and avoid overwhelming the scheduler
Test with a small array (e.g., --array=1-3) before scaling up

Resource Management

Request resources per task, not for the entire array
Consider task runtime - all tasks should finish in similar time
Use appropriate concurrency limits based on cluster policy

Data Management

Use unique output filenames with %A_%a to avoid conflicts
Create output directories before submitting if needed
Consider using task-specific working directories
Clean up intermediate files from completed tasks

Error Handling

Include error checking in your script
Log which parameter combination or file each task processes
Failed tasks can be identified and resubmitted individually
Use set -e to exit on errors

Debugging

Test with --array=1 or --array=1-3 first
Check one output file to verify correctness
Use explicit echo statements to log task ID and parameters
Verify file/parameter selection logic works correctly

Advanced Techniques

Dynamic Array Size from File Count

# Count files and create array job
shopt -s nullglob
files=(data/*.txt)
NUM_FILES=${#files[@]}
sbatch --array=1-$NUM_FILES process_files.sh

Resubmitting Failed Tasks

# Find failed tasks from sacct
# Find failed tasks from sacct
JOB_ID=12345
sacct -j $JOB_ID --format=JobID,State | grep FAILED | awk '{print $1}' > failed_tasks.txt

# Create array specification from failed tasks
FAILED_ARRAY=$(sed "s/${JOB_ID}_//" failed_tasks.txt | tr '\n' ',' | sed 's/,$//')

# Resubmit only failed tasks
sbatch --array=$FAILED_ARRAY rerun_job.sh

Root Cause

Array jobs solve the problem of submitting and managing large numbers of similar tasks:

Without Array Jobs: - Need to write loops to submit hundreds of individual jobs - Job IDs are unrelated, making management difficult - Output files need manual naming conventions - Monitoring and canceling groups of related jobs is tedious

With Array Jobs: - Single submission for all related tasks - Automatic task indexing with $SLURM_ARRAY_TASK_ID - Unified job ID for the entire array - Easy monitoring and cancellation of task groups - Built-in output file naming with %A_%a - Scheduler can optimize resource allocation for task groups

References

Related Articles

How to Submit and Run Batch Jobs with SLURM - Basic batch job submission
How to Request Interactive Sessions on Compute Nodes - Interactive testing before array submission

SLURM Documentation