How to Use SLURM Array Jobs for Parameter Sweeps and Batch Processing ====================================================================== .. meta:: :description: Guide to using SLURM array jobs for parameter sweeps, batch processing, and parallel task execution :keywords: slurm, array job, parameter sweep, batch processing, job array, parallel tasks :author: HPC Support Team .. rst-class:: header | Last updated: 2025-12-04 | *Solution under review* Environment ----------- - HPC4 cluster - Superpod cluster - SLURM workload manager - Tasks that need to run with different parameters or input files - Parameter sweeps, sensitivity analysis, batch data processing Issue ----- - Need to run the same program with different parameters or input files - Want to process multiple datasets with the same analysis script - Conducting parameter sweeps for optimization or sensitivity analysis - Have many independent tasks that can run in parallel - Submitting individual jobs for each parameter is time-consuming and error-prone - Need efficient way to manage hundreds or thousands of similar jobs Resolution ---------- Use SLURM array jobs with the ``--array`` option to submit multiple similar tasks with a single command. Each array task gets a unique ``$SLURM_ARRAY_TASK_ID`` that can be used to select parameters or input files. Basic Array Job Syntax ~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash #!/bin/bash #SBATCH --job-name=array_example #SBATCH --account=exampleproj #SBATCH --partition=amd #SBATCH --array=1-10 #SBATCH --output=array_%A_%a.out #SBATCH --error=array_%A_%a.err # $SLURM_ARRAY_TASK_ID contains the array index (1, 2, 3, ..., 10) echo "This is array task $SLURM_ARRAY_TASK_ID" # Use the task ID in your computation python process.py --task-id $SLURM_ARRAY_TASK_ID This submits 10 jobs (array tasks) with indices 1 through 10. Array Job Environment Variables ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :widths: 40 60 :header-rows: 1 * - Variable - Description * - ``$SLURM_ARRAY_TASK_ID`` - Current array task index (e.g., 1, 2, 3, ...) * - ``$SLURM_ARRAY_JOB_ID`` - Job ID of the entire array (same for all tasks) * - ``$SLURM_ARRAY_TASK_MIN`` - Minimum array index * - ``$SLURM_ARRAY_TASK_MAX`` - Maximum array index * - ``$SLURM_ARRAY_TASK_COUNT`` - Total number of array tasks Use Case 1: Processing Multiple Input Files ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Process different data files using the array task ID to select the file. Method 1: Numbered Files ^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash #!/bin/bash #SBATCH --job-name=process_data #SBATCH --account=exampleproj #SBATCH --partition=amd #SBATCH --array=1-100 #SBATCH --output=logs/job_%A_%a.out # Process file based on array task ID # i.e. data/input_1.txt, data/input_2.txt, ... INPUT_FILE="data/input_${SLURM_ARRAY_TASK_ID}.txt" OUTPUT_FILE="results/output_${SLURM_ARRAY_TASK_ID}.txt" python analyze.py --input $INPUT_FILE --output $OUTPUT_FILE Method 2: File List ^^^^^^^^^^^^^^^^^^^ Read filenames from a list and select based on array task ID. .. code-block:: bash #!/bin/bash #SBATCH --job-name=file_processing #SBATCH --account=exampleproj #SBATCH --partition=amd #SBATCH --array=1-50 #SBATCH --output=logs/job_%A_%a.out # Get the filename from a list FILE_LIST="file_list.txt" INPUT_FILE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" $FILE_LIST) # Process the file echo "Processing: $INPUT_FILE" python process.py $INPUT_FILE **Example file_list.txt:** .. code-block:: text /path/to/dataset1.dat /path/to/dataset2.dat /path/to/dataset3.dat ... Use Case 2: Parameter Sweeps ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Run simulations or analyses with different parameter values. Simple Parameter Mapping ^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash #!/bin/bash #SBATCH --job-name=param_sweep #SBATCH --account=exampleproj #SBATCH --partition=amd #SBATCH --array=0-99 #SBATCH --output=logs/param_%A_%a.out # Map array task ID to parameter values # Example: sweep learning rate from 0.001 to 0.1 LEARNING_RATE=$(awk "BEGIN {print 0.001 + $SLURM_ARRAY_TASK_ID * 0.001}") echo "Running with learning rate: $LEARNING_RATE" python train_model.py --lr $LEARNING_RATE Multi-Dimensional Parameter Grid ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Sweep multiple parameters simultaneously. .. code-block:: bash #!/bin/bash #SBATCH --job-name=grid_search #SBATCH --account=exampleproj #SBATCH --partition=gpu #SBATCH --gpus-per-task=1 #SBATCH --array=0-99 #SBATCH --output=logs/grid_%A_%a.out # Define parameter grid # 10 learning rates × 10 batch sizes = 100 combinations LR_VALUES=(0.001 0.002 0.005 0.01 0.02 0.05 0.1 0.2 0.5 1.0) BATCH_VALUES=(16 32 64 128 256 512 1024 2048 4096 8192) # Calculate indices LR_IDX=$((SLURM_ARRAY_TASK_ID / 10)) BATCH_IDX=$((SLURM_ARRAY_TASK_ID % 10)) # Get parameter values LR=${LR_VALUES[$LR_IDX]} BATCH=${BATCH_VALUES[$BATCH_IDX]} echo "Learning Rate: $LR, Batch Size: $BATCH" python train.py --lr $LR --batch-size $BATCH Using Parameter File ^^^^^^^^^^^^^^^^^^^^ Read parameter combinations from a file. .. code-block:: bash #!/bin/bash #SBATCH --job-name=param_file #SBATCH --account=exampleproj #SBATCH --partition=amd #SBATCH --array=1-100 #SBATCH --output=logs/param_%A_%a.out # Read parameters from file (one combination per line) PARAMS=$(sed -n "${SLURM_ARRAY_TASK_ID}p" parameters.txt) # Parse parameters (assuming space-separated) read -r ALPHA BETA GAMMA <<< "$PARAMS" echo "Running with α=$ALPHA, β=$BETA, γ=$GAMMA" ./simulation --alpha $ALPHA --beta $BETA --gamma $GAMMA **Example parameters.txt:** .. code-block:: text 0.1 0.5 1.0 0.1 0.5 2.0 0.1 1.0 1.0 0.2 0.5 1.0 ... Use Case 3: Processing Folders ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Process data in different directories. .. code-block:: bash #!/bin/bash #SBATCH --job-name=folder_processing #SBATCH --account=exampleproj #SBATCH --partition=amd #SBATCH --array=1-20 #SBATCH --output=logs/folder_%A_%a.out # Define folder pattern FOLDER_PREFIX="/data/experiment" FOLDER="${FOLDER_PREFIX}_${SLURM_ARRAY_TASK_ID}" # Check if folder exists if [ -d "$FOLDER" ]; then echo "Processing folder: $FOLDER" cd $FOLDER python ../analysis.py else echo "Warning: Folder $FOLDER does not exist" exit 1 fi Array Job Array Specifications ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Different ways to specify array indices: .. code-block:: bash # Range: tasks 1, 2, 3, ..., 100 #SBATCH --array=1-100 # Range with step: tasks 0, 10, 20, ..., 100 #SBATCH --array=0-100:10 # Specific values: tasks 1, 5, 10, 15 #SBATCH --array=1,5,10,15 # Mixed: tasks 1, 2, 3, 4, 5, 10, 20, 30 #SBATCH --array=1-5,10,20,30 # Limit concurrent tasks: max 10 running at once #SBATCH --array=1-1000%10 .. tip:: Use ``%`` to limit concurrent array tasks. This prevents overwhelming the system with too many simultaneous jobs while still allowing all tasks to queue. Managing Array Jobs ~~~~~~~~~~~~~~~~~~~ Monitoring Array Jobs ^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # View all array tasks squeue -u $USER # View specific array job squeue -j 12345 # Count running/pending tasks squeue -u $USER --array -t RUNNING | wc -l squeue -u $USER --array -t PENDING | wc -l Canceling Array Tasks ^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Cancel entire array job scancel 12345 # Cancel specific array task scancel 12345_5 # Cancel range of array tasks scancel 12345_[10-20] # Cancel all array tasks with specific job name scancel --name=array_job Output File Naming ^^^^^^^^^^^^^^^^^^ Use special placeholders in output filenames: .. code-block:: bash # %A = array job ID (same for all tasks) # %a = array task ID (unique for each task) #SBATCH --output=results_%A_%a.out #SBATCH --error=errors_%A_%a.err # Organize outputs in subdirectories #SBATCH --output=logs/task_%a/output.log #SBATCH --error=logs/task_%a/error.log Best Practices ~~~~~~~~~~~~~~ **Array Job Design** - Make each array task independent - no dependencies between tasks - Ensure all tasks have similar resource requirements - Use ``--array=1-N%M`` to limit concurrent tasks and avoid overwhelming the scheduler - Test with a small array (e.g., ``--array=1-3``) before scaling up **Resource Management** - Request resources per task, not for the entire array - Consider task runtime - all tasks should finish in similar time - Use appropriate concurrency limits based on cluster policy **Data Management** - Use unique output filenames with ``%A_%a`` to avoid conflicts - Create output directories before submitting if needed - Consider using task-specific working directories - Clean up intermediate files from completed tasks **Error Handling** - Include error checking in your script - Log which parameter combination or file each task processes - Failed tasks can be identified and resubmitted individually - Use ``set -e`` to exit on errors **Debugging** - Test with ``--array=1`` or ``--array=1-3`` first - Check one output file to verify correctness - Use explicit echo statements to log task ID and parameters - Verify file/parameter selection logic works correctly Advanced Techniques ~~~~~~~~~~~~~~~~~~~ Dynamic Array Size from File Count ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Count files and create array job shopt -s nullglob files=(data/*.txt) NUM_FILES=${#files[@]} sbatch --array=1-$NUM_FILES process_files.sh Resubmitting Failed Tasks ^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Find failed tasks from sacct # Find failed tasks from sacct JOB_ID=12345 sacct -j $JOB_ID --format=JobID,State | grep FAILED | awk '{print $1}' > failed_tasks.txt # Create array specification from failed tasks FAILED_ARRAY=$(sed "s/${JOB_ID}_//" failed_tasks.txt | tr '\n' ',' | sed 's/,$//') # Resubmit only failed tasks sbatch --array=$FAILED_ARRAY rerun_job.sh Root Cause ---------- Array jobs solve the problem of submitting and managing large numbers of similar tasks: **Without Array Jobs:** - Need to write loops to submit hundreds of individual jobs - Job IDs are unrelated, making management difficult - Output files need manual naming conventions - Monitoring and canceling groups of related jobs is tedious **With Array Jobs:** - Single submission for all related tasks - Automatic task indexing with ``$SLURM_ARRAY_TASK_ID`` - Unified job ID for the entire array - Easy monitoring and cancellation of task groups - Built-in output file naming with ``%A_%a`` - Scheduler can optimize resource allocation for task groups References ---------- **Related Articles** - :doc:`slurm-how-to-submit-and-run-batch-jobs-G75o-i` - Basic batch job submission - :doc:`slurm-how-to-request-interactive-sessi-HV7WS9` - Interactive testing before array submission **SLURM Documentation** - `SLURM Job Arrays `_ - `SLURM sbatch Command `_ - `SLURM Environment Variables `_ .. rst-class:: footer **HPC Support Team** | ITSO, HKUST | Email: cchelp@ust.hk | Web: https://itso.hkust.edu.hk/ **Article Info** | Issued: 2025-12-04 | Issued by: kftse@ust.hk