Running Container in Batch Mode on HPC

Last updated: 2025-06-13
Keywords: container, batch, slurm, sbatch, nvidia, enroot
Solution under review

Environment

  • Slurm workload manager

  • GPU-enabled nodes

  • Enroot/Pyxis container runtime

Issue

  • Run long-running container workloads without interactive sessions

  • Execute batch jobs using containerized applications

  • Schedule container-based computations on HPC clusters

  • Submit jobs to the queue for efficient resource utilization

Resolution

Important

Before running batch jobs, we strongly recommend testing your container and commands in interactive mode. This helps ensure your container works correctly and your commands are properly configured.

For detailed instructions on running containers interactively, see: Running Interactive Container Sessions on HPC

#. Batch Job with Custom Container

If you need to use a customized container (created during interactive testing), save it first and then use it in batch mode:

custom_container_job.sh
#!/bin/bash
#SBATCH --account=[YOUR_ACCOUNT]
#SBATCH --partition=normal
#SBATCH --job-name=custom_container
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=1
#SBATCH --cpus-per-task=28
#SBATCH --time=24:00:00
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err

# Use your custom container
srun --container-writable \
    --container-remap-root \
    --no-container-mount-home \
    --container-image $HOME/containers/my-custom-container.sqsh \
     python3 ...

#. Multi-Node Container Jobs

For parallel applications that span multiple nodes:

multinode_container_job.sh
#!/bin/bash
#SBATCH --account=[YOUR_ACCOUNT]
#SBATCH --partition=large
#SBATCH --job-name=multinode_container
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=224
#SBATCH --time=24:00:00
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err

# Use your custom container
srun --container-writable \
    --container-remap-root \
    --no-container-mount-home \
    --container-image $HOME/containers/my-custom-container.sqsh \
     python3 ...

Best Practices

  • Resource Planning: Request appropriate time limits for batch jobs (can be longer than interactive limits)

  • Output Files: Use descriptive output file names with %x (job name) and %j (job ID) placeholders

  • Container Storage: Store containers in $HOME/containers/ for organization

  • Error Handling: Always specify both --output and --error files for debugging

References