Running Container in Batch Mode on HPC ====================================== .. meta:: :description: Guide for running container jobs in batch mode on HPC clusters with Slurm :keywords: container, batch, slurm, sbatch, nvidia, enroot :author: kftse .. rst-class:: header | Last updated: 2025-06-13 | Keywords: container, batch, slurm, sbatch, nvidia, enroot | *Solution under review* Environment ----------- - Slurm workload manager - GPU-enabled nodes - Enroot/Pyxis container runtime Issue ----- - Run long-running container workloads without interactive sessions - Execute batch jobs using containerized applications - Schedule container-based computations on HPC clusters - Submit jobs to the queue for efficient resource utilization Resolution ---------- .. important:: **Before running batch jobs, we strongly recommend testing your container and commands in interactive mode.** This helps ensure your container works correctly and your commands are properly configured. For detailed instructions on running containers interactively, see: :doc:`enroot-running-interactive-container-se-L58_Wq` #. Batch Job with Custom Container ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you need to use a customized container (created during interactive testing), save it first and then use it in batch mode: .. code-block:: bash :caption: custom_container_job.sh #!/bin/bash #SBATCH --account=[YOUR_ACCOUNT] #SBATCH --partition=normal #SBATCH --job-name=custom_container #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gpus-per-node=1 #SBATCH --cpus-per-task=28 #SBATCH --time=24:00:00 #SBATCH --output=%x-%j.out #SBATCH --error=%x-%j.err # Use your custom container srun --container-writable \ --container-remap-root \ --no-container-mount-home \ --container-image $HOME/containers/my-custom-container.sqsh \ python3 ... #. Multi-Node Container Jobs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For parallel applications that span multiple nodes: .. code-block:: bash :caption: multinode_container_job.sh #!/bin/bash #SBATCH --account=[YOUR_ACCOUNT] #SBATCH --partition=large #SBATCH --job-name=multinode_container #SBATCH --nodes=4 #SBATCH --ntasks-per-node=1 #SBATCH --gpus-per-node=8 #SBATCH --cpus-per-task=224 #SBATCH --time=24:00:00 #SBATCH --output=%x-%j.out #SBATCH --error=%x-%j.err # Use your custom container srun --container-writable \ --container-remap-root \ --no-container-mount-home \ --container-image $HOME/containers/my-custom-container.sqsh \ python3 ... Best Practices ~~~~~~~~~~~~~~ - **Resource Planning**: Request appropriate time limits for batch jobs (can be longer than interactive limits) - **Output Files**: Use descriptive output file names with ``%x`` (job name) and ``%j`` (job ID) placeholders - **Container Storage**: Store containers in ``$HOME/containers/`` for organization - **Error Handling**: Always specify both ``--output`` and ``--error`` files for debugging References ---------- - `CUDA Containers for Deep Learning `_ - `NGC Container Registry `_ - `Pyxis/Enroot Usage `_ - `Slurm srun Documentation `_ - `Container Best Practices `_ .. rst-class:: footer **HPC Support Team** | ITSO, HKUST | Email: cchelp@ust.hk | Web: https://itso.hkust.edu.hk/ **Article Info** | Issued: 2025-02-12 | Issued by: kftse (at) ust.hk