Quick Start

Use this page as the main onboarding entry for new HPC users. If you received a welcome email with your initial credentials, keep that email handy as you follow this guide.

This quick-start covers both HPC4 (hpc4.ust.hk) and SuperPOD (superpod.ust.hk).

Official pages for partition details, quotas, policies, and announcements:

Understanding the Cluster

What is an HPC cluster?

A cluster is many computers (nodes) connected by a high-speed network and managed as a single shared system. Hundreds of people use it at the same time.

HPC cluster architecture: user terminal → login nodes → Slurm scheduler → compute nodes (CPU/GPU) → T1 storage (/scratch, fast/temporary) → T2 storage (/home, /project, backed-up/shared).

Login nodes — where you land when you SSH in. Shared by everyone. Use them only for editing files, light compiling, submitting jobs, and transferring data. Do not run heavy computations on login nodes — they will be killed by the system administrators.
Compute nodes — the machines that actually run your work. They come in CPU-only and GPU-equipped variants. You never SSH directly to them; instead you submit jobs to the scheduler.
Storage — network-mounted file systems shared across all nodes. Tier-1 (/scratch) is fast but temporary. Tier-2 (/home, /project) is backed-up and meant for long-term data. Quotas and retention policies differ between HPC4 and SuperPOD; see the official pages linked above.

The scheduler: Slurm

Both HPC4 and SuperPOD use Slurm to manage access to compute nodes.

How it works, in plain English

You do not run big programs on the login node. Instead, you write a batch script describing what resources you need (CPUs, memory, time) and which commands to run. You hand that script to Slurm with sbatch. Slurm returns a job ID immediately — you can log out, go home, the job will run when resources are free.

$ sbatch my_job.sh
Submitted batch job 123456

$ squeue --me
  JOBID  PARTITION  NAME     USER  ST  TIME  NODES
 123456  amd        my_job   user  PD   0:00  1

# ... later, when the job finishes ...

$ cat slurm-123456.out

That is the entire mental model. The rest of this section explains the details.

Workflow

        graph TD
    you["① You<br/>write batch script<br/>request CPUs, GPUs,<br/>memory, walltime"]
    queue["② Slurm Queue<br/>jobs wait in line<br/>for available resources"]
    sched["③ Scheduler<br/>decides when and where<br/>your job runs"]
    run["④ Job runs<br/>on compute node<br/>(you may log out)"]
    out["⑤ Output files<br/>slurm-jobid.out<br/>in /scratch or /project"]

    you -->|"sbatch"| queue
    queue -->|"schedule"| sched
    sched -->|"allocate"| run
    run -.->|"async"| out

    style you fill:#dbeafe,stroke:#2563eb,color:#1e3a8a
    style queue fill:#ffedd5,stroke:#ea580c,color:#7c2d12
    style sched fill:#fee2e2,stroke:#b91c1c,color:#7f1d1d
    style run fill:#dcfce7,stroke:#15803d,color:#14532d
    style out fill:#dcfce7,stroke:#15803d,color:#14532d

Recommended workflow: start small, then scale up

        graph TD
    test["Test small<br/>short walltime, few CPUs"]
    check["Check output<br/>squeue / sacct<br/>debug errors"]
    scale["Scale up<br/>request more<br/>resources"]

    test -->|"inspect"| check
    check -->|"if OK"| scale
    scale -->|"resubmit"| test
    check -.->|"fix error"| test

    style test fill:#ede9fe,stroke:#7c3aed,color:#5b21b6
    style check fill:#ede9fe,stroke:#7c3aed,color:#5b21b6
    style scale fill:#ede9fe,stroke:#7c3aed,color:#5b21b6

The most common beginner mistake is requesting too many resources. Start with a tiny test, inspect the output, and only scale up when you are sure everything works.

Job size — smaller box = scheduled sooner

3D box diagram: a job is a box with three dimensions — computing (cpu/gpu) ↑, memory ↗, time →. Big box takes more space; small box fits into gaps via backfill.

A job is a box of resources × memory × time. Smaller boxes fit into gaps that larger jobs cannot use — this is called backfill scheduling. Start small, measure resource utilization with glances or nvidia-smi, then scale up.

How to choose your first resource request

Resource	Start with	Slurm flag	Scale up if
CPUs	4	`--cpus-per-task=4`	your code uses more cores
Memory	auto-allocated	do not set (both clusters allocate automatically)	job is killed by OOM
GPUs	0 (CPU job) or ≥1 (GPU job)	`--gpus-per-node=1`	your code uses GPU libraries

Note

On both HPC4 and SuperPOD, do not set --mem or --mem-per-cpu. Memory is allocated proportionally based on the number of CPUs or GPUs requested. Setting it manually can conflict with the scheduler.

Useful commands

Command	Purpose
`sbatch script.sh`	Submit a job
`squeue --me`	Check your jobs
`sacct -j <jobid>`	Job history / resource usage
`scancel <jobid>`	Cancel a job

How to check if your job succeeded

$ squeue --me                # while waiting/running
  JOBID  PARTITION  NAME     USER  ST  TIME  NODES
 123456  amd        my_job   user  R    2:30  1

$ sacct -j 123456            # after job finishes
  JobID    State    ExitCode  Elapsed
  123456   COMPLETED  0       00:02:30

$ cat slurm-123456.out       # check your output
Hello from cpu42

Cluster vs your laptop

	Your laptop	HPC4 / SuperPOD
Who uses it	You alone	Shared by hundreds of users
Starting work	Open a terminal	Submit a job via `sbatch`
Getting results	Immediately	After the job runs (queued)
Software	Install anything	Load via `module` commands
File system	Local SSD	Network-mounted (NFS)
GPUs	Usually 0–1	Many, shared via scheduler

Key differences between HPC4 and SuperPOD

	HPC4	SuperPOD
Login host	`hpc4.ust.hk`	`superpod.ust.hk`
Edge Spack	`/opt/shared/.spack-edge`	`/scratch/spack/2025`
Recommended approach	Spack + Lmod modules	Container-based (Enroot/Pyxis)
GPU partitions	`gpu-a30`, `gpu-l20`, `gpu-rtx5880`, `gpu-rtx4090d`	`normal`
CPU partitions	`amd`, `intel`	`cpu` (preprocessing)

See Software Support Overview (HPC4) and Enroot/Pyxis container runtime (SuperPOD) for more detailed software documentation.

Acknowledgement

Important

If your research makes use of HPC4 or SuperPOD, please include the appropriate acknowledgement in your publication, thesis, or presentation. Also send a copy or URL of your work to the cluster support team.

HPC4: The computations in this work were performed on the High Performance Computing facilities, HKUST HPC4, provided by ITSO, The Hong Kong University of Science and Technology. → hpc4support@ust.hk

SuperPOD: The computations in this work were performed on the High Performance Computing facilities, HKUST SuperPOD, provided by ITSO, The Hong Kong University of Science and Technology (HKUST). → spodsupport@ust.hk

Use the pages below as the main onboarding path. Each item includes a short description so readers can decide where to start.

Access and Authentication

Start here if you need the login host, authentication flow, VPN expectations, or a quick check that your account access is ready.

Access and Authentication

Data and Storage

Use this page to understand where to store files, what each path is for, and how to move data in and out safely.

Data and Storage Guide

Software Environment

Read this when you need Python, compilers, MPI, or module commands, and want a practical starting point for the HPC software stack.

Software Environment

Submit Your First Job

Follow this path for the shortest first success: create one small batch script, submit it, and confirm that it runs on a compute node.

Submit Your First Job

Job Templates and Control

Continue here after the first batch job works and you need GPU, MPI, interactive srun, or job-control commands such as squeue and scancel.

Job Templates and Control