How to Efficiently Download Files from Cluster
Environment
macOS, Linux/Unix, or Windows with WSL
SSH access to HPC cluster
Issue
Need to download large datasets from HPC cluster
Cannot establish an outgoing SSH connection to transfer file
Dataset may contain many small files
Need efficient and reliable transfer method
Resolution
Use sshfs to mount remote directory locally, then use fpsync for parallel downloads.
Install Required Packages
Install
sshfsandfpart:
# Ubuntu/Debian:
$ sudo apt install sshfs fpart
# macOS:
$ brew install sshfs fpart
# CentOS/RHEL:
$ sudo dnf install fuse-sshfs fpart
Mount Remote Directory
Create local mount point:
$ mkdir -p ~/cluster_data
Mount remote directory:
$ sshfs username@hpc.ust.hk:/path/to/dataset ~/cluster_data
Download Using fpsync
Create local destination directory:
$ mkdir -p ~/local_dataset
Transfer files using parallel processes:
$ fpsync -t $HOME/.fpsync -n 8 -vv ~/cluster_data/ ~/local_dataset/
Unmount after transfer:
$ fusermount -u ~/cluster_data # Linux
$ umount ~/cluster_data # macOS
Note
Choose appropriate number of parallel processes (
-n) based on your systemVerify transfer completion before unmounting
Warning
Ensure sufficient local disk space before starting transfer
Do not interrupt transfer process to avoid incomplete files
Large parallel transfers may impact system performance
Root Cause
Outbound SSH is not permitted. Use sshfs to mount a local directory using an inbound SSH connection to HPC cluster.
For parallel transfer, use fpsync to efficiently download files.