Using Vega

Login

There are 8 login nodes and 3 gateway that redirect to any of the login nodes in a load-balanced way:

Hostname

Node type

login.vega.izum.si

login to one of the eight login nodes

logincpu.vega.izum.si

login to one of the four CPU login nodes

logingpu.vega.izum.si

login to one of the four GPU login nodes

Host key fingerprint:

Algorithm

Fingerprint (SHA256)

RSA

SHA256:wPX+u4yCNLEh3s8c+ajvWja2i+mx+6iuo0KjnJuxYco

ECDSA

SHA256:PUL9NXrCdfWMUPPQRwjdwpFVzGB61Ta97FwdSXyndFE

ED25519

SHA256:VjR9viww63uTSojXTo5WKZNR352p5+/KC0nzycCG27U

More details can be found in the wiki page HPC Vega / Login information.

2FA

There is 1 server for 2FA:

Hostname

Node type

otp.vega.izum.si

OTP server

Host key fingerprint:

Algorithm

Fingerprint (SHA256)

ED25519

SHA256:oN2fHLZe2Z/BbaNUnU+f6m56HWVX/o0mlK9/wqD54fE

$ ssh username@otp.vega.izum.si
-------------------------------------------------------------------------------------------------------
Configuring two-factor authentication (2FA)
Please, install Aegis Authenticator on your Android phone or Raivo OTP on iOS, then scan the displayed QR Code
-------------------------------------------------------------------------------------------------------
Press any key to continue
Warning: pasting the following URL into your browser exposes the OTP secret to Google:
  https://www.google.com/chart?chs=200x200&chld=M|0&cht=qr&chl=otpauth://totp/username@otp.vega.izum.si%3Fsecret%3DABCDEFGHIJKLMNOPQRST1234%26issuer%3Dotp.vega.izum.si
[here an ASCII diagram of the QR-code]
Your new secret key is: ABCDEFGHIJKLMNOPQRST1234
Enter code from app (-1 to skip): -1
Code confirmation skipped
Your emergency scratch codes are:
  12345678
  98765432
2FA configured successfully, next ssh connection will use it
Connection to otp.vega.izum.si closed.

Building dependencies

Boost

# last update: October 2024
GCC_VERSION=12.3.0
BOOST_VERSION=1.82.0
module load GCC/${GCC_VERSION} OpenMPI/4.1.5-GCC-${GCC_VERSION} cURL/8.0.1-GCCcore-${GCC_VERSION}
mkdir boost-build
cd boost-build
BOOST_DOMAIN="https://boostorg.jfrog.io/artifactory/main"
BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}"
mkdir -p "${BOOST_ROOT}"
curl -sL "${BOOST_DOMAIN}/release/${BOOST_VERSION}/source/boost_${BOOST_VERSION//./_}.tar.bz2" | tar xj
cd "boost_${BOOST_VERSION//./_}"
echo 'using mpi ;' > tools/build/src/user-config.jam
./bootstrap.sh --with-libraries=filesystem,system,mpi,serialization,test
./b2 -j 4 install --prefix="${BOOST_ROOT}"

Building software

ESPResSo

Release 4.2:

# last update: October 2024
GCC_VERSION=12.3.0
BOOST_VERSION=1.82.0
module load GCC/${GCC_VERSION} \
            OpenMPI/4.1.5-GCC-${GCC_VERSION} \
            FFTW/3.3.10-GCC-${GCC_VERSION} \
            CUDA/12.3.0 \
            Python/3.11.3-GCCcore-${GCC_VERSION} \
            CMake/3.26.3-GCCcore-${GCC_VERSION} \
            cURL/8.0.1-GCCcore-${GCC_VERSION}
export BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${CUDA_HOME}/lib64/stubs"

git clone --recursive --branch 4.2 --origin upstream \
    https://github.com/espressomd/espresso.git espresso-4.2
cd espresso-4.2
git fetch upstream python
git cherry-pick 381aac217
git show HEAD:src/core/CMakeLists.txt > src/core/CMakeLists.txt
sed -i 's/{PYTHON_INSTDIR}/{ESPRESSO_INSTALL_PYTHON}/' src/core/CMakeLists.txt
sed -i 's/{PYTHON_EXECUTABLE}/{Python_EXECUTABLE}/' src/python/espressomd/CMakeLists.txt
git add src/core/CMakeLists.txt src/python/espressomd/CMakeLists.txt
git commit -m 'CMake: Modernize handling of Python dependencies'
python3 -m venv venv
source venv/bin/activate
python3 -m pip install -c "requirements.txt" "numpy<2.0" scipy vtk h5py cython==0.29.36
mkdir build
cd build
cp ../maintainer/configs/maxset.hpp myconfig.hpp
sed -i "/ADDITIONAL_CHECKS/d" myconfig.hpp
cmake .. -D CMAKE_BUILD_TYPE=Release -D WITH_CUDA=ON \
    -D WITH_CCACHE=OFF -D WITH_SCAFACOS=OFF -D WITH_HDF5=OFF
make -j 4
deactivate
module purge

Release 4.3:

# last update: October 2024
GCC_VERSION=12.3.0
BOOST_VERSION=1.82.0
module load GCC/${GCC_VERSION} \
            OpenMPI/4.1.5-GCC-${GCC_VERSION} \
            FFTW/3.3.10-GCC-${GCC_VERSION} \
            CUDA/12.3.0 \
            Python/3.11.3-GCCcore-${GCC_VERSION} \
            CMake/3.26.3-GCCcore-${GCC_VERSION} \
            cURL/8.0.1-GCCcore-${GCC_VERSION}
export BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${CUDA_HOME}/lib64/stubs"

git clone --recursive --branch python --origin upstream \
    https://github.com/espressomd/espresso.git espresso-4.3
cd espresso-4.3
python3 -m venv venv
source venv/bin/activate
python3 -m pip install -c "requirements.txt" "numpy<2.0" scipy vtk h5py cython==3.0.8
mkdir build
cd build
cp ../maintainer/configs/maxset.hpp myconfig.hpp
sed -i "/ADDITIONAL_CHECKS/d" myconfig.hpp
cmake .. -D CMAKE_BUILD_TYPE=Release -D ESPRESSO_BUILD_WITH_CCACHE=OFF \
    -D ESPRESSO_BUILD_WITH_CUDA=ON -D CMAKE_CUDA_ARCHITECTURES="80" \
    -D CUDAToolkit_ROOT="${CUDA_HOME}" \
    -D ESPRESSO_BUILD_WITH_WALBERLA=ON -D ESPRESSO_BUILD_WITH_WALBERLA_AVX=ON \
    -D ESPRESSO_BUILD_WITH_SCAFACOS=OFF -D ESPRESSO_BUILD_WITH_HDF5=OFF
make -j 4
deactivate
module purge

Submitting jobs

Batch command for a benchmark job:

for n in 32 64 128 256 512 1024 2048 4096 ; do
  sbatch --partition=cpu --ntasks=${n} --mem-per-cpu=800MB \
         --hint=nomultithread --exclusive \
         --ntasks-per-node=$((n<128 ? n : 128)) job.sh
done

Job script:

#!/bin/bash
#SBATCH --job-name=multixscale
#SBATCH --time=00:10:00
#SBATCH --output %j.stdout
#SBATCH --error  %j.stderr

# last update: September 2024
GCC_VERSION=12.3.0
BOOST_VERSION=1.82.0
module load GCC/${GCC_VERSION} \
            OpenMPI/4.1.5-GCC-${GCC_VERSION} \
            FFTW/3.3.10-GCC-${GCC_VERSION} \
            CUDA/12.3.0 \
            Python/3.11.3-GCCcore-${GCC_VERSION} \
            CMake/3.26.3-GCCcore-${GCC_VERSION} \
            cURL/8.0.1-GCCcore-${GCC_VERSION}

# fix for "GPU Error: 34 cudaErrorStubLibrary: CUDA driver is a stub library"
export LD_LIBRARY_PATH=$(echo "${LD_LIBRARY_PATH}" | sed "s|:/cvmfs/sling.si/modules/el7/software/CUDA/12.3.0/lib64/stubs||")
export BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib"

OLD_PYTHONPATH="${PYTHONPATH}"
export ESPRESSO_ROOT="${HOME}/multixscale/4.3-walberla-packinfo/espresso-4.3"
export PYTHONPATH="${ESPRESSO_ROOT}/build-lj/src/python${OLD_PYTHONPATH:+:$OLD_PYTHONPATH}"
source "${ESPRESSO_ROOT}/venv/bin/activate"

srun --cpu-bind=cores python3 script.py --particles_per_core=2000

export PYTHONPATH="${OLD_PYTHONPATH}"
deactivate
module purge

The following harmless warning will be generated:

srun: error: WARNING: Multiple leaf switches contain nodes: gn[01-60]

This is a known issue on HPC Vega, see e.g. this Max CoE thread from Nov 2022. No corrective action is required.

The following point-to-point messaging layer (pml) error message may be generated when using more than 1 socket on a GPU node:

[gn22:1299643] pml_ucx.c:419  Error: ucp_ep_create(proc=246) failed: Shared memory error
[gn22:1299644] [[26541,3474],246] selected pml cm, but peer [[26541,3474],0] on gn20 selected pml ucx

Set the appropriate environment variable to force the same policy on all sockets:

OMPI_MCA_pml=cm srun --cpu-bind=cores python3 ../script-tuned-mesh.py --particles_per_core=400 --hardware=gpu

The following CUDA error is due to the depreciated stubs library:

GPU Error: 34 cudaErrorStubLibrary: CUDA driver is a stub library

Fix it by eliding the stubs search path:

export LD_LIBRARY_PATH=$(echo "${LD_LIBRARY_PATH}" | sed "s|:/cvmfs/sling.si/modules/el7/software/CUDA/12.3.0/lib64/stubs||")

Interactive job:

srun --partition=dev --time=0:05:00 --job-name=interactive \
     --ntasks=32 --ntasks-per-node=32 --mem-per-cpu=400MB \
     --hint=nomultithread --pty /usr/bin/bash
module load Boost/1.72.0-gompi-2020a hwloc
mpiexec -n ${SLURM_NTASKS} --bind-to core --map-by core --report-bindings \
    /ceph/hpc/software/mpibench

Output:

[cn0294:2697220] MCW rank  0 bound to socket 1[core  0[hwt 0-1]]: [][BB/../../../../../../../../../../../../../../..]
[cn0294:2697220] MCW rank  1 bound to socket 1[core  1[hwt 0-1]]: [][../BB/../../../../../../../../../../../../../..]
[cn0294:2697220] MCW rank  2 bound to socket 1[core  2[hwt 0-1]]: [][../../BB/../../../../../../../../../../../../..]
[cn0294:2697220] MCW rank  3 bound to socket 1[core  3[hwt 0-1]]: [][../../../BB/../../../../../../../../../../../..]
[cn0294:2697220] MCW rank  4 bound to socket 1[core  4[hwt 0-1]]: [][../../../../BB/../../../../../../../../../../..]
[cn0294:2697220] MCW rank  5 bound to socket 1[core  5[hwt 0-1]]: [][../../../../../BB/../../../../../../../../../..]
[cn0294:2697220] MCW rank  6 bound to socket 1[core  6[hwt 0-1]]: [][../../../../../../BB/../../../../../../../../..]
[cn0294:2697220] MCW rank  7 bound to socket 1[core  7[hwt 0-1]]: [][../../../../../../../BB/../../../../../../../..]
[cn0294:2697220] MCW rank  8 bound to socket 1[core  8[hwt 0-1]]: [][../../../../../../../../BB/../../../../../../..]
[cn0294:2697220] MCW rank  9 bound to socket 1[core  9[hwt 0-1]]: [][../../../../../../../../../BB/../../../../../..]
[cn0294:2697220] MCW rank 10 bound to socket 1[core 10[hwt 0-1]]: [][../../../../../../../../../../BB/../../../../..]
[cn0294:2697220] MCW rank 11 bound to socket 1[core 11[hwt 0-1]]: [][../../../../../../../../../../../BB/../../../..]
[cn0294:2697220] MCW rank 12 bound to socket 1[core 12[hwt 0-1]]: [][../../../../../../../../../../../../BB/../../..]
[cn0294:2697220] MCW rank 13 bound to socket 1[core 13[hwt 0-1]]: [][../../../../../../../../../../../../../BB/../..]
[cn0294:2697220] MCW rank 14 bound to socket 1[core 14[hwt 0-1]]: [][../../../../../../../../../../../../../../BB/..]
[cn0294:2697220] MCW rank 15 bound to socket 1[core 15[hwt 0-1]]: [][../../../../../../../../../../../../../../../BB]
START mpiBench v1.5
...