Using Vega¶

Login¶

There are 8 login nodes and 3 gateway that redirect to any of the login nodes in a load-balanced way:

Hostname	Node type
`login.vega.izum.si`	login to one of the eight login nodes
`logincpu.vega.izum.si`	login to one of the four CPU login nodes
`logingpu.vega.izum.si`	login to one of the four GPU login nodes

Host key fingerprint:

Algorithm	Fingerprint (SHA256)
RSA	`SHA256:wPX+u4yCNLEh3s8c+ajvWja2i+mx+6iuo0KjnJuxYco`
ECDSA	`SHA256:PUL9NXrCdfWMUPPQRwjdwpFVzGB61Ta97FwdSXyndFE`
ED25519	`SHA256:VjR9viww63uTSojXTo5WKZNR352p5+/KC0nzycCG27U`

More details can be found in the wiki page HPC Vega / Login information.

2FA¶

There is 1 server for 2FA:

Hostname	Node type
`otp.vega.izum.si`	OTP server

Host key fingerprint:

Algorithm	Fingerprint (SHA256)
ED25519	`SHA256:oN2fHLZe2Z/BbaNUnU+f6m56HWVX/o0mlK9/wqD54fE`

$ ssh username@otp.vega.izum.si
-------------------------------------------------------------------------------------------------------
Configuring two-factor authentication (2FA)
Please, install Aegis Authenticator on your Android phone or Raivo OTP on iOS, then scan the displayed QR Code
-------------------------------------------------------------------------------------------------------
Press any key to continue
Warning: pasting the following URL into your browser exposes the OTP secret to Google:
  https://www.google.com/chart?chs=200x200&chld=M|0&cht=qr&chl=otpauth://totp/username@otp.vega.izum.si%3Fsecret%3DABCDEFGHIJKLMNOPQRST1234%26issuer%3Dotp.vega.izum.si
[here an ASCII diagram of the QR-code]
Your new secret key is: ABCDEFGHIJKLMNOPQRST1234
Enter code from app (-1 to skip): -1
Code confirmation skipped
Your emergency scratch codes are:
  12345678
  98765432
2FA configured successfully, next ssh connection will use it
Connection to otp.vega.izum.si closed.

Building dependencies¶

Boost¶

# last update: October 2024
GCC_VERSION=12.3.0
BOOST_VERSION=1.82.0
module load GCC/${GCC_VERSION} OpenMPI/4.1.5-GCC-${GCC_VERSION} cURL/8.0.1-GCCcore-${GCC_VERSION}
mkdir boost-build
cd boost-build
BOOST_DOMAIN="https://boostorg.jfrog.io/artifactory/main"
BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}"
mkdir -p "${BOOST_ROOT}"
curl -sL "${BOOST_DOMAIN}/release/${BOOST_VERSION}/source/boost_${BOOST_VERSION//./_}.tar.bz2" | tar xj
cd "boost_${BOOST_VERSION//./_}"
echo 'using mpi ;' > tools/build/src/user-config.jam
./bootstrap.sh --with-libraries=filesystem,system,mpi,serialization,test
./b2 -j 4 install --prefix="${BOOST_ROOT}"

Building software¶

ESPResSo¶

Release 4.2:

# last update: October 2024
GCC_VERSION=12.3.0
BOOST_VERSION=1.82.0
module load GCC/${GCC_VERSION} \
            OpenMPI/4.1.5-GCC-${GCC_VERSION} \
            FFTW/3.3.10-GCC-${GCC_VERSION} \
            CUDA/12.3.0 \
            Python/3.11.3-GCCcore-${GCC_VERSION} \
            CMake/3.26.3-GCCcore-${GCC_VERSION} \
            cURL/8.0.1-GCCcore-${GCC_VERSION}
export BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${CUDA_HOME}/lib64/stubs"

git clone --recursive --branch 4.2 --origin upstream \
    https://github.com/espressomd/espresso.git espresso-4.2
cd espresso-4.2
git fetch upstream python
git cherry-pick 381aac217
git show HEAD:src/core/CMakeLists.txt > src/core/CMakeLists.txt
sed -i 's/{PYTHON_INSTDIR}/{ESPRESSO_INSTALL_PYTHON}/' src/core/CMakeLists.txt
sed -i 's/{PYTHON_EXECUTABLE}/{Python_EXECUTABLE}/' src/python/espressomd/CMakeLists.txt
git add src/core/CMakeLists.txt src/python/espressomd/CMakeLists.txt
git commit -m 'CMake: Modernize handling of Python dependencies'
python3 -m venv venv
source venv/bin/activate
python3 -m pip install -c "requirements.txt" "numpy<2.0" scipy vtk h5py cython==0.29.36
mkdir build
cd build
cp ../maintainer/configs/maxset.hpp myconfig.hpp
sed -i "/ADDITIONAL_CHECKS/d" myconfig.hpp
cmake .. -D CMAKE_BUILD_TYPE=Release -D WITH_CUDA=ON \
    -D WITH_CCACHE=OFF -D WITH_SCAFACOS=OFF -D WITH_HDF5=OFF
make -j 4
deactivate
module purge

Release 4.3:

# last update: October 2024
GCC_VERSION=12.3.0
BOOST_VERSION=1.82.0
module load GCC/${GCC_VERSION} \
            OpenMPI/4.1.5-GCC-${GCC_VERSION} \
            FFTW/3.3.10-GCC-${GCC_VERSION} \
            CUDA/12.3.0 \
            Python/3.11.3-GCCcore-${GCC_VERSION} \
            CMake/3.26.3-GCCcore-${GCC_VERSION} \
            cURL/8.0.1-GCCcore-${GCC_VERSION}
export BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${CUDA_HOME}/lib64/stubs"

git clone --recursive --branch python --origin upstream \
    https://github.com/espressomd/espresso.git espresso-4.3
cd espresso-4.3
python3 -m venv venv
source venv/bin/activate
python3 -m pip install -c "requirements.txt" "numpy<2.0" scipy vtk h5py cython==3.0.8
mkdir build
cd build
cp ../maintainer/configs/maxset.hpp myconfig.hpp
sed -i "/ADDITIONAL_CHECKS/d" myconfig.hpp
cmake .. -D CMAKE_BUILD_TYPE=Release -D ESPRESSO_BUILD_WITH_CCACHE=OFF \
    -D ESPRESSO_BUILD_WITH_CUDA=ON -D CMAKE_CUDA_ARCHITECTURES="80" \
    -D CUDAToolkit_ROOT="${CUDA_HOME}" \
    -D ESPRESSO_BUILD_WITH_WALBERLA=ON -D ESPRESSO_BUILD_WITH_WALBERLA_AVX=ON \
    -D ESPRESSO_BUILD_WITH_SCAFACOS=OFF -D ESPRESSO_BUILD_WITH_HDF5=OFF
make -j 4
deactivate
module purge

Submitting jobs¶

Batch command for a benchmark job:

for n in 32 64 128 256 512 1024 2048 4096 ; do
  sbatch --partition=cpu --ntasks=${n} --mem-per-cpu=800MB \
         --hint=nomultithread --exclusive \
         --ntasks-per-node=$((n<128 ? n : 128)) job.sh
done

Job script:

#!/bin/bash
#SBATCH --job-name=multixscale
#SBATCH --time=00:10:00
#SBATCH --output %j.stdout
#SBATCH --error  %j.stderr

# last update: September 2024
GCC_VERSION=12.3.0
BOOST_VERSION=1.82.0
module load GCC/${GCC_VERSION} \
            OpenMPI/4.1.5-GCC-${GCC_VERSION} \
            FFTW/3.3.10-GCC-${GCC_VERSION} \
            CUDA/12.3.0 \
            Python/3.11.3-GCCcore-${GCC_VERSION} \
            CMake/3.26.3-GCCcore-${GCC_VERSION} \
            cURL/8.0.1-GCCcore-${GCC_VERSION}

# fix for "GPU Error: 34 cudaErrorStubLibrary: CUDA driver is a stub library"
export LD_LIBRARY_PATH=$(echo "${LD_LIBRARY_PATH}" | sed "s|:/cvmfs/sling.si/modules/el7/software/CUDA/12.3.0/lib64/stubs||")
export BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib"

OLD_PYTHONPATH="${PYTHONPATH}"
export ESPRESSO_ROOT="${HOME}/multixscale/4.3-walberla-packinfo/espresso-4.3"
export PYTHONPATH="${ESPRESSO_ROOT}/build-lj/src/python${OLD_PYTHONPATH:+:$OLD_PYTHONPATH}"
source "${ESPRESSO_ROOT}/venv/bin/activate"

srun --cpu-bind=cores python3 script.py --particles_per_core=2000

export PYTHONPATH="${OLD_PYTHONPATH}"
deactivate
module purge

The following harmless warning will be generated:

srun: error: WARNING: Multiple leaf switches contain nodes: gn[01-60]

This is a known issue on HPC Vega, see e.g. this Max CoE thread from Nov 2022. No corrective action is required.

The following point-to-point messaging layer (pml) error message may be generated when using more than 1 socket on a GPU node:

[gn22:1299643] pml_ucx.c:419  Error: ucp_ep_create(proc=246) failed: Shared memory error
[gn22:1299644] [[26541,3474],246] selected pml cm, but peer [[26541,3474],0] on gn20 selected pml ucx

Set the appropriate environment variable to force the same policy on all sockets:

OMPI_MCA_pml=cm srun --cpu-bind=cores python3 ../script-tuned-mesh.py --particles_per_core=400 --hardware=gpu

The following CUDA error is due to the depreciated stubs library:

GPU Error: 34 cudaErrorStubLibrary: CUDA driver is a stub library

Fix it by eliding the stubs search path:

export LD_LIBRARY_PATH=$(echo "${LD_LIBRARY_PATH}" | sed "s|:/cvmfs/sling.si/modules/el7/software/CUDA/12.3.0/lib64/stubs||")

Interactive job:

srun --partition=dev --time=0:05:00 --job-name=interactive \
     --ntasks=32 --ntasks-per-node=32 --mem-per-cpu=400MB \
     --hint=nomultithread --pty /usr/bin/bash
module load Boost/1.72.0-gompi-2020a hwloc
mpiexec -n ${SLURM_NTASKS} --bind-to core --map-by core --report-bindings \
    /ceph/hpc/software/mpibench

Output:

[cn0294:2697220] MCW rank  0 bound to socket 1[core  0[hwt 0-1]]: [][BB/../../../../../../../../../../../../../../..]
[cn0294:2697220] MCW rank  1 bound to socket 1[core  1[hwt 0-1]]: [][../BB/../../../../../../../../../../../../../..]
[cn0294:2697220] MCW rank  2 bound to socket 1[core  2[hwt 0-1]]: [][../../BB/../../../../../../../../../../../../..]
[cn0294:2697220] MCW rank  3 bound to socket 1[core  3[hwt 0-1]]: [][../../../BB/../../../../../../../../../../../..]
[cn0294:2697220] MCW rank  4 bound to socket 1[core  4[hwt 0-1]]: [][../../../../BB/../../../../../../../../../../..]
[cn0294:2697220] MCW rank  5 bound to socket 1[core  5[hwt 0-1]]: [][../../../../../BB/../../../../../../../../../..]
[cn0294:2697220] MCW rank  6 bound to socket 1[core  6[hwt 0-1]]: [][../../../../../../BB/../../../../../../../../..]
[cn0294:2697220] MCW rank  7 bound to socket 1[core  7[hwt 0-1]]: [][../../../../../../../BB/../../../../../../../..]
[cn0294:2697220] MCW rank  8 bound to socket 1[core  8[hwt 0-1]]: [][../../../../../../../../BB/../../../../../../..]
[cn0294:2697220] MCW rank  9 bound to socket 1[core  9[hwt 0-1]]: [][../../../../../../../../../BB/../../../../../..]
[cn0294:2697220] MCW rank 10 bound to socket 1[core 10[hwt 0-1]]: [][../../../../../../../../../../BB/../../../../..]
[cn0294:2697220] MCW rank 11 bound to socket 1[core 11[hwt 0-1]]: [][../../../../../../../../../../../BB/../../../..]
[cn0294:2697220] MCW rank 12 bound to socket 1[core 12[hwt 0-1]]: [][../../../../../../../../../../../../BB/../../..]
[cn0294:2697220] MCW rank 13 bound to socket 1[core 13[hwt 0-1]]: [][../../../../../../../../../../../../../BB/../..]
[cn0294:2697220] MCW rank 14 bound to socket 1[core 14[hwt 0-1]]: [][../../../../../../../../../../../../../../BB/..]
[cn0294:2697220] MCW rank 15 bound to socket 1[core 15[hwt 0-1]]: [][../../../../../../../../../../../../../../../BB]
START mpiBench v1.5
...