Using Vega¶
Login¶
There are 8 login nodes and 3 gateway that redirect to any of the login nodes in a load-balanced way:
Hostname |
Node type |
---|---|
|
login to one of the eight login nodes |
|
login to one of the four CPU login nodes |
|
login to one of the four GPU login nodes |
Host key fingerprint:
Algorithm |
Fingerprint (SHA256) |
---|---|
RSA |
|
ECDSA |
|
ED25519 |
|
More details can be found in the wiki page HPC Vega / Login information.
2FA¶
There is 1 server for 2FA:
Hostname |
Node type |
---|---|
|
OTP server |
Host key fingerprint:
Algorithm |
Fingerprint (SHA256) |
---|---|
ED25519 |
|
$ ssh username@otp.vega.izum.si
-------------------------------------------------------------------------------------------------------
Configuring two-factor authentication (2FA)
Please, install Aegis Authenticator on your Android phone or Raivo OTP on iOS, then scan the displayed QR Code
-------------------------------------------------------------------------------------------------------
Press any key to continue
Warning: pasting the following URL into your browser exposes the OTP secret to Google:
https://www.google.com/chart?chs=200x200&chld=M|0&cht=qr&chl=otpauth://totp/username@otp.vega.izum.si%3Fsecret%3DABCDEFGHIJKLMNOPQRST1234%26issuer%3Dotp.vega.izum.si
[here an ASCII diagram of the QR-code]
Your new secret key is: ABCDEFGHIJKLMNOPQRST1234
Enter code from app (-1 to skip): -1
Code confirmation skipped
Your emergency scratch codes are:
12345678
98765432
2FA configured successfully, next ssh connection will use it
Connection to otp.vega.izum.si closed.
Building dependencies¶
Boost¶
# last update: October 2024
GCC_VERSION=12.3.0
BOOST_VERSION=1.82.0
module load GCC/${GCC_VERSION} OpenMPI/4.1.5-GCC-${GCC_VERSION} cURL/8.0.1-GCCcore-${GCC_VERSION}
mkdir boost-build
cd boost-build
BOOST_DOMAIN="https://boostorg.jfrog.io/artifactory/main"
BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}"
mkdir -p "${BOOST_ROOT}"
curl -sL "${BOOST_DOMAIN}/release/${BOOST_VERSION}/source/boost_${BOOST_VERSION//./_}.tar.bz2" | tar xj
cd "boost_${BOOST_VERSION//./_}"
echo 'using mpi ;' > tools/build/src/user-config.jam
./bootstrap.sh --with-libraries=filesystem,system,mpi,serialization,test
./b2 -j 4 install --prefix="${BOOST_ROOT}"
Building software¶
ESPResSo¶
Release 4.2:
# last update: October 2024
GCC_VERSION=12.3.0
BOOST_VERSION=1.82.0
module load GCC/${GCC_VERSION} \
OpenMPI/4.1.5-GCC-${GCC_VERSION} \
FFTW/3.3.10-GCC-${GCC_VERSION} \
CUDA/12.3.0 \
Python/3.11.3-GCCcore-${GCC_VERSION} \
CMake/3.26.3-GCCcore-${GCC_VERSION} \
cURL/8.0.1-GCCcore-${GCC_VERSION}
export BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${CUDA_HOME}/lib64/stubs"
git clone --recursive --branch 4.2 --origin upstream \
https://github.com/espressomd/espresso.git espresso-4.2
cd espresso-4.2
git fetch upstream python
git cherry-pick 381aac217
git show HEAD:src/core/CMakeLists.txt > src/core/CMakeLists.txt
sed -i 's/{PYTHON_INSTDIR}/{ESPRESSO_INSTALL_PYTHON}/' src/core/CMakeLists.txt
sed -i 's/{PYTHON_EXECUTABLE}/{Python_EXECUTABLE}/' src/python/espressomd/CMakeLists.txt
git add src/core/CMakeLists.txt src/python/espressomd/CMakeLists.txt
git commit -m 'CMake: Modernize handling of Python dependencies'
python3 -m venv venv
source venv/bin/activate
python3 -m pip install -c "requirements.txt" "numpy<2.0" scipy vtk h5py cython==0.29.36
mkdir build
cd build
cp ../maintainer/configs/maxset.hpp myconfig.hpp
sed -i "/ADDITIONAL_CHECKS/d" myconfig.hpp
cmake .. -D CMAKE_BUILD_TYPE=Release -D WITH_CUDA=ON \
-D WITH_CCACHE=OFF -D WITH_SCAFACOS=OFF -D WITH_HDF5=OFF
make -j 4
deactivate
module purge
Release 4.3:
# last update: October 2024
GCC_VERSION=12.3.0
BOOST_VERSION=1.82.0
module load GCC/${GCC_VERSION} \
OpenMPI/4.1.5-GCC-${GCC_VERSION} \
FFTW/3.3.10-GCC-${GCC_VERSION} \
CUDA/12.3.0 \
Python/3.11.3-GCCcore-${GCC_VERSION} \
CMake/3.26.3-GCCcore-${GCC_VERSION} \
cURL/8.0.1-GCCcore-${GCC_VERSION}
export BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${CUDA_HOME}/lib64/stubs"
git clone --recursive --branch python --origin upstream \
https://github.com/espressomd/espresso.git espresso-4.3
cd espresso-4.3
python3 -m venv venv
source venv/bin/activate
python3 -m pip install -c "requirements.txt" "numpy<2.0" scipy vtk h5py cython==3.0.8
mkdir build
cd build
cp ../maintainer/configs/maxset.hpp myconfig.hpp
sed -i "/ADDITIONAL_CHECKS/d" myconfig.hpp
cmake .. -D CMAKE_BUILD_TYPE=Release -D ESPRESSO_BUILD_WITH_CCACHE=OFF \
-D ESPRESSO_BUILD_WITH_CUDA=ON -D CMAKE_CUDA_ARCHITECTURES="80" \
-D CUDAToolkit_ROOT="${CUDA_HOME}" \
-D ESPRESSO_BUILD_WITH_WALBERLA=ON -D ESPRESSO_BUILD_WITH_WALBERLA_AVX=ON \
-D ESPRESSO_BUILD_WITH_SCAFACOS=OFF -D ESPRESSO_BUILD_WITH_HDF5=OFF
make -j 4
deactivate
module purge
Submitting jobs¶
Batch command for a benchmark job:
for n in 32 64 128 256 512 1024 2048 4096 ; do
sbatch --partition=cpu --ntasks=${n} --mem-per-cpu=800MB \
--hint=nomultithread --exclusive \
--ntasks-per-node=$((n<128 ? n : 128)) job.sh
done
Job script:
#!/bin/bash
#SBATCH --job-name=multixscale
#SBATCH --time=00:10:00
#SBATCH --output %j.stdout
#SBATCH --error %j.stderr
# last update: September 2024
GCC_VERSION=12.3.0
BOOST_VERSION=1.82.0
module load GCC/${GCC_VERSION} \
OpenMPI/4.1.5-GCC-${GCC_VERSION} \
FFTW/3.3.10-GCC-${GCC_VERSION} \
CUDA/12.3.0 \
Python/3.11.3-GCCcore-${GCC_VERSION} \
CMake/3.26.3-GCCcore-${GCC_VERSION} \
cURL/8.0.1-GCCcore-${GCC_VERSION}
# fix for "GPU Error: 34 cudaErrorStubLibrary: CUDA driver is a stub library"
export LD_LIBRARY_PATH=$(echo "${LD_LIBRARY_PATH}" | sed "s|:/cvmfs/sling.si/modules/el7/software/CUDA/12.3.0/lib64/stubs||")
export BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib"
OLD_PYTHONPATH="${PYTHONPATH}"
export ESPRESSO_ROOT="${HOME}/multixscale/4.3-walberla-packinfo/espresso-4.3"
export PYTHONPATH="${ESPRESSO_ROOT}/build-lj/src/python${OLD_PYTHONPATH:+:$OLD_PYTHONPATH}"
source "${ESPRESSO_ROOT}/venv/bin/activate"
srun --cpu-bind=cores python3 script.py --particles_per_core=2000
export PYTHONPATH="${OLD_PYTHONPATH}"
deactivate
module purge
The following harmless warning will be generated:
srun: error: WARNING: Multiple leaf switches contain nodes: gn[01-60]
This is a known issue on HPC Vega, see e.g. this Max CoE thread from Nov 2022. No corrective action is required.
The following point-to-point messaging layer (pml) error message may be generated when using more than 1 socket on a GPU node:
[gn22:1299643] pml_ucx.c:419 Error: ucp_ep_create(proc=246) failed: Shared memory error
[gn22:1299644] [[26541,3474],246] selected pml cm, but peer [[26541,3474],0] on gn20 selected pml ucx
Set the appropriate environment variable to force the same policy on all sockets:
OMPI_MCA_pml=cm srun --cpu-bind=cores python3 ../script-tuned-mesh.py --particles_per_core=400 --hardware=gpu
The following CUDA error is due to the depreciated stubs library:
GPU Error: 34 cudaErrorStubLibrary: CUDA driver is a stub library
Fix it by eliding the stubs search path:
export LD_LIBRARY_PATH=$(echo "${LD_LIBRARY_PATH}" | sed "s|:/cvmfs/sling.si/modules/el7/software/CUDA/12.3.0/lib64/stubs||")
Interactive job:
srun --partition=dev --time=0:05:00 --job-name=interactive \
--ntasks=32 --ntasks-per-node=32 --mem-per-cpu=400MB \
--hint=nomultithread --pty /usr/bin/bash
module load Boost/1.72.0-gompi-2020a hwloc
mpiexec -n ${SLURM_NTASKS} --bind-to core --map-by core --report-bindings \
/ceph/hpc/software/mpibench
Output:
[cn0294:2697220] MCW rank 0 bound to socket 1[core 0[hwt 0-1]]: [][BB/../../../../../../../../../../../../../../..]
[cn0294:2697220] MCW rank 1 bound to socket 1[core 1[hwt 0-1]]: [][../BB/../../../../../../../../../../../../../..]
[cn0294:2697220] MCW rank 2 bound to socket 1[core 2[hwt 0-1]]: [][../../BB/../../../../../../../../../../../../..]
[cn0294:2697220] MCW rank 3 bound to socket 1[core 3[hwt 0-1]]: [][../../../BB/../../../../../../../../../../../..]
[cn0294:2697220] MCW rank 4 bound to socket 1[core 4[hwt 0-1]]: [][../../../../BB/../../../../../../../../../../..]
[cn0294:2697220] MCW rank 5 bound to socket 1[core 5[hwt 0-1]]: [][../../../../../BB/../../../../../../../../../..]
[cn0294:2697220] MCW rank 6 bound to socket 1[core 6[hwt 0-1]]: [][../../../../../../BB/../../../../../../../../..]
[cn0294:2697220] MCW rank 7 bound to socket 1[core 7[hwt 0-1]]: [][../../../../../../../BB/../../../../../../../..]
[cn0294:2697220] MCW rank 8 bound to socket 1[core 8[hwt 0-1]]: [][../../../../../../../../BB/../../../../../../..]
[cn0294:2697220] MCW rank 9 bound to socket 1[core 9[hwt 0-1]]: [][../../../../../../../../../BB/../../../../../..]
[cn0294:2697220] MCW rank 10 bound to socket 1[core 10[hwt 0-1]]: [][../../../../../../../../../../BB/../../../../..]
[cn0294:2697220] MCW rank 11 bound to socket 1[core 11[hwt 0-1]]: [][../../../../../../../../../../../BB/../../../..]
[cn0294:2697220] MCW rank 12 bound to socket 1[core 12[hwt 0-1]]: [][../../../../../../../../../../../../BB/../../..]
[cn0294:2697220] MCW rank 13 bound to socket 1[core 13[hwt 0-1]]: [][../../../../../../../../../../../../../BB/../..]
[cn0294:2697220] MCW rank 14 bound to socket 1[core 14[hwt 0-1]]: [][../../../../../../../../../../../../../../BB/..]
[cn0294:2697220] MCW rank 15 bound to socket 1[core 15[hwt 0-1]]: [][../../../../../../../../../../../../../../../BB]
START mpiBench v1.5
...