.. _Using Vega:
Using Vega
==========
.. _Vega login:
Login
-----
There are 8 login nodes and 3 gateway that redirect
to any of the login nodes in a load-balanced way:
============================ =======================================
Hostname Node type
============================ =======================================
``login.vega.izum.si`` login to one of the eight login nodes
``logincpu.vega.izum.si`` login to one of the four CPU login nodes
``logingpu.vega.izum.si`` login to one of the four GPU login nodes
============================ =======================================
Host key fingerprint:
========= ========================================================
Algorithm Fingerprint (SHA256)
========= ========================================================
RSA ``SHA256:wPX+u4yCNLEh3s8c+ajvWja2i+mx+6iuo0KjnJuxYco``
ECDSA ``SHA256:PUL9NXrCdfWMUPPQRwjdwpFVzGB61Ta97FwdSXyndFE``
ED25519 ``SHA256:VjR9viww63uTSojXTo5WKZNR352p5+/KC0nzycCG27U``
========= ========================================================
More details can be found in the wiki page `HPC Vega / Login information
`__.
.. _Vega 2FA:
2FA
---
There is 1 server for 2FA:
============================ =======================================
Hostname Node type
============================ =======================================
``otp.vega.izum.si`` OTP server
============================ =======================================
Host key fingerprint:
========= ========================================================
Algorithm Fingerprint (SHA256)
========= ========================================================
ED25519 ``SHA256:oN2fHLZe2Z/BbaNUnU+f6m56HWVX/o0mlK9/wqD54fE``
========= ========================================================
.. code-block:: none
$ ssh username@otp.vega.izum.si
-------------------------------------------------------------------------------------------------------
Configuring two-factor authentication (2FA)
Please, install Aegis Authenticator on your Android phone or Raivo OTP on iOS, then scan the displayed QR Code
-------------------------------------------------------------------------------------------------------
Press any key to continue
Warning: pasting the following URL into your browser exposes the OTP secret to Google:
https://www.google.com/chart?chs=200x200&chld=M|0&cht=qr&chl=otpauth://totp/username@otp.vega.izum.si%3Fsecret%3DABCDEFGHIJKLMNOPQRST1234%26issuer%3Dotp.vega.izum.si
[here an ASCII diagram of the QR-code]
Your new secret key is: ABCDEFGHIJKLMNOPQRST1234
Enter code from app (-1 to skip): -1
Code confirmation skipped
Your emergency scratch codes are:
12345678
98765432
2FA configured successfully, next ssh connection will use it
Connection to otp.vega.izum.si closed.
.. _Vega building dependencies:
Building dependencies
---------------------
Boost
^^^^^
.. code-block:: bash
# last update: October 2024
GCC_VERSION=12.3.0
BOOST_VERSION=1.82.0
module load GCC/${GCC_VERSION} OpenMPI/4.1.5-GCC-${GCC_VERSION} cURL/8.0.1-GCCcore-${GCC_VERSION}
mkdir boost-build
cd boost-build
BOOST_DOMAIN="https://boostorg.jfrog.io/artifactory/main"
BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}"
mkdir -p "${BOOST_ROOT}"
curl -sL "${BOOST_DOMAIN}/release/${BOOST_VERSION}/source/boost_${BOOST_VERSION//./_}.tar.bz2" | tar xj
cd "boost_${BOOST_VERSION//./_}"
echo 'using mpi ;' > tools/build/src/user-config.jam
./bootstrap.sh --with-libraries=filesystem,system,mpi,serialization,test
./b2 -j 4 install --prefix="${BOOST_ROOT}"
.. _Vega building software:
Building software
-----------------
ESPResSo
^^^^^^^^
Release 4.2:
.. code-block:: bash
# last update: October 2024
GCC_VERSION=12.3.0
BOOST_VERSION=1.82.0
module load GCC/${GCC_VERSION} \
OpenMPI/4.1.5-GCC-${GCC_VERSION} \
FFTW/3.3.10-GCC-${GCC_VERSION} \
CUDA/12.3.0 \
Python/3.11.3-GCCcore-${GCC_VERSION} \
CMake/3.26.3-GCCcore-${GCC_VERSION} \
cURL/8.0.1-GCCcore-${GCC_VERSION}
export BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${CUDA_HOME}/lib64/stubs"
git clone --recursive --branch 4.2 --origin upstream \
https://github.com/espressomd/espresso.git espresso-4.2
cd espresso-4.2
git fetch upstream python
git cherry-pick 381aac217
git show HEAD:src/core/CMakeLists.txt > src/core/CMakeLists.txt
sed -i 's/{PYTHON_INSTDIR}/{ESPRESSO_INSTALL_PYTHON}/' src/core/CMakeLists.txt
sed -i 's/{PYTHON_EXECUTABLE}/{Python_EXECUTABLE}/' src/python/espressomd/CMakeLists.txt
git add src/core/CMakeLists.txt src/python/espressomd/CMakeLists.txt
git commit -m 'CMake: Modernize handling of Python dependencies'
python3 -m venv venv
source venv/bin/activate
python3 -m pip install -c "requirements.txt" "numpy<2.0" scipy vtk h5py cython==0.29.36
mkdir build
cd build
cp ../maintainer/configs/maxset.hpp myconfig.hpp
sed -i "/ADDITIONAL_CHECKS/d" myconfig.hpp
cmake .. -D CMAKE_BUILD_TYPE=Release -D WITH_CUDA=ON \
-D WITH_CCACHE=OFF -D WITH_SCAFACOS=OFF -D WITH_HDF5=OFF
make -j 4
deactivate
module purge
Release 4.3:
.. code-block:: bash
# last update: October 2024
GCC_VERSION=12.3.0
BOOST_VERSION=1.82.0
module load GCC/${GCC_VERSION} \
OpenMPI/4.1.5-GCC-${GCC_VERSION} \
FFTW/3.3.10-GCC-${GCC_VERSION} \
CUDA/12.3.0 \
Python/3.11.3-GCCcore-${GCC_VERSION} \
CMake/3.26.3-GCCcore-${GCC_VERSION} \
cURL/8.0.1-GCCcore-${GCC_VERSION}
export BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${CUDA_HOME}/lib64/stubs"
git clone --recursive --branch python --origin upstream \
https://github.com/espressomd/espresso.git espresso-4.3
cd espresso-4.3
python3 -m venv venv
source venv/bin/activate
python3 -m pip install -c "requirements.txt" "numpy<2.0" scipy vtk h5py cython==3.0.8
mkdir build
cd build
cp ../maintainer/configs/maxset.hpp myconfig.hpp
sed -i "/ADDITIONAL_CHECKS/d" myconfig.hpp
cmake .. -D CMAKE_BUILD_TYPE=Release -D ESPRESSO_BUILD_WITH_CCACHE=OFF \
-D ESPRESSO_BUILD_WITH_CUDA=ON -D CMAKE_CUDA_ARCHITECTURES="80" \
-D CUDAToolkit_ROOT="${CUDA_HOME}" \
-D ESPRESSO_BUILD_WITH_WALBERLA=ON -D ESPRESSO_BUILD_WITH_WALBERLA_AVX=ON \
-D ESPRESSO_BUILD_WITH_SCAFACOS=OFF -D ESPRESSO_BUILD_WITH_HDF5=OFF
make -j 4
deactivate
module purge
.. _Vega submitting jobs:
Submitting jobs
---------------
Batch command for a benchmark job:
.. code-block:: bash
for n in 32 64 128 256 512 1024 2048 4096 ; do
sbatch --partition=cpu --ntasks=${n} --mem-per-cpu=800MB \
--hint=nomultithread --exclusive \
--ntasks-per-node=$((n<128 ? n : 128)) job.sh
done
Job script:
.. code-block:: bash
#!/bin/bash
#SBATCH --job-name=multixscale
#SBATCH --time=00:10:00
#SBATCH --output %j.stdout
#SBATCH --error %j.stderr
# last update: September 2024
GCC_VERSION=12.3.0
BOOST_VERSION=1.82.0
module load GCC/${GCC_VERSION} \
OpenMPI/4.1.5-GCC-${GCC_VERSION} \
FFTW/3.3.10-GCC-${GCC_VERSION} \
CUDA/12.3.0 \
Python/3.11.3-GCCcore-${GCC_VERSION} \
CMake/3.26.3-GCCcore-${GCC_VERSION} \
cURL/8.0.1-GCCcore-${GCC_VERSION}
# fix for "GPU Error: 34 cudaErrorStubLibrary: CUDA driver is a stub library"
export LD_LIBRARY_PATH=$(echo "${LD_LIBRARY_PATH}" | sed "s|:/cvmfs/sling.si/modules/el7/software/CUDA/12.3.0/lib64/stubs||")
export BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib"
OLD_PYTHONPATH="${PYTHONPATH}"
export ESPRESSO_ROOT="${HOME}/multixscale/4.3-walberla-packinfo/espresso-4.3"
export PYTHONPATH="${ESPRESSO_ROOT}/build-lj/src/python${OLD_PYTHONPATH:+:$OLD_PYTHONPATH}"
source "${ESPRESSO_ROOT}/venv/bin/activate"
srun --cpu-bind=cores python3 script.py --particles_per_core=2000
export PYTHONPATH="${OLD_PYTHONPATH}"
deactivate
module purge
The following harmless warning will be generated:
.. code-block:: none
srun: error: WARNING: Multiple leaf switches contain nodes: gn[01-60]
This is a known issue on HPC Vega, see e.g. `this Max CoE thread from Nov 2022
`__.
No corrective action is required.
The following point-to-point messaging layer (`pml`) error message
may be generated when using more than 1 socket on a GPU node:
.. code-block:: none
[gn22:1299643] pml_ucx.c:419 Error: ucp_ep_create(proc=246) failed: Shared memory error
[gn22:1299644] [[26541,3474],246] selected pml cm, but peer [[26541,3474],0] on gn20 selected pml ucx
Set the appropriate environment variable to force the same policy on all sockets:
.. code-block:: bash
OMPI_MCA_pml=cm srun --cpu-bind=cores python3 ../script-tuned-mesh.py --particles_per_core=400 --hardware=gpu
The following CUDA error is due to the depreciated stubs library:
.. code-block:: none
GPU Error: 34 cudaErrorStubLibrary: CUDA driver is a stub library
Fix it by eliding the stubs search path:
.. code-block:: bash
export LD_LIBRARY_PATH=$(echo "${LD_LIBRARY_PATH}" | sed "s|:/cvmfs/sling.si/modules/el7/software/CUDA/12.3.0/lib64/stubs||")
Interactive job:
.. code-block:: bash
srun --partition=dev --time=0:05:00 --job-name=interactive \
--ntasks=32 --ntasks-per-node=32 --mem-per-cpu=400MB \
--hint=nomultithread --pty /usr/bin/bash
module load Boost/1.72.0-gompi-2020a hwloc
mpiexec -n ${SLURM_NTASKS} --bind-to core --map-by core --report-bindings \
/ceph/hpc/software/mpibench
Output:
.. code-block:: none
[cn0294:2697220] MCW rank 0 bound to socket 1[core 0[hwt 0-1]]: [][BB/../../../../../../../../../../../../../../..]
[cn0294:2697220] MCW rank 1 bound to socket 1[core 1[hwt 0-1]]: [][../BB/../../../../../../../../../../../../../..]
[cn0294:2697220] MCW rank 2 bound to socket 1[core 2[hwt 0-1]]: [][../../BB/../../../../../../../../../../../../..]
[cn0294:2697220] MCW rank 3 bound to socket 1[core 3[hwt 0-1]]: [][../../../BB/../../../../../../../../../../../..]
[cn0294:2697220] MCW rank 4 bound to socket 1[core 4[hwt 0-1]]: [][../../../../BB/../../../../../../../../../../..]
[cn0294:2697220] MCW rank 5 bound to socket 1[core 5[hwt 0-1]]: [][../../../../../BB/../../../../../../../../../..]
[cn0294:2697220] MCW rank 6 bound to socket 1[core 6[hwt 0-1]]: [][../../../../../../BB/../../../../../../../../..]
[cn0294:2697220] MCW rank 7 bound to socket 1[core 7[hwt 0-1]]: [][../../../../../../../BB/../../../../../../../..]
[cn0294:2697220] MCW rank 8 bound to socket 1[core 8[hwt 0-1]]: [][../../../../../../../../BB/../../../../../../..]
[cn0294:2697220] MCW rank 9 bound to socket 1[core 9[hwt 0-1]]: [][../../../../../../../../../BB/../../../../../..]
[cn0294:2697220] MCW rank 10 bound to socket 1[core 10[hwt 0-1]]: [][../../../../../../../../../../BB/../../../../..]
[cn0294:2697220] MCW rank 11 bound to socket 1[core 11[hwt 0-1]]: [][../../../../../../../../../../../BB/../../../..]
[cn0294:2697220] MCW rank 12 bound to socket 1[core 12[hwt 0-1]]: [][../../../../../../../../../../../../BB/../../..]
[cn0294:2697220] MCW rank 13 bound to socket 1[core 13[hwt 0-1]]: [][../../../../../../../../../../../../../BB/../..]
[cn0294:2697220] MCW rank 14 bound to socket 1[core 14[hwt 0-1]]: [][../../../../../../../../../../../../../../BB/..]
[cn0294:2697220] MCW rank 15 bound to socket 1[core 15[hwt 0-1]]: [][../../../../../../../../../../../../../../../BB]
START mpiBench v1.5
...