.. _Using Vega: Using Vega ========== .. _Vega login: Login ----- There are 8 login nodes and 3 gateway that redirect to any of the login nodes in a load-balanced way: ============================ ======================================= Hostname Node type ============================ ======================================= ``login.vega.izum.si`` login to one of the eight login nodes ``logincpu.vega.izum.si`` login to one of the four CPU login nodes ``logingpu.vega.izum.si`` login to one of the four GPU login nodes ============================ ======================================= Host key fingerprint: ========= ======================================================== Algorithm Fingerprint (SHA256) ========= ======================================================== RSA ``SHA256:wPX+u4yCNLEh3s8c+ajvWja2i+mx+6iuo0KjnJuxYco`` ECDSA ``SHA256:PUL9NXrCdfWMUPPQRwjdwpFVzGB61Ta97FwdSXyndFE`` ED25519 ``SHA256:VjR9viww63uTSojXTo5WKZNR352p5+/KC0nzycCG27U`` ========= ======================================================== More details can be found in the wiki page `HPC Vega / Login information `__. .. _Vega 2FA: 2FA --- There is 1 server for 2FA: ============================ ======================================= Hostname Node type ============================ ======================================= ``otp.vega.izum.si`` OTP server ============================ ======================================= Host key fingerprint: ========= ======================================================== Algorithm Fingerprint (SHA256) ========= ======================================================== ED25519 ``SHA256:oN2fHLZe2Z/BbaNUnU+f6m56HWVX/o0mlK9/wqD54fE`` ========= ======================================================== .. code-block:: none $ ssh username@otp.vega.izum.si ------------------------------------------------------------------------------------------------------- Configuring two-factor authentication (2FA) Please, install Aegis Authenticator on your Android phone or Raivo OTP on iOS, then scan the displayed QR Code ------------------------------------------------------------------------------------------------------- Press any key to continue Warning: pasting the following URL into your browser exposes the OTP secret to Google: https://www.google.com/chart?chs=200x200&chld=M|0&cht=qr&chl=otpauth://totp/username@otp.vega.izum.si%3Fsecret%3DABCDEFGHIJKLMNOPQRST1234%26issuer%3Dotp.vega.izum.si [here an ASCII diagram of the QR-code] Your new secret key is: ABCDEFGHIJKLMNOPQRST1234 Enter code from app (-1 to skip): -1 Code confirmation skipped Your emergency scratch codes are: 12345678 98765432 2FA configured successfully, next ssh connection will use it Connection to otp.vega.izum.si closed. .. _Vega building dependencies: Building dependencies --------------------- Boost ^^^^^ .. code-block:: bash # last update: October 2024 GCC_VERSION=12.3.0 BOOST_VERSION=1.82.0 module load GCC/${GCC_VERSION} OpenMPI/4.1.5-GCC-${GCC_VERSION} cURL/8.0.1-GCCcore-${GCC_VERSION} mkdir boost-build cd boost-build BOOST_DOMAIN="https://boostorg.jfrog.io/artifactory/main" BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}" mkdir -p "${BOOST_ROOT}" curl -sL "${BOOST_DOMAIN}/release/${BOOST_VERSION}/source/boost_${BOOST_VERSION//./_}.tar.bz2" | tar xj cd "boost_${BOOST_VERSION//./_}" echo 'using mpi ;' > tools/build/src/user-config.jam ./bootstrap.sh --with-libraries=filesystem,system,mpi,serialization,test ./b2 -j 4 install --prefix="${BOOST_ROOT}" .. _Vega building software: Building software ----------------- ESPResSo ^^^^^^^^ Release 4.2: .. code-block:: bash # last update: October 2024 GCC_VERSION=12.3.0 BOOST_VERSION=1.82.0 module load GCC/${GCC_VERSION} \ OpenMPI/4.1.5-GCC-${GCC_VERSION} \ FFTW/3.3.10-GCC-${GCC_VERSION} \ CUDA/12.3.0 \ Python/3.11.3-GCCcore-${GCC_VERSION} \ CMake/3.26.3-GCCcore-${GCC_VERSION} \ cURL/8.0.1-GCCcore-${GCC_VERSION} export BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}" export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib" export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${CUDA_HOME}/lib64/stubs" git clone --recursive --branch 4.2 --origin upstream \ https://github.com/espressomd/espresso.git espresso-4.2 cd espresso-4.2 git fetch upstream python git cherry-pick 381aac217 git show HEAD:src/core/CMakeLists.txt > src/core/CMakeLists.txt sed -i 's/{PYTHON_INSTDIR}/{ESPRESSO_INSTALL_PYTHON}/' src/core/CMakeLists.txt sed -i 's/{PYTHON_EXECUTABLE}/{Python_EXECUTABLE}/' src/python/espressomd/CMakeLists.txt git add src/core/CMakeLists.txt src/python/espressomd/CMakeLists.txt git commit -m 'CMake: Modernize handling of Python dependencies' python3 -m venv venv source venv/bin/activate python3 -m pip install -c "requirements.txt" "numpy<2.0" scipy vtk h5py cython==0.29.36 mkdir build cd build cp ../maintainer/configs/maxset.hpp myconfig.hpp sed -i "/ADDITIONAL_CHECKS/d" myconfig.hpp cmake .. -D CMAKE_BUILD_TYPE=Release -D WITH_CUDA=ON \ -D WITH_CCACHE=OFF -D WITH_SCAFACOS=OFF -D WITH_HDF5=OFF make -j 4 deactivate module purge Release 4.3: .. code-block:: bash # last update: October 2024 GCC_VERSION=12.3.0 BOOST_VERSION=1.82.0 module load GCC/${GCC_VERSION} \ OpenMPI/4.1.5-GCC-${GCC_VERSION} \ FFTW/3.3.10-GCC-${GCC_VERSION} \ CUDA/12.3.0 \ Python/3.11.3-GCCcore-${GCC_VERSION} \ CMake/3.26.3-GCCcore-${GCC_VERSION} \ cURL/8.0.1-GCCcore-${GCC_VERSION} export BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}" export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib" export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${CUDA_HOME}/lib64/stubs" git clone --recursive --branch python --origin upstream \ https://github.com/espressomd/espresso.git espresso-4.3 cd espresso-4.3 python3 -m venv venv source venv/bin/activate python3 -m pip install -c "requirements.txt" "numpy<2.0" scipy vtk h5py cython==3.0.8 mkdir build cd build cp ../maintainer/configs/maxset.hpp myconfig.hpp sed -i "/ADDITIONAL_CHECKS/d" myconfig.hpp cmake .. -D CMAKE_BUILD_TYPE=Release -D ESPRESSO_BUILD_WITH_CCACHE=OFF \ -D ESPRESSO_BUILD_WITH_CUDA=ON -D CMAKE_CUDA_ARCHITECTURES="80" \ -D CUDAToolkit_ROOT="${CUDA_HOME}" \ -D ESPRESSO_BUILD_WITH_WALBERLA=ON -D ESPRESSO_BUILD_WITH_WALBERLA_AVX=ON \ -D ESPRESSO_BUILD_WITH_SCAFACOS=OFF -D ESPRESSO_BUILD_WITH_HDF5=OFF make -j 4 deactivate module purge .. _Vega submitting jobs: Submitting jobs --------------- Batch command for a benchmark job: .. code-block:: bash for n in 32 64 128 256 512 1024 2048 4096 ; do sbatch --partition=cpu --ntasks=${n} --mem-per-cpu=800MB \ --hint=nomultithread --exclusive \ --ntasks-per-node=$((n<128 ? n : 128)) job.sh done Job script: .. code-block:: bash #!/bin/bash #SBATCH --job-name=multixscale #SBATCH --time=00:10:00 #SBATCH --output %j.stdout #SBATCH --error %j.stderr # last update: September 2024 GCC_VERSION=12.3.0 BOOST_VERSION=1.82.0 module load GCC/${GCC_VERSION} \ OpenMPI/4.1.5-GCC-${GCC_VERSION} \ FFTW/3.3.10-GCC-${GCC_VERSION} \ CUDA/12.3.0 \ Python/3.11.3-GCCcore-${GCC_VERSION} \ CMake/3.26.3-GCCcore-${GCC_VERSION} \ cURL/8.0.1-GCCcore-${GCC_VERSION} # fix for "GPU Error: 34 cudaErrorStubLibrary: CUDA driver is a stub library" export LD_LIBRARY_PATH=$(echo "${LD_LIBRARY_PATH}" | sed "s|:/cvmfs/sling.si/modules/el7/software/CUDA/12.3.0/lib64/stubs||") export BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}" export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib" OLD_PYTHONPATH="${PYTHONPATH}" export ESPRESSO_ROOT="${HOME}/multixscale/4.3-walberla-packinfo/espresso-4.3" export PYTHONPATH="${ESPRESSO_ROOT}/build-lj/src/python${OLD_PYTHONPATH:+:$OLD_PYTHONPATH}" source "${ESPRESSO_ROOT}/venv/bin/activate" srun --cpu-bind=cores python3 script.py --particles_per_core=2000 export PYTHONPATH="${OLD_PYTHONPATH}" deactivate module purge The following harmless warning will be generated: .. code-block:: none srun: error: WARNING: Multiple leaf switches contain nodes: gn[01-60] This is a known issue on HPC Vega, see e.g. `this Max CoE thread from Nov 2022 `__. No corrective action is required. The following point-to-point messaging layer (`pml`) error message may be generated when using more than 1 socket on a GPU node: .. code-block:: none [gn22:1299643] pml_ucx.c:419 Error: ucp_ep_create(proc=246) failed: Shared memory error [gn22:1299644] [[26541,3474],246] selected pml cm, but peer [[26541,3474],0] on gn20 selected pml ucx Set the appropriate environment variable to force the same policy on all sockets: .. code-block:: bash OMPI_MCA_pml=cm srun --cpu-bind=cores python3 ../script-tuned-mesh.py --particles_per_core=400 --hardware=gpu The following CUDA error is due to the depreciated stubs library: .. code-block:: none GPU Error: 34 cudaErrorStubLibrary: CUDA driver is a stub library Fix it by eliding the stubs search path: .. code-block:: bash export LD_LIBRARY_PATH=$(echo "${LD_LIBRARY_PATH}" | sed "s|:/cvmfs/sling.si/modules/el7/software/CUDA/12.3.0/lib64/stubs||") Interactive job: .. code-block:: bash srun --partition=dev --time=0:05:00 --job-name=interactive \ --ntasks=32 --ntasks-per-node=32 --mem-per-cpu=400MB \ --hint=nomultithread --pty /usr/bin/bash module load Boost/1.72.0-gompi-2020a hwloc mpiexec -n ${SLURM_NTASKS} --bind-to core --map-by core --report-bindings \ /ceph/hpc/software/mpibench Output: .. code-block:: none [cn0294:2697220] MCW rank 0 bound to socket 1[core 0[hwt 0-1]]: [][BB/../../../../../../../../../../../../../../..] [cn0294:2697220] MCW rank 1 bound to socket 1[core 1[hwt 0-1]]: [][../BB/../../../../../../../../../../../../../..] [cn0294:2697220] MCW rank 2 bound to socket 1[core 2[hwt 0-1]]: [][../../BB/../../../../../../../../../../../../..] [cn0294:2697220] MCW rank 3 bound to socket 1[core 3[hwt 0-1]]: [][../../../BB/../../../../../../../../../../../..] [cn0294:2697220] MCW rank 4 bound to socket 1[core 4[hwt 0-1]]: [][../../../../BB/../../../../../../../../../../..] [cn0294:2697220] MCW rank 5 bound to socket 1[core 5[hwt 0-1]]: [][../../../../../BB/../../../../../../../../../..] [cn0294:2697220] MCW rank 6 bound to socket 1[core 6[hwt 0-1]]: [][../../../../../../BB/../../../../../../../../..] [cn0294:2697220] MCW rank 7 bound to socket 1[core 7[hwt 0-1]]: [][../../../../../../../BB/../../../../../../../..] [cn0294:2697220] MCW rank 8 bound to socket 1[core 8[hwt 0-1]]: [][../../../../../../../../BB/../../../../../../..] [cn0294:2697220] MCW rank 9 bound to socket 1[core 9[hwt 0-1]]: [][../../../../../../../../../BB/../../../../../..] [cn0294:2697220] MCW rank 10 bound to socket 1[core 10[hwt 0-1]]: [][../../../../../../../../../../BB/../../../../..] [cn0294:2697220] MCW rank 11 bound to socket 1[core 11[hwt 0-1]]: [][../../../../../../../../../../../BB/../../../..] [cn0294:2697220] MCW rank 12 bound to socket 1[core 12[hwt 0-1]]: [][../../../../../../../../../../../../BB/../../..] [cn0294:2697220] MCW rank 13 bound to socket 1[core 13[hwt 0-1]]: [][../../../../../../../../../../../../../BB/../..] [cn0294:2697220] MCW rank 14 bound to socket 1[core 14[hwt 0-1]]: [][../../../../../../../../../../../../../../BB/..] [cn0294:2697220] MCW rank 15 bound to socket 1[core 15[hwt 0-1]]: [][../../../../../../../../../../../../../../../BB] START mpiBench v1.5 ...