.. _Using Vega:

Using Vega
==========

.. _Vega login:

Login
-----

There are 8 login nodes and 3 gateway that redirect
to any of the login nodes in a load-balanced way:

============================  =======================================
Hostname                      Node type
============================  =======================================
``login.vega.izum.si``        login to one of the eight login nodes
``logincpu.vega.izum.si``     login to one of the four CPU login nodes
``logingpu.vega.izum.si``     login to one of the four GPU login nodes
============================  =======================================

Host key fingerprint:

=========  ========================================================
Algorithm  Fingerprint (SHA256)
=========  ========================================================
RSA        ``SHA256:wPX+u4yCNLEh3s8c+ajvWja2i+mx+6iuo0KjnJuxYco``
ECDSA      ``SHA256:PUL9NXrCdfWMUPPQRwjdwpFVzGB61Ta97FwdSXyndFE``
ED25519    ``SHA256:VjR9viww63uTSojXTo5WKZNR352p5+/KC0nzycCG27U``
=========  ========================================================

More details can be found in the wiki page `HPC Vega / Login information
<https://doc.vega.izum.si/login/>`__.

.. _Vega 2FA:

2FA
---

There is 1 server for 2FA:

============================  =======================================
Hostname                      Node type
============================  =======================================
``otp.vega.izum.si``          OTP server
============================  =======================================

Host key fingerprint:

=========  ========================================================
Algorithm  Fingerprint (SHA256)
=========  ========================================================
ED25519    ``SHA256:oN2fHLZe2Z/BbaNUnU+f6m56HWVX/o0mlK9/wqD54fE``
=========  ========================================================

.. code-block:: none

    $ ssh username@otp.vega.izum.si
    -------------------------------------------------------------------------------------------------------
    Configuring two-factor authentication (2FA)
    Please, install Aegis Authenticator on your Android phone or Raivo OTP on iOS, then scan the displayed QR Code
    -------------------------------------------------------------------------------------------------------
    Press any key to continue
    Warning: pasting the following URL into your browser exposes the OTP secret to Google:
      https://www.google.com/chart?chs=200x200&chld=M|0&cht=qr&chl=otpauth://totp/username@otp.vega.izum.si%3Fsecret%3DABCDEFGHIJKLMNOPQRST1234%26issuer%3Dotp.vega.izum.si
    [here an ASCII diagram of the QR-code]
    Your new secret key is: ABCDEFGHIJKLMNOPQRST1234
    Enter code from app (-1 to skip): -1
    Code confirmation skipped
    Your emergency scratch codes are:
      12345678
      98765432
    2FA configured successfully, next ssh connection will use it
    Connection to otp.vega.izum.si closed.

.. _Vega building dependencies:

Building dependencies
---------------------

Boost
^^^^^

.. code-block:: bash

    # last update: October 2024
    GCC_VERSION=12.3.0
    BOOST_VERSION=1.82.0
    module load GCC/${GCC_VERSION} OpenMPI/4.1.5-GCC-${GCC_VERSION} cURL/8.0.1-GCCcore-${GCC_VERSION}
    mkdir boost-build
    cd boost-build
    BOOST_DOMAIN="https://boostorg.jfrog.io/artifactory/main"
    BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}"
    mkdir -p "${BOOST_ROOT}"
    curl -sL "${BOOST_DOMAIN}/release/${BOOST_VERSION}/source/boost_${BOOST_VERSION//./_}.tar.bz2" | tar xj
    cd "boost_${BOOST_VERSION//./_}"
    echo 'using mpi ;' > tools/build/src/user-config.jam
    ./bootstrap.sh --with-libraries=filesystem,system,mpi,serialization,test
    ./b2 -j 4 install --prefix="${BOOST_ROOT}"

.. _Vega building software:

Building software
-----------------

ESPResSo
^^^^^^^^

Release 4.2:

.. code-block:: bash

    # last update: October 2024
    GCC_VERSION=12.3.0
    BOOST_VERSION=1.82.0
    module load GCC/${GCC_VERSION} \
                OpenMPI/4.1.5-GCC-${GCC_VERSION} \
                FFTW/3.3.10-GCC-${GCC_VERSION} \
                CUDA/12.3.0 \
                Python/3.11.3-GCCcore-${GCC_VERSION} \
                CMake/3.26.3-GCCcore-${GCC_VERSION} \
                cURL/8.0.1-GCCcore-${GCC_VERSION}
    export BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}"
    export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib"
    export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${CUDA_HOME}/lib64/stubs"

    git clone --recursive --branch 4.2 --origin upstream \
        https://github.com/espressomd/espresso.git espresso-4.2
    cd espresso-4.2
    git fetch upstream python
    git cherry-pick 381aac217
    git show HEAD:src/core/CMakeLists.txt > src/core/CMakeLists.txt
    sed -i 's/{PYTHON_INSTDIR}/{ESPRESSO_INSTALL_PYTHON}/' src/core/CMakeLists.txt
    sed -i 's/{PYTHON_EXECUTABLE}/{Python_EXECUTABLE}/' src/python/espressomd/CMakeLists.txt
    git add src/core/CMakeLists.txt src/python/espressomd/CMakeLists.txt
    git commit -m 'CMake: Modernize handling of Python dependencies'
    python3 -m venv venv
    source venv/bin/activate
    python3 -m pip install -c "requirements.txt" "numpy<2.0" scipy vtk h5py cython==0.29.36
    mkdir build
    cd build
    cp ../maintainer/configs/maxset.hpp myconfig.hpp
    sed -i "/ADDITIONAL_CHECKS/d" myconfig.hpp
    cmake .. -D CMAKE_BUILD_TYPE=Release -D WITH_CUDA=ON \
        -D WITH_CCACHE=OFF -D WITH_SCAFACOS=OFF -D WITH_HDF5=OFF
    make -j 4
    deactivate
    module purge

Release 4.3:

.. code-block:: bash

    # last update: October 2024
    GCC_VERSION=12.3.0
    BOOST_VERSION=1.82.0
    module load GCC/${GCC_VERSION} \
                OpenMPI/4.1.5-GCC-${GCC_VERSION} \
                FFTW/3.3.10-GCC-${GCC_VERSION} \
                CUDA/12.3.0 \
                Python/3.11.3-GCCcore-${GCC_VERSION} \
                CMake/3.26.3-GCCcore-${GCC_VERSION} \
                cURL/8.0.1-GCCcore-${GCC_VERSION}
    export BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}"
    export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib"
    export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${CUDA_HOME}/lib64/stubs"

    git clone --recursive --branch python --origin upstream \
        https://github.com/espressomd/espresso.git espresso-4.3
    cd espresso-4.3
    python3 -m venv venv
    source venv/bin/activate
    python3 -m pip install -c "requirements.txt" "numpy<2.0" scipy vtk h5py cython==3.0.8
    mkdir build
    cd build
    cp ../maintainer/configs/maxset.hpp myconfig.hpp
    sed -i "/ADDITIONAL_CHECKS/d" myconfig.hpp
    cmake .. -D CMAKE_BUILD_TYPE=Release -D ESPRESSO_BUILD_WITH_CCACHE=OFF \
        -D ESPRESSO_BUILD_WITH_CUDA=ON -D CMAKE_CUDA_ARCHITECTURES="80" \
        -D CUDAToolkit_ROOT="${CUDA_HOME}" \
        -D ESPRESSO_BUILD_WITH_WALBERLA=ON -D ESPRESSO_BUILD_WITH_WALBERLA_AVX=ON \
        -D ESPRESSO_BUILD_WITH_SCAFACOS=OFF -D ESPRESSO_BUILD_WITH_HDF5=OFF
    make -j 4
    deactivate
    module purge

.. _Vega submitting jobs:

Submitting jobs
---------------

Batch command for a benchmark job:

.. code-block:: bash

    for n in 32 64 128 256 512 1024 2048 4096 ; do
      sbatch --partition=cpu --ntasks=${n} --mem-per-cpu=800MB \
             --hint=nomultithread --exclusive \
             --ntasks-per-node=$((n<128 ? n : 128)) job.sh
    done

Job script:

.. code-block:: bash

    #!/bin/bash
    #SBATCH --job-name=multixscale
    #SBATCH --time=00:10:00
    #SBATCH --output %j.stdout
    #SBATCH --error  %j.stderr

    # last update: September 2024
    GCC_VERSION=12.3.0
    BOOST_VERSION=1.82.0
    module load GCC/${GCC_VERSION} \
                OpenMPI/4.1.5-GCC-${GCC_VERSION} \
                FFTW/3.3.10-GCC-${GCC_VERSION} \
                CUDA/12.3.0 \
                Python/3.11.3-GCCcore-${GCC_VERSION} \
                CMake/3.26.3-GCCcore-${GCC_VERSION} \
                cURL/8.0.1-GCCcore-${GCC_VERSION}

    # fix for "GPU Error: 34 cudaErrorStubLibrary: CUDA driver is a stub library"
    export LD_LIBRARY_PATH=$(echo "${LD_LIBRARY_PATH}" | sed "s|:/cvmfs/sling.si/modules/el7/software/CUDA/12.3.0/lib64/stubs||")
    export BOOST_ROOT="${HOME}/bin/boost_mpi_${BOOST_VERSION//./_}_gcc_${GCC_VERSION//./_}"
    export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib"

    OLD_PYTHONPATH="${PYTHONPATH}"
    export ESPRESSO_ROOT="${HOME}/multixscale/4.3-walberla-packinfo/espresso-4.3"
    export PYTHONPATH="${ESPRESSO_ROOT}/build-lj/src/python${OLD_PYTHONPATH:+:$OLD_PYTHONPATH}"
    source "${ESPRESSO_ROOT}/venv/bin/activate"

    srun --cpu-bind=cores python3 script.py --particles_per_core=2000

    export PYTHONPATH="${OLD_PYTHONPATH}"
    deactivate
    module purge

The following harmless warning will be generated:

.. code-block:: none

    srun: error: WARNING: Multiple leaf switches contain nodes: gn[01-60]

This is a known issue on HPC Vega, see e.g. `this Max CoE thread from Nov 2022
<https://hackmd.io/@enccs/max-coe-nov2022#Yambo>`__.
No corrective action is required.

The following point-to-point messaging layer (`pml`) error message
may be generated when using more than 1 socket on a GPU node:

.. code-block:: none

    [gn22:1299643] pml_ucx.c:419  Error: ucp_ep_create(proc=246) failed: Shared memory error
    [gn22:1299644] [[26541,3474],246] selected pml cm, but peer [[26541,3474],0] on gn20 selected pml ucx

Set the appropriate environment variable to force the same policy on all sockets:

.. code-block:: bash

    OMPI_MCA_pml=cm srun --cpu-bind=cores python3 ../script-tuned-mesh.py --particles_per_core=400 --hardware=gpu

The following CUDA error is due to the depreciated stubs library:

.. code-block:: none

    GPU Error: 34 cudaErrorStubLibrary: CUDA driver is a stub library

Fix it by eliding the stubs search path:

.. code-block:: bash

    export LD_LIBRARY_PATH=$(echo "${LD_LIBRARY_PATH}" | sed "s|:/cvmfs/sling.si/modules/el7/software/CUDA/12.3.0/lib64/stubs||")

Interactive job:

.. code-block:: bash

    srun --partition=dev --time=0:05:00 --job-name=interactive \
         --ntasks=32 --ntasks-per-node=32 --mem-per-cpu=400MB \
         --hint=nomultithread --pty /usr/bin/bash
    module load Boost/1.72.0-gompi-2020a hwloc
    mpiexec -n ${SLURM_NTASKS} --bind-to core --map-by core --report-bindings \
        /ceph/hpc/software/mpibench

Output:

.. code-block:: none

    [cn0294:2697220] MCW rank  0 bound to socket 1[core  0[hwt 0-1]]: [][BB/../../../../../../../../../../../../../../..]
    [cn0294:2697220] MCW rank  1 bound to socket 1[core  1[hwt 0-1]]: [][../BB/../../../../../../../../../../../../../..]
    [cn0294:2697220] MCW rank  2 bound to socket 1[core  2[hwt 0-1]]: [][../../BB/../../../../../../../../../../../../..]
    [cn0294:2697220] MCW rank  3 bound to socket 1[core  3[hwt 0-1]]: [][../../../BB/../../../../../../../../../../../..]
    [cn0294:2697220] MCW rank  4 bound to socket 1[core  4[hwt 0-1]]: [][../../../../BB/../../../../../../../../../../..]
    [cn0294:2697220] MCW rank  5 bound to socket 1[core  5[hwt 0-1]]: [][../../../../../BB/../../../../../../../../../..]
    [cn0294:2697220] MCW rank  6 bound to socket 1[core  6[hwt 0-1]]: [][../../../../../../BB/../../../../../../../../..]
    [cn0294:2697220] MCW rank  7 bound to socket 1[core  7[hwt 0-1]]: [][../../../../../../../BB/../../../../../../../..]
    [cn0294:2697220] MCW rank  8 bound to socket 1[core  8[hwt 0-1]]: [][../../../../../../../../BB/../../../../../../..]
    [cn0294:2697220] MCW rank  9 bound to socket 1[core  9[hwt 0-1]]: [][../../../../../../../../../BB/../../../../../..]
    [cn0294:2697220] MCW rank 10 bound to socket 1[core 10[hwt 0-1]]: [][../../../../../../../../../../BB/../../../../..]
    [cn0294:2697220] MCW rank 11 bound to socket 1[core 11[hwt 0-1]]: [][../../../../../../../../../../../BB/../../../..]
    [cn0294:2697220] MCW rank 12 bound to socket 1[core 12[hwt 0-1]]: [][../../../../../../../../../../../../BB/../../..]
    [cn0294:2697220] MCW rank 13 bound to socket 1[core 13[hwt 0-1]]: [][../../../../../../../../../../../../../BB/../..]
    [cn0294:2697220] MCW rank 14 bound to socket 1[core 14[hwt 0-1]]: [][../../../../../../../../../../../../../../BB/..]
    [cn0294:2697220] MCW rank 15 bound to socket 1[core 15[hwt 0-1]]: [][../../../../../../../../../../../../../../../BB]
    START mpiBench v1.5
    ...