This example implements the Gram-Schmidt orthogonalization method for fully occupied matrices. While for sparse matrices, the Householder method is typically more efficient, the Gram-Schmidt method is still widely used for fully occupied matrices due to its robustness. And due to its structure, the method can be easily implemented on GPUs using CUDA.
Code example Gram-Schmidt method (23 KB)
Due to its very small computational cost, the scalar product is certainly 'not' suited for porting to a GPU. However, if a complex algorithm requires a scalar product and the algorithm is ported to a GPU, it is necessary to also port the scalar product to the GPU. Otherwise, one would need to transfer data back to the CPU, which is even more time consuming.
This example demonstrates several possible strategies (two of which deliberatedly don't work) to show the pitfalls of massively parallel computation and how to avoid them. It also shows the use of atomic operations on newer devices, also no significant speed up is observed by using them.
Code example scalar product (17 KB)