Difference between revisions of "CUDA examples"

From ICPWiki
Jump to navigation Jump to search
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
=Gram-Schmidt method=
 
=Gram-Schmidt method=
  
This example implements the Gram-Schmidt orthogonalization method for fully occupied matrices. While for sparse matrices, the Householder method is typically more efficient, the Gram-Schmidt method is still widely used for fully occupied matrices due to its robustness. And due to its structure, the method can be easily implemented on GPUs using CUDA.
+
This example implements the modified Gram-Schmidt orthogonalization method for fully occupied matrices. When used for QR decomposition, the Gram-Schmidt method is less efficient compared to Householder inflections or Givens rotation. But as method to orthogonalize sets of vectors that do not span the whole vector space, it is still the best approach and widely used.
  
{{Download|Media:GramSchmidt-0.1.tar.gz|Code example Gram-Schmidt method|tgz}}
+
Different implementations are provided both for GPU and CPU, for details see [http://link.springer.com/article/10.1140/epjst/e2012-01638-7 T. Brandes, A. Arnold, T. Soddemann and D. Reith, Eur. Phys. J. ST 210:73-88 (2012)].
 +
 
 +
{{Download|GramSchmidt.tar.gz|Code example Gram-Schmidt method|tgz}}
  
 
=Scalar product=
 
=Scalar product=
Line 9: Line 11:
 
Due to its very small computational cost, the scalar product is certainly 'not' suited for porting to a GPU. However, if a complex algorithm requires a scalar product and the algorithm is ported to a GPU, it is necessary to also port the scalar product to the GPU. Otherwise, one would need to transfer data back to the CPU, which is even more time consuming.
 
Due to its very small computational cost, the scalar product is certainly 'not' suited for porting to a GPU. However, if a complex algorithm requires a scalar product and the algorithm is ported to a GPU, it is necessary to also port the scalar product to the GPU. Otherwise, one would need to transfer data back to the CPU, which is even more time consuming.
  
This example demonstrates several possible strategies (two of which deliberatedly don't work) to show the pitfalls of massively parallel computation and how to avoid them. It also shows the use of atomic operations on newer devices, also no significant speed up is observed by using them.
+
This example demonstrates several possible strategies. It also serves as a good example for the use of atomic operations, and how they can be avoided. e.g. on older hardware. Some of the examples deliberatedly don't work to show the pitfalls of massively parallel computation, and to demonstrate that atomic operations are really necessary. Note that the use of atomic operations is rather slow, and therefore in this example no speed up is gained by using them. However, the use of atomic operations makes the code considerably shorter and easier to read, which is important in a scientific environment, where code is continously developed further.
 +
 
 +
{{Download|ScalarGPU.tar.gz|Code example scalar product|tgz}}
 +
 
 +
[http://mathema.tician.de/software/pycuda PyCUDA] is an extremely powerful Python extension that does not only allow to use CUDA code from Python, but can do just-in-time kernel compilation for you, and allows to write code similiar to numpy, just that it will be executed on a GPU - and much faster therefore. This is an example code calculating again the scalar product, just in Python. Thanks to PyCUDA it is as fast as the plain CUDA code (or even faster, using GPUArray...).
 +
 +
{{Download|PyScalar.tar.gz|Code example scalar product - in Python|tgz}}
 +
 
 +
=Ideal gas with direct OpenGL visualization=
 +
 
 +
This example uses the CUDA-OpenGL-binding to render an ideal gas
 +
in a rotating box. An OpenGL vertex buffer is written directly
 +
from CUDA, which runs the ideal gas as a very simple kernel. Of course,
 +
the same bindung could also be used to visualize more complex data.
 +
 
 +
{{Download|CUDA-OpenGL.tar.gz|Code example CUDA-OpenGL bindings|tgz}}
 +
 
 +
And also this example exists as a Python implementation as well. This
 +
requires both PyOpenGL and PyCUDA. It is written for the currently
 +
downloadable version of PyCUDA (0.94.2), but doesn't work with
 +
the current development version. The necessary changes should be simple
 +
to figure out.
 +
 
 +
{{Download|PyGL.tar.gz|Code example CUDA-OpenGL bindings - in Python|tgz}}
 +
 
 +
=Gauss-Elimination=
 +
 
 +
This example performs the classical Gauss-Elimination with back
 +
substitution to solve a linear equation. It does not perform
 +
pivotization, but serves as a simple example for shared memory use.
  
{{Download|Media:ScalarGPU-0.1.tar.gz|Code example scalar product|tgz}}
+
{{Download|gausseli.tar.gz|Code example Gauss-Elimination in CUDA|tgz}}

Latest revision as of 15:17, 11 November 2012

Gram-Schmidt method

This example implements the modified Gram-Schmidt orthogonalization method for fully occupied matrices. When used for QR decomposition, the Gram-Schmidt method is less efficient compared to Householder inflections or Givens rotation. But as method to orthogonalize sets of vectors that do not span the whole vector space, it is still the best approach and widely used.

Different implementations are provided both for GPU and CPU, for details see T. Brandes, A. Arnold, T. Soddemann and D. Reith, Eur. Phys. J. ST 210:73-88 (2012).

tgz.pngCode example Gram-Schmidt method (23 KB)Info circle.png

Scalar product

Due to its very small computational cost, the scalar product is certainly 'not' suited for porting to a GPU. However, if a complex algorithm requires a scalar product and the algorithm is ported to a GPU, it is necessary to also port the scalar product to the GPU. Otherwise, one would need to transfer data back to the CPU, which is even more time consuming.

This example demonstrates several possible strategies. It also serves as a good example for the use of atomic operations, and how they can be avoided. e.g. on older hardware. Some of the examples deliberatedly don't work to show the pitfalls of massively parallel computation, and to demonstrate that atomic operations are really necessary. Note that the use of atomic operations is rather slow, and therefore in this example no speed up is gained by using them. However, the use of atomic operations makes the code considerably shorter and easier to read, which is important in a scientific environment, where code is continously developed further.

tgz.pngCode example scalar product (17 KB)Info circle.png

PyCUDA is an extremely powerful Python extension that does not only allow to use CUDA code from Python, but can do just-in-time kernel compilation for you, and allows to write code similiar to numpy, just that it will be executed on a GPU - and much faster therefore. This is an example code calculating again the scalar product, just in Python. Thanks to PyCUDA it is as fast as the plain CUDA code (or even faster, using GPUArray...).

tgz.pngCode example scalar product - in Python (15 KB)Info circle.png

Ideal gas with direct OpenGL visualization

This example uses the CUDA-OpenGL-binding to render an ideal gas in a rotating box. An OpenGL vertex buffer is written directly from CUDA, which runs the ideal gas as a very simple kernel. Of course, the same bindung could also be used to visualize more complex data.

tgz.pngCode example CUDA-OpenGL bindings (15 KB)Info circle.png

And also this example exists as a Python implementation as well. This requires both PyOpenGL and PyCUDA. It is written for the currently downloadable version of PyCUDA (0.94.2), but doesn't work with the current development version. The necessary changes should be simple to figure out.

tgz.pngCode example CUDA-OpenGL bindings - in Python (14 KB)Info circle.png

Gauss-Elimination

This example performs the classical Gauss-Elimination with back substitution to solve a linear equation. It does not perform pivotization, but serves as a simple example for shared memory use.

tgz.pngCode example Gauss-Elimination in CUDA (19 KB)Info circle.png