Wednesday, January 12, 2011

XORWOW L'ecuyer TestU01 Results

Nvidia uses XorWow random number generator in its CURAND library. It is a simple and fast random number generator with a reasonably long period. It can also be parallelized relatively easily. Nvidia suggests it passes L'Ecuyer TestU01, but is not very explicit about it. So I've decided to see how it performed on TestU01.

I found very simple to test a new random number generator on TestU01, the documentation is great and the examples helpful. Basically there is just a simple C file to create & compile & run.

XORWOW passes the SmallCrush battery, and according to Marsaglia, it also passes DieHard. But it fails on 1 test in Crush and 3 in BigCrush. By comparison, Mersenne-Twister fails on 2 tests in Crush and 2 in BigCrush. Here are the results of the failures:


========= Summary results of Crush =========

 Version:          TestU01 1.2.3
 Generator:        Xorwow
 Number of statistics:  144
 Total CPU time:   00:50:15.92
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
 72  LinearComp, r = 29             1 - eps1
 ----------------------------------------------
 All other tests were passed

========= Summary results of BigCrush =========

 Version:          TestU01 1.2.3
 Generator:        Xorwow
 Number of statistics:  160
 Total CPU time:   06:12:38.79
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
  7  CollisionOver, t = 7            1.6e-6
 27  SimpPoker, r = 27                eps 
 81  LinearComp, r = 29             1 - eps1
 ----------------------------------------------
 All other tests were passed

XORWOW L'ecuyer TestU01 Results

Nvidia uses XorWow random number generator in its CURAND library. It is a simple and fast random number generator with a reasonably long period. It can also be parallelized relatively easily. Nvidia suggests it passes L'Ecuyer TestU01, but is not very explicit about it. So I've decided to see how it performed on TestU01.

I found very simple to test a new random number generator on TestU01, the documentation is great and the examples helpful. Basically there is just a simple C file to create & compile & run.

XORWOW passes the SmallCrush battery, and according to Marsaglia, it also passes DieHard. But it fails on 1 test in Crush and 3 in BigCrush. By comparison, Mersenne-Twister fails on 2 tests in Crush and 2 in BigCrush. Here are the results of the failures:


========= Summary results of Crush =========

 Version:          TestU01 1.2.3
 Generator:        Xorwow
 Number of statistics:  144
 Total CPU time:   00:50:15.92
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
 72  LinearComp, r = 29             1 - eps1
 ----------------------------------------------
 All other tests were passed

========= Summary results of BigCrush =========

 Version:          TestU01 1.2.3
 Generator:        Xorwow
 Number of statistics:  160
 Total CPU time:   06:12:38.79
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
  7  CollisionOver, t = 7            1.6e-6
 27  SimpPoker, r = 27                eps 
 81  LinearComp, r = 29             1 - eps1
 ----------------------------------------------
 All other tests were passed

Monday, January 03, 2011

The CUDA Performance Myth

There is an interesting article on how to generate efficiently the inverse of the normal cumulative distribution on the GPU. This is useful for Monte-Carlo simulations based on normally distributed variables.

Another result of the paper is a method (breakless algorithm) to compute it apparently faster than the very good Wichura's AS241 algorithm on the CPU as well keeping a similar precision. The key is to avoid branches (if-then) at the cost of not avoiding log() calls. As the algorithm is very simple, I decided to give it a try in Java.

Unfortunately I found out that in Java, on 64 bit machines, the breakless algorithm is actually around twice slower. Intrigued, I tried in Visual C++ the same, and found this time it was 1.5x slower. Then I tried in gcc and found that it was 10% faster... But the total time was significant, because I had not applied any optimization in the gcc compilation. With -O3 flag AS241 became much faster and we were back at the Java result: the breakless algorithm was twice slower. The first result is that the JITed java code is as fast as optimized C++ compiled code. The second result is that the authors did not think of compiling with optimization flag.

Then I decided to benchmark the CUDA float performance of similar algorithms. The CUDA program was 7x faster than the double precision multithreaded CPU program. This is comparing Nvidia GT330m vs Core i5 520m and float precision vs double precision on a naturally parallel problem. This is very far from the usually announced x80 speedup. Of course if one compares the algorithms with GCC single threaded no optimization, we might attain x50, but this is not a realistic comparison at all. I have heard that double precision is 8x slower on the GPU when compared to float precision: the difference then becomes quite small. Apparently Fermi cards are much faster, unfortunately I don't have any. And still I would not expect much better than 10x speedup. This is good but very far from the usually advertised speedup.

The CUDA Performance Myth

There is an interesting article on how to generate efficiently the inverse of the normal cumulative distribution on the GPU. This is useful for Monte-Carlo simulations based on normally distributed variables.

Another result of the paper is a method (breakless algorithm) to compute it apparently faster than the very good Wichura's AS241 algorithm on the CPU as well keeping a similar precision. The key is to avoid branches (if-then) at the cost of not avoiding log() calls. As the algorithm is very simple, I decided to give it a try in Java.

Unfortunately I found out that in Java, on 64 bit machines, the breakless algorithm is actually around twice slower. Intrigued, I tried in Visual C++ the same, and found this time it was 1.5x slower. Then I tried in gcc and found that it was 10% faster... But the total time was significant, because I had not applied any optimization in the gcc compilation. With -O3 flag AS241 became much faster and we were back at the Java result: the breakless algorithm was twice slower. The first result is that the JITed java code is as fast as optimized C++ compiled code. The second result is that the authors did not think of compiling with optimization flag.

Then I decided to benchmark the CUDA float performance of similar algorithms. The CUDA program was 7x faster than the double precision multithreaded CPU program. This is comparing Nvidia GT330m vs Core i5 520m and float precision vs double precision on a naturally parallel problem. This is very far from the usually announced x80 speedup. Of course if one compares the algorithms with GCC single threaded no optimization, we might attain x50, but this is not a realistic comparison at all. I have heard that double precision is 8x slower on the GPU when compared to float precision: the difference then becomes quite small. Apparently Fermi cards are much faster, unfortunately I don't have any. And still I would not expect much better than 10x speedup. This is good but very far from the usually advertised speedup.