Wednesday, December 21, 2011

Quant Interview & Education

Recently, I interviewed someone for a quant position. I was very surprised to find out that someone who did one of the best master in probabilities and finance in France could not solve a very basic probability problem:

This is accessible to someone with very little knowledge of probabilities

When I asked this problem around to co-workers (who have all at least a master in a scientific subject), very few could actually answer it properly. Most of the time, I suspect it is because they did not dedicate enough time to do it properly, and wanted to answer it too quickly.

It was more shocking that someone just out of school, with a major in probabilities could not answer that properly. It raises the question: what is all this education worth?

The results were not better as soon as the question was not exactly like what students in those masters are used to, like for example, this simple stochastic calculus question:


My opinion is that, today in our society, people study for too long. The ideal system for me would be one where people learn a lot in math/physics the first 2 years of university, and then have more freedom in their education, much like a doctorate.

We still offered the job to this person, because live problem solving is not the most important criteria. Other qualities like seriousness and motivation are much more valuable.

Quant Interview & Education

Recently, I interviewed someone for a quant position. I was very surprised to find out that someone who did one of the best master in probabilities and finance in France could not solve a very basic probability problem:

This is accessible to someone with very little knowledge of probabilities

When I asked this problem around to co-workers (who have all at least a master in a scientific subject), very few could actually answer it properly. Most of the time, I suspect it is because they did not dedicate enough time to do it properly, and wanted to answer it too quickly.

It was more shocking that someone just out of school, with a major in probabilities could not answer that properly. It raises the question: what is all this education worth?

The results were not better as soon as the question was not exactly like what students in those masters are used to, like for example, this simple stochastic calculus question:


My opinion is that, today in our society, people study for too long. The ideal system for me would be one where people learn a lot in math/physics the first 2 years of university, and then have more freedom in their education, much like a doctorate.

We still offered the job to this person, because live problem solving is not the most important criteria. Other qualities like seriousness and motivation are much more valuable.

Wednesday, November 30, 2011

Gnome 3 not so crap after all

In a previous post, I was complaining how bad Gnome 3 was. Since I have installed a real dock: docky, it is now much more usable. I can easily switch / launch applications without an annoying full screen change.

In addition I found out that it had a good desktop search (tracker). The ALT+F2 also does some sort of completion, too bad it can not use tracker here as well.

So it looks like Gnome 3 + gnome-tweak-tool + docky makes a reasonably good desktop. XFCE does not really fit the bill for me: bad handling of sound, bad default applications, not so good integration with gnome application notifications.



Now if only I found a way to change this ugly big white scrollbar...

Gnome 3 not so crap after all

In a previous post, I was complaining how bad Gnome 3 was. Since I have installed a real dock: docky, it is now much more usable. I can easily switch / launch applications without an annoying full screen change.

In addition I found out that it had a good desktop search (tracker). The ALT+F2 also does some sort of completion, too bad it can not use tracker here as well.

So it looks like Gnome 3 + gnome-tweak-tool + docky makes a reasonably good desktop. XFCE does not really fit the bill for me: bad handling of sound, bad default applications, not so good integration with gnome application notifications.



Now if only I found a way to change this ugly big white scrollbar...

Thursday, November 17, 2011

Good & Popular Algorithms are Simple

I recently tried to minimise a function according to some constraints. One popular method to minimise a function in several dimensions is Nelder-Mead Simplex. It is quite simple, so simple that I programmed it in Java in 1h30, including a small design and a test. It helped that the original paper from Nelder-Mead is very clear:
However the main issue is that it works only for unconstrained problems. Nelder and Mead suggested to add a penalty, but in practice this does not work so well. For constrained problems, there is an adaptation of the original idea by Box (incredible name for a constrained method) that he called the Complex method, a deliberate pun to the Nelder-Mead Simplex method. The basic idea is to reset the trial point near the fixed boundary if it goes outside. Now this took me much longer to program, as the paper is not as clear, even if the method is still relatively simple. But worst, after a day of deciphering the paper and programming the complex method, I find out that it does not works so well: it does not manage to minimise a simple multidimensional quadratic with a simple bound constraint: f(X)=sum(X)^2 with X >= 0 where X is an N-dimensional vector. In the end, I don't want to work with such a simple function, but it is a good simple test to see if the method is really working or not. Here is how the function looks with N=2:
And if we restrict to x,y >= 0 it becomes:

I suspected an error in my program, so I decided to try with scilab, that has also the Box method as part of their neldermead_search functionality. Scilab also failed to compute the minimum in 8 dimensions of my simple quadratic.
I tried various settings, without ever obtaining a decent result (I expect to find a value near 0).

There is another algorithm that can find a global minimum, also very popular: the Differential Evolution. At first, being a genetic algorithm, you would think it would be complicated to write. But no, the main loop is around 20 lines.

Those 20 lines are a bit more difficult than Nelder-Mead, but still, when I saw that, I understood that "this is a classic algorithm". And it does work with constraints easily. How to do this is explained well in K. Price book "Differential Evolution", and takes only a few lines of code. Here is the result I got in dimension 29:
dim=29 min=1.2601854176573729E-12 genMax=412 x=[3.340096901536317E-8, 8.889252404343621E-8, 7.163904251807348E-10, 9.71847877381699E-9, 2.7423324674150668E-8, 2.4022269439114537E-9, 1.7336434478718816E-11, 7.244238163901778E-9, 1.0013136274729337E-8, 7.412154679865083E-9, 5.4694460144807974E-9, 2.3682413086417524E-9, 4.241739250073559E-7, 4.821920889534676E-10, 2.115396281722523E-9, 8.750883007882899E-8, 2.512011485133975E-9, 4.811507109129279E-9, 1.0752997894113096E-7, 5.120475258343548E-9, 8.404448964497456E-9, 4.1062290228305595E-9, 1.7030766521603753E-8, 5.589430643552073E-9, 8.237098544820173E-10, 3.5796523161196554E-9, 5.186299547466997E-9, 2.430326342762937E-7, 5.493850433494286E-9]

It works well, although there is quite a bit of parameters. I noticed that the strategy was especially important.

Good & Popular Algorithms are Simple

I recently tried to minimise a function according to some constraints. One popular method to minimise a function in several dimensions is Nelder-Mead Simplex. It is quite simple, so simple that I programmed it in Java in 1h30, including a small design and a test. It helped that the original paper from Nelder-Mead is very clear:
However the main issue is that it works only for unconstrained problems. Nelder and Mead suggested to add a penalty, but in practice this does not work so well. For constrained problems, there is an adaptation of the original idea by Box (incredible name for a constrained method) that he called the Complex method, a deliberate pun to the Nelder-Mead Simplex method. The basic idea is to reset the trial point near the fixed boundary if it goes outside. Now this took me much longer to program, as the paper is not as clear, even if the method is still relatively simple. But worst, after a day of deciphering the paper and programming the complex method, I find out that it does not works so well: it does not manage to minimise a simple multidimensional quadratic with a simple bound constraint: f(X)=sum(X)^2 with X >= 0 where X is an N-dimensional vector. In the end, I don't want to work with such a simple function, but it is a good simple test to see if the method is really working or not. Here is how the function looks with N=2:
And if we restrict to x,y >= 0 it becomes:

I suspected an error in my program, so I decided to try with scilab, that has also the Box method as part of their neldermead_search functionality. Scilab also failed to compute the minimum in 8 dimensions of my simple quadratic.
I tried various settings, without ever obtaining a decent result (I expect to find a value near 0).

There is another algorithm that can find a global minimum, also very popular: the Differential Evolution. At first, being a genetic algorithm, you would think it would be complicated to write. But no, the main loop is around 20 lines.

Those 20 lines are a bit more difficult than Nelder-Mead, but still, when I saw that, I understood that "this is a classic algorithm". And it does work with constraints easily. How to do this is explained well in K. Price book "Differential Evolution", and takes only a few lines of code. Here is the result I got in dimension 29:
dim=29 min=1.2601854176573729E-12 genMax=412 x=[3.340096901536317E-8, 8.889252404343621E-8, 7.163904251807348E-10, 9.71847877381699E-9, 2.7423324674150668E-8, 2.4022269439114537E-9, 1.7336434478718816E-11, 7.244238163901778E-9, 1.0013136274729337E-8, 7.412154679865083E-9, 5.4694460144807974E-9, 2.3682413086417524E-9, 4.241739250073559E-7, 4.821920889534676E-10, 2.115396281722523E-9, 8.750883007882899E-8, 2.512011485133975E-9, 4.811507109129279E-9, 1.0752997894113096E-7, 5.120475258343548E-9, 8.404448964497456E-9, 4.1062290228305595E-9, 1.7030766521603753E-8, 5.589430643552073E-9, 8.237098544820173E-10, 3.5796523161196554E-9, 5.186299547466997E-9, 2.430326342762937E-7, 5.493850433494286E-9]

It works well, although there is quite a bit of parameters. I noticed that the strategy was especially important.

Wednesday, August 17, 2011

Gnome 3, Unity, Crap

After the upgrate to Ubuntu 11.04 I was directly on Unity. Having a dock on the left side is nice, but unfortunately, it has various bugs which makes it sometimes annoying. Also the menu on top like Mac Os X is not a bad idea, but it breaks many apps (for example Picasa). Then there is the scrollbar insanity, it's almost impossible to click on to scroll, and again breaks in some apps (for example Eclipse). Fortunately one can disable all of that and go back to the old Gnome 2.

Then I decided to try Gnome 3 the past 2 weeks. I was curious that Linus found it so bad. At first it looked nice, probably because it's different. But very quickly, it becomes annoying: the full screen changed just to switch window with the mouse is a stupid idea. And then installing Gnome 3 in Ubuntu is not such a great idea: Rhythmbox does not work well anymore with it, all my "file open" mapping was gone and replaced with garbage: why does it propose GIMP to open a PDF is beyond me.

Gnome 2 + a nice standard dock + widgets (or the nice unity indicators) + an indexer would have been perfect. Now I am back to KDE 4 which provides all of that, even if it feels a bit bloated.


Gnome 3, Unity, Crap

After the upgrate to Ubuntu 11.04 I was directly on Unity. Having a dock on the left side is nice, but unfortunately, it has various bugs which makes it sometimes annoying. Also the menu on top like Mac Os X is not a bad idea, but it breaks many apps (for example Picasa). Then there is the scrollbar insanity, it's almost impossible to click on to scroll, and again breaks in some apps (for example Eclipse). Fortunately one can disable all of that and go back to the old Gnome 2.

Then I decided to try Gnome 3 the past 2 weeks. I was curious that Linus found it so bad. At first it looked nice, probably because it's different. But very quickly, it becomes annoying: the full screen changed just to switch window with the mouse is a stupid idea. And then installing Gnome 3 in Ubuntu is not such a great idea: Rhythmbox does not work well anymore with it, all my "file open" mapping was gone and replaced with garbage: why does it propose GIMP to open a PDF is beyond me.

Gnome 2 + a nice standard dock + widgets (or the nice unity indicators) + an indexer would have been perfect. Now I am back to KDE 4 which provides all of that, even if it feels a bit bloated.


Sunday, August 14, 2011

Carmack & GPGPU programming

Finally someone who shares the same opinion on the current state of GPGPU programming.

John Carmack: On the other hand, we have converted all of our offline processing stuff to ray tracing. For years, the back-end MegaTexture generation for Rage was done with... we had a GPGPU cluster with NVIDIA cards and it was such a huge pain to keep. It was an amazing pain where one system would be having heat problems and would be behaving weird even though we thought they had identical drivers. Something would always be wrong with render farm 12, and whenever we wanted to put in new features it was like “Okay, writing new fragment programs to go into this.” Now, granted I did this just when CUDA was in its infancy. If I did re-implement it with OpenCL or CUDA we wouldn’t have some of these problems, but when I converted all these over to ray tracing there was a number of things that got a lot better. Things that we deal with, [for example] shadows and reflections that have to be approximated, and were so used to doing with rasterization... we sometimes forget how big of hacks these are. To be able to say I really just want that ray, and tell me what it hit; not do a projection with feathering shadowed edges and whatever the heck else we’re doing there, so much of the code got so much easier. If it’s a choice of... now that we have these awesome multi-core x86 CPUs where we can get 24 threads in commodity boxes... it’s true that one GPU card can do more ray tracing than one 24 thread x86 box, but it’s not multiples more and if it’s just a matter of buying more $2000 boxes, it makes the development, maintenance, and upkeep much better. While everyone in high performance computing is all “rah-rah” GPUs right now, I’ve come full circle back around to saying the fact that we can get massive amounts of x86 cores and threads... it wont win on FLOPS/watt or FLOPS/volume, but in terms of results per developer hour it is much, much better.

Carmack & GPGPU programming

Finally someone who shares the same opinion on the current state of GPGPU programming.

John Carmack: On the other hand, we have converted all of our offline processing stuff to ray tracing. For years, the back-end MegaTexture generation for Rage was done with... we had a GPGPU cluster with NVIDIA cards and it was such a huge pain to keep. It was an amazing pain where one system would be having heat problems and would be behaving weird even though we thought they had identical drivers. Something would always be wrong with render farm 12, and whenever we wanted to put in new features it was like “Okay, writing new fragment programs to go into this.” Now, granted I did this just when CUDA was in its infancy. If I did re-implement it with OpenCL or CUDA we wouldn’t have some of these problems, but when I converted all these over to ray tracing there was a number of things that got a lot better. Things that we deal with, [for example] shadows and reflections that have to be approximated, and were so used to doing with rasterization... we sometimes forget how big of hacks these are. To be able to say I really just want that ray, and tell me what it hit; not do a projection with feathering shadowed edges and whatever the heck else we’re doing there, so much of the code got so much easier. If it’s a choice of... now that we have these awesome multi-core x86 CPUs where we can get 24 threads in commodity boxes... it’s true that one GPU card can do more ray tracing than one 24 thread x86 box, but it’s not multiples more and if it’s just a matter of buying more $2000 boxes, it makes the development, maintenance, and upkeep much better. While everyone in high performance computing is all “rah-rah” GPUs right now, I’ve come full circle back around to saying the fact that we can get massive amounts of x86 cores and threads... it wont win on FLOPS/watt or FLOPS/volume, but in terms of results per developer hour it is much, much better.

Friday, April 08, 2011

exp(y*log(x)) Much Faster than Math.pow(x,y)

Today I found out that replacing Math.pow(x,y) by Math.exp(y*Math.log(x)) made me gain 50% performance in my program. Of course, both x and y are double in my case. I find this quite surprising, I expected better from Math.pow.

exp(y*log(x)) Much Faster than Math.pow(x,y)

Today I found out that replacing Math.pow(x,y) by Math.exp(y*Math.log(x)) made me gain 50% performance in my program. Of course, both x and y are double in my case. I find this quite surprising, I expected better from Math.pow.

Wednesday, April 06, 2011

Best Headphones Ever?

I just miraculously received some AKG Q701 headphones. I never tried really high end headphones before those. I was used to my Sennheiser HD555, which I thought were quite good already.

I always thought there was not much difference between let's say a $100 headphone and a $500 one, and that only real audiophiles would be able notice the difference: people like Jake who can distinguish artefacts in 320kbps mp3, but not people like me. I was wrong. Those headphones make a real difference, everything seems very clear, and it doesn't feel like listening music through headphones at all, it is much closer to a good hi-fi stereo sound on the ear.

Best Headphones Ever?

I just miraculously received some AKG Q701 headphones. I never tried really high end headphones before those. I was used to my Sennheiser HD555, which I thought were quite good already.

I always thought there was not much difference between let's say a $100 headphone and a $500 one, and that only real audiophiles would be able notice the difference: people like Jake who can distinguish artefacts in 320kbps mp3, but not people like me. I was wrong. Those headphones make a real difference, everything seems very clear, and it doesn't feel like listening music through headphones at all, it is much closer to a good hi-fi stereo sound on the ear.

Saturday, February 05, 2011

SIMD and Mersenne-Twister

Since 2007, there is a new kind of Mersenne-Twister (MT) that exploits SIMD architecture, the SFMT. The Mersenne-Twister has set quite a standard in random number generation for Monte-Carlo simulations, even though it has flaws.

I was wondering if SFMT improved the performance over MT for a Java implementation. There is actually on the same page a decent Java port of the original algorithm. When I ran it, it ended up slower by more than 20% than the classical Mersenne-Twister (32-bit) on a 64-bit JDK 1.6.0.23 for Windows.

This kind of result is not too unsurprising as the code is more complex. But this shows that Java is unable to use SIMD instructions properly, which is still a bit disappointing.

SIMD and Mersenne-Twister

Since 2007, there is a new kind of Mersenne-Twister (MT) that exploits SIMD architecture, the SFMT. The Mersenne-Twister has set quite a standard in random number generation for Monte-Carlo simulations, even though it has flaws.

I was wondering if SFMT improved the performance over MT for a Java implementation. There is actually on the same page a decent Java port of the original algorithm. When I ran it, it ended up slower by more than 20% than the classical Mersenne-Twister (32-bit) on a 64-bit JDK 1.6.0.23 for Windows.

This kind of result is not too unsurprising as the code is more complex. But this shows that Java is unable to use SIMD instructions properly, which is still a bit disappointing.

Wednesday, January 12, 2011

XORWOW L'ecuyer TestU01 Results

Nvidia uses XorWow random number generator in its CURAND library. It is a simple and fast random number generator with a reasonably long period. It can also be parallelized relatively easily. Nvidia suggests it passes L'Ecuyer TestU01, but is not very explicit about it. So I've decided to see how it performed on TestU01.

I found very simple to test a new random number generator on TestU01, the documentation is great and the examples helpful. Basically there is just a simple C file to create & compile & run.

XORWOW passes the SmallCrush battery, and according to Marsaglia, it also passes DieHard. But it fails on 1 test in Crush and 3 in BigCrush. By comparison, Mersenne-Twister fails on 2 tests in Crush and 2 in BigCrush. Here are the results of the failures:


========= Summary results of Crush =========

 Version:          TestU01 1.2.3
 Generator:        Xorwow
 Number of statistics:  144
 Total CPU time:   00:50:15.92
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
 72  LinearComp, r = 29             1 - eps1
 ----------------------------------------------
 All other tests were passed

========= Summary results of BigCrush =========

 Version:          TestU01 1.2.3
 Generator:        Xorwow
 Number of statistics:  160
 Total CPU time:   06:12:38.79
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
  7  CollisionOver, t = 7            1.6e-6
 27  SimpPoker, r = 27                eps 
 81  LinearComp, r = 29             1 - eps1
 ----------------------------------------------
 All other tests were passed

XORWOW L'ecuyer TestU01 Results

Nvidia uses XorWow random number generator in its CURAND library. It is a simple and fast random number generator with a reasonably long period. It can also be parallelized relatively easily. Nvidia suggests it passes L'Ecuyer TestU01, but is not very explicit about it. So I've decided to see how it performed on TestU01.

I found very simple to test a new random number generator on TestU01, the documentation is great and the examples helpful. Basically there is just a simple C file to create & compile & run.

XORWOW passes the SmallCrush battery, and according to Marsaglia, it also passes DieHard. But it fails on 1 test in Crush and 3 in BigCrush. By comparison, Mersenne-Twister fails on 2 tests in Crush and 2 in BigCrush. Here are the results of the failures:


========= Summary results of Crush =========

 Version:          TestU01 1.2.3
 Generator:        Xorwow
 Number of statistics:  144
 Total CPU time:   00:50:15.92
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
 72  LinearComp, r = 29             1 - eps1
 ----------------------------------------------
 All other tests were passed

========= Summary results of BigCrush =========

 Version:          TestU01 1.2.3
 Generator:        Xorwow
 Number of statistics:  160
 Total CPU time:   06:12:38.79
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
  7  CollisionOver, t = 7            1.6e-6
 27  SimpPoker, r = 27                eps 
 81  LinearComp, r = 29             1 - eps1
 ----------------------------------------------
 All other tests were passed

Monday, January 03, 2011

The CUDA Performance Myth

There is an interesting article on how to generate efficiently the inverse of the normal cumulative distribution on the GPU. This is useful for Monte-Carlo simulations based on normally distributed variables.

Another result of the paper is a method (breakless algorithm) to compute it apparently faster than the very good Wichura's AS241 algorithm on the CPU as well keeping a similar precision. The key is to avoid branches (if-then) at the cost of not avoiding log() calls. As the algorithm is very simple, I decided to give it a try in Java.

Unfortunately I found out that in Java, on 64 bit machines, the breakless algorithm is actually around twice slower. Intrigued, I tried in Visual C++ the same, and found this time it was 1.5x slower. Then I tried in gcc and found that it was 10% faster... But the total time was significant, because I had not applied any optimization in the gcc compilation. With -O3 flag AS241 became much faster and we were back at the Java result: the breakless algorithm was twice slower. The first result is that the JITed java code is as fast as optimized C++ compiled code. The second result is that the authors did not think of compiling with optimization flag.

Then I decided to benchmark the CUDA float performance of similar algorithms. The CUDA program was 7x faster than the double precision multithreaded CPU program. This is comparing Nvidia GT330m vs Core i5 520m and float precision vs double precision on a naturally parallel problem. This is very far from the usually announced x80 speedup. Of course if one compares the algorithms with GCC single threaded no optimization, we might attain x50, but this is not a realistic comparison at all. I have heard that double precision is 8x slower on the GPU when compared to float precision: the difference then becomes quite small. Apparently Fermi cards are much faster, unfortunately I don't have any. And still I would not expect much better than 10x speedup. This is good but very far from the usually advertised speedup.

The CUDA Performance Myth

There is an interesting article on how to generate efficiently the inverse of the normal cumulative distribution on the GPU. This is useful for Monte-Carlo simulations based on normally distributed variables.

Another result of the paper is a method (breakless algorithm) to compute it apparently faster than the very good Wichura's AS241 algorithm on the CPU as well keeping a similar precision. The key is to avoid branches (if-then) at the cost of not avoiding log() calls. As the algorithm is very simple, I decided to give it a try in Java.

Unfortunately I found out that in Java, on 64 bit machines, the breakless algorithm is actually around twice slower. Intrigued, I tried in Visual C++ the same, and found this time it was 1.5x slower. Then I tried in gcc and found that it was 10% faster... But the total time was significant, because I had not applied any optimization in the gcc compilation. With -O3 flag AS241 became much faster and we were back at the Java result: the breakless algorithm was twice slower. The first result is that the JITed java code is as fast as optimized C++ compiled code. The second result is that the authors did not think of compiling with optimization flag.

Then I decided to benchmark the CUDA float performance of similar algorithms. The CUDA program was 7x faster than the double precision multithreaded CPU program. This is comparing Nvidia GT330m vs Core i5 520m and float precision vs double precision on a naturally parallel problem. This is very far from the usually announced x80 speedup. Of course if one compares the algorithms with GCC single threaded no optimization, we might attain x50, but this is not a realistic comparison at all. I have heard that double precision is 8x slower on the GPU when compared to float precision: the difference then becomes quite small. Apparently Fermi cards are much faster, unfortunately I don't have any. And still I would not expect much better than 10x speedup. This is good but very far from the usually advertised speedup.