Thursday, January 15, 2009

The End Of Rings Around Plain Java - A Better Concurrency Test

In my previous post, I was wondering why single thread was faster. D Andreou gave the correct explanation: as we send only 1 start message and as each node only send 1 message to the next one, there is always only 1 message being processed. So the test is optimum on 1 thread. It does not make much sense to make a multithreading benchmark on a problem that is fundamentally single threaded.


His suggestion was to simple send N start messages where N >= number of processors. In theory, the performance will become optimal with N threads then. Unfortunately this is not what happened in real life. In real life the single threaded performance is still better if you send even 16 messages on a biprocessor machine.

    public static void main(String[] args) throws Exception {
        OptimizedRing ring = new OptimizedRing();
        RingNode node = ring.startRing(Integer.parseInt(args[0]));
        node.sendMessage(new StartMessage());
        node.sendMessage(new TokenMessage(node.nodeId,1));
        node.sendMessage(new TokenMessage(node.nodeId,1));
        node.sendMessage(new TokenMessage(node.nodeId,1));
        ring.executor.awaitTermination(10, TimeUnit.MINUTES);
    }

My idea was that it was related to the swiching from thread to thread overhead, which is precisely what I think the original author of the test had in mind to test. I am not 100% convinced it is really what's happening. I wanted a test that would actually be faster using N threads; so I decided to add a bit of computation at before processing each Token. Unfortunately I had the bad idea to compute Pi by Monte Carlo method to do that. Running my tests I was surprised it did not change the results, and made things worse the most computer intensive the computation was (increasing the number of monte carlo iterations). It scared me a bit wondering what the hell could be wrong there. The following class performs much worse with 2 threads compared to 1:

public class BadParallelPi {

    private static void startExecutors() throws Exception {        
        long startTime = System.currentTimeMillis();
        System.out.println(startTime);
        ExecutorService executor1 = Executors.newFixedThreadPool(1);
        executor1.execute(new Computation());
        executor1.execute(new Computation());
        executor1.shutdown();
        executor1.awaitTermination(60, TimeUnit.SECONDS);
        long delay = System.currentTimeMillis() - startTime;
        System.out.println("finished single thread in "+(delay/1000.0));
        startTime = System.currentTimeMillis();
        System.out.println(startTime);
        executor1 = Executors.newFixedThreadPool(2);
        executor1.execute(new Computation());
        executor1.execute(new Computation());
        executor1.shutdown();
        executor1.awaitTermination(60, TimeUnit.SECONDS);
        delay = System.currentTimeMillis() - startTime;
        System.out.println("finished 2 threads in "+(delay/1000.0));
    }
    
    public static class Computation implements Runnable {
        public volatile int count = 0;
         private double computePi() {
            double pi = 0;
            double x,y;
            int n = 10000000;
            for (int i=0;i<n;i++) {
                x = Math.random();
                x *= x;
                y = Math.random();
                y *= y;
                if (x+y < 1) {
                    pi +=1;
                }
            }
            pi = 4*pi/n;
            return pi;
        }
        
        public void run() {
            double pi = computePi();
            long time = System.currentTimeMillis();
            System.out.println(time+" thread "+Thread.currentThread().getId()+" pi="+pi);
            count++;
        }        
    }

    
    public static void main(String[] args) throws Exception {
        startExecutors();
    }
}

Did you figure out why?


It took me less time with this simple code than with the original ring test to find out why. It is simply because of the Math.random call. Math.random only creates one random number generator, and it will be shared among threads. So every thread will wait at the other one at this point. Creating one random generator per thread showed 2 threads were much faster than 1, finally.


Back to the original ring test. Adding the correct way to compute Pi by Monte Carlo, I now had decent test results as long as the number of iterations is not too small. 10 iterations is enough to show a real difference between N threads and 1. Adding a small computation helps figuring out what happens behind the scene. You can also verify D Andreou claim, using only 1 start message the single threaded version is faster. If computation is too weak (for example number of Monte Carlo iteration of 0, one only measures method calls between threads (context switching), which is obviously optimal for 1 thread. Measuring Actor libraries on it is dangerous: if I write a single threaded Actor library, it will be the fastest of this test, but it certainly is not what you want to use as Actor library.


Let's see now how Scala fares compared to the Plain Java solution, using computation:

Machine

Algorithm

Time for 100000 ring count, 10 mc, 4 messages

Time for 10000 ring count, 100 mc, 4 messages

Core2Duo

OptimizedRing 2 Threads

57s

37s

Core2Duo

OptimizedRing 4 Threads

78s

39s

Core2Duo

Scala Actors

82s

47s

Core2Duo

SimpleRing (100 Threads)

137s

58s

Core2Duo

OptimizedRing 1 Thread

89s

71s

Core2Quad

OptimizedRing 4 Threads

81s

25s

Core2Quad

Scala Actors

71s

30s

Core2Quad

OptimizedRing 2 Threads

61s

43s

Core2Quad

OptimizedRing 1 Threads

100s

80s

The Core2Duo is Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz
The Core2Quad is Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz

It is interesting to compare results of 4 threads on a biprocessor with monte carlo count of 10 and 100. We see a much higher thread overhead with fewer computation. With too few computation in monte carlo, the overhead of threads is too high over 2 concurrent threads. This explains why the very simple threading architecture fares much better in the last column compared to the previous one.

Scala Actors fares much better when it is not hindered in the creation of too many threads. It seem actually very good at abstracting multithreading intricacies, while still providing near Java performance in the real world where each actor does enough computation and multithreading is important.

2 comments :

  1. By the way, Java 7 will probably come with a way to create Random numbers efficiently in more than one thread.

    ReplyDelete
  2. Yes. Here it is in its current form:

    ThreadLocalRandom

    Of course, it's functionally equivalent to a ThreadLocal < Random >, just easier to use, and is a subclass of Random (can be injected to existing code to remove that contention).

    @Fabien, that was quite sneaky indeed. I had never noticed that Math.random introduces contention too, just as Random (I only now saw it is implemented over Random).

    Doug Lea also claimed that even some SPEC benchmarks sufferred from this very oversight.

    ReplyDelete