Where is the Context Switch Effect in Java?

Hello!
If you run a lot of Threads concurrently you'll have a reduction of performance due to a more Context Switch.
Of course if you have tasks that presents waiting time you need more Threads according to this formula N*(1+WT/ST) where N is the number of Cpus, WT is the waiting time and ST is the Execution time as it is written here. This is done in order to make the global system do not have wait time!
So If you have just only 1 CPU with no wait Tasks the best number of Threads is just 1! So it's better to run sequentially the tasks on the same Thread instead of associate a Thread to each Task!
So I just wanted to proof it in a JVM Enviroment (under Windows XP) and I found that this is NOT TRUE! I found the best results with 50 Threads!! Moreover I found that the performance are very similar if I run 1000 Threads! So basically I can't find the Context Switch effect!

Here some interesting point:
1) In my test I consider as performance indicator the time it spends to perform the whole test. I run the test with a different number of Threads that run concurrency. Naturally if I run the test with different number of Threads, they have to do globally the same load of work, in order to have comparable results of time. So basically the load of work for each thread is divide by the number threads that are running. [ Detail: The load of work with 1 thread is a for (with an other nested for) (block M in the code) of N_ACCESS=5*10*10*2 access. That means you can run the test with a number of Thread that is multiple of 1000 so for example: 1, 2, 5, 10, 50 , 100, 500, 1000 ]

2) The test should start all the thread in the same time. In Java there isn't a way to do it. If you run a simple cycle like

you have an influence in the results because the for is 100 time slower if you run 100 threads or just 1! So I developed a way to run all the th in the same time. I make that cycle in a synchronized(ob) statement (block G in the code). Each thread starts in and stops trying to acquire the lock of ob (block L in the code). The main Thread waits for 1 second and then release the lock. After this operation all the Thread starts from the same point!

3) In order to avoid run time optimization (or just in time one) I run a pre-test (blocks B and C in the code). In this way I will leave the JVM to optimize the code and then I run the real test. In this way I'll try to reduce the noise of the optimization.

4) In order to avoid also the influence of the cache L1 and L2 I make the test access to a vector of the dimension of 4MBytes. In most of the cache algorithms this operation will load in cache L1 and L2 the vector. In my System I have 32K of L1 and 2MByte of L2. This operation will erase all possible values. In this way it's like I reset the system in order to reduce other causes of noise in the measurement of the time.

5) In order to avoid the influence of the Garbage Collector I will run the test with 1GByte of big heap size. Moreover to be sure the GC is not running during the test I'll turn on the verbose option.

6) The 1st parameter is the time in milliseconds the main Threads has to wait the test finish. Of course in order to have correct results you have to set it more than the time of the test. You'll find this empirically: just set it quite high (ex 10000) and then you'll see the time spent in the test and you can set it to an higher value. For example in my case I have a test time no more than 1 second so I set the parameter to 2000.
The 2nd parameter is the number of Threads. As I have already said you have to use a multiple of 1000 ex: 1, 2, 5, 10, 50 , 100, 500, 1000. For example don't use 3, 7, 30, 70 and so on.
So basically my command line is for 10 Threads
java -verbose:gc -Xmx1024m -Xms1024m QuestionContextSwitching 2000 10

As you can see I found a minimum value running 50 thread!!! It's really strange because I have just computation work!
The second strange stuff is that I don't find a so big difference between the test with 10 threads and test with 1000 threads! where is the Context Switch effect?

So now I can guess a few things! Please tell me it they are not true!
1) The OS can manage well and optimize the Software with a lot of threads. This means you don't have really to worry about how many threads your application is going to run: there is the OS that can handle it!
2) The OS and the hardware can introduce a bit of parallelism! The software with 1 Thread is worst than with 50 threads!
3) There are algorithms in the OS that can optimize the Cache paging! (if this is true, do you know some?)

Is There someone that know if the JVM can optimize the execution when it has to execute e a lof of Threads?

You may need more time to counter the affects of the JIT compiler -- run the test many times in a loop, just to warm it up.... then go into another loop for your official test.

Remember because of the JIT, the first iterations will always be slower. So, if you have more threads, the average speed will be faster as the JIT compile time will be amortized across more threads. You may also want to use the "-client" JIT compiler, as the server version will optimize better if it is executed longer (or with more threads).

You also seem to be starting the clock after the start() method is called. I would recommend having each thread have it's own start and end time.... So, you can guarantee that the start time is in the run() method, and doesn't count the time for the actual thread to start.

Henry Wong wrote:You may need more time to counter the affects of the JIT compiler -- run the test many times in a loop, just to warm it up.... then go into another loop for your official test.

Remember because of the JIT, the first iterations will always be slower. So, if you have more threads, the average speed will be faster as the JIT compile time will be amortized across more threads. You may also want to use the "-client" JIT compiler, as the server version will optimize better if it is executed longer (or with more threads).

Henry

Thanks Henry for the answers! :-)
First of all I forgot to say I am using the Sun JVM, so I have HotSpot not JIT, but the concept is pretty the same.

I have already done a pre test (block B in the code). In this case I run all the Threads and then I recreate them again and start again for the actual test. The code is the same in all the Threads so I guess that in the second round the optimization is already done. Do you think I have to run a pre cycle in the run method? In this case the code will be much more complicate because I have to wait all the Threads do only the first part of the run method. I'll do it and I'll post the result probably this could be a problem.

You may also want to use the "-client" JIT compiler, as the server version will optimize better if it is executed longer (or with more threads).

How does this waits for the threads to finish? For all you know, it can still be running upon return from sleep.

Henry

First of all the wait_time_milli is an input parameter (the first in line). As I have already said you have to set it empirically (try with 10000 and then set it to a value just higher than the time of the test). In my case I have test time not higher than 1 second. So I set if to 2000 that means 2 seconds to be sure it waits exactly all the test.

I am sure this waits exactly the time I set because if the sleep method return earlier I will have an Exception (that I print). During my tests I don't have it.

You also seem to be starting the clock after the start() method is called. I would recommend having each thread have it's own start and end time.... So, you can guarantee that the start time is in the run() method, and doesn't count the time for the actual thread to start.

I would also have a longer test -- less than a second is too short.

Henry

When I start the test (block G) I wait 1 seconds all the thread starts and are waiting in block L. Probably this value is too short (I'll try to change it)
I measure the gap between from when the void main thread unlock ob so all the threads can run, to the maximum of the timefinish to all the threads. So now I think someone could ask me: why you put timefinish=System.nanoTime(); inside the cycle and why you don't use cathc the time when a Thread starts?
The answer is simple. If I put the timefinish outside the cycle like this:

the code "timefinish=System.nanoTime();" is done one per Thread! So the load will be different if I run 1 thread or if I run 100!

The same thing happens if I put a timestart at the beginning of the run method and then I will see the minimum value. Like this

the code "timestart=System.nanoTime();" is run one time If I run 1 thread and 100 times if I run 100 threads! This change the load of work and introduce noise to the time measurements

1. Switch to use millis instead. Not sure why, but my computer doesn't have nano accuracy, and the big numbers were annoying me.

2. Start measurements only in the run() method. Every thread keeps track of its start and stop. So, we don't actually measure the threads creation and takedown times.

3. Gave more time for warmup.

4. Actually wait for thread completion.

5. Don't divide up the work. Have each thread do the same amount of work regardless of the number of threads.

Explanation.... the work was already too short. It wasn't measureing anything but the threads start and stop times already. When the work was divided up, it made it worst. In fact, you can argue that your test doesn't do any work at all. When you have 1000 threads, all threads did one iteration. You are basically measuring the thread system and not the execution.

The total test time is the total time taken by all of the threads combined. And the average thread time, is the time amortized across the threads.

Notce that with one thread, it is nearly zero. This is because the test is too short. The values do show up as the number of threads increase. The increase is cause by two factors, the work increased (because we don't divide up the work). And the threads are interfering with each other -- there is only one CPU which has to timeslice the threads. If you look only at the average time, this should elimate the work increase part.

1) It's a good idea to have different loads of work, but I have tried to avoid it. Basically you run this

5000 times if you have 1 thread, and 500000 times if you have 100 threads. Are you sure that the Hot Spot don't optimize the code run 500000 times more than one run 5000 times? I believe that if you have more frequent code you get a better optimization.

2) Your total time is the sum of all the times in the threads

But this give a wrong result: assume that a Thread could be preempted by an other. For example you can have this scenario:

threads[0] starts
timefinish = System.currentTimeMillis();
do a part of the job of threads[0]

So timefinish of threads[0] is the sum of the duration time of threads[0] and threads[1]!!!

It's better to introduce a new variable timestart at the beginning of the run method and calculate max(timefinish)- min(timestart). I'll do it!!

3) I don't understand this

the threads[100] is started really later than threads[0]. They don't start in the same time!!! threads[100] has to wait the creation and the starts of 99 threads before. It's not fare for the poor threads[100]! :-)

4) I wanted also to use join() function. According to your code

think that the loop waits in threads[0].join(). Assume that threads[0] ends earlier than the others and the next running thread is threads[100]. The loops has to check 99 threads state and then it can wait in threads[100].join(). But the loop from 0 to 99 runs concurrently to all the other threads of the test. This influence introduces a delay.

But your had a great idea!! :-) The idea to calculate the average of the time probably can give a more precise result. I'll try it and I'll post the results soon! :-)
See you in a bit! :-)

the threads[100] is started really later than threads[0]. They don't start in the same time!!! threads[100] has to wait the creation and the starts of 99 threads before. It's not fare for the poor threads[100]! :-)

I could be wrong, but Henry moved the timing routine into the threads execution(ie inside run) so the clock on thread 100 doesn't actually start until threads[i].start(); is invoked on that thread.

Seems pretty fair.

"Computer science is no more about computers than astronomy is about telescopes" - Edsger Dijkstra

Enrico Tamellin

Greenhorn

Posts: 23

posted 9 years ago

Rusty Shackleford wrote:

the threads[100] is started really later than threads[0]. They don't start in the same time!!! threads[100] has to wait the creation and the starts of 99 threads before. It's not fare for the poor threads[100]! :-)

I could be wrong, but Henry moved the timing routine into the threads execution(ie inside run) so the clock on thread 100 doesn't actually start until threads[i].start(); is invoked on that thread.

Seems pretty fair.

Hello Rusty! :-)
You are right! the clock start when it actually starts to run.
But in this test I want to start all the threads in the same time. In Java you can't, so the only thing you could do is to introduce a synchronization point as I did. Suppose you spend 1 millisecond to create a thread, this means that in the loop you start one thread every milliseconds. Moreover your main thread could be preempted by a thread you start. So it could happen that some threads finishes before you actually finish the loop in the void main(). This means that even if you want to run 1000 threads concurrency you might don't do it. In a test you must be sure that all the number of threads you set are running in the same time.

I hope my explanation is clear.... I am not an English native speaker! :-)

Enrico Tamellin

Greenhorn

Posts: 23

posted 9 years ago

By the way I have just tested my program with the good suggestion of Henry!

In this test I have a fix amount of work for each thread (this time is just 1 for!)
Naturally, if a test with 1 thread use T1 of execution time, when I'll run a test with 10 threads I'll have aproximately 10*T1. So I will not have comparable results.
But I just normalize the results by the number of threads.

The results confirms my previous results:

1) the test runs better with 50 or 100 threads for a just computation task!!!
2) There isn't so much different between 100 or 1000 of threads!

So basically I have found that there is NO CONTEXT SWITCHES EFFECT!!! INCREDIBLE! It's like I have found the solution to all the servers that have to create Thread Pool in order to bound the amount of threads in the system.
My incredible results seems to look like: DON'T CARE ABOUT BOUND THE THREADS: THERE IS NO CONTEXT SWITCHES EFFECT! :-)
If this is true I'm like an hero!

As per your program, NOT all the threads are started at the same point. That is the reason, you dont experience the context switching problem.

You started the threads and initialized them well in advance. Perfect!!..But the moment you say .start() it starts working. So, you have to introduce some GREEN_LIGHT signal in your run() method, which would proceed only when all the threads are started.

Even come close to waiting for the threads to start together? In fact, if two threads happen to hit this point at the same time, this code may actually slow one down for a fraction of a millisecond, to separate them.

Java was designed from the start with threads being very inexpensive. Using threads is normal in Java code. The Java folks have greatly improved programming to use threads in 1.5 by adding built in functions/API for Barriers, and other functions that are standard in any multi-threading or multi-tasking function. See Henry's wonderful Java Threads book, which I bought long ago.

That said, it makes sense to have a few threads, such as one for processing and one for button pressing in the GUI. It does not make sense to have thousands of threads unless you have a thousand CPUs. Today, its trivial to get 4 cpus, and systems with two or more quads are common in the server world.

We are probably a few years from having 64 or 128 CPUs in our desktop systems. So designing an application with thousands of threads is not really ideal.

Look at the ThreadPool APIs. They help a lot.

Enrico Tamellin

Greenhorn

Posts: 23

posted 9 years ago

rajesh bala wrote:There is a basic flaw in the way you are benchmarking.

As per your program, NOT all the threads are started at the same point. That is the reason, you dont experience the context switching problem.

You started the threads and initialized them well in advance. Perfect!!..But the moment you say .start() it starts working. So, you have to introduce some GREEN_LIGHT signal in your run() method, which would proceed only when all the threads are started.

~Rajesh.B

Hello Rajesh! :-)

Why do you think, if you start all the threads in the same point you'll not have a context switches effect later?

In my test all the threads stop here

and then after the void main finish the code inside he synchronized(ob) block all the threads are allowed to run.
So even if I don't start al the threads in the same time (i can't do it in Java) they actually run in the same time and concurrently... so I'll have a context switch effect!
that code is like your GREEN_LIGHT!!!

So I answer also to Henry

The block of code L is a very simple implementation of a barrier with the minimum overhead!! Even if it's not a good style of programming it works well: all the threads have to wait there untill the void main realease the lock of ob! When the void main finish the code inside his synchronized block, it realease the lock of ob and all the threads are allowed to run in the same time!

if you want you can use also a wait()/notifyAll() implementation.... I'll post it soon.

Enrico Tamellin

Greenhorn

Posts: 23

posted 9 years ago

Hello Pat! :-)

Pat Farrell wrote: The Java folks have greatly improved programming to use threads in 1.5 by adding built in functions/API for Barriers, and other functions that are standard in any multi-threading or multi-tasking function.

Do you think they reduced also the context switches effect? Do you have some link to it?

See Henry's wonderful Java Threads book, which I bought long ago.

:-) I have already read the part of barrer! :-)

It does not make sense to have thousands of threads unless you have a thousand CPUs.

I agree with you, but in my test I have found that the code (that do only cmputational code) performs betterwith 50-100 threas instead of 1! Why??

Look at the ThreadPool APIs. They help a lot.

Yes! All this work was to study: how many threads I should use in a Thread Pool, in order to have the best performance? If have only computational work (like in my case) it should be only 1...instead I have just seen that the best result I have with 50-100!!!

Enrico Tamellin

Greenhorn

Posts: 23

posted 9 years ago

Ok, As I said I have tried also the implementation with wait() and notifyAll().
It's more clear, but I have got the same result: NO CONTEXT SWITCH EFFECT!!!

:-) ... now I am almost convinced that the context switch doesn't exist! :-)

Enrico Tamellin wrote:Do you think they reduced also the context switches effect? Do you have some link to it?

No, I expect that they might have done some cleanup, but it was pretty minor already. What 1.5 did was mostly implement the stuff that Henry wrote about years earlier.

Done properly, and while I haven't looked into it in depth, the JVM should be able to switch between threads with only a few times more overhead than any other subroutine call. its orders of magnitude less overhead than a process to process clank.

Enrico Tamellin

Greenhorn

Posts: 23

posted 9 years ago

Pat Farrell wrote:

Enrico Tamellin wrote:Do you think they reduced also the context switches effect? Do you have some link to it?

Done properly, and while I haven't looked into it in depth, the JVM should be able to switch between threads with only a few times more overhead than any other subroutine call. its orders of magnitude less overhead than a process to process clank.

But is the JVM that perform the context switches?
I thought that the JVM create a thread with a native thread of the OS that will be the only responsable of the scheduling, and so also the context switches. I mean a Jave Thread behaves like a normal OS thread in the System. That's why I was expected that my test with 1000 threads have some context switches effects

Enrico Tamellin wrote:But is the JVM that perform the context switches?
I thought that the JVM create a thread with a native thread of the OS that will be the only responsable of the scheduling, and so also the context switches. I mean a Jave Thread behaves like a normal OS thread in the System. That's why I was expected that my test with 1000 threads have some context switches effects

Are you seeing orders of huge impact from Java Threads? If not, then they are not native OS threads.

@henry knows far better than I, but I'm pretty sure that most JVMs do threading locally or internally and do not use OS processes.