Discussions

One of the reasons Java became popular was the introduction of primitives in the language itself for the management of threads such as the synchronized keyword.

Everyone knows that concurrent programming and management of threads in general are necessary because the world of web programming is essentially concurrent because multiple clients can access in the time, is necessary but in turn is complicated.

In addition to the inherent complexity of the synchronization of shared resources, everyone knows that thread management is very costly for the operating system and resource-intensive and rarely many more than 100 threads are used, for example Tomcat 6 by default limits the maximum number of threads to 200.

The problem is the standard Java servlet which bounds a thread to only one request, so that until the request terminates the thread is not free to be used to execute another request. Therefore the maximum number of concurrent requests is determined by the maximum number of threads we can use for requests.

To avoid this problem proprietary extensions arose from manufacturers of application servers (Tomcat, Jetty, GlassFish ...), all of them in one way or another (though usually via Java NIO) break the association 1 request - 1 thread so the thread of the request ends without ending the request, because request-thread synchronization is avoided the request is "non-blocking" or "asynchronous."

In "normal" web requests end as soon as possible, the problem arises in long-polling Comet wherein the request may be retained for a long time, if you use the standard servlet the thread is retained (stopped) because the request (and its associated objects) can only be used within the thread of the request. If a thread is withheld cannot be used for other requests so you need to reserve as many threads as Comet users.

In the alternative Java NIO approach one thread is capable of processing multiple requests avoiding the problem of switching thousands of threads and allowing greater scalability, consequently a greater number of concurrent users.

The problem introduced by the "mono-thread" approach of NIO is that we must be aware, for example in Comet programming, that we must not block with our actions the NIO thread, because if the NIO thread locks many other requests will have to wait. An example of blocking is a database operation, such operation is clearly blocking and it may take time, meanwhile the system (NIO thread) is "stopped" waiting for the response from the database, and other requests are also stopped.

To avoid this problem we can always create a thread (or use one of a pool) and delegate the query to the database to the new thread, unblocking the NIO thread. The problem with this practice is that it introduces new multithreaded programming with all its drawbacks in scalability previously enunciated.

To solve this problem there is an emerging "new" paradigm, "asynchronous programming”, asynchronous programming promotes using the same thread to process multiple requests, sequentially, but with no request blocking the thread, as we will see later the operations performed by requests will be executed "in pieces." To get there we must avoid running tasks that blocking the thread, so we need a non-blocking API, for instance a database operation using a non-blocking API "registers" the operation but does not run immediately, so that the method call immediately returns without having accomplished the database task, in the same time we provide a callback to be called when the database process ends, so that the flow will continue through the callback, this way of executing code is named "asynchronous" with respect to how it would have been sequentially (the call does not return until the task finishes).

Through this registration the task is internally queued, the main thread will check if the queued tasks are finished using asynchronous APIs of the operating system or through a pool thread because the goal is not to block the main thread running the requests, because you are executing little by little the code of requests it seems they are executed run "in parallel" but with the same thread.

If we need to execute tasks in parallel, we just can doing more things immediately after the call of long asynchronous task, when the long asynchronous task finishes (which usually implies waiting for input / output of a device) the flow continues in the callback, but always using the same main thread. Because the same main thread is ever used to execute requests, there is no problem of concurrent access to shared objects so you are free of synchronization problems. This is obviously not possible for example with JDBC because this API is blocking (although nothing precludes an asynchronous API on top of JDBC API).

One example of this emerging programming paradigm is Node.js , a web application server in which you code in JavaScript. This paradigm fits well with JavaScript because this language does not support threads.

The API of Node.js is non-blocking, either because the task is not blocking or when it is, Node.js prevents blocking allowing us to register a callback. Every call to the Node.js API is an opportunity for Node.js engine to change the request and execute any pending callback waiting for a blocking operation to finish, thus Node.js COMMUTE between requests using the same thread so that running requests are gradually executed “in parallel”, because our code instead of following a synchronous flow or sequence, is broken in small pieces registered as callbacks, inside a callback we can again call new asynchronous methods registering new callbacks (pieces of code).

As you can see this strategy is JUST GREAT!

Notwithstanding, it just have a problem ...

IT WAS INVENTED AROUND 40 YEARS AGO AND IS A HANDMADE, INEFFICIENT AND TEDIOUS VERSION OF THE JOB OF A THREAD SCHEDULER!

Software threads are a mere illusion, every core of a processor is almost mono-thread (common Intel processors), a thread scheduler switches processor (core) registers to continue running a different code area in a different stack for a very small time frame giving us the illusion of parallelism, only real when the processor has multiple cores or every core can execute several hardware threads. This context switch is automatic, regardless the code being executed, and following a policy of effective processor usage. For a manual alternative is very difficult to overcome a thread scheduler (there was a time that was possible in older Linuxes hence we had SEDA but this is no longer true) because a thread scheduler has the ability to reclaim control of the processor/core to the software thread at any time.

It is FALSE that thread management is costly as it was demonstrated in this article, I have recently found another link that corroborates the same but apparently it seems the opposite (I recommend reading my comment).

It is FALSE that thread management is costly, CPU usage is ZERO as you can see in this example of 3000 threads:

It is FALSE that we can effectively separate blocker and non-blocker tasks and make maximum use of the CPU with a single thread.

The following example execute many millions of iterations and increments of a simple integer variable, an extreme example of non-blocking task. Interestingly the higher number of threads set with the variable THREADS (the number of iterations and increments are the same) the more time is decreased despite the creation or initialization of threads (if it were a pool that cost would not exist). Try THREADS with the number of cores of your computer and then with 1000 threads for example, the time is less with 1000 threads! Now imagine the performance difference with more complicated and blocking tasks.

public static void main (String [] args) throws Exception

{

final int THREADS = 8;

final long LENGTH = 100000000000L / THREADS;

long start = System.currentTimeMillis ();

Runnable task = new Runnable ()

{

public void run ()

{

long j = 0;

for (long i = 0; i <LENGTH; i + +)

j + +;

}

};

Thread [] threadList = new Thread [THREADS];

for (int i = 0; i <threadList.length; i + +)

{

threadList [i] = new Thread (task);

threadList [i]. start ();

}

for (int i = 0; i <threadList.length; i + +)

threadList [i]. join ();

long end = System.currentTimeMillis ();

System.out.println ("END" + (end - start));

}

Fortunately Servlet 3.0 allows us to decide the number of threads that we estimate in our Comet application. An interesting example .

My recent experience with ItsNat Comet tells me that is not uncommon in our Comet applications need to notify ALL clients at the same time, and the most effective/performant way is that each client has an associated thread in server.

Please stop this nonsense and if necessary add threads to JavaScript...

I think a much better approach to solving some of the issues related to items called out above is to bring QoS up into the application stack in particular resource reservation and prioritization making much optimal (or directed) job/request/thread scheduling eliminating possible congestion/contention further downstream in the request pipeline. This can be achieved via metering and resource reservation. I wrote about this recently.

What is really cool about our approach is that we have basically modeled all measured call sites in a Java application (event JRuby/Ruby and Jython/Python) as network nodes/routers with queuing, admission control, capacity mgmt and prioritization - all within the same thread execution context.

I agree.. There are much better ways of handling this problem.. The solution is absurd.. using JavaScript, which is 100 times slower than a compiled language.. If ther is asynchronous processing, scale up, use more processors, they are cheap.. and/or break your request into pieces (processes).. do as little as possible on connection initiation and save satate in a file, or something.. anyway, this is the wrong way to go for a lot of reasons..

Second, this 'test' only causes about 10 context switches per second which is ridiculously small to cause any effect. If one reduces delay in the internal loop to 100 milliseconds (simulating 30000 context switches per second) then the CPU is quite noticibly busy (and it gets worse on more multicore systems - TLB flushes and all that stuff).

Good modern languages make this kinds of loads trivial to handle - Erlang runtime can easily handle hundreds of thousands of threads, with realtime performance.

Fortunately Servlet 3.0 allows us to decide the number of threads that we estimate in our Comet application. An interesting example .

Although I added some simple support for async HTTP within RESTEasy that I hope to get in the JAX-RS 2.0 spec, I really question the need for async HTTP in general. Firstly, you need thousands (tens of thousands with modern kernels?) of concurrent *blocking* threads before you even hit the wall and problems that async HTTP is supposed to solve. IMO, the vast vast majority of web apps will never come close to thousands of concurrent users. Even if there are thousands of concurrent users, most of those apps are concurrent blocking. Even then, isn't it cheaper to just add another machine than to complicate your design with async programming?

I like Servlet 3.0 becaue it allows tuning the number of threads being used (for instance in Comet) to reduce the memory (stacks), nothing to do with scalability, unless memory is the main scalability limitation and this is not the case when people rant against threading regarding to scalability.

Of course if the number of threads serving a number of Comet clients is lower you save memory, no doubt, but what if ALL clients must be notified in the same time? Then the best approach to get the best performance is 1 thread per client otherwise several clients will be notified later because several clients are queued in the same thread. In my opinion this scenario (all clients must be notified) is going to be very frequent in Comet apps.

Of course if the number of threads serving a number of Comet clients is lower you save memory, no doubt, but what if ALL clients must be notified in the same time? Then the best approach to get the best performance is 1 thread per client otherwise several clients will be notified later because several clients are queued in the same thread.

Pray tell me, I have a machine with 2 CPUs and 500 connected clients. How can I send a message to each of the simultaneously (without network-level tricks like multicast or broadcast)?

If you have more clients than physical CPUs then some clients will get their message later than others.

Of course if the number of threads serving a number of Comet clients is lower you save memory, no doubt, but what if ALL clients must be notified in the same time? Then the best approach to get the best performance is 1 thread per client otherwise several clients will be notified later because several clients are queued in the same thread.

Pray tell me, I have a machine with 2 CPUs and 500 connected clients. How can I send a message to each of the simultaneously (without network-level tricks like multicast or broadcast)?

If you have more clients than physical CPUs then some clients will get their message later than others.

Pray tell me, I have a machine with 2 CPUs and 500 connected clients. How can I send a message to each of the simultaneously (without network-level tricks like multicast or broadcast)?

If you have more clients than physical CPUs then some clients will get their message later than others.

No, you can't, because threading is just an illusion, just hardware threads matter.

Said this, when you notify one client because something in server happens, you must to do things for this notification, maybe a database action or simply message formatting, and as you can figure out pure non-blocking actions are a quimera, some minor blocking ever happens, so meanwhile you are executing the client notification, the thread may be sometimes blocked, if other clients are queued in the same thread they will be also blocked. If you provide a thread per client no client is blocking other clients and the CPU use will be maximized.

Of course this is the most memory consuming approach but we are talking about Comet, real time etc and in Comet and real time, the time to notify clients matters very much otherwise the typical polling approach may be enough and not so server intensive.

Said this, when you notify one client because something in server happens, you must to do things for this notification, maybe a database action or simply message formatting, and as you can figure out pure non-blocking actions are a quimera

Nope. Non-blocking IO is not a chimera. First, message formatting is not an IO-bound task, so CPU will not be idle during formatting. Writing log message is IO-bound, however.

But that's easy to work around by having physical_cpu*2 worker threads.

And in other languages (hint: Erlang) _all_ IO is async, so there is no such problems at all.

Pray tell me, I have a machine with 2 CPUs and 500 connected clients. How can I send a message to each of the simultaneously (without network-level tricks like multicast or broadcast)?

If you have more clients than physical CPUs then some clients will get their message later than others.

Alex - It's been a long time since the CPUs were sending anything over the wire. What happens (for both blocking and non-blocking I/O) is that the CPU copies the bytes-to-be-sent to a certain location in memory, and then the NIC (operating asynchronously from the CPU) uses DMA to read those bytes from RAM and send them over the wire. All NIO is (for TCP/IP) is a way to know whether or not there is space in that special DMA part of memory to copy more bytes-to-be-sent. (If there is space to copy into, then the selector for that port is set; otherwise not.)

At any rate, to answer your question, the CPU (or muliple concurrent CPUs or cores or SMT cores or whatever) will copy the data to the right places in memory and the NIC (or NICs or whatever) will send that data over the wire when possible. Generally it is quite easy for a machine with 1 CPU to do this for 10000 clients, or for that matter a machine with 16 SMT cores. The CPU doesn't wait for a message to get delivered unless someone works really hard to write some really stupid code.

Also, as the other messages pointed out, there's no such concept as "clients getting their messages at the same time".

Lastly, multicast and broadcast are certainly not required, and wouldn't solve the "simultaneous delivery" problem at any rate (although they can be efficient for widespread message distribution).

Fortunately Servlet 3.0 allows us to decide the number of threads that we estimate in our Comet application. An interesting example .

Although I added some simple support for async HTTP within RESTEasy that I hope to get in the JAX-RS 2.0 spec, I really question the need for async HTTP in general. Firstly, you need thousands (tens of thousands with modern kernels?) of concurrent *blocking* threads before you even hit the wall and problems that async HTTP is supposed to solve. IMO, the vast vast majority of web apps will never come close to thousands of concurrent users. Even if there are thousands of concurrent users, most of those apps are concurrent blocking. Even then, isn't it cheaper to just add another machine than to complicate your design with async programming?

Anyway the results are the same (same conclusions). Try with THREADS = 2*cores and THREADS=1000 or higher (of course a number very high will throw a memory error or will dregrade the performance because too many threads).

I don't understand the aversion for async programming. Yes, it's not elegant in pure Java (though Scala is much nicer), but it's effective and can be quite fase.

No, no, is not aversion I'm just against the myth of async programming is that you need for scaling with very very few threads, and in my opinion it is false, and I don't know a performance benchmark proving the contrary, furthermore, very simple extreme tests as the tests provided in the article prove the contrary, minor bloking ruins the performance of a concurrent application based on very few threads, and other more complicated test (cited in article) prove the same: the more threads the more scaling, of course they are limits but these limits, when threading is becoming a problem, arise on many thousands of concurrent (not stopped/waiting) threads, and in this level of load, threading is not a problem.

I like asynchronous programming for parallelizing tasks when you can partially parallelize your sequential code, and paradoxically asynchronous programming is fine to increase the performance when the system is mainly stopped because you can maximize the use of cores for a single request dispatching (remember the race of clock speed was over several years ago).

I absolutely agree with you Pat, the use cases you have exposed, basically share the same reason: you can parallelize these actions, that is, you can submit some task and inmediatelly doing another task.

My answer is the same said to Alex Besogonov before, your examples add some more reasons for more concrete use cases, but these reasons have nothing to do with the main reason of "asynchronous programming in the same thread", that is "free the main thread in this blocking request doing asynchronous because other requests must be executed in the same main thread", with this approach pure sequential actions are converted in evented actions because of some kind of benefit in scalability.

From your blog: "Thread management is very low-level and best left to the operating system"

Yes and no, some time ago in the SEDA project, Matt Welsh beated the thread scheduler of Linux 2.4 systems, in my opinion (and several public benchmarks seem to prove this) this is no longer true in modern systems, the problem of SEDA project is it helped to extend a truth, valid in a concrete time frame and systems, as the "definitive forever truth".

NIO based I/O schedulers, typical of current servlet containers, are trying to do the same, beating the thread scheduler in I/O (fortunately in a very controlled task). I think the result is as expected, similar performance and scalability but not better, I must recognize they can reduce the amount of memory being used with similar performance and scalability than a classic thread-blocking solution and it is good in spite of higher complexity in internal servlet container code, but extending this approach to business code of end users with no clear benefit (and with very high probability of ruining the performance of your system), is crazy and it will make us crazy :)

"but these reasons have nothing to do with the main reason of "asynchronous programming in the same thread", that is "free the main thread in this blocking request doing asynchronous because other requests must be executed in the same main thread"

Misconstrues my point. My point is that this asynchronous model allows for multiple independent operations to be performed. This makes it possible to:

have a request return an incomplete to the browser faster without waiting for every operation to complete. Slow operations have their result delivered late as a part of a ajax response.

Being able to deliver the user perception of a fast response time is the best use case I feel for this event / asynchronous model.

Like every tool it is subject to abuse before it is used properly. Certainly, chopping up every operation into little bits gains no value but delivers lots of confusing code. But please don't go the other way and say the asynchronous model is a complete waste of time.

WRT Java NIO, NIO does have enormous value in the case where there are lots of connections that are not CPU intensive. For example, network monitoring software has to connect with many switches. These switches (especially when heavily loaded) tend to respond slowly (100's bytes/sec ) to monitoring data requests. Some times the connections hang and need to be dropped and reconnected. having the 1 thread/connection model is bad.

So NIO is a good match for cases where there are multiple connections with slow data transfer rates.

Pat you are losing my point, I'm not against asynchronous programming, I do asynchronous programming most of the time, and I fully understand the "joy" of asynchronous programming for instance here, here and here. Comet programming is basically asynchronous programming.

I am against phrases like this "the same thread is used to execute thousands of requests of thousands of users, this thread solution is the only that scales"

This has nothing to do with your examples.

And yes in some cases NIO can be fine because it provides more control to the I/O dispatcher, but NIO was sold as the only way to scale to thousands of requests, because of asynchronous programming and using very very few threads avoiding the "high cost" of context switching.

If you want we can avoid the term "asynchronous programming" because it may be confusing, you do asynchronous programming when you parallelize your code when your code is not pure sequential, parallelization of non-pure sequential action is not the focus of the article.

Maybe "sequential task executed asynchronously/discontinuously as consecutive chunks into the same thread to allow other sequential taks to be executed in the same thread in the same time the same way", as you can see the alternative "name" is not pretty :)

Dudes, you mean to tell me you re-discovered, after all these decades, the relationship between software threads and hardware capacity? Or the JIT improvements?

The reason a gazillion threads is to be avoided is resource contention. Adding integers with not even a lock in sight has nothing to do with a normal server app, which logs to buffered file systems, uses static locked server objects, shares pools of ($-expensive) database connections etc.

When's the last time you plotted the response times for one of those? The point is that you *don't want* to deal with 2000 simultaneous threads-requests, which will switch contexts every 5 lines of code when trying to log to the same file becuse the contention rate is now 2000/5 versus 100/5 with 100 threads and the switching cost grows correspondingly. how about, in your example, have an actualy static and shared lock and hit that from each thread, every 5 integer increments.

As far as I know, Erlang is 1) functional (i.e. discourages shared-state) and 2) message-passing and actor based, not thread-based. Those hundreds of thousands are actors/processes not what you commonly know as "thread"... Modern actor systems like akka support millions of actors.

The point is, José María, that node is able to handle in a single thread and with a comparably tiny memory footprint a (big number) of clients for which you'd otherwise need a (big number) of threads and much more memory. No matter how you put it, that's a big plus. Then you can always spawn some more node processes (or perhaps just more node threads in a single node process, in the future) as needed, to handle even greater numbers of clients, to put these extra cpu cores to work. IOW, what you get is a much more efficient use of the available resources.

Also, when you say "IT WAS INVENTED AROUND 40 YEARS AGO AND IS A HANDMADE, INEFFICIENT AND TEDIOUS VERSION OF THE JOB OF A THREAD SCHEDULER!", you're wrong.

Context switches are quite expensive, but even if they weren't too expensive, you'd have to agree that they're more expensive than no context switches at all. So for example, if you were loop while (1) { n++ } in a single thread in a single core cpu, it will *always* loop noticeably faster than in 2 threads, and use less memory, too.

TechTarget provides technology professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective purchase decisions and managing their organizations technology projects - with its network of technology-specific websites, events and online magazines.