Java is the new C

Friday, July 31, 2015

Front end development productivity has been stalling for a long time. I'd even say regarding productivity, there was no substantial improvement since the days of Smalltalk (Bubble Event Model, MVC done right, component based).

For reasons unknown to me (fill in some major raging), it took some 15 years until a decent and well thought webcomponent standard (proposal) arised: WebComponents. Its kind an umbrella of several standards, most importantly:

Html Templates - the ability to have markup snippets which are not "seen" and processed by the browser but can be "instantiated" programmatically later on.

Html Imports - finally the ability to include a bundle of css, html and javascript implicitely resolving dependencies by not loading an import twice. Style does not leak out to the main document.

Shadow Dom

It would not be a real web technology without some serious drawbacks: Html Imports are handled by the browser, so you'll get many http requests when loading a webcomponent app using html imports to resolve dependencies.
As the Html standard is strictly separated from the Http protocol, simple solutions such as server side imports (to reduce number of http requests) can never be part of an official standard.

Anyway, with Http-2's pipelining and multiplexing features html-imports won't be an issue within some 3 to 5 years (..), until then some workarounds and shim's are required.

Live Demo: http://46.4.83.116/polymer/ (just sign in with anything except user 'admin'). Requires Websocket connectivity, modern browser (uses webcomps-lite.js, so no IE). Curious wether it survives ;-).
Currently Chrome and Opera implement WebComponents, however there is a well working polyfill for mozilla + safari and a last resort polyfill for ie.
Using a server side html import shim (as provided by kontraktor's http4k), web component based apps run on android 4.x and IOS safari.

a web component with dependencies (link rel='import'), embedded style, a template and code

Dependency Management with Imports ( "<link rel='import' ..>" )

The browser does not load an imported document twice and evaluates html imports linearly. This way dependencies are resolved in correct order implicitly. An imported html snippet may contain (nearby) arbitrary html such as scripts, css, .. most often it will contain the definition of a web component.

Encapsulation
Styles and templates defined inside an imported html component do not "leak" to the containing document.
Web components support data binding (one/two way). Typically a Web component coordinates its direct children only (encapsulation). Template elements can be easily accessed with "this.$.elemId".

Application structure
An application also consists of components. Starting from the main-app component, one subdivides an app hierarchically into smaller subcompontents, which has a nice self-structuring effect, as one creates reusable visual components along the way.

The main index.html typically refers to the main app component only (kr-login is an overlay component handling authentication with kontraktor webapp server):

hm .. you might imagine how such an app will load on a mobile device, as the number of concurrent connections typically is limited to 2..6 and a request latency of 200 to 500ms isn't uncommon. As bandwidth increases continously, but latency roughly stays the same reducing the number of requests pays off even at cost of increased initial bandwidth for many apps.

In order to reduce the number of requests and bandwidth usage, the JavaScript community maintains a load of tools minifying, aggregating and preprocessing resources. When using Java at the server side, one ends up having two build systems, gradle or maven for your java stuff, node.js based javascript tools like bower, grunt, vulcanize etc. . In addition there are a lot of temporary (redundant) artifacts are created this way.

As Java web server landscape mostly sticks to server-centric "Look-'ma-we-can-do-ze-web™" applications, its hardly possible to make use of modern javascript frameworks using Java at server side, especially as REAL JAVA DEVELOPERS DO NOT SIMPLY INSTALL NODE.JS (though they have a windows installer now ;) ..). Nashorn unfortunately isn't yet there, currently it fails to replace node.js due to missing API or detail incompatibilities.

I think its time for an unobtrusive pointer to my kontraktor actor library, which abstracts away most of the messy stuff allowing for direct communication of JavaScript and Java Actors via http or websockets.

Even when serving single page applications, there is stuff only a web server can implement effectively:

dynamically inline Html-Import's

dynamically minify scripts (even inline javascript)

In essence inlining, minification and compression are temporary wire-level optimizations, no need to add this to the build and have your project cluttered up. Kontraktor's Http4k optimizes served content dynamically if run in production mode.
The same application with (kontraktor-)Http4k in production mode:

even better: As imports are removed during server side inlining, the web app runs under Safari IOS and Android 4.4.2 default browser.

Actor based Server side

Wether its a TCP Server Socket accepting client(-socket)s or a client browser connecting a web server: its structurally the same pattern:

What's different when comparing a TCP server and a WebApp Http server is

underlying transport protocol

message encoding

As kontraktor maps actors to a concrete transport topology at runtime, a web app server application does not look special:

a generic server using actors

The decision on transport and encoding is done by publishing the server actor. The underlying framework (kontraktor) then will map the high level behaviour description to a concrete transport and encoding. E.g. for websockets + http longpoll transports it would look like:

So with a different transport mapping, a the webapp backend might be used as an in-process or tcp-connetable service.

So far so good, however a webapp needs html-import inlining, minification and file serving ...
At this point there is an end to the abstraction game, so kontraktor simply leverages the flexibility of RedHat's Undertow by providing a "resource path" FileResource handler.

The resource path file handler parses .html files and inlines+minifies (depending on settings) html-import's, css links and script tags. Ofc for development this should be turned off.
The resource path works like Java's classpath, which means a referenced file is looked up starting with the first path entry advancing along the resource path until a match is found. This can be nice in order to quickly switch versions of components used and keep your libs organized (no copying required).

Saturday, June 27, 2015

[Edit: Title is a click bait of course, Fowler is aware of the async/sync issues, recently posted http://martinfowler.com/articles/microservice-trade-offs.html with clarifying section regarding async. ]

With distributed applications, there comes the need for ordered point to point message passing. This is different to client/server relations, where many clients send requests at low rate and the server can choose to scale using multiple threads processing requests concurrently.

Remote Messaging performance is to distributed systems what method invocation performance is for non-distributed monolithic applications.

(guess what is one of the most optimized areas in the JVM: method invocation)

[Edit: with "REST", I also refer to HTTP based webservice style API, this somewhat imprecise]

Revisiting high level remoting Abstractions
There were various attempts at building high-level, location transparent abstractions (e.g. corba, distributed objects), however in general those idea's have not received broad acceptance.

Though not explicitely written, the article implies synchronous remote calls, where a sender blocks and waits for a remote result to arrive thereby including cost of a full network roundtrip for each remote call performed.

With asynchronous remote calls, many of the complaints do not hold true anymore. When using asynchronous message passing, the granularity of remote calls is not significant anymore.

"course grained" processing

remote.getAgeAndBirth().then( (age,birth) -> .. );

is not significantly faster than 2 "fine grained" calls

all( remote.getAge(), remote.getBirth() )

.then( resultArray -> ... );

as both variants include network round trip latency only once.

On the other hand with synchronous remote calls, every single remote method call has a penalty of one network round trip, and only then Fowlers arguments hold.

Another element changing the picture is the availability of "Spores", a snippet of code which can be passed over the network and executed at receiver side e.g.

Spore's are implementable effectively with availability of VM's and JIT compilation.

Actor Systems and similar asynchronous message passing approaches have gained popularity in recent years. Main motivation was to ease concurrency and the insight that multithreading with shared data does not scale well and is hard to master in an industrial grade software development environment.

As large servers in essence are "distributed systems in a box", those approaches apply also for distributed systems.

Following I'll test performance of remote invocations of some frameworks. I'd like to proof that established frameworks are far from what is technically possible and want to show that popular technology choices such as REST are fundamentally inept to form the foundation of large and fine grained distributed applications.

Test Participants

Disclaimer: As tested products are of medium to high complexity, there is danger of misconfiguration or test errors, so if anybody has a (verfied) improvement to one of the testcases, just drop me a comment or file an issue to the github repository containing the tests:

Vert.x 3.1provides a weaker level of abstraction compared to actor systems, e.g. there are no remote references. Vert.x has a symbolic notion of network communication (event bus, endpoints). As it's "polyglot", marshalling and message encoding need some manual support. Vert.x is kind of a platform and addresses many practical aspects of distributed applications such as application deployment, integration of popular technology stacks, monitoring, etc.

REST (RestExpress)As Http 1.1 based REST is limited by latency (synchronous protocol), I just choosed this more or less randomly.

finagle - requires me to clone and build their fork of thrift 0.5 first. Then I'd have to define thrift messages, then generate, then finally run it.

parallel universe - at time of writing the actor remoting was not in a testable state ("Galaxy" is alpha), examples are without build files, the gradle build did not work. Once i managed to build, the programs where expecting configuration files which I could not find. Maybe worth a revisit (accepting pull requests as well :) ).

The Test

I took a standard remoting example:

The "Ask" testcase:

The sender sends a message of two numbers, the remote receiver answers with the sum of those 2 numbers. The remoting layer has to track and match requests and responses as there can be tens of thousand "in-flight".

The "Tell" testcase:

Sender sends fire-and forget. No reply is sent from the receiver.

Results

Attention: Don't miss notes below charts.

Platform: Linux Centos 7 dual socket 20 real cores @2.5 GHZ, 64GB ram. As the tests are ordered point to point, none of the tests made use of more than 4 cores.

tell Sum
(msg/second)

ask Sum (msg/second)

Kontraktor Idiomatic

1.900.000

860.000

Kontraktor Sum-Object

1.450.000

795.000

Vert.x 3

200.000

200.000

AKKA (Kryo)

120.000

65.000

AKKA

70.000

64.500

RestExpress

15.000

15.000

Rest >15 connections

48.000

48.000

let me chart that for you ..

Remarks:

Kontraktor 3 outperforms by a huge margin. I verified the test is correct and all messages are transmitted (if in doubt just clone the git repo and reproduce).

Vert.x 3 seems to have a built-in rate limiting. I saw peaks of 400k messages/second however it averaged at 200k (hints for improvement welcome). In addition, the first connecting sender only gets 15k/second throughput. If I stop and reconnect, throughput is as charted. I tested the very first Vert.x 3 final release. For marshalling fast-serialization (FST) was used (also used in kontraktor). Will update as Vert.x 3 matures

Akka. I spent quite some time on improving the performance with mediocre results. As Kryo is roughly of same speed as fst serialization, I'd expect at least 50% of kontraktor's performance.Edit: Further analysis shows, Akka is hit by poor serialization performance. It has an option to use Protobuf for encoding, which might improve results (but why kryo did not help then ?).

Implications of using protobuf:* need each message be defined in a .proto file, need generator to be run* frequently additional datatransfer is done like "app data => generated messages => app data"* no reference sharing support, no cyclic object graphs can be transmitted* no implicit compression by serialization's reference sharing.* unsure wether the ask() test profits as it did not profit from Kryo as well* Kryo performance is in the same ballpark as protobuf but did not help that much either.

Smell: I had several people contacting me aiming to improve Akka results. They disappear somehow.

Once I find time I might add a protobuf test. Its a pretty small test program, so if there was an easy fix, it should not be a huge effort to provide it. The git repo linked above contains a maven buildable ready to use project.

REST. Poor throughput is not caused by RestExpress (which I found quite straight forward to use) but by Http1.1's dependence on latency. If one moves a server to other hardware (e.g. different subnet, cloud), throughput of a service can change drastically due to different latency. This might change with Http 2.Good news is: You can <use> </any> <chatty> { encoding: for messages }, as it won't make a big difference for point to point REST performance.Only when opening many connections (>20) concurrently, throughput increases. This messes up transaction/message ordering, so can only be used for idempotent operations (a species mostly known from white papers and conference slides, rarely seen in the wild).

Misc Observations

Backpressure
Sending millions of messages as fast as possible can be tricky to implement in a non-blocking environment. A naive send loop

might block the processing thread

build up a large outbound queue as put is faster than take+sending.

can prevent incoming Callbacks from being enqueued + executed (=Deadlock or OOM).

Of course this is a synthetic test case, however similar situations exist e.g. when streaming large query results or sending large blob's to other node's (e.g. init with reference data).

None of the libraries (except rest) did that out of the box:

Kontraktor requires a manual increase of queue sizes over default (default is 32k) in order to not deadlock in the "ask" test. In addition its required to programatically adopt send rate by using the backpressure signal raised by the TCP stack (network send blocks). This can be done non-blocking, "offer()" style.

For VertX i used a periodic task sending a burst of 500 to 1000 messages. Unfortunately the optimal number of messages per burst depends on hardware performance, so the test might need adoption when run on e.g. a Laptop.

For Akka I send 1 million messages each 30 seconds in order to avoid implementation of application level flow control. It just queued up messages and degrades to like 50 msg/s when used naively (big loop).

Kontraktor actually is far from optimal. It still generates and serializes a "MethodCall() { int targetId, [..], String methodName, Object args[]}" for each message remoted. It does not use Externalizable or other ways of bypassing generic (fast-)serialization.
Throughputs beyond 10 million remote method invocations/sec have been proved possible at cost of a certain fragility + complexity (unique id's and distributed systems ...) + manual marshalling optimizations.

Conclusion

As scepticism regarding distributed object abstractions is mostly performance related, high performance asynchronous remote invocation is a game changer

Popular libraries have room for improvement in this area

Don't use REST/Http for inter-system connect, (Micro-) Service oriented architectures. Point to point performance is horrible. It has its applications in the area of (WAN) web services / platform neutral, easily accessible API's and client/server patterns.

Asynchronous programming is different, requires different/new solution patterns (at source code level). It is unavoidable to learn use of asynchronous messaging primitives. "Pseudo Synchronous" approaches (e.g. fibers) are good in order to better scale multithreading, but do not work out for distributed systems.

picking up Peters well written overview on the uses of Unsafe, i'll have a short fly-by on how low level techniques in Java can save development effort by enabling a higher level of abstraction or allow for Java performance levels probably unknown to many.

My major point is to show that conversion of Objects to bytes and vice versa is an important fundamental, affecting virtually any modern java application.

Hardware enjoys to process streams of bytes, not object graphs connected by pointers as "All memory is tape"(M.Thompson if I remember correctly ..).Many basic technologies are therefore hard to use with vanilla Java heap objects:

Consider a retail WebApplication where there might be millions of registered users. We are actually not interested in representing data in a relational database as all needed is a quick retrieve of user related data once he logs in. Additionally one would like to traverse the social graph quickly.

Let's take a simple user class holding some attributes and a list of 'friends' making up a social graph.

easiest way to store this on heap, is a simple huge HashMap.

Alternatively one can use off heap maps to store large amounts of data. An off heap map stores its keys and values inside the native heap, so garbage collection does not need to track this memory. In addition, native heap can be told to automagically get synchronized to disk (memory mapped files). This even works in case your application crashes, as the OS manages write back of changed memory regions.

There are some open source off heap map implementations out there with various feature sets (e.g. ChronicleMap), for this example I'll use a plain and simple implementation featuring fast iteration (optional full scan search) and ease of use.

Significantly less overall memory consumption. A user record serialized is ~60 bytes, so in theory 300 million records fit into 180GB of server memory. No need to raise the big data flag and run 4096 hadoop nodes on AWS ;).

Comparing a regular in-memory java HashMap and a fast-serialization based persistent off heap map holding 15 millions user records, will show following results (on a 3Ghz older XEON 2x6):

As one can see, even with fast serialization there is a heavy penalty (~factor 5) in access performance, anyway: compared to other persistence alternatives, its still superior (1-3 microseconds per "get" operation, "put()" very similar).

Use of JDK serialization would perform at least 5 to 10 times slower (direct comparison below) and therefore render this approach useless.

A single server won't be able to serve (hundreds of) thousands of users, so we somehow need to share data amongst processes, even better: across machines.

Using a fast implementation, its possible to generously use (fast-) serialization for over-the-network messaging. Again: if this would run like 5 to 10 times slower, it just wouldn't be viable. Alternative approaches require an order of magnitude more work to achieve similar results.

By wrapping the persistent off heap hash map by an Actor implementation (async ftw!), some lines of code make up a persistent KeyValue server with a TCP-based and a HTTP interface (uses kontraktor actors). Of course the Actor can still be used in-process if one decides so later on.

Now that's a micro service. Given it lacks any attempt of optimization and is single threaded, its reasonably fast [same XEON machine as above]:

A real world implementation might want to double performance by directly putting received serialized object byte[] into the map instead of encoding it twice (encode/decode once for transmission over wire, then decode/encode for offheaping map).

"RestActorServer.Publish(..);" is a one liner to also expose the KVActor as a webservice in addition to raw tcp:

C like performance using flyweight wrappers / structsWith serialization, regular Java Objects are transformed to a byte sequence. One can do the opposite: Create wrapper classes which read data from fixed or computed positions of an underlying byte array or native memory address. (E.g. see this blog post).

By moving the base pointer its possible to access different records by just moving the the wrapper's offset. Copying such a "packed object" boils down to a memory copy. In addition, its pretty easy to write allocation free code this way. One downside is, that reading/writing single fields has a performance penalty compared to regular Java Objects. This can be made up for by using the Unsafe class.

"flyweight" wrapper classes can be implemented manually as shown in the blog post cited, however as code grows this starts getting unmaintainable.
Fast-serializaton provides a byproduct "struct emulation" supporting creation of flyweight wrapper classes from regular Java classes at runtime. Low level byte fiddling in application code can be avoided for the most part this way.

How a regular Java class can be mapped to flat memory (fst-structs):

Of course there are simpler tools out there to help reduce manual programming of encoding (e.g. Slab) which might be more appropriate for many cases and use less "magic".

What kind of performance can be expected using the different approaches (sad fact incoming) ?

Lets take the following struct-class consisting of a price update and an embedded struct denoting a tradable instrument (e.g. stock) and encode it using various methods:

a 'struct' in code

Pure encoding performance:

Structs

fast-Ser (no shared refs)

fast-Ser

JDK Ser (no shared)

JDK Ser

26.315.000,00

7.757.000,00

5.102.000,00

649.000,00

644.000,00

Real world test with messaging throughput:

In order to get a basic estimation of differences in a real application, i do an experiment how different encodings perform when used to send and receive messages at a high rate via reliable UDP messaging:

The Test:A sender encodes messages as fast as possible and publishes them using reliable multicast, a subscriber receives and decodes them.

Slowest compared to fastest: factor of 82. The test highlights an issue not covered by micro-benchmarking: Encoding and Decoding should perform similar, as factual throughput is determined by Min(Encoding performance, Decoding performance). For unknown reasons JDK serialization manages to encode the message tested like 500_000 times per second, decoding performance is only 80_000 per second so in the test the receiver gets dropped quickly:"...***** Stats for receive rate: 80351 per second ************** Stats for receive rate: 78769 per second *********SUB-ud4q has been dropped by PUB-9afs on service 1fatal, could not keep up. exiting"
(Creating backpressure here probably isn't the right way to address the issue ;-) )

Low Level utilities like Unsafe enable different representations of data resulting in extraordinary throughput or guaranteed latency boundaries (allocation free main path) for particular workloads. These are impossible to achieve by a large margin with JDK's public tool set.

In distributed systems, communication performance is of fundamental importance. Removing Unsafe is not the biggest fish to fry looking at the numbers above .. JSON or XML won't fix this ;-).

While the HotSpot VM has reached an extraordinary level of performance and reliability, CPU is wasted in some parts of the JDK like there's no tomorrow. Given we are living in the age of distributed applications and data, moving stuff over the wire should be easy to achieve (not manually coded) and as fast as possible.

Addendum: bounded latency

A quick Ping Pong RTT latency benchmark showing that java can compete with C solutions easily, as long the main path is allocation free and techniques like described above are employed:

[credits: charts+measurement done with HdrHistogram]

This is an "experiment" rather than a benchmark (so do not read: 'Proven: Java faster than C'), it shows low-level-Java can compete with C in at least this low-level domain.
Of course its not exactly idiomatic Java code, however its still easier to handle, port and maintain compared to a JNI or pure C(++) solution. Low latency C(++) code won't be that idiomatic either ;-)

About me: I am a solution architect freelancing at an exchange company in the area of realtime GUIs, middleware, and low latency CEP (Complex Event Processing) nightly hacking at https://github.com/RuedigerMoeller.

Tuesday, October 28, 2014

With the rise of the Web, textual encodings like xml/JSON have become very popular. Of course textual message encoding has many advantages from a developer's perspective. What most developers are not aware of, is how expensive encoding and decoding of textual messages really is compared to a well defined binary protocol.

Its common to define a system's behaviour by its protocol. Actually, a protocol messes up two distinct aspects of communication, namely:

Frequently (not always), these two very distinct aspects are mixed up without need. So we are forced to run the whole internet in "debug mode", as 99% of webservice and webapp communication is done using textual protocols.

The overhead in CPU consumption compared to a well defined binary encoding is factor ~3-5 (JSON) up to >10-20 (XML). The unnecessary waste of bandwith also adds to that greatly (yes you can zip, but this in turn will waste even more CPU).

I haven't calculated the numbers, but this is environmental pollution at big scale. Unnecessary CPU consumption to this extent wastes a lot of energy (global warming anyone ?).

Friday, October 17, 2014

Thanks to Jean Philippe Bempel who challenged my results (for a reason), I discovered an issue in last post: Code-completion let me accidentally choose Executors.newSingleThreadScheduledExecutor() instead of Executors.newSingleThreadExecutor(), so the pinned-to-thread-actor results are actually even better than reported previously. The big picture has not changed that much, but its still worthwhile reporting.

On a second note: There are many other aspects to concurrent scheduling such as queue implementations etc.. Especially if there is no "beef" inside the message processing, these differences become more dominant compared to cache misses, but this is another problem that has been covered extensively by other people in depth (e.g. Nitsan Wakart).

Focus of this experiment is locality/cache misses, keep in mind different queueing implementations of executors for sure add dirt/bias.

As requested, I add results from the linux "perf" tool to prove there are significant differences in cache misses caused by random assignment of Thread - to Actor as done by ThreadPoolExecutor and WorkStealingExecutor.

As in previous post, "dedicated" actor-pinned-to-thread performs best. For very small local state, there are only few cache misses so differences are small, but widen once a bigger chunk of memory is accessed by each actor. Note that ThreadPool is hampered by its internal scheduling/queuing mechanics, regardless of locality, it performs weak.

When increasing number of Actors to 8000 (so 1000 actors per thread), "Workstealing" and "Dedicated" perform similar. Reason: executing 8000 actors round robin creates cache misses for both executors. Note that in a real world server its likely that there are active and inactive actors, so I'd expect "Dedicated" to perform slightly better than in this synthetic test.

In order to get a more realistic impression I replaced the synthetic int-iteration by some dirty "real world" dummy stuff (do some allocation and HashMap put/get). Instead of increasing the size of the "localstate" int array, I increase the HashMap size (should also have negative impact on locality).

Note that this is rather short processing, so queue implementations and executor internal implementation might dominate locality here. This test is run on Opteron 8c16t * 2Sockets, a processor with 8kb L1 cache size only. (BTW: impl is extra dirty, so no performance optimization comments pls, thx)

As ThreadPoolExecutor is abnormous bad in this Test/Processor combination, plain numbers:

64 HMap entries

256 HMapentries

2000 HMapentries

4000 HMapentries

32k HMapentries

320k HMapentries

WorkStealing

1070

1071

1097

1129

1238

1284

Dedicated

656

646

661

649

721

798

ThreadPool

8314

8751

9412

9492

10269

10602

Conclusions basically stay same as in original post. Remember cache misses are only one factor of overall runtime performance, so there are workloads where results might look different. Quality/specialization of queue implementation will have huge impact in case processing consists of only some lines of code.

Finally, my result:

Pinning actors to threads created lowest cache miss rates in any case tested.

The experiment in my last post had a serious flaw: In an actor system, operations on a single actor are executed one after the other. However by naively adding message-processing jobs to executors, private actor state was accessed concurrently, leading to "false-sharing" and cache coherency related costs especially for small local state sizes.

Therefore I modified the test. For each Actor scheduled, the next message-processing is scheduled once the previous one finished, so the experiment resembles the behaviour of typical actors (or lightweight processes/tasks/fibers) correctly without concurrent access to a memory region.

Experiment roundup:

Several million messages are scheduled to several "Actor" simulating classes. Message processing is simulated by reading and writing the private, actor-local state in random order. There are more Actors (24-8000) than threads (6-8). Note that results established (if any) will also hold true for other light-weight concurrency schemes like go-routines, fibers, tasks ...

The test is done with

ThreadPoolExecutor

WorkStealingExecutor

Dedicated Thread (Each Actor has a fixed assignment to a worker thread)

As ThreadPoolExecutor and WorkStealingExecutor schedule each message on a random Thread, they will produce more cache misses compared to pinning each actor onto a fixed thread. Speculation is, that work stealing cannot make up for the costs provoked by cache misses.

(Some) Variables:

Number of worker threads

Number of actors

Amount of work per message

Locality / Size of private unshared actor state

8 Threads 24 actors 100 memory accesses (per msg)

Interpretation:
For this particular load, fixed assigned threads outperform executors. Note: the larger the local state of an actor, the higher the probability of a prefetch fail => cache miss. In this scenario my suspection holds true: Work stealing cannot make up for the amount of cache misses. fixed assigned threads profit, because its likely, some state of a previously processed message resides still in cache once a new message is processed on an actor.
Its remarkable how bad ThreadpoolExecutor performs in this experiment.

This is a scenario typical for backend-type service: There are few actors with high load. When running a front end server with many clients, there are probably more actors, as typically there is one actor per client session. Therefor lets push up the number of actors to 8000:

8 Threads 8000 actors 100 memory accesses (per msg)

Interpretation:

With this amount of actors, all execution schemes suffer from cache misses, as the accumulated size of 8000 actors is too big to fit into L1 cache. Therefore the cache advantage of fixed-assigned threads ('Dedicated') does not make up for the lack of work stealing. Work Stealing Executor outperforms any other execution scheme if a large amount of state is involved.

This is a somewhat unrealistic scenario as in a real server application, client request probably do not arrive "round robin", but some clients are more active than others. So in practice I'd expect "Dedicated" will at least have some advantage of higher cache hits. Anyway: when serving many clients (stateful), WorkStealing could be expected to outperform.

Just to get a third variant: same test with 240 actors:

These results complete the picture: with fewer actors, cache effect supercede work stealing. The higher the number of actors, the higher the number of cache misses gets, so work stealing starts outperforming dedicated threads.

Modifying other variables

Number of memory accesses

If a message-processing does few memory accesses, work stealing improves compared to the other 2. Reason: fewer memory access means fewer cache misses means work stealing gets more significant in the overall result.

WorkStealing could be viewed as the better overall solution. Especially if a lot of L1 cache misses are to be expected anyway.

The ideal executor would be WorkStealing with a soft actor-to-thread affinitiy. This would combine the strength of both execution schemes and would yield significant performance improvements for many workloads

Vanilla thread pools without work stealing and actor-to-thread affinity perform significantly worse and should not be used to execute lightweight processes.