I find postings like Joe’s highly valuable. Few among us are language designers, so rarely do those of us who aren’t get a first-hand account from someone who is as to why and how something within a given language’s design came to be. Joe describes the distribution primitives that Erlang provides as well as their composability; it might seem simple, but anyone who’s written non-trivial distributed computing infrastructure knows that choosing the right primitives and making the right design trade-offs is anything but simple. This explains why I continue to be so impressed with the design choices and trade-offs Joe and crew made for Erlang — I’ve simply never seen any distributed computing infrastructure so elegant and yet so practical and capable.

His blog entry basically makes fun of Cisco for inventing/releasing another RPC system. It’s not clear exactly what he thinks they should have done instead.

I think my posting pretty clearly implies that Cisco should have avoided writing their own and instead should have reused something that already exists.

What is strange about this criticism is that tons of technology companies have developed their own RPC system — Facebook and Cisco publicly, and other technology companies I am familiar with in a not-so-public way. Guess what: large commercial distributed systems are built largely on RPC. Is he arguing that all of the engineers at these companies simultaneously got the bad idea of investing in something they don’t need? If RPC is such a bad idea, then why is everybody doing it?

Is everybody really doing it? Are large commercial distributed systems really built largely on RPC? I’ve seen some non-trivial CORBA-based deployments over the years, but in my experience large systems are built using approaches other than RPC. Like the Web, which isn’t RPC. Like email, which isn’t RPC. Like pub/sub enterprise messaging systems, which aren’t RPC.

Let’s consider the definition of what an RPC actually is. The term is often misused to mean “a synchronous call to another system over the network.” This is not what an RPC is. For example, an HTTP request is synchronous, but it is not an RPC. RPC, rather, is a specific approach for developing networked applications where local calls wrap and hide operations that happen to be carried out on another system across the network. For starters, let’s check Wikipedia:

Remote procedure call (RPC) is a technology that allows a computer program to cause a subroutine or procedure to execute in another address space (commonly on another computer on a shared network) without the programmer explicitly coding the details for this remote interaction. That is, the programmer would write essentially the same code whether the subroutine is local to the executing program, or remote.

Next, let’s check RFC 707, where RPC comes from, in which James E. White specifically proposed a procedure call model for networked applications designed to hide the network, and thereby allow developers to use familiar approaches to developing applications that happened to perform network operations. Quoting from that RFC:

Ideally, the goal of both the Protocol and its accompanying RTE is to make remote resources as easy to use as local ones. Since local resources usually take the form of resident and/or library subroutines, the possibility of modeling remote commands as “procedures” immediately suggests itself. The Model is further confirmed by the similarity that exists between local procedures and the remote commands to which the Protocol provides access. Both carry out arbitrarily complex, named operations on behalf of the requesting program (the caller); are governed by arguments supplied by the caller; and return to it results that reflect the outcome of the operation. The procedure call model thus acknowledges that, in a network environment, programs must sometimes call subroutines in machines other than their own.

and also:

“The procedure call model would elevate the task of creating applications protocols to that of defining procedures and their calling sequences. It would also provide the foundation for a true distributed programming system (DPS) that encourages and facilitates the work of the applications programmer by gracefully extending the local programming environment, via the RTE, to embrace modules on other machines.” This integration of local and network programming environments can even be carried as far as modifying compilers to provide minor variants of their normal procedure-calling constructs for addressing remote procedures (for which calls to the appropriate RTE primitives would be dropped out).

Josh continues:

Yes, on a network sh*t happens, and no sane RPC system will try to hide this from you.

As you can see from the original definition of RPC, something called an RPC that doesn’t hide the network is, by definition, not an RPC. As I said above, unfortunately the term is often misused as meaning “synchronous messaging,” and that incorrect usage seems to be what Josh is defending. Josh then says:

But then again, I don’t know of any RPC system that tries to hide this from you except possibly CORBA.

That’s not correct either. What CORBA actually does is make everything appear remote, even local objects, but does so in a way that allows object request broker (ORB) implementations to bypass much of the overhead of remote invocations when the ORB knows that a target object is local. Still, not all the overhead can be eliminated due to object lifecycle and method dispatching requirements, meaning that such local calls are typically never as fast as true local calls. DCE also treats services as always being remote, but last I checked it included no local bypass optimizations (though a variant called OODCE once did this, IIRC). But either way, what’s important with these systems is that calls within your code look just like any other calls within your code, whether they’re calling remote operations or not. And that’s RPC.

Regarding versioning problems, Josh says:

But any RPC framework worth its salt makes it possible to have different interface versions interoperate. Adding a new parameter? No problem, old servers simply won’t see it. Completely changing the semantics of your call? No problem — just give the new call a new name.

Yes, Josh, there are generally ways to do versioning in such systems, but they’re not very good. CORBA includes some facilities to help with versioning, but in practice they don’t actually help that much. Both COM and CORBA promoted interface inheritance and runtime interface negotiation (called “narrowing” in CORBA) as a way to do versioning, which works, but only for a restricted set of changes. Add a parameter to an existing call? Sorry, no can do, unless your marshaling format carries complete information for the entire call including parameter names, types, and positions, and also versions each parameter, all of which systems like CORBA, DCOM, and DCE specifically do not do due to the large overhead it adds whether a given application uses it or not, and in CORBA’s case also because of the interference it can cause for local dispatching optimizations. All in all, versioning is hard, not only for RPC, but for distributed systems in general.

Middleware and distributed systems veterans are well aware of the arguments like the ones I’ve made in my blog and other places recently and in various publications over the years; such arguments are generally common knowledge among us, and have been for years.

Cisco’s system is not available yet, but when it comes out, I’m quite certain you’ll find, Josh, that it’s the same old thing, just repackaged in a new box.

I see from this CIO Magazine article that Cisco is releasing a new client/server messaging system called Etch. Sigh — those who don’t know history are indeed doomed to repeat it. Some choice quotes from the article:

This week Cisco Systems announced a new messaging protocol intended to allow developers to integrate client/server applications without the overhead of traditional protocols such as SOAP.

I was unaware that SOAP had become “traditional.”

One of its design goals was to create an inter-application communications technology without SOAP’s complexity and overhead, explained Marascio. While SOAP relies on a very complicated WSDL file to define the interface between the client and server, Etch uses a file in Cisco’s own interface definition language that shares many similarities to a Java interface file.

I bet this new IDL is not only simpler than WSDL, but it probably also avoids all the impedance mismatch problems that invariably occur when mapping IDL to programming languages.

In addition to a simplified configuration, Etch also promises less overhead over the wire, compared to SOAP. In a testbed environment where SOAP was managing around 900 calls a second, Etch generated more than 50,000 messages in a one-way mode, and 15,000 transactions with a full round-trip, company officials stated.

Oh good, the “performance presumption.” So now we’re back to where we were a decade ago, at least as far as message transfer rates go. I wonder if Etch also solves the problem that the bottlenecks usually lie elsewhere?

The Etch integration into Visual Studio and Eclipse will be very familiar to anyone who has used SOAP integration tools. After authoring the IDL definition, the developer tells the IDE to generate either a client stub or a server skeleton. The client stub is usable immediately; the developer needs only to configure the transport and endpoint, and to code the message calls.

On the server, the developer takes the skeleton and implements the business logic that lives inside the message handlers.

Now that’s what I call innovation!

Projects implementing their communications using Etch aren’t out of luck if they need to interoperate with SOAP, JSON, REST or other existing protocols. Cisco has already demonstrated the capability to easily create bridges between Etch and SOAP, according to Marascio. He said that turnkey bridges to SOAP and REST should be available six to nine months after the release of Etch.

Or, to put it another way: Etch is really just adding more stuff to be developed, tested, deployed, managed, maintained, and integrated, yet it doesn’t actually solve any new problems or solve any old problems better than what already exists.

Cisco also is examining the possibility of establishing Etch as a standard. Marascio pointed out that Cisco is well represented in the IETF, the main standards body for Internet protocols. Alternatively, Cisco might attempt to promote Etch as an industry standard, an effort that would be aided by Etch’s open source nature.

Well of course you want to standardize it — where would any new NIH RPC protocol be without an accompanying standards effort? Rather than the IETF, though, perhaps you ought to get those ISO OOXML guys to rubber-stamp it?

I find it hard to believe that in 2008 people are still inventing stuff like this. Sheesh. Color me underwhelmed.

Given that a number of statements Ted’s made about Erlang in this discussion simply aren’t true, it’s quite clear Ted has never written any production Erlang code. [Update: Patrick Logan has posted a detailed analysis of Ted’s misunderstandings of Erlang.] Being a long-time author, it bothers me when people write authoritatively on topics they have no business writing about, so my only goal with my responses in this conversation has simply been to set the record straight with respect to Erlang. Ted originally said Erlang was a study in concurrency; I merely pointed out that it was more importantly a study in reliability. That’s really not even debatable. Unfortunately, it’s turned into a frustrating one-sided conversation because Ted lacks any detailed knowledge of Erlang, so he keeps unhelpfully trying to shift the focus elsewhere.

In his past two responses, Ted has picked at my questions like a grammar school English teacher, accusing me of conflating things, making bad assumptions, etc. I see that Patrick Logan is trying to clarify things, which might help. Yet Ted still hasn’t adequately explained why he’s taken such a hard stance against reliability being a fundamental feature of Erlang, nor how UNIX processes and Erlang processes are the same, as he keeps asserting, nor has he explained why he thinks it’s much, much harder to make an Erlang application manageable and monitorable than it is to build Erlang’s reliability into other systems like the JVM or Scala.

But now, we see the worst: the “programmer challenge.” Ugh. Thankfully, I’m sure most readers know that a programming contest of the sort Ted proposes would prove absolutely nothing. I guess he proposed it because I mentioned how I recently spent a quarter of a day making an Erlang application monitorable, in response to his continued claims that doing so is really hard, so now he wants to make a competition of it. I’d rather that you just explain, Ted, the experiences you’ve had that have led you to claim that Erlang applications can’t be easily managed or monitored. Better yet, since you’re the one who wants a contest, and given that you’re the one making all the claims, why don’t you go off and see how quickly you can build Erlang’s reliability into Scala and the JVM, since you claim it’s so simple?

If you’re a regular reader of Ted’s blog, you know that Ted generally offers good advice and you can learn useful things from him. He’s a good writer and a wonderful conference presenter, as he can make hard concepts easier to grok and generally does so with humor to keep you awake. But I feel that anyone in Ted’s position has a responsibility to avoid passing off incorrect information to his readers as fact. My advice therefore is simply that you don’t take what Ted says as gospel for this particular topic. Let me assure you that Erlang offers far, far more value than just exceptional concurrency support, which is where Ted’s initial posting in this thread seemed to want to limit it, and which is all I objected to. Unlike Ted, I’ve written quite a bit of Erlang code, and I use it every single day. If you write distributed systems, you owe it to yourself to explore Erlang’s capabilities and features. I’ve been writing and researching middleware and distributed systems for nearly 20 years now, and I’ve seen a lot over the years. Erlang is by far the most innovative and sound approach to distributed systems development I’ve ever seen and experienced — the trade-offs its designers chose are simply excellent. Like I’ve said numerous times over the past year, I really wish I’d found Erlang a decade ago, because I know for certain it would have saved my teams and me countless hours of development time.

Erlang’s reliability model–that is, the spawn-a-thousand-processes model–is not unique to Erlang. In fact, it’s been the model for Unix programs and servers, most notably the Apache web server, for decades. When building a robust system under Unix, a master-slave model, in which a master process spawns (and monitors) n number of child processes to do the actual work, offers that same kind of reliability and robustness. If one of these processes fail (due to corrupted memory access, operating system fault, or what-have-you), the process can simply die and be replaced by a new child process.

There’s really no comparison between the UNIX process model (which BTW I hold in very high regard) and Erlang’s approach to achieving high reliability. They are simply not at all the same, and there’s no way you can claim that UNIX “offers that same kind of reliability and robustness” as Erlang can. If it could, wouldn’t virtually every UNIX process be consistently yielding reliability of five nines or better?

Obviously, achieving high reliability requires at least two computers. On those systems, what part of the UNIX process model allows a process on one system to seamlessly fork child processes on another and monitor them over there? Yes, there are ways to do it, but would anyone claim they are as reliable and robust as Erlang’s approach? I sure wouldn’t. Also, UNIX pipes provide IPC for processes on the same host, but what about communicating with processes on other hosts? Yes, there are many, many ways to achieve that as well — after all, I’ve spent most of my career working on distributed computing systems, so I’m well aware of the myriad choices here — but that’s actually a problem in this case: too many choices, too many trade-offs, and far too many ways to get it wrong. Erlang can achieve high reliability in part because it solves these issues, and a whole bunch of other related issues such as live code upgrade/downgrade, extremely well.

Ted continues:

There is no reason a VM (JVM, CLR, Parrot, etc) could not do this. In fact, here’s the kicker: it would be easier for a VM environment to do this, because VM’s, by their nature, seek to abstract away the details of the underlying platform that muddy up the picture.

In your original posting, Ted, you criticized Erlang for having its own VM, yet here you say that a VM approach can yield the best solution for this problem. Aren’t you contradicting yourself?

It would be relatively simple to take an Actors-based Java application, such as that currently being built in Scala, and move it away from a threads-based model and over to a process-based model (with the JVM constuction[sic]/teardown being handled entirely by underlying infrastructure) with little to no impact on the programming model.

Would it really be “relatively simple?” Even if what you describe really were relatively simple, which I strongly doubt, there’s still no guarantee that the result would help applications get anywhere near the levels of reliability they can achieve using Erlang.

As to Steve’s comment that the Erlang interpreter isn’t monitorable, I never said that–I said that Erlang was not monitorable using current IT operations monitoring tools. The JVM and CLR both have gone to great lengths to build infrastructure hooks that make it easy to keep an eye not only on what’s going on at the process level (“Is it up? Is it down?”) but also what’s going on inside the system (“How many requests have we processed in the last hour? How many of those were successful? How many database connections have been created?” and so on). Nothing says that Erlang–or any other system–can’t do that, but it requires the Erlang developer build that infrastructure him-or-herself, which usually means it’s either not going to get done, making life harder for the IT support staff, or else it gets done to a minimalist level, making life harder for the IT support staff.

I know what you meant in your original posting, Ted, and my objection still stands. Are you saying here that all Java and .NET applications are by default network-monitoring-friendly, whereas Erlang applications are not? I seem to recall quite a bit of effort spent by various teams at my previous employer to make sure our distributed computing products, including the Java-based products and .NET-based products, played reasonably well with network monitoring systems, and I sure don’t recall any of it being automatic. Yes, it’s nice that the Java and CLR guys have made their infrastructure monitorable, but that doesn’t relieve developers of the need to put actual effort into tying their applications into the monitoring system in a way that provides useful information that makes sense. There is no magic here, and in my experience, even with all this support, it still doesn’t guarantee that monitoring support will be done to the degree that the IT support staff would like to see.

And do you honestly believe Erlang — conceived, designed, implemented, and maintained by a large well-established telecommunications company for use in highly-reliable telecommunications systems — would offer nothing in the way of tying into network monitoring systems? I guess SNMP, for example, doesn’t count anymore?

(Coincidentally, I recently had to tie some of the Erlang stuff I’m currently working on into a monitoring system which isn’t written in Erlang, and it took me maybe a quarter of a workday to integrate them. I’m absolutely certain it would have taken longer in Java.)

But here’s the part of Ted’s response that I really don’t understand:

So given that an execution engine could easily adopt the model that gives Erlang its reliability, and that using Erlang means a lot more work to get the monitorability and manageability (which is a necessary side-effect requirement of accepting that failure happens), hopefully my reasons for saying that Erlang (or Ruby’s or any other native-implemented language) is a non-starter for me becomes more clear.

Ted, first you state that an execution engine could (emphasis mine) “easily adopt the model that gives Erlang its reliability,” and then you say that it’s “a lot more work” for anyone to write an Erlang application that can be monitored and managed? Aren’t you getting those backwards? It should be obvious that in reality, writing a monitorable Erlang app is not hard at all, whereas building Erlang-level reliability into another VM would be a considerably complicated and time-consuming undertaking.