Mark Little's WebLog

I work for Red Hat, where I lead JBoss technical direction and research/development. Prior to this I was SOA Technical Development Manager and Director of Standards. I was Chief Architect and co-founder at Arjuna Technologies, an HP spin-off (where I was a Distinguished Engineer). I've been working in the area of reliable distributed systems since the mid-80's. My PhD was on fault-tolerant distributed systems, replication and transactions. I'm also a Professor at Newcastle University and Lyon.

Friday, January 06, 2017

Following on from the last entry, we shall now consider one implementation of a communication subsystem which provides some of the delivery properties described previously. It is important to understand how these ordering requirements can be met, and the overhead which is involved in guaranteeing them, before we discuss how such communication primitives can be used to provide replicated object groups. Other reliable communication subsystems exist, of which [Chang 84][Cristian 85][Cristian 90][Verissimo 89] are a sample, but we shall consider Psync because it illustrates many points clearly.

Psync

Psync [Peterson 87][Mishra 89] ("pseudosynchronous") is a communication subsystem designed to provide reliable multicast communication between objects, and is based on the message history approach described above. The system assumes that operations on objects which change the state occur atomically and are idempotent. Associated with each object is a manager process. A client process locates a particular manager (perhaps by consulting a naming service) and then invokes operations on the object by sending requests to that manager. When a manager receives a request to invoke a particular operation on an object, it encapsulates the operation in a message and uses the Psync many-to-many communications protocol to forward the message to all of the managers involved (including itself) if the object is a member of a group. Based on the set of received messages, each manager can then decide on an order in which to apply the operations to its

copy of the object. This protocol can be extended to be used for the interactions of replicated object groups, and the exact details of the replication protocol used in Psync will be described in a later posting.

Conversations and Context Graphs

Psync explicitly preserves the partial ordering of messages exchanged among a collection of processes in the presence of communication and processor failures (Psync cannot function in the presence of network partitions). A collection of processes exchange messages through a conversation abstraction. This conversation is defined by a directed acyclic graph (a context graph) that preserves the partial order of the exchanged messages. This ordering is made available to all managers involved in a conversation and by using this they can determine when to execute operations on their local objects.

When processes communicate they do so by sending messages in the context of those messages they have already received. Participants are able to receive all messages sent by the other participants in the conversation but they do not receive the messages that they themselves send. Each participant in a conversation has a view of the context graph that corresponds to those messages it has sent or received. The semantics of the communications primitives provided by Psync are defined in terms of the context graph and a participant's view.

The figure below shows an example of a context graph. This conversation is started with the initial message m1. Messages m2 and m3 were sent by processes that had received m1, but are independent of each other (hence no link between them), and m4 was sent by a process that had received m1 and m3, but not m2. Messages that are not in the context of some other message are said to be concurrent (occur at the same logical time). The relation < is defined to be "in the context of".

The context graph contains information about which processes have received what messages. The receipt of a message implies that the sender has seen all of the messages which came before it in the context graph. A message mp sent by process p is said to be stable if for each participant in the conversation q ≠ p, there exists vertex mq in the context graph sent by q, such that m < mq. For a message to be stable means that all participants except the sender have received it, it must follow that all future messages sent to the conversation must be in the context of the stable message.

In the figure below, we have a context graph which depicts a conversation between three participants, a, b, and c. Messages al, a2, ... denotes the sequence of messages sent by process a, and so on. Message al, b1, and c1 are the only stable messages; messages a2 and a3 are two unstable messages sent by process a.

Psync maintains a copy of a conversation's context graph at each host on which a participant in the conversation resides. Each process in the conversation receives messages from this local copy of the context graph, which is termed the image. Whenever a process at one host sends a message, Psync propagates a copy of the message to each of the hosts in the conversation. This message contains information about all messages upon which the new message depends, so that the receiving hosts can append it to their context graphs in the correct place.

Dealing with Network and Host Failures

Suppose message m is not delivered to some host because of a network failure. If at some future time a message m' arrives that depends on m then the host will detect that it is missing m and will send a retransmission request form to the host that sent m', (this host is guaranteed to have m as a local participant has just sent a message which is in the context of it).

The operations provided to aid applications in recovering from host failures include the ability for a participant to remove a failed participant from its definition of the participant set for a conversation. This is necessary so that messages will eventually stabilize relative to the functioning participants. Once a given participant has been masked out, Psync ignores all further messages from that process.

There is also an inverse operation that allows a participant to rejoin a participant set. It would be invoked when a participant becomes aware that another participant which was formally failed has now recovered.

When a host recovers, a participant can initiate a recovery action which will inform other participants that the invoking participant has restarted, and to initiate reconstruction of the local image of the context graph. Each member of the conversation will transmit its local copy of the context graph to the recovering participant who can then use this to reconstruct its own local copy.

Total Ordering

As described, the Psync protocol only gives a partial ordering of messages i.e., only the causal ordering of messages is preserved. To convert a partial order into a total order, whereby messages which are not causally related are ordered identically at all overlapping destinations, requires additional information to be shared amongst the destinations which indicates the order in which to place such messages. In Psync, the context graph which accompanies each message provides this information. The partial order that Psync provides can be used to give a total order if all participants perform the same topological sort of the context graph. This sort must be incremental i.e., each process waits for a portion of its view to be stabilized before allowing the sort to proceed. This is done to ensure that no future messages sent to the conversation will invalidate the total ordering. The replication protocol used in Psync uses just such a scheme and will be described in a later article.

Tuesday, January 03, 2017

The next few entries in the series will consider some aspects of multicast protocols.

As described in [Shrivastava 90a], the latency of a multicast service is defined to be the time taken for a message, once sent, to reach the destination processes. This latency is particularly important for protocols providing reliability and ordering guarantees. As we shall see, whereas the latency for an unreliable multicast service is bounded (typically of the order of a few milliseconds), the latency for a multicast service which operates in the presence of failures (message losses and node crashes) can be bounded or unbounded depending upon the implementation.

Existing order preserving protocols can be broadly classified in the following way:

message history based: the main idea behind such protocols is that when a process sends a message it appends some historical information about the messages it has received in its recent past. This historical information enables the receivers to retrieve any missing messages and to order them properly. This type of protocol ensures that an incomplete multicast is eventually completed, and hence possesses an unbounded latency property.

centralised distributors: here the sender delivers the message to a specific member of the group (the primary) who is then responsible for distributing the message to the fellow members of the group. The primary can assign a total order to the messages it receives. As we have already seen, failure detection mechanisms are necessary to detect failed primaries and to elect new primaries which can take over and complete the multicasts. Such protocols can possess bounded latency, but the necessity to detect asynchronously occurring failures can impose an overhead on performance.

multi-phase commit: these protocols, providing total order, use multiphase algorithms (typically 2 or 3 message rounds) which are similar to the two-phase algorithm described earlier for atomic action commits. The sender delivers the message to the destinations which return sufficient information to the sender about the messages that they have received so that in the subsequent rounds of the protocol the sender can impose an identical order of the message onto the destinations. The message is only considered to have been delivered if all of the phases of the protocol complete. Such protocols provide bounded latency.

clock-based: these protocols are an important class of the multi-phase algorithms, and assume the existence of a global time base. Timestamps derived from such a time base can then be used for imposing a total order on messages. Such protocols can provide constant latency communication, having the attractive property that if a sender multicasts a message at clock time T, then it can be sure that all functioning receivers will have received the message by clock time T + ∆, where ∆ is the constant indicating the protocol latency (∆ must be determined by applying worst case timing and failure assumptions).

We shall next examine a system which provides a reliable multicast protocol using a message history based approach.

Monday, January 02, 2017

I want to take a quick break from the other series for a moment to do something I've been remiss about for a while: addressing a recent report by Gartner on the state of Java EE. Before doing so I decided the other day to read the article my friend and colleague John Clingan made a few weeks ago and I realised that John had done a great job already. In fact so good I couldn't see myself adding much to it, but I'll try below. Note, there are other responses to this report and I haven't had a chance to read them but they might be just as good.

To start with, it's worth noting that I've known Anne for a number of years and we've had one or two disagreements during that time but also some agreements. Her announcements about the death of SOA seven years or so ago falls somewhere in between, for example. It's also important to realise, if you don't already, that I do believe everything has a natural cycle ("the circle of life") and nothing lasts forever; whether it's dinosaurs giving way to mammals (with some help from a meteor and/or the Deccan Traps), or CORBA shifting aside for J2EE, evolution is a fact of life. Therefore, whilst I disagree with Anne about the short-to-medium term future of Java EE, longer term it will pass into history. However, before doing so it will evolve and continue to influence the next generation of technologies, just as the dinosaurs became the birds and aspects of CORBA evolved into J2EE. For more details on my thinking on this topic over the years I leave it as an exercise to the reader to check this blog or my JBoss blog.

John covers my next point very well but it's worth stressing: I've mentioned on many occasions that my background is heavily scientific and I continue to try to employ objective scientific method to everything I do. So when I see blanket statements like "fade in relevance" or "lighter-weight approaches", whether it's in a Gartner report, a paper I'm reviewing as a PC member, or even something one of my team is writing, I immediately push back if there's no supporting evidence. And I read this report several times searching for it but kept coming up empty handed. It was one subjective statement after another, with no real attempt to justify them. That's disappointing because with my scientific hat on I love to see facts backing up theories, especially if those theories contradict my own. That's how we learn.

One other thing I did get from the report, and I'm prepared to admit this may be my own subjective reading, was that it seemed like a vague attempt to look into the future for middleware frameworks and stacks without really understanding what has brought us to where we are today. I've written about this many times before but weak assertions by the report that "modern applications" are somehow so different from those we've developed for the last 50 years that existing approaches such as Java EE are not useful really doesn't make any sense if you bother to think about it objectively. Distributed applications, services and components need to communicate with each other, which means some form of messaging (hopefully reliable, scalable and something which performs); there's really no such thing as a stateless application, so you'll need to save data somewhere (again, reliable, scalable and performant); hey, maybe you also want some consistency between applications or copies of data, so transactions of one form or another might be of help. And the list goes on.

Of course application requirements change over time (e.g., I recall doing research back in the 1980's where scale was measured in the 10's of machines), but new applications and architectures don't suddenly spring into being and throw away all previous requirements; it's evolution too though. I've presented my thoughts on this over the past couple of years at various conferences. In some ways you can consider Java EE a convenient packaging mechanism for these core services, which are typically deployed in a co-located (single process) manner. Yet if you look beyond the Java EE "veneer" you can still see the influence of enterprise distributed systems that predate it and also the similarities with where some next generation efforts are heading.

I suppose another key point which John also made well was that the report fails miserably to differentiate between Java EE as a standard and the various implementations. I'm not going to cover the entire spectrum of problems I have with this particular failure, but of course lightweight is one of them. Well over 5 years ago I wrote about the things we'd done years previously to make AS7 run on a plug computer or Android phone and we've kept up that level of innovation up throughout the intervening time. There's nothing in the Java EE standard which says you have to build a bloated, heavyweight application server and we, as well as other implementations, have proven that time and time again. It's disappointing that none of this is reflected in the report. I suppose it's one of those terms the authors hope that if they say it enough, people will attribute their own subjective opinions and assumptions to it.

Another one of those throw-away statements the report is good at is that "[...] at this point, Java developers eschew Java EE". I admit not every Java developer wants to use Java EE; not every Java developer wanted to use J2EE either. But you know what? Many Java developers do like Java EE and do want to use it. You only have to go to conferences like JavaOne, Devoxx or other popular Java events to find them in abundance. Or come and talk to some of our customers or those of IBM, Payara, Tomitribe or other members of the MicroProfile effort.

I could go on and on with this but the entry has already grown larger than I expected. John did a great job with his response and you should go read that for an objective analysis. Probably the only positive thing I could attribute to the original Gartner report is that its very existence probably proves that time travel is possible! It's a theory which fits the facts: a report which criticises something based on data which is largely a decade old means it was probably written a decade ago.

Sunday, January 01, 2017

Invocations on objects which are not replicated are traditionally based on the RPC as this retains the correct semantics of a procedure call i.e., a single flow (thread) of control from caller to callee and back again (as with a traditional procedure call). The previous entry described the concept of the Remote Procedure Call, and the simplified model of client-server interaction shown in the figure below will be assumed for the discussion to follow: a client uses the primitives sendjequest() for sending a call request and receive_result() for receiving the corresponding results. Clients and servers maintain enough state information to recognize and discard duplicate messages (filter requests). The server maintains a queue of messages from possibly multiple clients, and uses the primitive receive_re quest() to pick a message from the queue in a fifo order. After invoking the right method, the result is sent to the client with the send_result() primitive.

When making replicated invocations (such as when calling a replica group) the semantics of such communication differ considerably from that of the traditional RPC: there is no longer a single thread of control, but rather multiple threads which may all eventually return to the caller. Such invocations are typically referred to as Replicated Procedure Calls [Cooper 84a][Cooper 85], and can be implemented using one—to—many (or multicast) communication facilities. We discuss various aspects of multicast communication below.

One-to-Many Communication

The main services a multicast protocol provides can be categorised into three classes: ordering, reliability and latency. By imposing (increasing) ordering and reliability constraints on the delivery of multicast messages it is possible to define increasingly sophisticated protocols (typically at the expense of the latency). To understand these protocols first assume that a sender S is attempting to multicast to a group G = {P1,...,Pn}. Following the definitions outlined in [Shrivastava 90b][ANSA 90]:

Unordered and Unreliable

A multicast from S will be received by a subset of functioning nodes Pi ∈ G. Successive multicasts from S will be received in an arbitrary order at the destinations. The next figure shows sender S multicasting messages m1 and m2 to the group G. The first message is received by P2 and Pn in different orders, and message m2 is not received by P1

FIFO Multicast

Provided the sender does not crash while transmitting the message, all correctly functioning receivers are guaranteed to get the message. Furthermore, the multicasts will be received in the order they were made.

The next figure shows two senders (S 1 and S2) multicasting to the group G. All members of G received m1 before m2, but some members may receive m3 before m2. This last ordering is correct given the definition of the protocol: no information about the relative ordering of multicasts between senders is available to the receivers.

Atomic multicast

If the sender does not crash before completing a multicast, the message is guaranteed to be received by all functioning members. If, however, the sender crashes during a multicast, then it is guaranteed that the message is received by either all or none of the functioning processes (atomic deliveiy). All multicasts from the same sender are received in the order they were made.

Causal multicast

This multicast extends the ordering property of the Atomic multicast to causally related sends from different senders while still meeting the reliability guarantee. [Lamport 78] was the first to introduce the concept of potential causal relationships into computer interactions and showed what effects these relationships can have on the operations of interacting processes. Two events are potentially causally related if information from the first event could have reached the second event before it occurred. The notation used to denote such relationships is typically X → Y, where → means precedes (happened before). Note that if X and Y are events from the same process and Y follows X then Y is necessarily causally related to X. A causal communication system will only preserve an ordering of events if the order is causally related. If two events are not related in this way then there is no guarantee on the delivery order.

In the figure above S1 is multicasting the groups G 1and G2, P1is multicasting to group G1. G1={P2, P3} and G2={P1, P4}.Thereisapotentialflowofinformationfromsend(m1,Gi) to send(m2,G2), and from send(m 2,0 2) to send(m3,G1). This means that the sending of m3 by Pi is potentially causally related to the sending of m1 by S1. Hence the causal multicast protocol must ensure that all functioning members of G 1receive m1 before m3. Events such as m3 and m4 which are not causally related can be received in any order (they are termed concurrent).

Totally ordered multicast

The (partial) causal order can be extended to a total order of messages such that all messages (whether causally related or not) are received by all destinations in the same order (which must also preserve causality).

Saturday, December 31, 2016

This is first entry in the series I mentioned earlier. I've tried to replace the references with links to the actual papers or PhD theses where possible, but some are not available online.

Remote Procedure Call

The idea behind the Remote Procedure Call (RPC) [Birrell 84] is the fact that conventional procedure calls are well known and are a well understood mechanism for the transfer of data and control within a program running on a single processor. When a remote procedure is invoked, the calling process is suspended, any parameters are passed across the network to the node where the server resides, and then the desired procedure is executed. When the procedure finishes, any results are passed back to the calling process, where execution resumes as if returning from a local procedure call. Thus the RPC provides the system or application programmer a level of abstraction above the underlying message stream. Instead of sending and receiving messages, the programmer invokes remote procedures and receives return values.

The figure shows a client and server interacting via a Remote Procedure Call interface. When the client makes the call it is suspended until the server has sent a reply. To prevent the sender being suspended indefinitely the call can have a timeout value associated with it: after this time limit has elapsed the call could be retried or the sender could decide that the receiver has failed. Another method, which does not make use of timeouts in the manner described, instead relies on the sender and receiver transmitting additional probe messages which indicate that they are alive. As long as these messages are acknowledged then the original call can continue to be processed and the sender will continue to wait.

Groups

[ANSA 90][ANSA 91a][Liang 90][Olsen 91] describe the general role of groups in a distributed system. Groups provide a convenient and natural way to structure applications into a set of members cooperating to provide a service. They can be used as a transparent way of providing fault tolerance using replication, and also as a way of dividing up a task to exploit parallelism.

A group is a composite of objects sharing common application semantics as well as the same group identifier (address). Each group is viewed as a single logical entity, without exposing its internal structure and interactions to users. If a user cannot distinguish the interaction with a group from the interaction with a single member of that group, then the group is said to be fully transparent.

Objects are generally grouped together for several reasons: abstracting the common characteristics of group members and the services they provide; encapsulating the internal state and hiding interactions among group members from the clients so as to provide a uniform interface (group interface) to the external world; using groups as building blocks to construct larger system objects. A group may be composed of many objects (which may themselves be groups), but users of the group see only the single group interface. [ANSA 90] refers to such a group as an Interface Group.

An object group is defined to be a collection of objects which are grouped together to provide a service (the notion of an abstract component) and accessible only through the group interface. An object group is composed of one or more group members whose individual object interfaces must conform to that of the group.

Interfaces are types, so that if an interfacex has typeXand an interfacey has type Y, andX conforms to Y, thenx can be used wherey is used. This type conformance criteria is similar to that in Emerald [Black 86]. In the rest of this thesis, we shall assume for simplicity that a given object group is composed of objects which possess identical interfaces (although their internal implementations could be different).

The object group concept allows a service to be distributed transparently among a set of objects. Such a group could then be used to support replication to improve reliability of service (a replica group), or the objects could exploit parallelism by dividing tasks into parallel activities. Without the notion of the object group and the group interface through which all interactions take place, users of the group would have to implement their own protocols to ensure that interactions with the group members occur consistently e.g., to guarantee that each group member sees the same set of update requests.

By examining the different ways in which groups are required by different applications, it is possible to define certain requirements which are imposed on groups and the users of groups (e.g., whether collation of results is necessary from a group used for reliability purposes). [ANSA 9 la] discusses the logical components which constitute a generic group, some of which may not be required by every group for every application. These components are:

an arbiter, which controls the order in which messages are seen by group members.

a distributor/collator, which collates messages going out of the group, and distributes messages coming into the group.

member servers, which are the actual group members to which invocations are directed.

For some applications collation may not be necessary e.g., if it can be guaranteed that all members of a group will always respond with the same result. As we shall see later, if the communication primitives can guarantee certain delivery properties for messages, then arbitration may also not be necessary. In general, all of these components constitute a group. In the rest of this thesis the logical components will not be mentioned explicitly, and the term group member will be used to mean a combination of these components.

Multicast Communication

Conventional RPC communication is a unicast call since it involves one—to—one interaction between a single client and a single server. However, when considering replication it is more natural to consider interactions with replica groups. Group communication is an access transparent way to communicate with the members of such a group. Such group communication is termed multicasting [Cheriton 85][Hughes 86].

Multicast communication schemes allow a client to send a message to multiple receivers simultaneously. The receivers are members of a group which the sender specifies as the destination of the message. A broadcast is the general case of a multicast whereby instead of speci!ring a subset of the receivers in the system every receiver is sent a copy.

Most multicast communication mechanisms are unreliable as they do not guarantee that delivery of a given message will occur even if the receiver is functioning correctly (e.g., the underlying communication medium could lose a message). When considering the interaction of client and replica group (or even replica group to replica group communication) such unreliable delivery can cause problems in maintaining consistency of state between the individual replicas, complicating the replication control protocol (if one replica fails to receive a given state-modifying request but continues to receive and respond to other requests, this resulting state divergence could result in inconsistencies at the clients). Thus, it is natural to consider such group-to-group communication to be carried out using reliable multicasts, which give certain guarantees about delivery in the presence of failures. These can include the guarantee that if a receiver is operational then the message will be delivered even if the sender fails during transmission, and that the only reason a destination will not receive a message is because that destination has failed. By using a reliable multicast communication protocol many of the problems posed by replicating services can be handled at this low level, simplifying the higher level replica consistency protocol.

Over Christmas I was doing some cleaning up of my study and came across a copy of my PhD. Any excuse to stop cleaning, so I took some time to skim through it and thought some of the background research I had included might still be useful today for a new audience. Now although the PhD is available for download, it's not exactly easily searchable or referenced, so the next few entries will try to rectify some of that.

Saturday, October 15, 2016

We've all heard the story of The Three Little Pigs. We all know that building houses from straw isn't a great idea if your use case is to survive a wolf's breath. We all know that sticks aren't that much better either. Of course brick is a far better medium and the pig that built from it survived. Now what the original story doesn't say, probably because it's a little more detail than children really need to understand, is that before building their houses all of the pigs went to architecture school and studied at length about arches, the truss, stylobates and other things necessary to design and then construct buildings from various materials. Now it's likely the pig that used straw didn't listen when they were talking about the tensile strength of straw and the pig that used sticks ignored the warnings that they're really only good for building forest dens (or birds nests). But if they'd been listening as much as their brother then they'd have known to give him a hand with the bricks and not waste their time.

Now even the best architects make mistakes. It's not possible to suggest a single reason for these though. Sometimes it's because the architect didn't understand the physical properties of the material being used (a bit like the straw and stick pigs). Sometimes it's because they didn't fully understand the environment within which their building would reside. But fortunately for us, the vast majority of buildings are successful and we feel safe to be within them or around them.

You may be wondering why I'm talking about architecture here. I think I've mentioned this before but my biggest worry about the rush towards microservices is the lack of focus, or discussions, around architectures. I'm sure many of the established groups that have been building systems with services (micro or not) understand their architectures and the impact service-orientation has on it, or vice versa. But I'm also convinced that many groups and individuals who are enamoured by the lustre of microservices aren't considering architectures or the implications. That worries me because, as I said at the JavaOne 2016 presentation I gave recently, launching into developing with microservices without understanding their implications and the architecture is neither going to solve any architecture problems you may have with your existing application nor will it result in a good/efficient distributed system. In fact it's probably the worst thing you could do!

Even if you've got "pizza teams", have a culture that has embraced DevOps, have fantastic tools supporting CI and CD, if you don't understand your current and new architecture none of this is really going to help you. That's not to suggest those things aren't important after you've done your architecture design and reviews because they clearly are. The better they are the quicker and more reliably you can build your new application using microservices and manage it afterwards. But you should never start such an effort just because you've got the right tools, building blocks and support infrastructure. It could be argued that they're necessary, but they most certainly aren't sufficient. It needs to start with architecture.

Update: I should also have mentioned that after any architecture review you find that you don't need many, or any, microservices then you shouldn't feel a sense of failure. A good architect (software or not) knows when to use and when not to use things. Remember, it's quite probable the pig who used the bricks considered straw and sticks first but decided they just weren't right this time around.

Friday, September 02, 2016

OK so following on from my previous article on inferring the presence of microservices within an architecture, one possibility would be to view the network traffic. Of course it's no guarantee, but if you follow good microservices principles that are defined today, typically your services are distributed and communicating via HTTP (OK, some people say REST but as usual they tend to mean HTTP). Therefore, if you were to look at the network traffic of an "old style" application (let's not assume it has to be a monolith) and compare it with one that has been re-architected around microservices, it wouldn't be unreasonable to assume that if you saw a lot more HTTP requests flowing then microservices are being used. If the microservices are using some other form of communication, such as JMS, then you'd see something equivalent but with a binary protocol.

We have to recognise that there are a number of reasons why the amount of network traffic may increase from one version of an application to another, so it could be the case that microservices are not being used. However, just as Rutherford did when searching for the atomic nucleus and which all good scientists follow, you come up with a theory that fits the facts and revise it when the facts change. Therefore, for simplicities sake, we'll assume that this could be a good way to infer microservices are in place if all other things remain the same, e.g., the application is released frequently, doesn't require a complete re-install/re-build of user code etc.

Now this leads me to my next question: have you, dear reader, ever bothered to benchmark HTTP or any distributed interaction versus a purely local, IPC, interaction? I think the majority will say Yes and of those who haven't the majority will probably have a gut instinct for the results. Remote invocations are slower, sometimes by several orders of magnitude. Therefore, even ignoring the fault tolerance aspects, remote invocations between microservices are going to have a performance impact on your application. So you've got to ask: why am I doing this? Or maybe: at what point should I stop?

Let's pause for a second and look back through the dark depths of history. Back before the later 19th Century/early 20th Century, before electrification of factories really took off, assembling a product from multiple components typically required having those components shipped in from different parts of the country or the world. It was a slow process. If something went wrong and you got a badly built component, it might prevent assembly of the entire product until a new version had been sourced.

In the intervening years some factories stayed with this model (to this day), whereas others moved to a production factory process whereby all of the pieces were built on site. Some factories became so large, with their constituent pieces being built in their own neighbouring factories that cities grew up around them. However, the aim was that everything was built in one place so that mistakes could be rectified much more quickly. But are these factories monoliths? I'm not so sure it's clear cut simply because some of the examples I know of factories like this are in the Japanese car industry which has adapted to change and innovation extremely well over the years. I'd say these factories matured.

Anyway, let's jump back to the present day but remembering the factory example. You could imagine that factories of the type I mentioned evolved towards their co-located strategy over years from the distributed interaction approach (manufacturers of components at different ends of the planet). They managed to evolve because at some point they had all of the right components being built but the impediment to their sales was time to market or time to react. So bringing everything closer together made sense, Once they'd co-located then maybe every now and then they needed to interact with new providers in other locations and if those became long term dependencies they probably brought them "in house" (or "in factory").

How does this relate to microservices and the initial discussion on distributed invocations? Well whilst re-architecting around microservices might help your application evolve and be released more frequently, at some point you'll need to rev the components and application less and less. It becomes more mature and the requirements for change drop off. At that stage you'd better be asking yourself whether the overhead of separate microservices communicating via HTTP or even some binary protocol, is worth it. You'd better be asking yourself whether it's not better to just bring them all "in house" (or in process) to improve performance (and probably reliability and fault tolerance). If you get it wrong then of course you're back to square one. But if you get it right, that shouldn't mean you have built a monolith! You've just built an application which does it's job really well and doesn't need to evolve much more.