There are a number of tried-and-tested methods for discovery to be found in DHCP, Bonjour, uPnP, SSDL and DNS-SD. For web-based services, UDDI and WS-Discovery have come and - for the most part - gone.

Gateways like NGINX also provide routing options which can be used for decoupling service-to-service calls.

Enterprise Service Bus systems like NServiceBus and MassTransit also can be used in a pub/sub messaging pattern to decouple service-to-service calls.

I've mentioned just a few but there are many more. You have a lot of options here, so how do you choose?

Let's first briefly cover some different patterns before I cover what we have chosen to use and why.

Centralised Registry vs. Self-Discovery

There are two common patterns that you find in solutions for Service Discovery.

The first is the service registry, a centralised database that stores the location of a service.

The second is self or auto-discovery where there is no central database and is often found in zero-configuration networking. Instead, clients use a variety of approaches to broadcast packets across a network to request a remote service and wait for the required service to respond with its location.

The service registry is another single point of failure (SPF) in your infrastructure but can provide more operational control. When used with server-side discovery, which is often found in gateways, it can completely decouple any discovery logic from the services.

Zero-configuration networking can be generous on security within networks to permit devices to 'just work' but can be more challenging to secure as systems span networks. It is often more suitable for smaller networks (uPnP, Bonjour etc.).

Communication

There are four common types of service-to-service communication.

Point-to-point : services talk directly to each other.

Gateway : acts as the middleman, handling the routing of requests and responses between services.

Gateway Request : the responding service replies directly to the calling service rather than return through the gateway.

Message Queue: services publish messages to a queue, the responding service subscribes to the messages published and in turn publishes its response to the queue for the original service to subscribe to.

Point-to-point involves the shortest route so is often the quickest but requires each end-point to take a dependency on your discovery mechanism.

The gateway can decouple many concerns from your services, handling not just routing, but caching, front-end to back-end bridging with HTTPS termination, transport conversions like HTTP to TCP/IP, formats, aggregation and load-balancing, to name just a few.

The message-queue pub/sub model is slower and is more suited for longer running processes.

Registration

For service registries only, the registration can be handled by each client directly or by the server.

Further reading

I've only really scratched the surface on the above keeping the explanations as brief as possible, as I want to get on to some specifics, but you can find a much better, more detailed overview of Service Discovery in Chris Richardson's excellent post as part of his series on Microservices.

Chris also has many video talks and articles available online and speaks very eloquently on all matters relating to distributed design which I have greatly enjoyed during my own research. I highly recommend checking them out.

So this is the first critical point where we had a variety of choices to make in our design.

Do we want smart versus dumb pipes? How about decentralised control with auto-discovery? How does our communication behave? Who controls registration? Is one single approach for all scenarios even practical?

For us they are opinionated and deliberate choices.
Our approach that follows is not inherently better or worse, but each choice has consequences for many of the subsequent design decisions. In many cases, they can actually remove choice.

We will come back to reference these choices in the rest of this series.

It is also worth pointing out that I couldn't try out everything available, so our choice is not a reflection on other solutions out there, it is just the one I felt best fit ServiceStack and suited our needs.

Consul, like all service registry patterns is a potential SPF, but is designed for High Availability in mind.

In production, you run an odd number of Server nodes which form a DataCenter (DC), typically three or five. You can scale Consul to connect multiple datacenters.

The odd number is because it implements a consensus protocol based on RAFT which holds leadership elections, and they need a deciding vote to elect a leader.

For the best possible resiliency, server nodes can be spread across physical hardware, network locations and operating systems. Running three instances allows a single node to fail while running five can tolerate two node failures.

Consul is actually a hybrid model of server and client-side, something also found in Netflix's Eureka. This approach avoids one typical drawback of client-side discovery and self-registration systems i.e. network availability and latency.

It avoids this by using local agents on a loopback address.

Each service has access to an agent co-located on the same physical hardware. Consul uses a gossip protocol Serf for managing membership, failure detection and message broadcasting and RAFT logs to keep each agent's list of services synchronised.

This means lookups and registrations are local and fast with no network hops.

The approach also helps decouple the HTTP verb specifics of any external calls from your call site and instead makes the DTO responsible for defining how it is sent.

But wait, there's more...

In addition, Consul provides another piece of the infrastructure jigsaw which our plugin handles for you - service health which we will cover in our next topic.

The gateway will also select the correct format for retrieving the DTO. If your remote service only communicates in XML, it will transparently call it using XML but return you a POCO.

It will also automatically cache responses from a GET request according to the remote service's cache settings. In some cases, it will not even issue an RPC, instead returning you the DTO response straight from the cache.

Instead of REST and all the great custom and fallback routing options in ServiceStack, we have chosen to use only ServiceStack's pre-defined-routes.

Together with our second consequence of globally unique DTOs, this allows the RPC routing to just work with Consul.

So let me try and explain why we've not only ignored RESTful routing, but will actively seek to prevent it being used directly in our Services.

There are a few reasons behind this but first it might help to clarify that we plan to use services internally at first, but later on expose them externally using a Gateway to be built on top of Consul.

Internally, with ServiceStack's ServiceClient and the DTOs, you already have fully end-to-end typed API calls so never really need to see a URI, let alone care what they are, this isn't so bad for them.

We expect that most of the internal calls will use this typed approach.

You can use custom routes, and the service-to-service calls will even use them. This is not really the problem area though.

Any non-ServiceStack client that wants to consume the services would have to go via Consul to find the right service, and Consul doesn't know a thing about your custom routes.

This affects the few internal apps or services that do not use the ServiceStack client and probably the MOST important group, the external clients.

Contract stability is of paramount importance, but addendum's to contracts are OK.

So clumsily put, if we ensure our DTOs are backward-compatible, we have far more stability in our contracts. Contracts that can tolerate change. Contracts that instil confidence and the trust of consumers.

Another reason for avoiding custom routing in ServiceStack is the complexity of making it work correctly.

In what order do I add this service's routes to the routing table?

Will a fall-back or over-generous catchall route suddenly grab all other services requests?

Will the new dev/team remember to respect the guidelines?

As I mentioned previously, adding an external gateway is part of our future plans and we expect it to handle things like load-balancing, traffic shaping and SSL termination, all in one place, rather than in each service.

If in that future, we must have RESTful routing, it will be as a decoupled, globally managed concern in that gateway, carefully managing the mapping of routes to services. Even this though, by its nature, is static and prone to 'churn' in such a dynamic environment. (see schema changes in ORMs)

We are currently looking at a few options for Gateways so I'll simply mention one that stands out so far, Fabio

It looks to have great integration with Consul and avoids the need for more complex Consul-template solutions. Another one for the roadmap.

If you have multiple instances of a service available to process a DTO, Our plugin will sort these by the agent RTT, giving you the most responsive.

This isn't really load-balancing, more QoS, but it is useful nonetheless and worth mentioning.

Another thing Consul gives us is in how it maintains separate service catalogs per datacenter. Using this ability, we could locate datacenters and their services in different geographic regions to even out global traffic loads.

For true load-balancing though, we have to look for other solutions and they lie outside of each service.

A gateway is the most obvious candidate for this and Fabio allows you to split traffic between services based on rules, useful for things like canary deployments as well as more traditional load-balancing.

In the world of microservices however, we actually have all the ingredients we need to make something ourselves if we need to.

Having a service registry in Consul with RTT, Health and performance metrics information from logging for every service end-point opens up interesting possibilities for using that data. Combined with a good automated deployment pipeline, there are possibilities for elastic scaling. I'll explore this in more detail in the deployment topic.