[[About Vol.2 of the upcoming “Development and Deployment of MMOG” book. There is no need to worry, I just need some time to prepare for publishing of Vol.1. “beta” chapters of Vol.2 are planned to start appearing next week. Stay tuned!]]

Discussion on the advantages of TCP vs UDP (and vice versa) has a history which is almost as long as the eternal Linux-vs-Windows debate. As I have long been a supporter of the point of view that both UDP and TCP have their own niches (see, for example, [NoBugs15]), here are my two cents on this subject.

Note for those who already know the basics of IP and TCP: please skip to the ‘Closing the Gap: Improving TCP Interactivity’ section, as you still may be able to find a thing or two of interest.

IP: just packets, nothing more

“All the internet protocol (and IP stack) does is it provides a way to deliver data packets from host A to host B, using an IP address as an identifierAs both TCP and UDP run over IP, let’s see what the internet protocol (IP) really is. For our purposes, we can say that:

we have two hosts which need to communicate with each other

each of the hosts is assigned its own IP address

the internet protocol (and IP stack) provides a way to deliver data packets from host A to host B, using an IP address as an identifier

In practice, of course, it is much more complicated than that (with all kinds of stuff involved in the operation of the IP, from ICMP and ARP to OSPF and BGP), but for now we can more or less safely ignore the complications as implementation details.

What we need to know, though, it is that IP packet looks as follows:

IP Header (20 to 24 bytes for IPv4)

IP Payload

One very important feature of IP is that it does not guarantee packet delivery. Not at all. Any single packet can be lost, period. It means that any number of packets can also be lost.

IP works only statistically; this behaviour is by design; actually, it is the reason why backbone Internet routers are able to cope with enormous amounts of traffic. If there is a problem (which can range from link overload to sudden reboot), routers are allowed to drop packets.

Within the IP stack, it is the job of the hosts to provide delivery guarantees. Nothing is done in this regard en route.

UDP: datagrams ~= packets

Next, let’s discuss the simpler one of our two protocols: UDP. UDP is a very basic protocol which runs on top of IP. Actually, it is that basic, that when UDP datagrams run on top of IP packets, there is always 1-to-1 correspondence between the two, and all UDP does is add a very simple header (in addition to IP headers), the header consisting of 4 fields: source port, destination port, length, and checksum, making it 8 bytes in total.

So, a typical UDP packet will look as follows:

IP Header (20 to 24 bytes for IPv4)

UDP Header (8 bytes)

UDP Payload

The UDP ‘Datagram’ is pretty much the same as an IP ‘packet’, with the only difference between the two being the 8 bytes of UDP header; for the rest of the article we’ll use these two terms interchangeably.

As UDP datagrams simply run on top of IP packets, and IP packets can be lost, UDP datagrams can be lost too.

TCP: stream != packets

In contrast with UDP, TCP is a very sophisticated protocol, which does guarantee reliable delivery.

The only relatively simple thing about TCP is its packet:

IP Header (20 to 24 bytes for IPv4)

TCP Header (20 to 60 bytes)

TCP Payload

Usually, the size of a TCP header is around 20 bytes, but in relatively rare cases it may reach up to 60 bytes.

As soon as we’re past the TCP packet, things become complicated. Here is an extremely brief and sketchy description of TCP working1:

TCP interprets all the data to be communicated between two hosts as two streams (one stream going from host A to host B, and another going in the opposite direction)

whenever the host calls the TCP function send(), the data is pushed into the stream

the TCP stack keeps a buffer (usually 2K–16K in size) on the sending side; all the data pushed to the stream goes to this buffer. If the buffer is full, send() won’t return until there is enough space in the buffer2

“data is not removed from the sending TCP buffer at the moment when TCP packet is sentData from the buffer is sent over the IP as TCP packets; each TCP packet consists of an IP packet, a TCP header, and TCP data. TCP data within a TCP packet is data from the sending TCP buffer; data is not removed from the TCP buffer on the sending side at the moment when TCP packet is sent (!)

After the receiving side gets the TCP packet, it sends a TCP packet in the opposite direction, with a TCP header with the ACK bit set – an indication that a certain portion of the data has been received. On receiving this ACK packet, the sender may remove the corresponding piece from its TCP sending buffer.3

Data received goes to another buffer on the receiving side; again, its size is usually of the order of 2K–16K. This receiving buffer is where the data for the recv() function comes from.

If the sending side doesn’t receive an ACK in a predefined time – it will re-send the TCP packet. This is the primary mechanism by which TCP guarantees delivery in case of the packet being lost.4

So far so good. However, there are some further caveats. First of all, when re-sending because of no-ACK timeouts, these timeouts are doubled for each subsequent re-send. The first re-send is usually sent after time T1, which is double the so-called RTT (round-trip-time, which is measured by the host as a time interval between the moment when the packet was sent and another moment when the ACK for the packet was received); the second re-send is sent after time T2=2*T1, the third one after T3=2*T2=4*T1, and so on. This feature (known as ‘exponential back-off’) is intended to avoid the Internet being congested due to too many retransmitted packets flying around, though the importance of exponential back-off in avoiding Internet congestion is currently being challenged [Mondal]. Whatever the reasoning, exponential back-off is present in every TCP stack out there (at least, I’ve never heard of any TCP stacks which don’t implement it), so we need to cope with it (we’ll see below when and why it is important).

“Another caveat related to interactivity is the so-called Nagle algorithm.Another caveat related to interactivity is the so-called Nagle algorithm. Originally designed to avoid telnet sending 41-byte packets for each character pressed (which constitutes a 4000% overhead), it also allows the hiding of more packet-related details from a “TCP as stream” perspective (and eventually became a way to allow developers to be careless and push data to the stream in as small chunks as they like, in the hope that the ‘smart TCP stack will do all the packet assembly for us’). The Nagle algorithm avoids sending a new packet as long as there is (a) an unacknowledged outstanding packet and (b) there isn’t enough data in the buffer to fill the whole packet. As we will see below, it has significant implications on interactivity (but, fortunately for interactivity, usually Nagle can be disabled).

1 For simplicity, the discussion of flow control and TCP windows is omitted; so are the optimizations such as SACK and fast retransmit

2 Alternatively, if the socket is non-blocking, in this situation send() can return EWOULDBLOCK

3 In practice, ACK is not necessarily a separate packet; efforts are taken by the TCP stack to ‘piggy-back’ an ACK on any packet going in the needed direction

4 There are other mechanisms of re-sending, which include re-sending when an ACK was received, but was out-of-order, but they are beyond the scope of present article

TCP: Just the ticket? No so fast 🙁

Some people may ask: if TCP is so much more sophisticated and more importantly, provides reliable data delivery, why not just use TCP for every network transfer under the sun?

Unfortunately, it is not that simple. Reliable delivery in TCP does have a price tag attached, and this price is all about loss of interactivity :-(.

Let’s imagine a first-person shooter game that sends updates, with each update containing only the position of the player. Let’s consider two implementations: Implementation U which sends the player position over UDP (a single UDP packet every 10ms, as the game is fast and the position is likely to change during this time anyway), and Implementation T which sends the player position over TCP.

First of all, with Implementation T, if your application is calling send() every 10 ms, but RTT is, say, 50ms, your data updates will be delayed (according to the Nagle algorithm, see above). Fortunately, the Nagle algorithm can be disabled using the TCP_NODELAY option (see the section ‘Closing the gap: improving TCP interactivity’ for details).

“If the Nagle algorithm is disabled, and there are no packets lost on the way – there won’t be any difference between UDP-based and TCP-based implementations.If the Nagle algorithm is disabled, and there are no packets lost on the way (and both hosts are fast enough to process the data) – there won’t be any difference between these implementations. But what will happen if some packets are lost?

With Implementation U, even if a packet is lost, the next packet will heal the situation, and the correct player position will be restored very quickly (at most in 10 ms). With Implementation T, however, we cannot control timeouts, so the packet won’t be re-sent until around 2*RTT; as RTT can easily reach 50ms even for a first-person shooter (and is at least 100–150ms across the Atlantic), the retransmit won’t happen until about 100ms, which represents a Big Degradation compared to Implementation U.

In addition, with Implementation T, if one packet is lost but the second one is delivered, this second packet (while present on the receiving host) won’t be delivered to the application until the second instance of the first packet (i.e. the first packet retransmitted on timeout) is received; this is an inevitable consequence of treating all the data as a stream (you cannot deliver the latter portion of the stream until the former one is delivered).

To make things even worse for Implementation T, if there is more than one packet lost in a row, then the second retransmit with Implementation T won’t come until about 200ms (assuming 50ms RTT), and so on. This, in turn, often leads to existing TCP connections being ‘stuck’ when new TCP connections will succeed and will work correctly. This can be addressed, but requires some effort (see ‘Closing the gap: improving TCP interactivity’ section below).

So, should we always go with UDP?

In Implementation U described above, UDP worked pretty well, but this was closely related to the specifics of the messages exchanged. In particular, we assumed that every packet has all the information necessary, so loss of any packet will be ‘healed’ by the next packet. If such an assumption doesn’t hold, using UDP becomes non-trivial.

Also, the whole schema relies on us sending packets every 10ms; this may easily result in sending too much traffic even if there is little activity; on the other hand, increasing this interval with Implementation U will lead to loss of interactivity.

What should we do then?

Basically, the rules of thumb are about the following:

If characteristic times for your application are of the order of many hours (for example, you’re dealing with lengthy file transfers) – TCP will do just fine, though it is still advisable to use TCP built-in keep-alives (see below).

If characteristic times for your application are below ‘many hours’ but are over 5 seconds – it is more or less safe to go with TCP. However, to ensure interactivity consider implementing your ‘Own Keep-Alives’ as described below.

“If characteristic times for your application are (very roughly) between 100ms and 5 seconds – this is pretty much a grey area.If characteristic times for your application are (very roughly) between 100ms and 5 seconds – this is pretty much a grey area. The answer to ‘which protocol to use’ question will depend on many factors, from “How well you can deal with lost packets on application level” to “Do you need security?”. See both ‘Closing the gap: reliable UDP’ and ‘Closing the gap: improving TCP Interactivity’ sections below.

If characteristic times for your application are below 100ms – it is very likely that you need UDP. See the ‘Closing the gap: reliable UDP’ section below on the ways of adding reliability to UDP.

Closing the gap: reliable UDP

In cases when you need to use UDP but also need to make it reliable, you can use one of the ‘reliable UDP’ libraries [Enet][UDT][RakNet]. However, these libraries cannot do any magic, so they’re essentially restricted to retransmits at some timeouts. Therefore, before using such a library, you will still need to understand very well how exactly it achieves reliable delivery, and how much interactivity it sacrifices in the process (and for what kind of messages).

It should be noted that when implementing reliable UDP, the more TCP features you implement, the more chances there are that you end up with an inferior implementation of TCP. TCP is a very complex protocol (and most of its complexity is there for a good reason), so attempting to implement ‘better TCP’ is extremely difficult. On the other hand, implementing ‘reliable UDP’ at the cost of dropping most of TCP functionality, is possible.

EDIT: [Wikipedia.QUIC] by Google is an interesting effort in this regard, stay tuned for their further developments.

Closing the gap: improving TCP interactivity

There are several things which can be done to improve interactivity of TCP.

Keep-alives and ‘stuck’ connections

One of the most annoying problems when using TCP for interactive communication is ‘stuck’ TCP connections. When you see a browser page which ‘stuck’ in the middle, then press ‘Refresh’ – and bingo! – here is your page, then chances are you have run into such a ‘stuck’ TCP connection.

One way to deal with ‘stuck’ TCP connections (and without your customer working as a freebie error handler) is to have some kind of ‘keep alive’ messages which the parties exchange every N seconds; if there are no messages on one of the sides for, say, 2*N time – you can assume that TCP connection is ‘stuck’, and try to re-establish it.

TCP itself includes a Keep-Alive mechanism (look for SO_KEEPALIVE option for setsockopt()), but it is usually of the order of 2 hours (and worse, at least under Windows it is not configurable other than via a global setting in the Registry, ouch).

“Quite often, you need to create your own keep-alive over TCP, with the timeouts you need.So, if you need to detect your ‘stuck’ TCP connection earlier than in two hours, and your operating systems on both sides of your TCP connection don’t support per-socket keep alive timeouts, you need to create your own keep-alive over TCP, with the timeouts you need. It is usually not rocket science, but is quite a bit of work.

The basic way of implementing your own keep-alive usually goes as follows:

You’re splitting your TCP stream into messages (which is usually a good idea anyway); each message contains its type, size, and payload

One message type is MY_DATA, with a real payload. On receiving it, it is passed to the upper layer. Optionally, you may also reset a ‘connection is dead’ timer.

Another message type is MY_KEEPALIVE, without any payload. On receiving it, it is not passed to the upper layer, but a ‘connection is dead’ timer is reset.

MY_KEEPALIVE is sent whenever there are no other messages going over the TCP connection for N seconds

When a ‘connection is dead’ timer expires, the connection is declared dead and is re-established. While this is a fallacy from traditional TCP point of view, it has been observed to help interactivity in a significant manner.

As an optimization, you may want to keep the original connection alive while you’re establishing a new one; if the old connection receives something while you’re establishing the new one, you can resume communication over the old one, dropping new one.

TCP_NODELAY

“with TCP_NODELAY you should always assemble the whole message before calling send()One of the most popular ways to improve TCP interactivity is enabling TCP_NODELAY over your TCP socket (again, as a parameter of setsockopt() function).

However, TCP_NODELAY is not without its own caveats. Most importantly, with TCP_NODELAY you should always assemble the whole message-you-want-to-send before calling send(). Otherwise, each of your calls to send() will cause the TCP stack to send a packet (with the associated 40–84 bytes overhead, ouch).

5 usually TCP_NODELAY also has some other effects such as adding a PSH flag, which causes the TCP stack on the receiving side to deliver the data to the application right away without waiting for ‘enough data to be gathered’, which is also a Good Thing interactivity-wise. Still, it cannot force packet data to be delivered until previous-within-the-stream packet is delivered, as stream coherency needs to be preserved.

Out-of-Band Data

TCP OOB (Out-of-Band Data) is a mechanism which is intended to break the stream and deliver some data with a higher priority. As OOB adds priority (in a sense that it bypasses both the TCP sending buffer and TCP receiving buffer), it may help to deal with interactivity. However, with TCP OOB being able to send only one byte (while you can call send(…,MSG_OOB) with more than one byte, only the last byte of the block will be interpreted as OOB), its usefulness is usually quite limited.

“One scenario when MSG_OOB works pretty well, is to send an ‘abort’ command during a long file transferOne scenario when MSG_OOB works pretty well (and which is used in protocols such as FTP), is to send an ‘abort’ command during a long file transfer; on receiving OOB ‘abort’, the receiving side simply reads all the data from the stream, discarding it without processing, until the OOB marker (the place in the TCP stream where send(…,MSG_OOB) has been called on sending side) is reached. This way, all the TCP buffers are effectively discarded, and the communication can be resumed without dropping the TCP connection and re-establishing a new one. For more details on MSG_OOB see [Stevens] (with a relevant chapter available on [Masterraghu]).

Residual issues

Even with all the tricks above, TCP is still lacking interactivity-wise. In particular, out-of-order data delivery of over-1-byte-size is still not an option, stale-and-not-necessary-anymore data will still be retransmitted even if they’re not necessary, and dealing with ‘stuck’ connections is just a way to mitigate the problem rather than to address it. On the other hand, if your application is relatively tolerant to delays, ‘other considerations’ described below may easily be a deciding factor in your choice of protocol.

Other considerations

If you’re lucky enough and the requirements of your application can be satisfied by both TCP and UDP, other considerations may come into play. These considerations include (but are not limited to):

EDIT: pretty much whenever we can use TCP, we can also use so-called “Websockets”. They inherit most of the advantages and disadvantages of TCP, but are usually better suitable for web apps (and are generally even a bit more firewall-friendly than plain TCP)

“if you want your user to be able to connect from a hotel room, or from work – TCP usually tends to work betterTCP guarantees correct ordering of packets, UDP as such doesn’t (though ‘Reliable UDP’ might).

TCP has flow control, UDP as such doesn’t.

TCP is generally more firewall- and NAT-friendly. Which can be roughly translated as ‘if you want your user to be able to connect from a hotel room, or from work – TCP usually tends to work better, especially if going over port 80, or over port 443’. [QUIC] estimates the number of people-who-have-outbound-TCP-connectivity-but-don’t-have-outbound-UDP at 6-9%. Whether it is too much – depends on your application, but I strongly suggest to have a TCP fallback at least for those apps where it is not completely hopeless.

TCP is significantly simpler to program for. While TCP is not without caveats, not discussed here (see also [NoBugs15a]), dealing with UDP so it works without major problems generally takes significantly longer.

TCP generally has more overhead, especially during connection setup and connection termination. The overall difference in traffic will usually be small, but this might still be a valid concern.

Conclusions

“The most critical factor in selection of TCP over UDP or vice versa is usually related to acceptable delaysThe choice of TCP over UDP (or vice versa) might not always be obvious. In a sense, replacing TCP with UDP is trading off reliability for interactivity.

The most critical factor in selection of one over another one is usually related to acceptable delays; TCP is usually optimal for over-several-seconds times, and UDP for under-0.1-second times, with anything in between being a ‘grey area’. On the other hand, other considerations (partially described above) may play their own role, especially within the ‘grey area’.

Also, there are ways to improve TCP interactivity as well as UDP reliability (both briefly described above); this often allows to close the gap between the two.

[+]Disclaimer

as usual, the opinions within this article are those of ‘No Bugs’ Hare, and do not necessarily coincide with the opinions of the translators and Overload editors; also, please keep in mind that translation difficulties from Lapine (like those described in [Loganberry04]) might have prevented an exact translation. In addition, the translator and Overload expressly disclaim all responsibility from any action or inaction resulting from reading this article.

Acknowledgements

This article has been originally published in Overload Journal #130 in December 2015 and is also available separately on ACCU web site. Re-posted here with a kind permission of Overload. The article has been re-formatted to fit your screen.

Comments

From experience, there are many routers, bridges, etc. that are setup to not pass UDP traffic by default — UDP traffic must be specifically enabled in order, for example, for UDP packets to travel between subnets in an internal network, etc. Also, if you are deploying something into the ‘cloud’ (like AWS), their networking does not even allow UDP packets between instances, so, for another example, if your auto-discovery mechanism relies on UDP to work, it will not work on AWS at all (and there is currently no way to fix that — I spent a frantic 12 hours re-writing my UDP auto-discovery facility that worked fine in my own env to use Redis on AWS).

Right, and I’ve mentioned it (as “TCP being more firewall- and NAT-friendly”). However, as soon as you’ve got your server firewall-friendly, [QUIC] reference reports that only 6-9% of users cannot handle UDP, which is IMHO “good enough” to “develop UDP stuff when it is necessary, with a mandatory TCP/Websocket fallback when it is not absolutely hopeless”

Just a note about the UDP and EC2 (AWS) issue, for people who may arrive here and give up on their working implementation:
It works!
Cliff probably needed multicast for his service discoverability, which was not supported in AWS for a long time, but is available for linux hosts in a VPC since 2015 – you just have to read the docs and follow a few boring steps.

Very good article, and nuanced reasoning, use whatever fits _your_ needs. I would like to mention the limits on packet size (on many routers, switches, firewalls) with UDP thus requiring you to implement your own chunking and reassembling of data that needs to be transmitted as one payload. For a fresh network programmer the MTU can show up late into the production cycle as all testing internally and even to many clients works, when all of a sudden one clients route doesn’t allow for jumbo frames (MTU > 1500 bytes). If you’re starting your game/app from scratch, to avoid even further firewall issues, I’d look at websockets as an alternative to TCP for any future work. Built in encryption is nothing to sneeze at, considering the very hard problem of encrypting UDP.

On MTUs: sure, I will try hard not to forget about real-world IP fragmenting (or lack thereof ;-)) when writing Volume 2 of the upcoming book :-). BTW, from your experience – is there a sizeable difference in number of clients-who-can-handle-MTU-of-576 and clients-who-can-handle-MTU-of-1500? There are two schools in this regard: one says “play it safe at 576” and another says “1500 is good enough” (and personally, whenever possible, I am arguing for “UDP at 1400 PLUS fallback to Websockets” – and this works well, but I am always eager to hear about others’-real-world-experiences).

I can’t remember that we had any issues with 1400ish packets. Even larger worked in lots of cases but some large cloud hosts does not support >1500 sized. Surprisingly we tested 5kb+ UDP packages and it worked just fine in some cases, but i wouldn’t recommend using such without fallback alternatives.

An UDP+fallback is certainly a good way of doing it if you need that interaction speed. We settled for WSS (Secure Websockets) because it was good enough interaction wise, but also because of standard encryption built in and security was an issue we wanted to address.

Thanks! About 5K+ – it is pretty much inevitably fragmented, and being fragmented means that a loss of any fragment leads to a loss of the whole packet; as a result I would avoid these (splitting manually and trying to recover from single-packet losses).

About encryption – yes, it is a pain in the (ahem) neck, but from what I’ve heard – DTLS for UDP is not too bad to deal with. BTW, about “standard encryption” implemented by WSS – do you know which CA do they use? If they’re using a root certificate which is installed on client’s machine – it creates a serious weakness for protocol obfuscation (and for most of the games there is a strong need for obfuscation) – due to a MITM-on-himself kind of attack (which is not a problem for non-games, but is pretty well-known in the game world 🙁 ).

Besides TCP and UDP there is also SCTP [1, 2], which allows you to mix and match features of UDP and TCP.

If you ever heard of WebRTC, that’s what it uses for networking.

You want reliable transmission? You can have it. You don’t want reliable transmission? You can have it too. You want things to be sent in order? Sure. You can also send things out of order if you want.

Sadly, SCTP it not very supported by OSes. Linux kernel does support SCTP, but all other relevant OSes don’t. There is, however, a portable and cross-platform userland implementations of SCTP on top of UDP, which is what Chromium / Google Chrome browser uses for WebRTC [3].

Let me rephrase your statement a bit. THE ONLY OS which supports SCTP, is Linux. And we didn’t even start discussing support for SCTP in various NATs/firewalls (and this IS a Big Fat Problem as soon as we’re out of server-2-server communications).

> There is, however, a portable and cross-platform userland implementations of SCTP on top of UDP

Sure (and it solves the firewall issue too), but then it falls under the category “yet another (optionally)-reliable library over UDP”, and there is about a half-a-dozen of decent ones, depending on what-exactly-you-want (do you want punching? Then look at RakNet. Do you need low latencies for reliable streams? Then QUIC is potentially a good candidate. Etc. etc.).