Monday, 23 July 2012

Loss, latency error correction and retries

Internet Protocol is a packet based protocol - it means that all information carried over IP is broken in to packets - a block of bytes of data. Typically these are anything from 1 byte to 1500 bytes, but can (at an IP level) go to 65535 bytes. When transferring a large file the data will be broken in to the largest convenient chunk (typically 1500 bytes of IP) by the higher later protocol (typically TCP).

The job of the link layer below IP is to carry these IP packets. Even the IP packets may be broken in to smaller packets (fragments) to do that as typically the link layer works at 1500 bytes maximum packet size.

There are many ways this can be done, including Ethernet (over copper or fibre), WiFi, ADSL, Modem, and so on. Each of these low level protocols operate in different ways and have different characteristics. Some times the low level fits these in to smaller fixed size blocks such as 48 byte ATM cells.

One of the key things to understand is that these lower layer protocols are not responsible for guaranteeing delivery of packets. IP as a protocol is not intended to be 100% reliable. It will drop packets because links are full of as a result of errors. The higher level protocols, such as TCP, manage any resending of packets that is needed to get a reliable transfer of data.

However, these higher level protocols work better if IP only drops or delays packets because of congestion. Packets dropped due to errors have a disproportionate effect on overall throughput. To put this in some context, if you have 1% random packet drop on a link, TCP will not manage to fill the link to 99% of capacity as one might intuitively expect. Instead, each time a packet is dropped, TCP thinks that this the result of congestion and effectively slows down the transfer. It is quite possible that as little as 1% random loss will reduce your TCP throughly by 90%.

If, however, the loss is down to congestion, TCP is right to slow down, and that causes the loss to go away. TCP speeds up when no loss. The packet loss is how it knows that it is going too fast (though some other mechanisms do exist for this now). So a full link will, indeed, have loss.

A full link also gets latency - this is delay of the packets being sent. This is because routers have a queue of packets so as to smooth out the bursts of traffic. It turns out that many systems have queues that are too big and so create buffer bloat which makes TCP somewhat less efficient.

So, a low level link passes packets. It might delay them a bit (latency) or drop them (packet loss).

Whilst low level links are not responsible for reliable transmission, it is sensible to avoid errors causing dropped packets. Some links are prone to errors, especially radio and high speed DSL links. These types of links often have an option called interleaving. Interleaving means spreading the bits of a packet out over time and interleaving bits of other packets. This is done in conjunction with forward error correction (there is no point interleaving if not using FEC). What this means is that packets have extra data added which can be use to repair lost bits due to interference. The interleaving is actually a trick to make a long burst of errors appears as a small amount of error in several packets and so be more likely to be repairable.

Interleaving adds more latency as the packets are stretched in time, but it is a fixed amount of latency that is added by this. The extra FEC bits make the link slightly slower (more bits have to be sent for the same data) but again this is a fixed and predictable reduction in speed. There are often different levels of interleaving that can be used and sometimes different levels of FEC. It is a trade off - more interleave is more latency and more FEC is slower throughput, but they make a link more reliable in the face of some types of error.

Whilst most links have a small level of FEC and interleave, there are cases where these are taken to extremes - where the FEC data is hundreds of times the amount of data. However, such things are usually only used for specialist applications such as communications in to deep space (e.g. Voyager spacecraft).

FEC means packets are more likely to arrive and so less are dropped due to error. One thing FEC does not do is re-try missing packets - that is still left for the higher level protocols.

So, why the educational rant? Well, it seems even high level escalations in BT are unaware of basic packet protocols. They stated that packet loss on a line can result in latency. This is simply not true. The loss will mean that at a higher level packets may have to be re-sent by higher level protocols, but does not create latency at a packet level. They also stated that latency (of over a second at times) could be the result of interleaving, not realising that interleaving adds a very specific defined and consistent latency (usually a few milliseconds). You only get latency like that if something queues a packet, and not at the modem level.

It makes a certain amount of sense to do retransmission at the link layer: the round trip is a lot shorter. In theory you can also do hydrid ARQ/FEC, but I don't think G.998.4 does that. It was originally designed for video, where TCP retransmission is a nonstarter.

While INP as specced in G.998.4 (and used on AAISP Be Wholesale lines) does involve retransmission at the PHY level, it's bounded time - the maximum permitted delay is 63ms over and above the lower level delay (that of the interleaved DSL line underneath the INP layer), and is set in one millisecond increments.

Thus, even with INP, BT can say things like "this MSAN configuration has a worst case delay of 25ms, with a best case (no INP retransmissions) of 18ms".