Why would you want to avoid it? There is at least in theory one good reason. IP discards the whole datagram if one of its fragments is lost. Assume you send a 4 kilobyte datagram with a MTU of 1280, respectively 4 fragments. If you do your own fragmentation, those "fragments" are complete datagrams. If one is lost, you still get the other three. Relying on IP fragmentation means that if one is lost, you lose all four.

Now, I haven't looked in detail at the fragmentation code, but I was assuming that losing a fragment would invalidate the entire packet anyway - i.e. one lost fragment means you lost all, just like with IP.

Is there no merit to my conjecture that it can be worthwhile to pre-fragment packets to avoid any reassembly at network boundaries and minimize MTU-related issues?

Of course, it might simply be that they wanted to allow packets that exceeded UDP max packet size.

I was assuming that losing a fragment would invalidate the entire packet anyway - i.e. one lost fragment means you lost all, just like with IP

I didn't look either, but I doubt it. I would deem myself very arrogant if I assumed that I could write a fragmentation layer that is exactly identical to how IP does it, only better. In all likelihood, it would only be worse! Carmack and the other people involved in Q3 should know better than to try such a thing.

Of course, it might simply be that they wanted to allow packets that exceeded UDP max packet size.

That would be insane, however. If they have that much bulk data, then they'd better use TCP in the first place. UDP already allows you to send datagrams close to 64kiB in size, which is way too big for "realtime".

If you use UDP, you do that because you have hard realtime requirements on data arriving, you want low latency. You just cannot afford to wait. Thus, you certainly do not want to send messages of several hundred or thousands of kilobytes, and you certainly do not want to wait for dozens or hundreds of fragments to come in before you can assemble them.

You probably won't die if every now and then a single datagram is getting fragmented, if it happens, well then it happens, bad luck -- but planning to send datagrams several dozen of kilobytes in size is just a bad idea.

I didn't look either, but I doubt it. I would deem myself very arrogant if I assumed that I could write a fragmentation layer that is exactly identical to how IP does it, only better. In all likelihood, it would only be worse! Carmack and the other people involved in Q3 should know better than to try such a thing.

From what I understand there appears to be some issues regarding IP fragmentation with different MTU. If I understand correctly then the situation is like this:

You send a packet of n bytes where n > MTU of your network. Let's say we send 4000 bytes and the MTU is 1500, so that gets split into 3 packets. At a later point this packet then needs to pass through a relay with an MTU of 1450, so suddenly those two first fragments are too big and the sender may either send an ICMP back with "packet too big" or fragment the packet again.

To avoid that, one solution would then be to make sure that each packet is small enough that it's unlikely to encounter any fragmentation until it's assembled at the recipient's.

That would be insane, however. If they have that much bulk data, then they'd better use TCP in the first place. UDP already allows you to send datagrams close to 64kiB in size, which is way too big for "realtime".

Well, it's likely to be for initial map downloads and such. It makes sense to either use TCP and UDP.

Guess who they will blame when the game doesn't work? Unless you have a very clear meter for quality, and very clearly shows how the quality impacts the game working or not, they will blame you.

Ah yes, I can see that kind of thing happening. But luckily, to your advantage, no game (except turn-based like chess) will really "work" in such an environment, and if you display something like "packet loss!" (or even a number) in the corner to hint them, it should hopefully be good.

To avoid that, one solution would then be to make sure that each packet is small enough that it's unlikely to encounter any fragmentation until it's assembled at the recipient's.

Problem with a homebrew fragmentation implementation is that you don't really know when such a thing happens. At least, there is no easy way to find out. ICMP makes this work "magically" under IPv6 but you cannot easily access the info from any unprivilegued user process. Fragmentation happens automatically at the router under IPv4. You never know it happened.

If you do your own fragmentation, you only have the option of setting the "don't fragment" bit, but it's not very straighforward (for example Linux only allows DF for raw sockets, so you must run as root), and in my opinion it's something that stinks (for UDP, at least). Other than that, your only way of knowing that you've exceeded the MTU is by looking into your crystal ball, pretty much.

TCP does the same thing behind your back to discover the MTU and its window size, sure, but that's a different story. TCP is "bulk data", not "real time". It's perfectly OK to have packets dropped purposely to find out your limits.

For UDP, I would rather try to keep the datagram size to something reasonable that will likely pass on 99% of all routes, simply by sending less stuff at a time at application level. Something like 1280 bytes should fly (since practically the entire internet is IPv6 nowadays, even though most people use IPv4, the routers still have to comply with IPv6's minimum MTU). Maybe this will cause fragmentation for a few people who have a lower MTU (in theory, they could have as little as 576, but I doubt you'll find a lot of these, I can't even remember a time when it was anything lower than 1492 on my end), but you probably won't need to care. First, it just means that a select few people get 1 datagram in 2 fragments, what a tragedy -- it's not like they're getting 50 of them. And second, it'll be very few people, and those likely won't have any fun playing your game with their low-end internet connection anyway.

On the other hand, doing your own MTU discovery will necessarily and regularly drop datagrams for everybody. Which is perfectly OK if you use TCP to download a 50 MiB file. If TCP drops and resends 10 or 20 out of 50,000 packets to discover the best MTU and window size, your 20-second download takes 0.01 seconds longer, if it does at all (most likely it doesn't even) -- who cares. There is no real difference.

Only when it's UDP data in a game, you'd wish that packets weren't dropped purposely. Every now and then, you'll lose packets anyway, and it's bad enough when that happens. Sure, your application must be able to somehow cope with packet loss, but meh. Every packet lost somehow interferes with gameplay, maybe more severely or less so. In any case, I would never provoke it to happen on purpose, on a planned schedule.

Problem with a homebrew fragmentation implementation is that you don't really know when such a thing happens. At least, there is no easy way to find out. ICMP makes this work "magically" under IPv6 but you cannot easily access the info from any unprivilegued user process. Fragmentation happens automatically at the router under IPv4. You never know it happened.

But you're saying that MTU of 1280 should pretty much always work, so why not break your packet into such fragments to begin with? I mean, you say one should try to keep the UDP packet size down, but doing so pretty much would need you to split your packets at application level instead.

In some cases that can be better, but then again a custom fragmentation doesn't prevent you to do application level splittingif you want to.

What you say seem to support custom framentation rather than the opposite. What am I missing?

Submitting fewer messages at a time leaves you with complete messages. Fragmenting data at a lower level (i.e. split up "binary data" agnostically of the message structure) will likely contain partial messages that can only be received by assembling fragments.

Insofar, doing this at application level is arguably cleaner and "better".

Submitting fewer messages at a time leaves you with complete messages. Fragmenting data at a lower level (i.e. split up "binary data" agnostically of the message structure) will likely contain partial messages that can only be received by assembling fragments.
Insofar, doing this at application level is arguably cleaner and "better".

It does have the drawback that it breaks the abstraction. So now the application has to: 1) estimate packet size - meaning it must know the details of how the serialization works, 2) query the networking layer for maximum packet size and 3) add code fragmenting and defragmenting updates at various places (been there, done that).

It also means that any change in the networking layer might suddenly touch a whole lot more code than just the parts directly related to networking.

If you really care about MTU, google "path MTU discovery" and check the "don't fragment" flag in the IP packet header.

But also note that if you have the DF flag set, and your packet is larger than the MTU along the route to the destination... your packet will get dropped. No special effort will be made to route your packet around network segments with MTUs less than what your packet size is.

Edited by Washu, 12 August 2013 - 11:58 AM.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.ScapeCode - Blog | SlimDX

if you have the DF flag set, and your packet is larger than the MTU along the route to the destination... your packet will get dropped.

That is the point! You can easily use binary search to find a proper MTU to use for your entire path this way. You only need to stop on some reasonable multiple, say 128 or 64, so a few packets is all you need to get there.

Start with 2048. If it works, double it. If not, halve it. Then binary search until the difference between "works" and "doesn't work" is your search granularity size (such as 128.)

The one I built had a special file transfer transaction mechanism (the whole 'file' data block defined to not be useabale til it recieved intact other side). Simplified ACK retransmits by using a bit flag return msg (if any were never recieved) sent after a whole pass on all the packets needed to send the 'file' (flagged packets only would be resent) and this repeated if necessary til entire 'file' is intact. packets got dumped directly into the right place in a preallocated file buffer to eliminate extra block copying.

It was a while ago, but that one also had callback signals to an Application level that could cause throttling at a high level (where the App could make smarter decisions about how to handle a data flow slowdown into the server. This meant of course the packet driver needed some criteria to determine when to start alerting the Application layer.

--------------------------------------------Ratings are Opinion, not Fact

if you have the DF flag set, and your packet is larger than the MTU along the route to the destination... your packet will get dropped.

That is the point! You can easily use binary search to find a proper MTU to use for your entire path this way. You only need to stop on some reasonable multiple, say 128 or 64, so a few packets is all you need to get there.

Start with 2048. If it works, double it. If not, halve it. Then binary search until the difference between "works" and "doesn't work" is your search granularity size (such as 128.)

Then you're either using a non-optimal (too small) MTU or it will drop, depending on whether the new route has a higher or lower MTU.

I would still go with 1280 for simplicity, or if you really want to do a MTU discovery, do your binary search between 1280 and 1500.

The reason being that 1280 is guaranteed by IPv6 (and every router that isn't in a museum is IPv6 capable, so it must have at least that MTU) but on the other hand you are not likely to get more than 1500 as this is what standard ethernet delivers, and there is "always some ethernet" in between you and the other computer. Unless everyone configures their cards and routers for jumbo frames, which probably won't happen.

Of course a granularity of 128 is way too coarse for such a small interval, especially since that would only leave the choice between 1280 and 1408. Note that 1280 is what the "start at 2048" binary search would eventually converge to, too (2048fail --->1024work ---> 1536fail ---> 1280work).

Actually, once you've discovered the MTU, you don't use the DF bit anymore, so if the path changes, you may end up with IP fragmentation, but it will stil hopefully work well enough.

not likely to get more than 1500 as this is what standard ethernet delivers, and there is "always some ethernet" in between you and the other computer

Gigabit Ethernet and up typically supports jumbo frames, for 9,000 bytes per packet. On the other hand, ATM transmission packets are 53 bytes; 48 payload and 5 address. You're not going to avoid physical fragmentation there -- but at least that fragmentation is done on the physical layer, and "invisible" to the IP layer.

For 99.9% of games, I would suggest not worrying about it. If you're using TCP, don't worry -- just make sure you turn on TCP_NODELAY, and make sure you group your outgoing packets into a single send() call per network tick. For UDP games, pack your messages into a single UDP datagram per network tick, and because you don't want traffic to be too large (or the server will have a significant upstream bandwidth problem,) you don't want each UDP packet per player to be bigger than 1280 bytes anyway.