Archive for January, 2015

After some friends were ranting about Boost on twitter, I decided to do something about the single Boost dependency in my game code… but not the usual thing (rewriting it).

I’ve been using Boost’s operators.hpp for nearly a decade, because it actually makes writing operator code cleaner. Alas, my trimmed down subset of Boost was bringing nearly 60 source files with it, just for one file. That’s dumb. Thus, Boost without Bullshit was born:

The idea behind boost-wobs is that yes, there is some good stuff in Boost, but many developers refuse to use it because of the “BS” that comes with it. You can’t just add a single header to you project and get only the feature you want. It’s all or nothing.

I’ve now fixed operators.hpp to make it standalone with zero dependencies. Optionally, it also supports operations on std::iterator’s, which is now enabled using a special define. It used to be boost::iterator’s, but boost/iterator.hpp is depricated, and actually just std::iterator, so why not just use std::iterator instead?

This was a bit of a shower thought, but until this I didn’t have a good reason to choose UTF-16 for anything.

UTF-8 makes a lot of sense. It has all the benefits of ASCII text formatting, and the ability to support additional characters above and beyond the ASCII 127 or 254. It’s very similar to ASCII, you just have to be careful with your null and extended codes. No arguments here.

The main problem with UTF-8 is that you can’t just iterate as you do with ASCII (ch++). You need to check for special characters every step (just in case). On the plus (I think) all special characters are above a certain byte value, so the initial test is cheap, but it’s the variations that are a little costly.

UTF-16 makes a lot of sense for the actual storage format of strings in RAM. Most characters are a single 16-bit Glyph (any product I plan on working anyway), but there is support for 32-bit two-character Glyphs as well. UCS-2 is the name of the legacy wide-character pre-UTF encoding.

With UTF-16, if you prepare for it, you can get away with simple iterations (short* ch++). Not to mention, it’s a 2-byte read, which should effectively be faster/lest wasteful than multiple 1-byte reads. It would be wise to just discard/replace characters above the 16-bit range (no Emoji). China may have a problem (GB 18030), but its a controlled situation, and most glyphs will be on the Basic Multilingual Page anyway. Plus there’s a whole 6k of Private Use glyphs if needed anyway (item iconography like swords, shields, potions, etc). That’s kind-of a nice feature.

ASCII still makes a lot of sense for keys, script code, and file names. Since UTF-8 is backwards compatible with 7-bit ASCII, if we impose the restriction that all keys will be 7-bit ASCII, then our string are an optimal/smaller size. Also, UTF-16, though each character is wider, as long as it’s the correct endianness, it should also be 7-bit ASCII compatible. With the exception of strings and comments, it’s reasonable to impose a 7-bit ASCII restriction. After all, this is still a UTF-8 compatible file.

16bit String Lengths are a reasonable limitation. 0 to 65535 will cover 99.99% of string lengths. The only thing that will push it over is if you happen to have a novel worth of text, or a large body of text that’s heavily tagged (HTML/XML). So it’s bad for a web browser or text file, but fine for anything else. In an optimal use case, this means you’re using 3 extra bytes per UTF-8 string (16bit size, 8 bit terminator) or 4 extra bytes per UTF-16 string (16bit size, 16bit terminator).

This aligns well if you NEVER plan to take advantage of 32bit or 64bit reads. If you do however, then having a larger string length means the string itself will be padded to your preferred boundary. Use platform’s size_t for optimal usage.

Most standard zero-checking string functions are better fed a pointer to the data directly. This lets them work exactly like normal C strings, looping until they hit a zero terminator. But a smarter string function may want to know size faster (i.e. if equal, confirm sizes first).

A pre-padded string length can be made mostly compatible with a pre-padded datablock type. The string version will be one character longer, but any functions that deal with copies will be (mostly) the same (one extra action, pad with a 0 at the end).

Line Chunked, in addition to whole strings, is a useful format. A text editor would want a fast way to go from line 200 to 201. Always iterating until you find a newline is slow, so it’s best to do this initially. If the line sizes don’t need to change, you can butcher the source string by replacing CR and LF with 0, and having a pointer to each line start. If you need to change lines, and those lines will definitely grow larger than some maximum, then each line should be separately allocated, and be capable of re-allocating.

Unfortunately this doesn’t help for word wrap. Word wrapping is completely dependent on the size of the box the text is being fit in to. Ideally, you probably want some sort of array of link lists containing wrap points. Each index should know how many wraps it has, meaning you’ll need to track both what line and what wrap you are on. Process the text the same way as Line Chunked, but you wrap at wraps.

Text Interchange formats like JSON can be padded with spaces/control characters. A padded JSON file wont be as tightly packed as a whitespace removed one, but the fastest way to read/write string data is when it’s 32-bit aligned. On many ARM chips, it’s actually a requirement that you do aligned 32-bit reads. On Intel chips it wastes less cycles.

Or if all you care about is EFIGS (English, French, Italian, German, Spanish), then just ASCII and be done with it. There should be some rarely/never used characters under the Extended ASCII set, which gives room for your custom stuff.

Long story short:

Use UTF-8 as an external storage format. Use UTF-16 as an in-memory storage format. Use ASCII for keys, script code, file names, etc. Or be lazy, do ASCII for EFIGS.

Posted in Uncategorized | | Comments Off on When and why use UTF-8 and UTF-16? Stringy thoughts.

The changes are against the strict JSON spec, but instead are usability improvements:

Add C style /* Block Comments */ and C++ style // Line Comments.

Added support for Tail Commas on the LAST LINES of Arrays and Objects

The benefits of comments in JSON should be obvious, but Tail Commas are quite the nicety. When manually editing JSON files, you sometimes re-order your lines with copy+paste. According to the spec, all members of an Objext or Array are followed by a comma, except the last one. Removing that comma, or making sure it exists on all lines after copy+pasting is an unnecessary pain. Now its optional.

Hello 2015. The laptop I’ve been using the past few years actually fell apart, so I bought a new one. I started using Linux (Ubuntu) almost exactly 1 year ago, and as much as I like it, it’s not always the most logical and obvious OS to use, so I take notes. Here are notes.

I dropped a 500 GB Samsung SSD in to it, dedicated a 200 GB partition to Windows 7 Pro, and the rest to Ubuntu 14.10 (~260 GB, though I’m thinking about giving Windows a little more). I’ve installed various dev tools and SDKs on both. As expected, Windows has used about 100 GB of its space, and Ubuntu about 20 GB. Typical. 😉

Return of Oibaf (bleeding edge Video drivers)

Setting up the machine went very smoothly, except for two issues:

– The Steam UI was … strange. Not slow, but unresponsive (had to right click to refresh)
– the Print Screen key (and scrot) could not take screenshots (they were wrong)

The solution was to upgrade the video driver. The very latest Intel and OSS drivers are always available as part of the Oibaf graphics driver package.

The latest-and-greatest drivers can sometimes be risky to use. I normally don’t use them, but for a time I did, and all was fine until Mesa mainline got busted. Sadly oibaf doesn’t keep “last known good drives” around, only the bleeding edge, so I only recommend it if all else fails.

Skype Tray Icon Fix

Skype is a 32bit app running on 64bit Linux. To correctly make tray icons work, this package fixes it.

1

sudo apt-get install sni-qt:i386

Download it, and restart Skype.

TLP – Advanced Linux Power Management

This little piece of software dramatically improves my Linux battery life. I had used it in the past, but for a time I was using pre-release builds of Ubuntu, and no TLP update was available (I hadn’t learned about how to grab old versions of software from PPA’s yet).

About tooNormal

tooNormal is the digital notebook of Mike Kasprzak. Some may call it a blog, but it's more a collection of notes and thoughts, when Twitter just isn't verbose enough.

Mike is a long time veteran game developer, having done time at various game studios plus "the indie thing" for well over a decade. He owns and operates SYKRONICS. He also organizes and runs Ludum Dare with some awesome people.