The problem is that Strings expect to be Unicode, and not all 16 bit values or combinations of 16 bit values are legal Unicode strings.

The biggest and most obvious problem is that Unicode Supplemental characters (U+10000 through U+10FFFF) are encoded as a surrogate pair, consisting of a High Surrogate (xD800 - xDBFF) immediately followed by a Low Surrogate (xDC00 - xDFFF). The Windows WCHAR type and .Net char and String types are implicitly UTF-16 encoded Unicode, so any character D800 - DFFF is expected to be a High Surrogate immediately followed by a Low Surrogate. Encountered separately the characters cause a String to become invalid Unicode. The behavior of illegal Unicode sequences isn't always consistent or expected. Additionally Unicode 4.0 forbids emitting illegal Unicode, so .Net v2.0 doesn't convert those values when doing UTF8 conversions.

Another less obvious problem would be that Unicode is intended to encoded glyphs and has rules for that purpose. So if your binary data looked like Ö (U+00D6) and you (or an API you called) sent that string through Normalization Form D, your data would change to 2 code points O + ̈ (U+004F U+0308). Other edge cases would be the treatment of U+0000 (particularly for native Windows APIs), unassigned code points, the private use area, etc.

Problems could also be encountered with the method of converting binary or byte data to a string, such as systems disagreeing on the conversion method to a code page. (The 2nd example above would have different values on systems with different system code pages). Some code pages that don't have 1 to 1 mappings for all binary values. Sometimes Unicode character data is further converted when passed to another system, potentially introducing more loss.

One way to look at this problem is similar to casting pointers in C. Sure, you can cast a pointer to a different type, but it increases the risk that something will break. Copying binary data to a String or char[] form is similar to casting: it changes the representation of the underlying data to a strictly defined type with a different meaning. You probably wouldn't think of storing jpg binary data in a String, but it doesn't seem so bad if its just a little hash value.

Applications should preferably pass binary data as binary data 🙂 If your application must represent binary data as character or string data, such as to tunnel in an IRI query string, then you should carefully understand the use of that data and make sure to confine your application to safe Unicode ranges. A % escaped sequence, confining the values to the ASCII range, or some other mechanism might be appropriate.

One example of a safer encoding mechanism is how IDN uses Punycode encoding. Punycode effectively creates binary data when trying to encode an IDN name. That binary data then needs to pass that binary data through a system (DNS) that is restricted to a subset of ASCII. Punycode encodes the binary values as an ASCII range. Values that happened to look like surrogates would not be confused, and the ASCII range by itself is devoid of any normalization mappings that could cause potential confusion. This mechanism not as efficient as just passing binary data, but it does allow binary data to be passed in a domain name. Numerous e-mail and attachment encoding mechanisms work in a similar fashion.