Variables

Define Documentation

Function Documentation

int detectUTF8

(

const char *

txt

)

Determine whether the string contains a valid unicode sequence. The following table give the pattern of a valid UTF-8 character. Unicode Range 1st 2nd 3rd 4th U-00000000 - U-0000007F 0nnnnnnn U-00000080 - U-000007FF 110nnnnn 10nnnnnn U-00000800 - U-0000FFFF 1110nnnn 10nnnnnn 10nnnnnn U-00010000 - U-001FFFFF 11110nnn 10nnnnnn 10nnnnnn 10nnnnnn Note: 1. The latest UTF-8 RFC allows for a max of 4 bytes. Earlier allowed 6. 2. The number of bits of the leading byte before the first 0 is the total number of bytes. 3. The "n" are the bits of the unicode codepoint. This routine does not check to see if the code point is in the range. It could.

param txt the text to check return 1 if all high order characters form a valid unicode sequence -1 if there are no high order characters. Note: this is also a valid unicode sequence 0 if there are high order characters that do not form a valid unicode sequence author DM Smith

00117 {
00118 unsignedint countUTF8 = 0;
00119 int count = 0;
00120
00121 // Cast it to make masking and shifting easier00122 constunsignedchar *p = (constunsignedchar*) txt;
00123 while (*p) {
00124 // Is the high order bit set?00125 if (*p & 0x80) {
00126 // Then count the number of high order bits that are set.00127 // This determines the number of following bytes00128 // that are a part of the unicode character00129 unsignedchar i = *p;
00130 for (count = 0; i & 0x80; count++) {
00131 i <<= 1;
00132 }
00133
00134 // Validate count:00135 // Count 0: bug in code that would cause core walking00136 // Count 1: is a pattern of 10nnnnnn,00137 // which does not signal the start of a unicode character00138 // Count 5 to 8: 111110nn, 1111110n and 11111110 and 1111111100139 // are not legal starts, either00140 if (count < 2 || count > 4) return 0;
00141
00142 // At this point we expect (count - 1) following characters00143 // of the pattern 10nnnnnn00144 while (--count && *++p) {
00145 // The pattern of each following character must be: 10nnnnnn00146 // So, compare the top 2 bits.00147 if ((0xc0 & *p) != 0x80) return 0;
00148 }
00149
00150 // Oops, we've run out of bytes too soon: Cannot be UTF-800151 if (count) return 0;
00152
00153 // We have a valid UTF-8 character, so count it00154 countUTF8++;
00155 }
00156
00157 // Advance to the next character to examine.00158 p++;
00159 }
00160
00161 // At this point it is either UTF-8 or 7-bit ascii00162 return countUTF8 ? 1 : -1;
00163 }