The SitePoint Forums have moved.

You can now find them here.
This forum is now closed to new posts, but you can browse existing content.
You can find out more information about the move and how to open a new account (if necessary) here.
If you get stuck you can get support by emailing forums@sitepoint.com

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

2-character NBSP?

For some odd reason, I managed to find a way to make NBSP two-characters long.

The original string was (With 'start' and 'end' appended for display purposes):

Code:

start&nbsp;5TS50568A30099246end

Okay, not a problem... obviously, it's thinking of &nbsp; as 5 characters; substr from 5 (remember, start isnt actually there), and you get the number, right? No.... you lose 5TS.
Well, then it must be thinking of the Non-Breaking-Space character as one, ASCII, character. so substr starting at 1. Nope... get a question mark in front of the string.
Try 2, since i lost 3 characters with 5 - works fine.

Is this a case of some odd substr indexing? Or did i really end up with a 2-character-long NBSP?

Note: You might wonder why trim(html_entity_decode('&nbsp;')); doesn't reduce the string to an empty string, that's because the '&nbsp;' entity is not ASCII code 32 (which is stripped by trim()) but ASCII code 160 (0xa0) in the default ISO 8859-1 characterset.

Logic without the fatal effects.
All code snippets are licensed under WTFPL.

No, but it's represented as a single-character space. Why would a single character space take two values to interpret, especially when one of those values is completely unnecessary in the other charset?

Unicode uses up to four bytes per character because it supports about 16 million different characters. ASCII only supports 128 characters and so doesn't even use a full byte. If you can find a way to fit 16 million values into a single byte thhat can only hold 256 different values then you can define an altenative to Unicode that only uses single byte characters and will be able to make a fortune.

Since UTF-8 and UTF-16 do not use 4 bytes consistently for all characters there is obviously a set range of special values that are reserved in the Unicode character set to mark that the character uses two or four bytes instead of one. Obviously A0 (or 160) is within that range and therefore uses a two byte represenation. It is still a single character, it just uses two bytes to hold the character instead of one.

No, but it's represented as a single-character space. Why would a single character space take two values to interpret, especially when one of those values is completely unnecessary in the other charset?

What do you mean by a single-character space? ❶ is a single character. So is 中 . Do you propose that each of them could somehow be represented by a single byte, considering that there are no more than 256 combinations of bits in a byte?