Others related terms I've seen used

This usually refers to the UTF8=0 storage format, but it could also refere to a string of bytes.

Character string

This usually refers to the UTF8=1 storage format. The term is incorrect since all strings are made of characters by definition.

Byte semantics

This usually refers to how code behaves when given a string in the UTF8=0 storage format, in distinction to how it behaves when given a string in the UTF8=1 storage format. Code that make such a distinction suffer from The Unicode Bug.

Character semantics

This usually refers to how code behaves when given a string in the UTF8=1 storage format, in distinction to how it behaves when given a string in the UTF8=0 storage format. Code that make such a distinction suffer from The Unicode Bug.

Update: Changed "regardless of the value of its UTF8 flag" to something clearer in response to JavaFan's and wrog's comments.
Update: By request, added end tags for DT, DD and P elements even though they are optional.

now $x consists of a single byte. Even though it requires 16 bits of encoding.

Perhaps the confusion comes from saying that for your definition of a byte, the UTF8 flag doesn't matter, yet it refers to a string element, which is defined in terms of substr, for which the UTF8 flag *does* matter.

I'd say that in my example, $x ends up having 2 bytes, but one character. This is also the difference wc makes.

Of course, you are free to use whatever definition you want -- just do mind that not all people share your definition. Some people prefer not use the term byte at all, just character and octet.

Yes, that's what I call a byte. So maybe it's my definition, not my term that's unclear.

which is defined in terms of substr, for which the UTF8 flag *does* matter

Ah, there's the problem. "The UTF8 flag doesn't matter" means different things to us. For a given string, substr will always return the same value regardless of the UTF8 flag, so I say the UTF8 doesn't matter to substr.

Right, we do mean something different with "the UTF-8 doesn't matter". I interpret that as the only difference between the internal representation of the strings is whether the UTF-8 flag is set or not -- but you use it to mean "it doesn't matter whether the internal encoding is UTF-8 or not".

Iíve certainly struggled with all this myself, trying to come up with a clean and consistent way to talk about these things. I applaud the effort, because of how confused people get over all these things. I have a sneaking suspicion that itís our own fault that folks get confused, although I canít pin my nebulous feeling down any better than I have just now stated.

One difficulty youíre having is that you are sometimes talking about abstract strings but at others about the properties of how physical memory is laid out. That is probably too much to ask for all in one go. Even if that is your real goal here, I would exercise some caution in the order of presentation.

If you (initially, and perhaps always) limit the discussion to abstract strings alone, then I do believe that a consistent set of terms can be derived, mostly along the lines that you initially pursue.

A Perl string is an ordered sequence (like a list or an array) of zero or more individual scalar values. These scalar values are sometimes called code points when the number is what is emphasized, but more often called characters, albeit somewhat misleadingly. The word Ďcharacterí has glyphic connotations, or even typewriter-keystroke connotations. Itís certainly a massively common shorthand, and perhaps even a reasonably serviceable one, but it it is not without its pitfalls.

A code point that fits within 8 bits is sometimes called a byte.
A code point that fits within 21 bits is sometimes called a Unicode code point. Perlís code points are not limited to 21 bits, but to the size of your systemís largest unsigned integer, probably either 32 or 64 bits.

Unicode recognizes only two abstractions: code points and grapheme sequences. Both are determinable programmatically. A code point corresponds to what the programmer is apt to think a character to mean, being an individual scalar element in a string.

However, a grapheme sequence is more apt to correspond to what the end-user thinks of as a character, because it looks like a single glyph. For example, the letter b with an acute accent is a grapheme that the user will think of as just one solitary character, whereas the programmer is apt to think of as a sequence of two distinct code points. A very common grapheme that requires two code points is the sequence of a carriage return immediately followed by a line feed.

As you see, Iíve completely dodged the whole UTF8-flag thing. By giving definititions of string components for just code points alone (and graphemes built up of code points) as the fundamental contituent string components, not about bytes and characters, I (try to) avoid the thornier issues.

I do not believe the UTF8-flag should be part of the initial presentation, which should have at its heart the simple abstract scalar elements ó here, code points, meaning Ďcharacterí numbers ó of which all Perl strings are made. Abstract code points are the indivisible atoms from which our molecular strings are composed.

It is my feeling that you have to present a clear picture of how Perl strings work in the abstract before you can get to messy and complicated matters of serialization schemes in physical memory or on disk.
For those who need to talk about serializations, which I stress is comparatively few, then and only then can you further elaborate the dirty parts for this much smaller audience.

But I fear you are going to run into serious trouble if you do so atop on existing notional framework that has pre-existing and conflicting senses for Ďbyteí and Ďcharacterí. Those two terms have too many meanings in other programming languages, so if one stays clear of them, one avoids people thinking they are things they are not.

Best perhaps to leave it at code point, and perhaps hem a bit about grapheme clusters. Thatís all that matters in an abstract string; physical memory is a different matter, of course.

As you see, Iíve completely dodged the whole UTF8-flag thing. By giving definititions of string components for just code points alone (and graphemes built up of code points) as the fundamental contituent string components, not about bytes and characters, I (try to) avoid the thornier issues.

And by avoiding the UTF-8 flag/encoding, you're creating confusion. $x = "\xBB"; utf8::upgrade($x);. Now it's not clear to me whether you consider $x to by a byte or not. One can encode 0xBB in 8 bits (and it is encoded in 8 bits in LATIN-1), but its Unicode encoding uses 16. So, if you say A code point that fits within 8 bits is sometimes called a byte, that's ambiguous. Whether or not the code point 0xBB fits in 8 bits depends on its encoding.

The sequence of string elements in a string. This is not affected by the string's UTF8 flag.

This is not what you want to say (it immediately confused me because my first thought was, "This is wrong because if you have a string with non-ASCII characters in it and change its UTF8 flag, that will change the sequence of elements.")

What I think you meant to say

The sequence of string elements in a string, irrespective of any particular choice of memory representation being used for that string

and only bring up the UTF8 flag later.

Also, IMHO there needs to be distinct terminology for

a grouping of (usually but not always) 8 consecutive bits of physical storage

the abstract array element in the case where all elements are expected to be in the range 0-255, regardless of the actual storage format

the problem being (as noted by others) that most people associate "byte" with (1). Using it for (2) is unlikely to reduce the confusion out there and not having distinct terminology makes it difficult to talk about storage formats.

One could possibly commandeer "octet" for (2) but realize that "octet" originated in the RFC world where a word was needed to refer to physical storage in the specific case where bytes explicitly are known to be 8 bits. On the other hand, most of the stuff in the RFC world is indeed trying to abstract away from specific hardware, so one could justify its usage in a more abstract sense that way. And "octet" does, at least, immediately imply 0-255, unlike "byte"

There's also the small matter that it really doesn't make a whole lot of sense to use UTF-8-flag-on format to store something that is composed of octets, even if it is indeed possible to do. Which is why people conflate octet strings with the UTF-8-flag-off format and its 1-1 correspondence between octets and bytes
(...and thus why it is indeed important to point out that UTF-8-flag-on octet strings are possible, albeit silly...)

except you updated the wrong thing. It's the sentence "This is not affected by the string's UTF8 flag," under Basics->"String Value" that's tripping me (and apparently also javafan) up and that needs to either go away or be changed

a grouping of (usually but not always) 8 consecutive bits of physical storage

UTF8=0 storage format.

No, any storage format. In order to talk about storage formats at all you need a word for the raw underlying bytes whatever they are and however they're to be interpreted, and redefining "byte" to mean something else makes this really difficult.

You need a different word, and you're probably right that "octet" isn't a great choice either, so I had another thought: How about one of the following to refer to string elements that are constrained to lie in the 0-255 range?

"octetchar"

"octet-character"

"bytecharacter"

"byte-character"

"bytechar"

as opposed to "general character" or "Unicode character" when the full Unicode (or UV) range is possible. This way you're emphasizing that they're still characters in the sense that everybody agrees on (i.e., they're elements of a string and we're abstracting away from how they're represented). If you then say that a single octetchar can actually be multiple bytes in the UTF8=1 storage format, the meaning is clear.

Any glossary that starts out by trying to redefine the (industry standard) term "byte" to mean something other than the "minimally addressable unit of memory by the processor", invalidates itself from that point on. Just pissing in the wind.

As for "the Unicode bug", Unicode is the bug.

There's no point in rehashing the arguments, but one thing worth saying is that as long as people like you continue to try and rewrite history is this way, in an attempt to excuse the broken Perl implementation of Unicode, the longer it will be before we can get back to a world of sensible, sane, predictable and intuitively usable semantics.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.