Wiktionaryβ

Talk:Broken/\xe2\x80\x82

Failure to be verified may either mean that this information is fabricated, or is merely beyond our resources to confirm. We have archived here the disputed information, the verification discussion, and any documentation gathered so far, pending further evidence.
Do not re-add this information to the article without also submitting proof that it meets Wiktionary's criteria for inclusion.

Does this make any sense to anyone? SemperBlotto 08:40, 7 April 2009 (UTC)

Well, that took me some time to figure out. It's a space. I think it should be deleted, if for no other reason than that it would be incredibly difficult to link to. I suppose it is meaningful.....in quite a number of languages..... What an odd case. -Atelaesλάλει ἐμοί 09:15, 7 April 2009 (UTC)

Not just a space, it’s the en space (\u2002 as opposed to \u0020). —Stephen 12:06, 7 April 2009 (UTC)

(after edit conflict) I agree. Someone who can define en space well should create the entry and include the code there. Cheers. -- ALGRIFtalk 17:47, 7 April 2009 (UTC)

Already added en space (as well as en quad, em space, and em quad) and gave a definition I hope will avoid SoP issues. However which Unicode code point is used to represent them would seem to me to be encyclopedic, not dictionaric. — Carolina wrendiscussió 18:35, 7 April 2009 (UTC)

I agree that we should avoid citing code points in the various encoding schemes, which can only lead to tables of technical information. It may be okay to put specimens on the page, however, in the form of text or images.

These terms are SoP in typesetting, but I don't mind having entries for these set phrases which are in non-technical use. em and en are typographical measurements, applied here to the width of a space or the square of a quad. Em quad is seen, but redundant, because a quad is usually an em. They can be variously stated, although most of these should not be entries:

This discussion is no longer live and is left here as an archive. Please do not modify this conversation, but feel free to discuss its conclusions.

The entry title is the thing “[[ ]]”, not the term en space denoting it. Recently discussed at WT:RFV# , and currently redirected to en space.

en space is a term denoting a thing, while an en space (“ ”) is an instance of the thing, and not a term or even a symbol representing the thing. It is a tool of the typographer, and not something that appears in print, so it is not attestable per WT:CFI (i.e., the blank created by an en space is not distinguishable from a regular double space in print, whether it was set in letterpress, phototypeset, or set digitally). While keeping it as a redirect isn't a major problem or anything, it ignores the distinction of what belongs in a dictionary and what doesn't. —MichaelZ.2009-04-14 18:14 z

This width of space frequently occurs after cola and semi-cola in archaic texts, so in that sense we can show its being used. What’s the problem? †﴾(u):Raifʻhār(t):Doremítzwr﴿ 18:28, 14 April 2009 (UTC)

This is exactly why we shouldn't try to add non-words to the dictionary. You can't attest the word en space in a 1,500-year-old manuscript any more than you can attest the word dinosaur to 230 million BC. An en space is a metal object used by a letterpress compositor, or its digital analogue; it is not a gap between words on a page. We must keep a firm grip on what is a thing and what is its name, or the dictionary will plotz. Delete! —MichaelZ.2009-04-14 20:06 z

The problem is in including both things and their names in something devoted to names. It's seems just an empty mind-game of little or no utility. DCDuringTALK 18:37, 14 April 2009 (UTC)

Are you trying to maintain the principle (with which I think I agree) or do you really want to delete it? I haven't entirely gotten my head around including both the various non-letter symbols we include and their names. After all we don't include pictures of bricks, bricks themselves, or "bricks"; we just include [[bricks]]. DCDuringTALK 18:37, 14 April 2009 (UTC)

My take is that our entries describe lexical units. Most of those lexical units are words like "bricks", but some are typographical elements, like the letter B whose entry is [[B]] and the en space, whose entry should be [[[[ ]]]]. Our entry [[en space]], on the other hand, is for a word commonly used to refer to that typographical element. —Rod(A. Smith) 19:44, 14 April 2009 (UTC)

At the end of the day, [[[[ ]]]] is not used to signify any thing, unlike most of the other typographic elements. Any standard road sign would have more basis for inclusion as a signifier. Typographic elements, like wikijargon, are a kind of inside baseball that has no value to our user base, the supposed beneifciaries of our efforts, but seems of importance to us. DCDuringTALK 19:59, 14 April 2009 (UTC)

The en space is used to indicate something. Each instance of it indicates that the preceding word has ended, that the subsequent letters start a new word, and that the visual space between those words should be a specific width. Typographical units are valid dictionary entries, distinct from the words that refer to them. B(“the letter bee”) is distinct from the word “bee”. 1(“one”) is distinct from the word “one”. The Korean letter ㄱ(“giyeok”) is distinct from the word 기역(giyeok). We document lexical units, including words bee, one, 기역(giyeok), and en space and the typographical units they name, with entries at [[B]], [[1]], [[ㄱ]], and [[[[ ]]]]. —Rod(A. Smith) 20:46, 14 April 2009 (UTC)

Lots of things indicate something. The radiation symbol, the sign of the cross, the phrase “let's try to get to the beach before Marty,” a strip of yellow police tape, an Apple logo, a woodsman's blaze hacked into a tree, a frown, a turnstile, a cupholder. None of these belongs in a dictionary.

But an en space isn't even a symbol. An en space is a metal object which is strapped into a letterpress, or an analogous 8-bit character in a text bytestream. The gap between words is just a gap between words, and a reader without a pair of calipers doesn't distinguish the gap left by three thin spaces, an en space, a quad, a tab character, two mid spaces, or the right-alignment of a very long line.

Yes, but how many of those things are Unicode codepoints? –I’d be fine with entries for ☢ and ✝… †﴾(u):Raifʻhār(t):Doremítzwr﴿ 22:13, 14 April 2009 (UTC)

The Unicode consortium is maintaining a spec describing their encoding scheme. All words in all languages is a very different thing. —MichaelZ.2009-04-14 23:33 z

Dictionaries document lexical units, of which words are only one class (the largest class, sure, but just one class). Every decent English dictionary contains an entry for the letter B. Typographical characters are a type of lexical unit whose documentation belongs in a dictionary. —Rod(A. Smith) 21:15, 14 April 2009 (UTC)

For what it's worth, I notice that our entry for lexical item and Wikipedia's entry w:Lexical item disagree with my use of that phrase. I meant something like, "units of language", but now I'm hard-pressed to find a phrase that means just that. Rephrasing my point above, I believe dictionaries document units of language, of which words are only one class. A decent English dictionary contains an entry for the letter B, with facts like the origin of the letter, its position in the alphabet, its pronunciation, etc. Similarly, a dictionary seems like a great home for documentation of typographical units, like [[1]], [[ㄱ]], and [[[[ ]]]]. —Rod(A. Smith) 21:51, 14 April 2009 (UTC)

Which dictionary, decent or otherwise, has an entry for an actual en space, and not just its name, en space? Does it have two entries for italic B and roman B? Headwords or subsenses for serif and sans-serif B's, blackletter B's, and B's set in a Swiss humanist face? I think your dictionary probably just has entries for B, and b. Periods, commas, semicolons, dashes, etc belong to orthography, not lexicography. These are all part of writing, but they are not “lexical units.” A space is not a word, it is the empty spot between written words.

But be that as it may, you are still confounding text with the technical means used to represent it. An en space is not a “typographical character”, and it is not a gap between written characters. An en space is a concrete object, a kind of metal slug, which can make a bigger gap between other types by being bent or padded with a bit of chewed paper.[1] An en space is also a Unicode character with the value U+2002, which can produce an en-width gap in displayed or printed text, but, for example, can also make a bigger gap if the text is set justified left and right. —MichaelZ.2009-04-14 22:25 z

So an en space is not a “typographical character”, but it is also a Unicode character with the value U+2002? Now I'm even more confounded. —Rod(A. Smith) 22:52, 14 April 2009 (UTC)

That's right. It's a digital character, a byte, a code point – it's not a typographical character, letter, or symbol, or a glyph at all. It's a typographer's tool, not a part of the lexis. It defines a range of behaviours for the adjacent characters, which behaviours may be modified by software which lays out or displays text on a screen or printer, but doesn't display anything or have inherent meaning. It's a block in the press or a piece of data in the computer file, but it only leaves a blank on the screen or on paper. —MichaelZ.2009-04-14 23:03 z

A dictionary defines words. No matter how loosely you want to define what constitutes words, spaces will continue to be the empty bits between them. —MichaelZ.2009-04-14 23:08 z

Forgive my obtuseness, but I don’t see the difference between and ¦, ۝, and ○, apart from the fact that the former lacks black bits; what’s the significant difference in lexicographical terms? †﴾(u):Raifʻhār(t):Doremítzwr﴿ 22:03, 14 April 2009 (UTC)

Unlike bricks or dinosaurs, we are dealing here with something which can be sought for via our search box as a single Unicode graphic character. I don't see it as being outside the remit of an electronically-based dictionary to deal with such things, even if a paper-based dictionary would be unlikely to do so. The question therefore is whether Unicode graphic characters meet the standard of “Terms” to be broadly interpreted. A loose interpretation of the sixth bullet point: "Characters used in ideographic or phonetic writing such as 字 or ʃ." would favor allowing an entry for every Unicode graphic character.

DC is arguing in favor of a strict standard under which characters that exist solely as typographic tools would not meet CFI's standard of a “term”. If “[[ ]]” and “[[ ]]” were deleted, logical consistency would seem to call for also deleting “²” and “³” since those too are simply tools of the typographer to produce superscript 2 and superscript 3 respectively, and likely quite a few other existing entries would need to be deleted under that strict standard.

I see no reason to discriminate against non-printing graphic characters, and no reason to not have suitable entries for each and every one of them. Getting those suitable entries is likely to be the problem, though for some them, such as “⑲”, a redirect should suffice. — Carolina wrendiscussió 22:20, 14 April 2009 (UTC)

You're using a pretty loose interpretation of loose interpretation. If it meant any Unicode character, it would say any Unicode character, and then we would have to change it so that this remained a dictionary. Fortunately, none of these things is true. —MichaelZ.2009-04-14 22:29 z

I didn't say any Unicode character, I said any Unicode graphic character. There is a difference and I agree that combining characters such as U+0305 COMBINING OVERLINE or format characters such as U+200C ZERO WIDTH NON-JOINER don't merit entries. — Carolina wrendiscussió 22:46, 14 April 2009 (UTC)

And yet those code points are how the Wiktionary is accessed, not simply letters as is the case for paper dictionaries. To a certain extent, function must follow form.

Let me make certain that I understand you and the full scope of your point of view on what should be included and excluded as a “term”. In addition to “[[ ]]”, you would favor eliminating everything that does not fit a highly restrictive definition of lexical, such as, but not limited to -, ², ·, ‽, ⇐, ♪, and most of Appendix:Unsupported titles (including of course the entry on “ ”). If not, please explain why you feel any of those merit being included as terms while “[[ ]]” does not, for what you have propounded so far does not support making such a distinction. (In my opinion, that spaces don't have black bits does not of itself constitute a reason to exclude them.) — Carolina wrendiscussió 00:55, 15 April 2009 (UTC)

The design adage is actually “form follows function” rather than the reverse. The function of this dictionary is to define terms, not to duplicate the Unicode Consortium's specification.

But this is off topic. If you have a specific proposal to change CFI, then write it up at the Beer Parlour and maybe you can learn more about my full scope. For now I'll just say that “characters used in ideographic or phonetic writing such as 字 or ʃ” certainly does not declare or imply that “ ” is a term in any language. —MichaelZ.2009-04-15 19:03 z

I'm aware of the way the adage is usually structured, but the converse was appropriate to my point. I don't see the need to rewrite CFI to include “ ” within the scope of what we cover except possibly for added clarity. The list given of what constitutes a term in not written in manner that indicates exclusivity. To quote the relevant section: “A term need not be limited to a single word in the usual sense. Any of these are also acceptable:”. If that list were prescriptive instead of illustrative, the language should instead be something like: “A term is not limited to a single word in the usual sense. The following and only the following are also acceptable:”

Now let me repeat my question, MZ, and this time I'd appreciate an actual answer instead of blithely declaring it off topic. You are arguing that “[[ ]]” should be deleted because it does not meet what you consider to be the scope of “term”. I do not agree that it does not meet the scope of “term”, and have explained why I feel is does. If “[[ ]]” (and “[[ ]]”) are deleted, your interpretation of what is a “term” is likely to govern future discussions and therefore knowing clearly what that interpretation is on topic. Therefore let me repeat for hopefully the last time: Do you favor eliminating everything that does not fit a highly restrictive definition of lexical, such as, but not limited to -, ², ·, ‽, ⇐, ♪, and most of Appendix:Unsupported titles (including of course the entry on “ ”)? If not, please explain why you feel any of those merit being included as terms while “[[ ]]” does not. — Carolina wrendiscussió 19:06, 16 April 2009 (UTC)

I'm sorry, but each case has its own merits, or hasn't. I won't satisfy your demand for me to write an essay to “govern future discussions.”

This is not a term in any sense, not even a written character, except purely in the jargon of digital representation. Spaces are the bits between terms and their components. It's also not attestable. The half-em gap left between terms by this code point could have been created by a half-dozen other means. This doesn't meet our CFI. —MichaelZ.2009-04-17 22:32 z

Not much sense in hashing this further, since it is clear we disagree on the termitude of “[[ ]]” and that we won't resolve our difference of opinion here. White space can be shown to affect meaning as doggone and dog gone aren't even remotely related in meaning. (Indeed, a dog gone would elicit a Yippie! from me.) Given a refusal to discuss the broader issue, I will have to take the position that this deletion, if it goes through, sets no precedent except for other white space characters. — Carolina wrendiscussió 01:20, 18 April 2009 (UTC)

Delete; not verifiable. We don't index orthogrpahic variants where the spelling does not change. This is akin to discussing cat versus cɑt. --EncycloPetey 19:26, 16 April 2009 (UTC)

While the example is problematic, since cɑt could be a distinct word from cat in Fe'fe' or some other African language that uses Latin alpha (Ɑ ɑ) as a letter, it certainly is a more cogent argument than the one given by MZ. While I still think that we ought to have entries or redirects (to either another entry or an Appendix) for each Unicode graphic character, if only to avoid the inevitable attempts at (re)creation by anons, until such time as a systematic effort to make such entries is undertaken, I won't be insistent on keeping this, tho I still don't support deletion. — Carolina wrendiscussió 22:14, 17 April 2009 (UTC)

It's a good example, even if italics are used in Fe'fe' leetspeak. Many fonts have a unicameral italic small a, that does not make it a small Latin alpha, any more than a zero is a capital o, or a capital el a small i.

Perhaps redirects to an appendix would serve to prevent recreating such entries. Or a redirect with explanation, as at Wiktionarian. —MichaelZ.2009-04-18 14:28 z

What italics? There is some use of Latin alpha outside of IPA as a distinct letter from Latin a, which is why Latin Capital Letter Alpha (U+2C6D) was added in Unicode 5.1. One can argue the wisdom of making that distinction, but it isn't dissimilar in nature to separating u and v or i and j. — Carolina wrendiscussió 02:09, 19 April 2009 (UTC)

Based on EP's point ("We don't index orthogrpahic variants where the spelling does not change."), I'd be satisfied with a simple Mediawiki redirect to whatever entry we have for whitespace. —Rod(A. Smith) 22:40, 17 April 2009 (UTC)

Strong keep for both this and em space. We have entries for Translingual graphical symbols (e.g. "," and "-") and entries for their names in (hopefully) all languages (en:comma & hyphen, es:coma & guión, etc.). “[[ ]]” should be treated no differently. We should not have a Translingual symbol be a redirect to an English term. And the entry should have its Unicode codepoint. In the context of symbols, this information is not encyclopedic. It is necessary information that clarifies easily confused and hard to decipher symbols. Perhaps if we could separate names for things (that should go in dictionaries) from the actual text we use, then we wouldn't need to have this entry. But we can't, so we should.

EP, perhaps a better example of your point would have been bedroom & BEDROOM. But while true for whole words, we make obvious exceptions for single character entries (e.g. G & g and Γ, γ). --Bequw → ¢ • τ 09:14, 18 April 2009 (UTC)

Those aren't exceptions. The letters G and g serve different lexical functions and have separate etymologies. The capital and lower case were developed at different times and in different cultures. --EncycloPetey 17:47, 18 April 2009 (UTC)

Perhaps a good position is that we ought to have an entry for [[ |« »]], but not for something like en space (written with the en space, resulting perhaps from paragraph justification) as an alternative for en space (written with an ASCII space); just a thought… †﴾(u):Raifʻhār(t):Doremítzwr﴿ 18:25, 18 April 2009 (UTC)

Keep per Bequw. EncycloPetey's analogy is apt, and we shouldn't have entries for things like cɑt and hot dog; but we should have entries for things like ɑ and [[ |the en space]]. (And in fact, we already do have an entry for ɑ.) —RuakhTALK 15:08, 18 April 2009 (UTC)

Yes, we have an entry for ɑ as an IPA character, but that is irrelevant. We do not have entries for cɑt or ɑpple, or other English words using that character in the spelling, despite the fact that some publishers use that typography. Likewise, we dont have separate entries for the two forms of lowercase g that can appear in different font sets. --EncycloPetey 17:47, 18 April 2009 (UTC)

But we should have entries for the two forms of the lower-case g, in my opinion. †﴾(u):Raifʻhār(t):Doremítzwr﴿ 18:25, 18 April 2009 (UTC)

This is a purely stylistic difference, and carries no lexicographical meaning. The (near-)exception is in IPA, which prefers an “open-tailed g”, presumably for consistency, but allows a bicameral g – Unicode provides a code point for this, U+0261 (ɡ). In my opinion, this is a preferred style, like asking for a sans-serif font, and the letter remains a g. —MichaelZ.2009-04-19 01:08 z

Whatever difference or lack thereof there is between ‘g’ and ‘ɡ’, we’d still have separate entries for both, wouldn’t we? The difference is that we would not have separate entries for, say, gold and ɡold (though the latter might be kept as a hard redirect to the former). †﴾(u):Raifʻhār(t):Doremítzwr﴿ 17:29, 19 April 2009 (UTC)

I would say keep only if we are to distinguish glyphs like italic and serif variants. In fact, I would very much like to see this. A vertical line is a single glyph but can mean lowercase L or uppercase I, except that in writing an word-initial I usually has the bars on top and bottom. The line can also mean the number 1 in the US but the British write that differently. The Arabic numeral 5 when written by the Chinese always has the left stem extending noticeably above the horizontal bar. This is true despite looking the same in type as we would see it, just as in the U.S. the dollar sign is written with two vertical strokes although the typed symbol commonly has only one, sometimes going through and sometimes not. It is proper for dictionaries to document this kind of information. However, seeing as we do not yet, delete. DAVilla 08:34, 10 May 2009 (UTC)

And in countries which use the Cyrillic alphabet, a figure 4 is always closed, so it doesn't look like a letter Ч (che). Where a figure 1 starts with a prominent upstroke, the figure 7 is distinguished by a crossed stem.

But these typographic or calligraphic variations of letterform don't belong in a dictionary, nor do Roman or Cyrillic type styles like serif, italic, or boldfaced. They don't represent differences in spelling or even orthography, nor in pronunciation, nor any other lexicographical feature. They might be summarized in an encyclopedic appendix, but we have the biggest appendix in the world, the English-language Wikipedia. —MichaelZ.2009-05-18 16:39 z

Sure they belong here. It is for precisely these reasons of division in linguistic evolution that we don't have just one Roman alphabet or one Cyrillic alphabet, these inventions often reflecting phonetic deviation, and why words borrowed from similar languages take on very different pronunciations (or similar pronunciations and different spellings), and why letters have come to represent multiple phonetics over time. It may feel like documenting this is a static snapshot, but it is an entirely linguistic topic and subject to change as with any other, just over a larger span of time. DAVilla 04:08, 26 May 2009 (UTC)

Huh? So there should be separate entries for the different “terms” italic, italic, and italic (and italic?). Separate entries for style, style, and style? One for closed figure 4 and another for open-topped 4? Not only is this impossible to do in Wikimedia, but there is no dictionary precedent for any such thing. This is not lexicography. —MichaelZ.2009-05-26 04:28 z

No, only for glyphs, some day in a separate namespace possibly but starting in the appendix, just as usage notes for the different alphabets and such. Once you know that a glyph stands for a 1 or an I or an l, then it can take on the lexical meaning, and the different "terms" you name above would all be under the same title, italic or style, and not also in small caps or with an initial capital as at the beginning of a sentence or other obvious variants.

And am I mistaken? I thought the unabridged dictionaries did catalog the evolution of letterforms. I've seen a few places where we do already. DAVilla 05:10, 26 May 2009 (UTC)