That is, the representation of the character in the locale in terms of arrays of bytes, (like UTF-8 in the first example, and single byte in the second), the equivalent Unicode character codepoint and a description.

For most locales it ultimately comes from the LC_CTYPE stuff in (with glibc) /usr/share/i18n/locales/i18n ... which of course comes largely from the Unicode Character Database. Of course, it would be nice to have a command
–
derobertMay 6 '14 at 21:11

@derobert, yes, while locale (at least the GNU one) retrieves many of the informations stored in many of the categories, things it doesn't are the most important ones in LC_CTYPE and LC_COLLATE. I wonder if there's a hidden API to retrieve that information or uncompile the locale information.
–
Stéphane ChazelasMay 6 '14 at 22:01

Yeah - you can get that info parsed - I just finally got around to wrapping up my edit. There are several commands you probably already have installed - at least I did, and I didn't even know about them. I hope it helps. Specifically recode and uconv can give you what you what you say you're looking for. Possibly even just luit and od I guess...
–
mikeservMay 7 '14 at 2:35

Thats very good! That means you dont need perl at all, i think.
–
mikeservMay 7 '14 at 16:17

I seem to be able to basically extract my charset from LC_CTYPE with just od -A n -t c <LC_CTYPE | tsort Probably you've tried it already, but I'd never heard of it before and I was reading through info and it reminded me of this - and it seems to work. There's also ptx but I think it's less relevant. Anyway, if you haven't tried it and decide to do so - fair warning - it does require a little patience. lehman.cuny.edu/cgi-bin/man-cgi?tsort+1
–
mikeservMay 13 '14 at 4:08

NOTE:

I use od as the final filter above for preference and because I know I won't be working with multi-byte characters, which it will not correctly handle. recode u2..dump will both generate output more like that specified in the question and handle wide characters correctly.

PROGRAMMER'S API

As I demonstrate below, recode will provide you your complete character map. According to its manual, it does this according first to the current value of the DEFAULT_CHARSET environment variable, or, failing that, it operates exactly as you specify:

When a charset name is omitted or left empty, the value of the DEFAULT_CHARSET variable in the environment is used instead. If this variable is not defined, the recode library uses the current locale's encoding. On POSIX compliant systems, this depends on the first non-empty value among the environment variables LC_ALL, LC_CTYPE, LANG and can be determined through the command locale charmap.

The program named recode is just an application of its recoding library. The recoding library is available separately for other C programs. A good way to acquire some familiarity with the recoding library is to get acquainted with the recode program itself.

To use the recoding library once it is installed, a C program needs to have a line:

#include <recode.h>

For internationally-friendly string comparison The POSIX and C standards define the strcoll() function:

The strcoll() function shall compare the string pointed to by s1 to
the string pointed to by s2, both interpreted as appropriate to the
LC_COLLATE category of the current locale.

The strcoll() function shall not change the
setting of errno if successful.

Since no return value is reserved to indicate an error, an application
wishing to check for error situations should set errno to 0, then call
strcoll(), then check errno.

Regarding the POSIX character classes, you've already noted you used the C API to find these. For unicode character and classes you can use recode'sdump-with-names charset to get the desired output. From its manual again:

For example, the command recode l2..full < input implies a necessary
conversion from Latin-2 to UCS-2, as dump-with-names is only connected
out from UCS-2. In such cases, recode does not display the original
Latin-2 codes in the dump, only the corresponding UCS-2 values. To
give a simpler example, the command

The descriptive comment is given in English and ASCII, yet if the
English description is not available but a French one is, then the
French description is given instead, using Latin-1. However, if the
LANGUAGE or LANG environment variable begins with the letters fr,
then listing preference goes to French when both descriptions are
available.

Using similar syntax to the above combined with its included test dataset I can get my own character map with:

OUTPUT

Of course, only 128-bytes are represented, but that's because my locale, utf-8 charmaps or not, uses the ASCII charset and nothing more. So that's all I get. If I ran it without luit filtering it though, od would would roll it back around and print the same map again up to \0400.

There are two major problems with the above method, though. First there is the system's collation order - for non-ASCII locales the bite values for the charsets are not simply in sequence, which, as I think , is likely the core of the problem you're trying to solve.

Well, GNU tr's manpage states that it will expand the [:upper:][:lower:] classes in order - but that's not a lot.

I imagine some heavy-handed solution could be implemented with sort but that would be a rather unwieldy tool for a backend programming API.

recode will do this thing correctly, but you didn't seem too in love with the program the other day. Maybe today's edits will cast a more friendly light on it or maybe not.

GNU also offers the gettext function library, and it seems to be able to address this problem at least for the LC_MESSAGES context:

The bind_textdomain_codeset function can be used
to specify the output character set for message catalogs for domain
domainname. The codeset argument must be a valid codeset name which
can be used for the iconv_open function, or a null pointer.

If the codeset parameter is the null pointer, bind_textdomain_codeset
returns the currently selected codeset for the domain with the name
domainname. It returns NULL if no codeset has yet been selected.

The bind_textdomain_codeset function can be used several times. If
used multiple times with the same domainname argument, the later call
overrides the settings made by the earlier one.

The bind_textdomain_codeset function returns a pointer to a string
containing the name of the selected codeset. The string is allocated
internally in the function and must not be changed by the user. If the
system went out of core during the execution of
bind_textdomain_codeset, the return value is NULL and the global
variable errno is set accordingly.

You might also use native Unicode character categories, which are language independent and forego the POSIX classes altogether, or perhaps to call on the former to provide you enough information to define the latter.

In addition to complications, Unicode also brings new possibilities.
One is that each Unicode character belongs to a certain category. You
can match a single character belonging to the "letter" category with
\p{L}. You can match a single character not belonging to that category
with \P{L}.

Again, "character" really means "Unicode code point". \p{L} matches a
single code point in the category "letter". If your input string is à
encoded as U+0061 U+0300, it matches a without the accent. If the
input is à encoded as U+00E0, it matches à with the accent. The reason
is that both the code points U+0061 (a) and U+00E0 (à) are in the
category "letter", while U+0300 is in the category "mark".

You should now understand why \P{M}\p{M}*+ is the equivalent of \X.
\P{M} matches a code point that is not a combining mark, while \p{M}*+
matches zero or more code points that are combining marks. To match a
letter including any diacritics, use \p{L}\p{M}*+. This last regex
will always match à, regardless of how it is encoded. The possessive
quantifier makes sure that backtracking doesn't cause \P{M}\p{M}*+ to
match a non-mark without the combining marks that follow it, which \X
would never do.

The same website that provided the above information also discusses Tcl's own POSIX-compliant regex implementation which might be yet another way to achieve your goal.

And last among solutions I will suggest that you can interrogate the LC_COLLATE file itself for the complete and in-order system character map. This may not seem easily done, but I achieved some success with the following after compiling it with localedef as demonstrated below:

AT FIRST BLUSH

It really didn't look like much but then I started noticing copy commands throughout the list. The above file seems to copy in "en_US" for instance, and another real big one that it seems they all share to some degree is iso_14651_t1_common.

Its pretty big:

strings $_ | wc -c
#OUTPUT
431545

Here is the intro to /usr/share/i18n/locales/POSIX:

# Territory:
# Revision: 1.1
# Date: 1997-03-15
# Application: general
# Users: general
# Repertoiremap: POSIX
# Charset: ISO646:1993
# Distribution and use is free, also for
# commercial purposes.
LC_CTYPE
# The following is the POSIX Locale LC_CTYPE.
# "alpha" is by default "upper" and "lower"
# "alnum" is by definiton "alpha" and "digit"
# "print" is by default "alnum", "punct" and the <U0020> character
# "graph" is by default "alnum" and "punct"
upper <U0041>;<U0042>;<U0043>;<U0044>;<U0045>;<U0046>;<U0047>;<U0048>;\
<U0049>;<U004A>;<U004B>;<U004C>;<U004D>;<U004E>;<U004F>;

There is also luit terminal UTF-8 pty translation device I guess that acts a go-between for XTerms without UTF-8 support. It handles a lot of switches - such as logging all converted bytes to a file or -c as a simple |pipe filter.

I never realized there was so much to this - the locales and character maps and all of that. This is apparently a very big deal but I guess it all goes on behind the scenes. There are - at least on my system - a couple hundred man 3 related results for locale related searches.

The Xlib functions handle this all of the time - luit is a part of that package.

The Tcl_uni... functions might prove useful as well.

just a little <tab> completion and man searches and I've learned quite a lot on this subject.

With localedef - you can compile the locales in your I18N directory. The output is funky, and not extraordinarily useful - not like the charmaps at all - but you can get the raw format just as you specify above like I did:

I probably forgot about them because I couldn't get them to work. I never use Perl and I don't know how to load a module properly I guess. But the man pages look pretty nice. In any case, something tells me you'll find calling a Perl module at least a little less difficult than did I. And, again, these were already on my computer - and I never even use Perl. There are also notably a few I18N that I wistfully scrolled by knowing full well I wouldn't get them to work either.

That's all very nice and useful info, but that gives information on the source files (in i18n) that may or may not have been used to generate the locale I'm currently using. The locale information is probably coming from /usr/lib/locale/locale-archive or /some/dir/LC_CTYPE, and that's the part relevant to my locale that is stored in those files that I'm after.
–
Stéphane ChazelasMay 7 '14 at 10:54

@StephaneChezales - so just extract your LC_STUFF from the archive with localedef - it does that too. I can demo that as well, i guess. You can also view that and pretty much everything else with strings or od or any of the rest. I did, anyway. But by the way - the charmapsare the locale youre currently using - and localedef will report on that as well. Also thats what recode does too.
–
mikeservMay 7 '14 at 11:15

You're basically saying that we can do by hand what the system's libraries to do query character class information, but that's going to need thousands of lines of code to do that reliably and the result will be system specific. (parsing the environment the same way the system library does (LOCPATH, LANG, LANGUAGE, LC_CTYPE..., identify where to look for the data, extract it...). I can't see how to extract stuff from the archive with localedef though.
–
Stéphane ChazelasMay 7 '14 at 11:44

@StephaneChazelas - I don't suggest you do it by hand - I suggest you do it with a computer - using system binaries such as od,recode, uconv, and the rest. But it was my mistake - it's not localedef that extracts it, it's recode that will. You've gotta check out info recode - and besides the recode table command I show up there is much the same thing - and it will handle things in that same way, I think. It doesn't just pull your charset out of thin air. In any case I did have high hopes for those perl modules - did you try any out?
–
mikeservMay 7 '14 at 12:12

1

If there's an API to retrieve the list of characters in a given character class in the current locale, then that's specifically what I'm looking for. If you can demonstrate how to do this, I'll accept the answer. The only thing I could think of (and how I obtained the "expected output" in my question) is to use iswblank(3) for all possible character values.
–
Stéphane ChazelasMay 7 '14 at 12:18