I plan a few improvements to Phobos that will improve string handling.
Currently arrays of characters count as random-access ranges, which is
not true for arrays of char and wchar. I plan to make std.range aware of
that and only characterize char[] and wchar[] (and their qualified
versions) as bidirectional ranges. Also, std.range will define s.front
and s.back for strings to return the correctly decoded dchar. Naturally,
s.popFront and s.popBack will yank an entire encoded character, which is
what you want most of the time anyway. (You're still free to do s = s[1
.. $] if that's what you need.)
These changes will have the great effect of enabling std.algorithm to
work with strings correctly without any further impedance adaptation.
(At some point I'd defined byDchar to wrap a string as a bidirectional
range; it works, but of course it's much better without an intermediary.)
Following that change, I plan to eliminate std.string entirely and roll
all of its functionality into std.algorithm. This is because I noticed
that I'd like many string functions to be available for other data
types, and also because people who want to define their own non-UTF
encodings can benefit of the support that UTF already has.
(As an example, startsWith or endsWith are very useful not only with
strings, but general data as well.)
A possible idea would be to move algorithms out of std.string and roll
std.utf and std.encoding into std.string. That way std.string becomes
something UTF-specific, which may be sensible.
One problem I foresee is the growth of std.algorithm. It already has
many things in it, and I fear that some user who just wants to trim a
string may find it intimidating to browse through all that
documentation. I wonder how we could break std.algorithm into smaller
units (which is an issue largely independent from generalizing the
algorithms now found in std.string).
Any ideas are welcome.
Andrei

Currently arrays of characters count as random-access ranges, which is
not true for arrays of char and wchar. I plan to make std.range aware of
that and only characterize char[] and wchar[] (and their qualified
versions) as bidirectional ranges.

32 bits are not enough to represent certain "characters", they need more than
one of such dchar. So dchar too may be a bidirectional range.
I can't remember the bit size of wchar and dchar. So names like char, char16
and char32 can be better...
Sometimes I have ugly 7-bit ASCII strings, I am not sure I want to be forced to
use cast(ubyte[]) every time I use an algorithm on them :-)

One problem I foresee is the growth of std.algorithm. It already has
many things in it, and I fear that some user who just wants to trim a
string may find it intimidating to browse through all that
documentation.

It's not just a matter of documentation: to choose among n items a human needs
more time as n grows (people that designg important menus in GUIs must be aware
of this). So huge APIs slow down programming.
A possible solution is to keep the std.string module, but make it just a list
of aliases and thin wrappers around functions of std.algorithm, tuned for
string processing (example I usually don't need tolower on generic arrays),
there are some operations that are mostly useful for strings).
Bye,
bearophile

Currently arrays of characters count as random-access ranges, which is
not true for arrays of char and wchar. I plan to make std.range aware of
that and only characterize char[] and wchar[] (and their qualified
versions) as bidirectional ranges.

32 bits are not enough to represent certain "characters", they need more than
one of such dchar. So dchar too may be a bidirectional range.

[citation needed]

I can't remember the bit size of wchar and dchar. So names like char, char16
and char32 can be better...

I think it's a tad late for that.

Sometimes I have ugly 7-bit ASCII strings, I am not sure I want to be forced
to use cast(ubyte[]) every time I use an algorithm on them :-)

That's exactly one of the cases in which my change would help. char is
UTF-8, so that's out as an option for expressing ASCII characters.
You'll be able to define your own type:
struct AsciiChar {
ubyte datum;
...
}
Then express stuff in terms of AsciiChar[] etc.

One problem I foresee is the growth of std.algorithm. It already has
many things in it, and I fear that some user who just wants to trim a
string may find it intimidating to browse through all that
documentation.

It's not just a matter of documentation: to choose among n items a human needs
more time as n grows (people that designg important menus in GUIs must be aware
of this). So huge APIs slow down programming.
A possible solution is to keep the std.string module, but make it just a list
of aliases and thin wrappers around functions of std.algorithm, tuned for
string processing (example I usually don't need tolower on generic arrays),
there are some operations that are mostly useful for strings).

Currently arrays of characters count as random-access ranges, which
is not true for arrays of char and wchar. I plan to make std.range
aware of that and only characterize char[] and wchar[] (and their
qualified versions) as bidirectional ranges.

32 bits are not enough to represent certain "characters", they need
more than one of such dchar. So dchar too may be a bidirectional range.

[citation needed]

I also doubt 32-bit is not enough. In fact, Unicode has 0x10FFFF
as the highest code point.

Sometimes I have ugly 7-bit ASCII strings, I am not sure I want to be
forced to use cast(ubyte[]) every time I use an algorithm on them :-)

That's exactly one of the cases in which my change would help. char is
UTF-8, so that's out as an option for expressing ASCII characters.
You'll be able to define your own type:
struct AsciiChar {
ubyte datum;
...
}
Then express stuff in terms of AsciiChar[] etc.

I miss typedef. I think this is exactly what typedef was intended
for. Perhaps we can reintroduce it as a 'short hand' for such a
struct?
By the way, ASCII is a subset of UTF-8 (that was the whole
point), so there's no reason why 'char[]' can't still be used for
ASCII strings, right?
L.

Currently arrays of characters count as random-access ranges, which
is not true for arrays of char and wchar. I plan to make std.range
aware of that and only characterize char[] and wchar[] (and their
qualified versions) as bidirectional ranges.

32 bits are not enough to represent certain "characters", they need
more than one of such dchar. So dchar too may be a bidirectional range.

[citation needed]

I also doubt 32-bit is not enough. In fact, Unicode has 0x10FFFF
as the highest code point.

32-bit is enough to cover all code points. But there are many combining
code points in Unicode, allowing you to combine diacritic with various
other characters, such as an acute accent with a 'k'. Some of these
combinations exists in precombined form and are considered equivalent.
So if you want to count the number of characters the user actually see
instead of counting code points, then you need to take these combining
code points into account.
But if you really wanted to iterate over "characters" instead of code
points, note that it can become quite hard if you take into account
double diacritics, combining diacritic signs placed across two letters.
So I think it's reasonable to have dchar, a code point, as the base
unit for iterating over a string.
http://en.wikipedia.org/wiki/Combining_characterhttp://en.wikipedia.org/wiki/Unicode_normalization
Another interesting case:
http://en.wikipedia.org/wiki/Combining_grapheme_joiner
Unicode, isn't it great?
--
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Using alias you loose all type safety.
I remember Andrei mentioned that he and Walter couldn't agree
whether typedef should behave as a sub or super class. I think it
should not be looked at from a inheritance perspective, but just
consider it as wrapper struct with a ctor that takes the
underlying type.

By the way, ASCII is a subset of UTF-8 (that was the whole
point), so there's no reason why 'char[]' can't still be used for
ASCII strings, right?

AS far as I have understood (I am no Unicode guru), in some locales
toUpper and toLower map ASCII chars to non-ASCII chars. So ASCII being a
strict subset of UTF-8 is not always true.

True, but then that upper resp lowercase would no longer be
ASCII. As long as you stick to ASCII, char[] should work just fine.
So, toLower and toUpper can accept ASCII char[] but always output
one of those new char ranges. Problem fixed :)
L.

I plan a few improvements to Phobos that will improve string handling.
Currently arrays of characters count as random-access ranges, which is
not true for arrays of char and wchar. I plan to make std.range aware of
that and only characterize char[] and wchar[] (and their qualified
versions) as bidirectional ranges. Also, std.range will define s.front
and s.back for strings to return the correctly decoded dchar. Naturally,
s.popFront and s.popBack will yank an entire encoded character, which is
what you want most of the time anyway. (You're still free to do s = s[1
.. $] if that's what you need.)
These changes will have the great effect of enabling std.algorithm to
work with strings correctly without any further impedance adaptation.
(At some point I'd defined byDchar to wrap a string as a bidirectional
range; it works, but of course it's much better without an intermediary.)
Following that change, I plan to eliminate std.string entirely and roll
all of its functionality into std.algorithm. This is because I noticed
that I'd like many string functions to be available for other data
types, and also because people who want to define their own non-UTF
encodings can benefit of the support that UTF already has.

(As an example, startsWith or endsWith are very useful not only with
strings, but general data as well.)
A possible idea would be to move algorithms out of std.string and roll
std.utf and std.encoding into std.string. That way std.string becomes
something UTF-specific, which may be sensible.
One problem I foresee is the growth of std.algorithm. It already has
many things in it, and I fear that some user who just wants to trim a
string may find it intimidating to browse through all that
documentation. I wonder how we could break std.algorithm into smaller
units (which is an issue largely independent from generalizing the
algorithms now found in std.string).

Perhaps it's time to start adding more packages than just the std. Make
std.algorithm a package and try to split it into several modules.

I've been thinking about characters lately and have realized that
tolower, toupper, icmp, and friends should not be in a string library.
Those functions need an "alphabet" to be useful; not language, nor locale...
In fact, the character itself must have alphabet information. Otherwise
a string like "ali & jim" cannot be converted to upper-case correctly(*)
as "ALİ & JIM". And the word "correctly" there depends on each
character's alphabet.
Similarly, two characters that look the same cannot be compared for
ordering. Comparing the 'x' of one alphabet to the 'x' of another
alphabet is a meaningless operation.
Ali

Jacob Carlborg wrote:
> I would keep std.string for string specific functions and perhaps
> publicly import std.algorithm. For exmaple functions like: tolower, icmp
> and toStringz.
I've been thinking about characters lately and have realized that
tolower, toupper, icmp, and friends should not be in a string library.
Those functions need an "alphabet" to be useful; not language, nor
locale...
In fact, the character itself must have alphabet information. Otherwise
a string like "ali & jim" cannot be converted to upper-case correctly(*)
as "ALİ & JIM". And the word "correctly" there depends on each
character's alphabet.
Similarly, two characters that look the same cannot be compared for
ordering. Comparing the 'x' of one alphabet to the 'x' of another
alphabet is a meaningless operation.

My thoughts exactly. In fact I'm thinking of generalizing toupper and
tolower for strings to take an optional trie mapping strings to strings.
That way correct capitalization can be done for any string, given a good
collection of capitalization patterns.
Andrei

Jacob Carlborg wrote:
> I would keep std.string for string specific functions and perhaps
> publicly import std.algorithm. For exmaple functions like: tolower, icmp
> and toStringz.
I've been thinking about characters lately and have realized that
tolower, toupper, icmp, and friends should not be in a string library.
Those functions need an "alphabet" to be useful; not language, nor
locale...
In fact, the character itself must have alphabet information. Otherwise
a string like "ali & jim" cannot be converted to upper-case correctly(*)
as "ALİ & JIM". And the word "correctly" there depends on each
character's alphabet.
Similarly, two characters that look the same cannot be compared for
ordering. Comparing the 'x' of one alphabet to the 'x' of another
alphabet is a meaningless operation.
Ali

I'm not sure I really understand this, probably because I don't know
much about how Unciode works. I'm thinking out loud:
If "i", as you have in "ali", have the corresponding "İ" as upper case
wouldn't that be another character than the English "i"? If so, I'm not
sure I see the problem. If not, I see the problem.

Jacob Carlborg wrote:
> I would keep std.string for string specific functions and perhaps
> publicly import std.algorithm. For exmaple functions like: tolower,
icmp
> and toStringz.
I've been thinking about characters lately and have realized that
tolower, toupper, icmp, and friends should not be in a string library.
Those functions need an "alphabet" to be useful; not language, nor
locale...
In fact, the character itself must have alphabet information. Otherwise
a string like "ali & jim" cannot be converted to upper-case correctly(*)
as "ALİ & JIM". And the word "correctly" there depends on each
character's alphabet.
Similarly, two characters that look the same cannot be compared for
ordering. Comparing the 'x' of one alphabet to the 'x' of another
alphabet is a meaningless operation.
Ali

I'm not sure I really understand this, probably because I don't know
much about how Unciode works. I'm thinking out loud:
If "i", as you have in "ali", have the corresponding "İ" as upper case
wouldn't that be another character than the English "i"?

'i' and 'i' are the same "character", because they have the same ASCII
and Unicode values in different alphabets. But it is not the same
"letter" when they are part of different text.
iİ (and ıI) issue is probably too special. A number of Turkic alphabets
chose ASCII 'i' probably for historical reasons. Unicode did not define
a separate code point for 'i' either, probably because those alphabets
already were using the ASCII 'i'.

If so, I'm not
sure I see the problem. If not, I see the problem.

The letter 'i' (and I) is special but the issue is valid for any other
letter: Is it valid to compare an 'i' in English text to an 'i' in
German text?
I think it's only valid at the lowest data representation level. And
ASCII never claims to be more than a code table for "information
interchange". That part is fine.
The problem is with the use of certain ranges of the ASCII table as the
English alphabet. It is unfortunate that it works... :)
D is great that it supports three separate Unicode encodings in the
language, but encodings are at a lower level of abstraction than
"letters". I am not sure what data is used for toUniUpper and toUniLower
in std.uni, but they can't work correctly without alphabet information.
They favor the ASCII layout probabyl because for historical reasons.
I think the problems with using the ASCII table for sorting is well
known. A more interesting example is with the Azeri alphabet: it uses
the ASCII xX characters, but sorts them after hH.
Ali

D is great that it supports three separate Unicode encodings in the
language, but encodings are at a lower level of abstraction than
"letters". I am not sure what data is used for toUniUpper and toUniLower
in std.uni, but they can't work correctly without alphabet information.
They favor the ASCII layout probabyl because for historical reasons.
I think the problems with using the ASCII table for sorting is well
known. A more interesting example is with the Azeri alphabet: it uses
the ASCII xX characters, but sorts them after hH.

My idea of functions for upper/lowercase would help you solve exactly
the issue you mention. A conversion trie as an optional parameter would
allow to capitalize Straße as STRASSE and ali as ALİ.
The trie will match the longest substring of the original string and
will have translation strings in the nodes. The way capitalization is
done will depend on the way you set up the table.
Andrei

Perhaps it's time to start adding more packages than just the std. Make
std.algorithm a package and try to split it into several modules.

Please, no. I **HATE** fine-grained imports like Tango has. I don't want to
write tons of boilerplate at the top of every file just to have access to a
bunch
of closely related functionality. If this is done, **PLEASE** at least make a
std.algorithm.all that publicly imports everything in the old std.algorithm.

Perhaps it's time to start adding more packages than just the std. Make
std.algorithm a package and try to split it into several modules.

Please, no. I **HATE** fine-grained imports like Tango has. I don't want
to write tons of boilerplate at the top of every file just to have access
to a bunch
of closely related functionality. If this is done, **PLEASE** at least
make a std.algorithm.all that publicly imports everything in the old
std.algorithm.

We need a balance. Fine-grained can be great, but if it's too fine-grained,
it gets hard to find things and you have to import a ton of modules. Not
fine-grained enough, however, and you have a hard me finding things because
there's so much to search through in each module - though importing what you
need is easy.
Personally, I'm fine with std.algorithm being split into sub-modules. It's
already fairly large and splitting it up would make a lot of sense. But then
a solution allowing you to import large portions - if not all of it - at
once would definitely be nice. It's why being able to do something like
import std.*;
and have it recursively grab every sub-module would be nice. But
std.algorithm.all is a good idea.
- Jonathan M Davis

One problem I foresee is the growth of std.algorithm. It already has
many things in it, and I fear that some user who just wants to trim a
string may find it intimidating to browse through all that
documentation. I wonder how we could break std.algorithm into smaller
units (which is an issue largely independent from generalizing the
algorithms now found in std.string).
Any ideas are welcome.
Andrei

I like how naturaldocs, which is similar to ddoc helps with this: by
adding a group tag. See this example of a summary of a class:
http://www.naturaldocs.org/documenting/reference.html#Example_Class
Probably it is possible to come up with categories for algorithm like:
- functional tools
- searching and sorting
- string utilities
...
Arguably a more D like alternative is to make std.algorithm a package
and each 'category' a module of that package.

One problem I foresee is the growth of std.algorithm. It already has
many things in it, and I fear that some user who just wants to trim a
string may find it intimidating to browse through all that
documentation. I wonder how we could break std.algorithm into smaller
units (which is an issue largely independent from generalizing the
algorithms now found in std.string).
Any ideas are welcome.
Andrei

I like how naturaldocs, which is similar to ddoc helps with this: by
adding a group tag. See this example of a summary of a class:
http://www.naturaldocs.org/documenting/reference.html#Example_Class
Probably it is possible to come up with categories for algorithm like:
- functional tools
- searching and sorting
- string utilities
...
Arguably a more D like alternative is to make std.algorithm a package
and each 'category' a module of that package.

I think the idea of tags is awesome, particularly because it doesn't
require one to divide items in disjoint sets. I'll think some more of
it. It might require changes in ddoc. At any rate, sounds like a D3
thing. Until then, I think I'll add to std.algorithm in confidence that
we can scale the documentation later.
Andrei

One problem I foresee is the growth of std.algorithm. It already has
many things in it, and I fear that some user who just wants to trim a
string may find it intimidating to browse through all that
documentation. I wonder how we could break std.algorithm into smaller
units (which is an issue largely independent from generalizing the
algorithms now found in std.string).
Any ideas are welcome.
Andrei

I like how naturaldocs, which is similar to ddoc helps with this: by
adding a group tag. See this example of a summary of a class:
http://www.naturaldocs.org/documenting/reference.html#Example_Class
Probably it is possible to come up with categories for algorithm like:
- functional tools
- searching and sorting
- string utilities
...
Arguably a more D like alternative is to make std.algorithm a package
and each 'category' a module of that package.

I think the idea of tags is awesome, particularly because it doesn't
require one to divide items in disjoint sets. I'll think some more of
it. It might require changes in ddoc. At any rate, sounds like a D3
thing. Until then, I think I'll add to std.algorithm in confidence that
we can scale the documentation later.
Andrei

Cool, tags are even better (naturaldocs groups aren't tags really). How
are you going to do so? Perhaps better to reserve this as a standard
ddoc section saying it is 'to be imlemented'? This way everybody can
benefit eventually.

I think the idea of tags is awesome, particularly because it doesn't
require one to divide items in disjoint sets. I'll think some more of it.

A hierarchical D/Python-like module system isn't the only way to organize
blocks of code. Both future Windows file system and Google Email use tags to
create groups of items in a less disjoint way. But I don't know if it's
possible to design the equivalent of a module system based on tags instead of a
hierarchy of modules/packages (and superpackages). It seems a cute idea.

32 bits are not enough to represent certain "characters", they need more than
one of such dchar. So dchar too may be a bidirectional range.<<

Though a fixed number of bytes per code point seems convenient, it is not used
as much as the other Unicode encodings. It makes truncation slightly easier but
not significantly so compared to UTF-8 and UTF-16. It does not make calculating
the displayed width of a string any easier except in very limited cases, since
even with a "fixed width" font there may be more than one code point per
character position (combining marks) or more than one character position per
code point (for example CJK ideographs). Combining marks also mean editors
cannot treat one code point as being the same as one unit for editing.<

I think the idea of tags is awesome, particularly because it doesn't
require one to divide items in disjoint sets. I'll think some more of it.

A hierarchical D/Python-like module system isn't the only way to organize
blocks of code. Both future Windows file system and Google Email use tags to
create groups of items in a less disjoint way. But I don't know if it's
possible to design the equivalent of a module system based on tags instead of a
hierarchy of modules/packages (and superpackages). It seems a cute idea.

This is about the documentation, which at the moment is based on the
module system, type system and order of declarations. Such tags allow
for better indexes, organization and search through the docs.

I think the idea of tags is awesome, particularly because it doesn't
require one to divide items in disjoint sets. I'll think some more of
it.

A hierarchical D/Python-like module system isn't the only way to
organize blocks of code. Both future Windows file system and Google
Email use tags to create groups of items in a less disjoint way. But I
don't know if it's possible to design the equivalent of a module
system based on tags instead of a hierarchy of modules/packages (and
superpackages). It seems a cute idea.

This is about the documentation, which at the moment is based on the
module system, type system and order of declarations. Such tags allow
for better indexes, organization and search through the docs.

A next step is to allow to import all names with a specified tag, even if such
names are inside more than one text file (the compiler can create a json txt
file to speed up this retrieval):
import tag(string);
To keep things tidy I think it's better to minimize the number of different
tags inside each file, so they are similar to modules anyway: perfect
hierarchies are sometimes too much rigid to represent real life complexities,
but an approximate hierarchy is tidier and simpler to understand than an
amorphous soup of tags.
Bye,
bearophile

One problem I foresee is the growth of std.algorithm. It already has
many things in it, and I fear that some user who just wants to trim a
string may find it intimidating to browse through all that
documentation. I wonder how we could break std.algorithm into smaller
units (which is an issue largely independent from generalizing the
algorithms now found in std.string).
Any ideas are welcome.
Andrei

adding a group tag. See this example of a summary of a class:
http://www.naturaldocs.org/documenting/reference.html#Example_Class
Probably it is possible to come up with categories for algorithm like:
- functional tools
- searching and sorting
- string utilities
...
Arguably a more D like alternative is to make std.algorithm a
package and each 'category' a module of that package.

I think the idea of tags is awesome, particularly because it doesn't
require one to divide items in disjoint sets. I'll think some more of
it. It might require changes in ddoc. At any rate, sounds like a D3
thing. Until then, I think I'll add to std.algorithm in confidence
that we can scale the documentation later.
Andrei

By the way, in the sort term you could greatly improve the usability of
std.algorithm by cleaning up the index ("jump to") at the top of the
file. A simple alphabetical listing would be great and you could easily
start grouping links under categories (which would eventually become tags)

That jump to index is automatically generated. I can have it sorted
alphabetically, which makes sense for large lists. But then should I
also list components in alphabetical order?
Andrei

One problem I foresee is the growth of std.algorithm. It already has
many things in it, and I fear that some user who just wants to trim a
string may find it intimidating to browse through all that
documentation. I wonder how we could break std.algorithm into smaller
units (which is an issue largely independent from generalizing the
algorithms now found in std.string).
Any ideas are welcome.
Andrei

adding a group tag. See this example of a summary of a class:
http://www.naturaldocs.org/documenting/reference.html#Example_Class
Probably it is possible to come up with categories for algorithm like:
- functional tools
- searching and sorting
- string utilities
...
Arguably a more D like alternative is to make std.algorithm a package
and each 'category' a module of that package.

I think the idea of tags is awesome, particularly because it doesn't
require one to divide items in disjoint sets. I'll think some more of
it. It might require changes in ddoc. At any rate, sounds like a D3
thing. Until then, I think I'll add to std.algorithm in confidence that
we can scale the documentation later.
Andrei

By the way, in the sort term you could greatly improve the usability of
std.algorithm by cleaning up the index ("jump to") at the top of the file.
A simple alphabetical listing would be great and you could easily start
grouping links under categories (which would eventually become tags)