Hi Ted,
Following Linus comments, this version is back as an RFC, in order to
discuss the normalization method used. At a first glance, you will
notice the series got a lot smaller, with the separation of unicode code
from the NLS subsystem, as Linus requested. The ext4 parts are pretty
much the same, with only the addition of a verification in
ext4_feature_set_ok() to fail encoding mounts when without
CONFIG_UNICODE on newer kernels.
The main change presented here is a proposal to migrate the
normalization method from NFKD to NFD. After our discussions, and
reviewing other operating systems and languages aspects, I am more
convinced that canonical decomposition is more viable solution than
compatibility decomposition, because it doesn't ignore eliminate any
semantic meaning, like the definitive case of superscript numbers. NFD
is also the documented method used by HFS+ and APFS, so there is
precedent. Notice however, that as far as my research goes, APFS doesn't
completely follows NFD, and in some cases, like <compat> flags, it
actually does NFKD, but not in others (<fraction>), where it applies the
canonical form. We take a more consistent approach and always do plain NFD.
This RFC, therefore, aims to resume/start conversation with some
stalkeholders that may have something to say regarding the normalization
method used. I added people from SMB, NFS and FS development who
might be interested on this.
Regarding Casefold, I am unsure whether Casefold Common + Full still
makes sense after migrating from the compatibility to the canonical
form. While Casefold Full, by definition, addresses cases where the
casefolding grows in size, like the casefold of the german eszett to SS,
it also is responsible for folding smallcase ligatures without a
corresponding uppercase to their compatible counterpart. Which means
that on -F directories, o_f_f_i_c_e and o_ff_i_c_e will differ, while on
+F directories they will match. This seems unaceptable to me,
suggesting that we should start to use Common + Simple instead of Common
+ Full, but I would like more input on what seems more reasonable to
you.
After we decide on this, I will be sending new patches to update
e2fsprogs to the agreed method and remove the normalization/casefold
type flags (EXT4_UTF8_NORMALIZATION_TYPE_NFKD,
EXT4_UTF8_CASEFOLD_TYPE_NFKDCF), before actually proposing the current
patch series for inclusion in the kernel.
Practical things, w.r.t. this patch series:
- As usual, the UCD files are not part of the series, because they
would bounce. To test this one would need to fetch the files as
explained in the commit message.
- If you prefer, you can checkout from
https://gitlab.collabora.com/krisman/linux -b ext4-ci-directory-no-nls
- More details on the design decisions restricted to ext4 are
available in the corresponding commit messages.
Thanks for keeping up with this.
Gabriel Krisman Bertazi (7):
unicode: Implement higher level API for string handling
unicode: Introduce test module for normalized utf8 implementation
MAINTAINERS: Add Unicode subsystem entry
ext4: Include encoding information in the superblock
ext4: Support encoding-aware file name lookups
ext4: Implement EXT4_CASEFOLD_FL flag
docs: ext4.rst: Document encoding and case-insensitive
Olaf Weber (4):
unicode: Add unicode character database files
scripts: add trie generator for UTF-8
unicode: Introduce code for UTF-8 normalization
unicode: reduce the size of utf8data[]
Documentation/admin-guide/ext4.rst | 41 +
MAINTAINERS | 6 +
fs/Kconfig | 1 +
fs/Makefile | 1 +
fs/ext4/dir.c | 43 +
fs/ext4/ext4.h | 42 +-
fs/ext4/hash.c | 38 +-
fs/ext4/ialloc.c | 2 +-
fs/ext4/inline.c | 2 +-
fs/ext4/inode.c | 4 +-
fs/ext4/ioctl.c | 18 +
fs/ext4/namei.c | 104 +-
fs/ext4/super.c | 91 +
fs/unicode/Kconfig | 13 +
fs/unicode/Makefile | 22 +
fs/unicode/ucd/README | 33 +
fs/unicode/utf8-core.c | 183 ++
fs/unicode/utf8-norm.c | 797 +++++++
fs/unicode/utf8-selftest.c | 320 +++
fs/unicode/utf8n.h | 117 +
include/linux/fs.h | 2 +
include/linux/unicode.h | 30 +
scripts/Makefile | 1 +
scripts/mkutf8data.c | 3418 ++++++++++++++++++++++++++++
24 files changed, 5307 insertions(+), 22 deletions(-)
create mode 100644 fs/unicode/Kconfig
create mode 100644 fs/unicode/Makefile
create mode 100644 fs/unicode/ucd/README
create mode 100644 fs/unicode/utf8-core.c
create mode 100644 fs/unicode/utf8-norm.c
create mode 100644 fs/unicode/utf8-selftest.c
create mode 100644 fs/unicode/utf8n.h
create mode 100644 include/linux/unicode.h
create mode 100644 scripts/mkutf8data.c
--
2.20.1

From: Olaf Weber <olaf@sgi.com>
Add files from the Unicode Character Database, version 11.0, to the
source. A helper program that generates a trie used for normalization
from these files is part of a separate commit.
- Notes on the update from 8.0.0 and 11.0:
The structure of ucd files and special cases have not experienced any
changes between versions 8.0.0 and 11.0.0. 8.0.0 saw the addition of
Cherokee LC characters, which is an interesting case for case-folding.
The update is accompanied by new tests on the test_ucd module to catch
specific cases. No changes to mkutf8data script was required for the
update.
The actual files are not part of the commit submitted to the list
because they are to big and would bounce. Still, they can be obtained
by the following script:
FILES="CaseFolding.txt DerivedAge.txt extracted/DerivedCombiningClass.txt
DerivedCoreProperties.txt NormalizationCorrections.txt
NormalizationTest.txt UnicodeData.txt"
VERSION=11.0.0
BASE=http://www.unicode.org/Public/${VERSION}/ucd
for i in ${FILES} ; do
wget "${BASE}/$i" -O fs/unicode/ucd/$(basename ${i} .txt)-${VERSION}.txt
done
Signed-off-by: Olaf Weber <olaf@sgi.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
[Move ucd directory to fs/unicode/]
[Update to Unicode 11.0.0]
---
fs/unicode/ucd/README | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
create mode 100644 fs/unicode/ucd/README
diff --git a/fs/unicode/ucd/README b/fs/unicode/ucd/README
new file mode 100644
index 000000000000..5f89017b35ee
--- /dev/null
+++ b/fs/unicode/ucd/README
@@ -0,0 +1,33 @@
+The files in this directory are part of the Unicode Character Database
+for version 11.0.0 of the Unicode standard.
+
+The full set of files can be found here:
+
+ http://www.unicode.org/Public/11.0.0/ucd/
+
+The latest released version of the UCD can be found here:
+
+ http://www.unicode.org/Public/UCD/latest/
+
+The files in this directory are identical, except that they have been
+renamed with a suffix indicating the unicode version.
+
+Individual source links:
+
+ http://www.unicode.org/Public/11.0.0/ucd/CaseFolding.txt
+ http://www.unicode.org/Public/11.0.0/ucd/DerivedAge.txt
+ http://www.unicode.org/Public/11.0.0/ucd/extracted/DerivedCombiningClass.txt
+ http://www.unicode.org/Public/11.0.0/ucd/DerivedCoreProperties.txt
+ http://www.unicode.org/Public/11.0.0/ucd/NormalizationCorrections.txt
+ http://www.unicode.org/Public/11.0.0/ucd/NormalizationTest.txt
+ http://www.unicode.org/Public/11.0.0/ucd/UnicodeData.txt
+
+md5sums
+
+ 414436796cf097df55f798e1585448ee CaseFolding-11.0.0.txt
+ 6032a595fbb782694456491d86eecfac DerivedAge-11.0.0.txt
+ 3240997d671297ac754ab0d27577acf7 DerivedCombiningClass-11.0.0.txt
+ 2a4fe257d9d8184518e036194d2248ec DerivedCoreProperties-11.0.0.txt
+ 4e7d383fa0dd3cd9d49d64e5b7b7c9e0 NormalizationCorrections-11.0.0.txt
+ c9500c5b8b88e584469f056023ecc3f2 NormalizationTest-11.0.0.txt
+ acc291106c3758d2025f8d7bd5518bee UnicodeData-11.0.0.txt
--
2.20.1

From: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Introduces the encoding-awareness and case-insensitive features on ext4
for system administrators. Explain the minimum of design decisions that
are important for sysadmins wanting to enable this feature.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
---
Documentation/admin-guide/ext4.rst | 41 ++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)
diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst
index e506d3dae510..4e08d0309f1e 100644
--- a/Documentation/admin-guide/ext4.rst
+++ b/Documentation/admin-guide/ext4.rst
@@ -91,10 +91,51 @@ Currently Available
* large block (up to pagesize) support
* efficient new ordered mode in JBD2 and ext4 (avoid using buffer head to force
the ordering)
+* Encoding aware file names
+* Case insensitive file name lookups
[1] Filesystems with a block size of 1k may see a limit imposed by the
directory hash tree having a maximum depth of two.
+Encoding-aware file names and case-insensitive lookups
+======================================================
+
+Ext4 optionally supports filesystem-wide charset knowledge when handling
+file names, which allows the user to perform file system lookups using
+charset equivalent versions of the same file name, and optionally ensure
+that no invalid names are held by the filesystem. charset encoding
+awareness is also essential for performing case-insensitive lookups,
+because it is what defines the casefold operation.
+
+The case-insensitive file name lookup feature is supported in a smaller
+granularity, on a per-directory basis, allowing the user to mix
+case-insensitive and case-sensitive directories in the same filesystem.
+It is enabled by flipping a file attribute on an empty directory. For
+the reason stated above, the filesystem must have encoding enabled to
+use this feature.
+
+Both encoding-awareness and case-awareness are name-preserving on the
+disk, meaning that the file name provided by userspace is a
+byte-per-byte match to what is actually written in the disk. The
+Unicode normalization format used by the kernel is thus an internal
+representation, and not exposed to the userspace nor to the disk, with
+the important exception of disk hashes, used on large directories with
+DX feature. On DX directories, the hash must be calculated using the
+normalized version of the filename, meaning that the normalization
+format used actually has an impact on where the directory entry is
+stored.
+
+When we change from viewing filenames as opaque byte sequences to seeing
+them as encoded strings we need to address what happens when a program
+tries to create a file with an invalid name. The Unicode subsystem
+within the kernel leaves the decision of what to do in this case to the
+filesystem, which select its preferred behavior by enabling/disabling
+the strict mode. When Ext4 encounters one of those strings and the
+filesystem did not require strict mode, it falls back to considering the
+entire string as an opaque byte sequence, which still allows the user to
+operate on that file but the case-insensitive and equivalent sequence
+lookups won't work.
+
Options
=======
--
2.20.1

On Mon, Jan 28, 2019 at 04:32:12PM -0500, Gabriel Krisman Bertazi wrote:
> Following Linus comments, this version is back as an RFC, in order to
> discuss the normalization method used. At a first glance, you will
> notice the series got a lot smaller, with the separation of unicode code
> from the NLS subsystem, as Linus requested. The ext4 parts are pretty
> much the same, with only the addition of a verification in
> ext4_feature_set_ok() to fail encoding mounts when without
> CONFIG_UNICODE on newer kernels.
>
> The main change presented here is a proposal to migrate the
> normalization method from NFKD to NFD. After our discussions, and
> reviewing other operating systems and languages aspects, I am more
> convinced that canonical decomposition is more viable solution than
> compatibility decomposition, because it doesn't ignore eliminate any
> semantic meaning, like the definitive case of superscript numbers. NFD
> is also the documented method used by HFS+ and APFS, so there is
> precedent. Notice however, that as far as my research goes, APFS doesn't
> completely follows NFD, and in some cases, like <compat> flags, it
> actually does NFKD, but not in others (<fraction>), where it applies the
> canonical form. We take a more consistent approach and always do plain NFD.
>
> This RFC, therefore, aims to resume/start conversation with some
> stalkeholders that may have something to say regarding the normalization
> method used. I added people from SMB, NFS and FS development who
> might be interested on this.
For what it's worth, knfsd will just pass through pathnames unchanged
the client, and the Linux client will pass them on to applications
unchanged. I don't know what other clients might do. But it's hard for
NFS clients and servers to do anything more clever, because behavior of
exported filesystems varies, users have preexisting filesystems with
random encodings, and on the client side in the Linux case, the kernel
doesn't know about process locales. So, whatever behavior ext4
implements is likely the same behavior that will be seen by an
application on a client accessing an ext4 filesystem over NFS.
--b.
>
> Regarding Casefold, I am unsure whether Casefold Common + Full still
> makes sense after migrating from the compatibility to the canonical
> form. While Casefold Full, by definition, addresses cases where the
> casefolding grows in size, like the casefold of the german eszett to SS,
> it also is responsible for folding smallcase ligatures without a
> corresponding uppercase to their compatible counterpart. Which means
> that on -F directories, o_f_f_i_c_e and o_ff_i_c_e will differ, while on
> +F directories they will match. This seems unaceptable to me,
> suggesting that we should start to use Common + Simple instead of Common
> + Full, but I would like more input on what seems more reasonable to
> you.
>
> After we decide on this, I will be sending new patches to update
> e2fsprogs to the agreed method and remove the normalization/casefold
> type flags (EXT4_UTF8_NORMALIZATION_TYPE_NFKD,
> EXT4_UTF8_CASEFOLD_TYPE_NFKDCF), before actually proposing the current
> patch series for inclusion in the kernel.
>
> Practical things, w.r.t. this patch series:
>
> - As usual, the UCD files are not part of the series, because they
> would bounce. To test this one would need to fetch the files as
> explained in the commit message.
>
> - If you prefer, you can checkout from
> https://gitlab.collabora.com/krisman/linux -b ext4-ci-directory-no-nls
>
> - More details on the design decisions restricted to ext4 are
> available in the corresponding commit messages.
>
> Thanks for keeping up with this.
>
> Gabriel Krisman Bertazi (7):
> unicode: Implement higher level API for string handling
> unicode: Introduce test module for normalized utf8 implementation
> MAINTAINERS: Add Unicode subsystem entry
> ext4: Include encoding information in the superblock
> ext4: Support encoding-aware file name lookups
> ext4: Implement EXT4_CASEFOLD_FL flag
> docs: ext4.rst: Document encoding and case-insensitive
>
> Olaf Weber (4):
> unicode: Add unicode character database files
> scripts: add trie generator for UTF-8
> unicode: Introduce code for UTF-8 normalization
> unicode: reduce the size of utf8data[]
>
> Documentation/admin-guide/ext4.rst | 41 +
> MAINTAINERS | 6 +
> fs/Kconfig | 1 +
> fs/Makefile | 1 +
> fs/ext4/dir.c | 43 +
> fs/ext4/ext4.h | 42 +-
> fs/ext4/hash.c | 38 +-
> fs/ext4/ialloc.c | 2 +-
> fs/ext4/inline.c | 2 +-
> fs/ext4/inode.c | 4 +-
> fs/ext4/ioctl.c | 18 +
> fs/ext4/namei.c | 104 +-
> fs/ext4/super.c | 91 +
> fs/unicode/Kconfig | 13 +
> fs/unicode/Makefile | 22 +
> fs/unicode/ucd/README | 33 +
> fs/unicode/utf8-core.c | 183 ++
> fs/unicode/utf8-norm.c | 797 +++++++
> fs/unicode/utf8-selftest.c | 320 +++
> fs/unicode/utf8n.h | 117 +
> include/linux/fs.h | 2 +
> include/linux/unicode.h | 30 +
> scripts/Makefile | 1 +
> scripts/mkutf8data.c | 3418 ++++++++++++++++++++++++++++
> 24 files changed, 5307 insertions(+), 22 deletions(-)
> create mode 100644 fs/unicode/Kconfig
> create mode 100644 fs/unicode/Makefile
> create mode 100644 fs/unicode/ucd/README
> create mode 100644 fs/unicode/utf8-core.c
> create mode 100644 fs/unicode/utf8-norm.c
> create mode 100644 fs/unicode/utf8-selftest.c
> create mode 100644 fs/unicode/utf8n.h
> create mode 100644 include/linux/unicode.h
> create mode 100644 scripts/mkutf8data.c
>
> --
> 2.20.1

[-- Attachment #1: Type: text/plain, Size: 2721 bytes --]
On Monday 28 January 2019 16:32:12 Gabriel Krisman Bertazi wrote:
> The main change presented here is a proposal to migrate the
> normalization method from NFKD to NFD. After our discussions, and
> reviewing other operating systems and languages aspects, I am more
> convinced that canonical decomposition is more viable solution than
> compatibility decomposition, because it doesn't ignore eliminate any
> semantic meaning, like the definitive case of superscript numbers. NFD
> is also the documented method used by HFS+ and APFS, so there is
> precedent. Notice however, that as far as my research goes, APFS doesn't
> completely follows NFD, and in some cases, like <compat> flags, it
> actually does NFKD, but not in others (<fraction>), where it applies the
> canonical form. We take a more consistent approach and always do plain NFD.
>
> This RFC, therefore, aims to resume/start conversation with some
> stalkeholders that may have something to say regarding the normalization
> method used. I added people from SMB, NFS and FS development who
> might be interested on this.
Hello! I think that choice of NFD normalization is not right decision.
Some reasons:
1) NFD is not widely used. Even Apple does not use it (as you wrote
Apple has own normalization form).
2) All filesystems which I known either do not use any normalization or
use NFC.
3) Lot of existing Linux application generate file names in NFC.
4) Linux GUI libraries like Qt and Gtk generate strings from key strokes
in NFC. So if user type file name in Qt/Gtk box it would be in NFC.
So why to use NFD in ext4 filesystem if Linux userspace ecosystem
already uses NFC?
NFD here just makes another layer of problems, unexpected things and
make it somehow different.
Why not rather choose NFS? It would be more compatible with Linux GUI
applications and also with Microsoft Windows systems, which uses NFC
too.
Please, really consider to not use NFD. Most Linux applications really
do not do any normalization or do NFC. And usage of decomposition form
for application which do not implement full Unicode grapheme algorithms
just make for them another problems.
Yes, there are still lot of legacy application which expect that one
code point = one visible symbol (therefore one Unicode grapheme). And
because GUI in most cases generates NFC strings, also existing file
names are in NFC, these application works in most cases without problem.
Force usage of NFD filenames just break them.
(PS: I think that only 2 programming languages implements Unicode
grapheme algorithms correctly: Elixir and Perl 6; which is not so much)
--
Pali Rohár
pali.rohar@gmail.com
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

Pali Rohár <pali.rohar@gmail.com> writes:
> On Monday 28 January 2019 16:32:12 Gabriel Krisman Bertazi wrote:
>> The main change presented here is a proposal to migrate the
>> normalization method from NFKD to NFD. After our discussions, and
>> reviewing other operating systems and languages aspects, I am more
>> convinced that canonical decomposition is more viable solution than
>> compatibility decomposition, because it doesn't ignore eliminate any
>> semantic meaning, like the definitive case of superscript numbers. NFD
>> is also the documented method used by HFS+ and APFS, so there is
>> precedent. Notice however, that as far as my research goes, APFS doesn't
>> completely follows NFD, and in some cases, like <compat> flags, it
>> actually does NFKD, but not in others (<fraction>), where it applies the
>> canonical form. We take a more consistent approach and always do plain NFD.
>>
>> This RFC, therefore, aims to resume/start conversation with some
>> stalkeholders that may have something to say regarding the normalization
>> method used. I added people from SMB, NFS and FS development who
>> might be interested on this.
>
> Hello! I think that choice of NFD normalization is not right decision.
> Some reasons:
>
> 1) NFD is not widely used. Even Apple does not use it (as you wrote
> Apple has own normalization form).
To be exact, Apple claims to use NFD in their specification [1] . What I
observed is that they don't ignore some types of compatibility
characters correctly as they should. For instance, the ff ligature is
decomposed into f + f.
> 2) All filesystems which I known either do not use any normalization or
> use NFC.
> 3) Lot of existing Linux application generate file names in NFC.
>
Most do use NFC. But this is an internal representation for ext4 and it
is name preserving. We only use the normalization when comparing if names
matches and to calculate dcache and dx hashes. The unicode standard
recomends the D forms for internal representation.
> 4) Linux GUI libraries like Qt and Gtk generate strings from key strokes
> in NFC. So if user type file name in Qt/Gtk box it would be in NFC.
>
> So why to use NFD in ext4 filesystem if Linux userspace ecosystem
> already uses NFC?
NFC is costlier to calculate, usually requiring an intermediate NFD
step. Whether it is prohibitively expensive to do in the dcache path, I
don't know, but since it is a critical path, any gain matters.
> NFD here just makes another layer of problems, unexpected things and
> make it somehow different.
Is there any case where
NFC(x) == NFC(y) && NFD(x) != NFD(y) , or
NFC(x) != NFC(y) && NFD(x) == NFD(y)
I am having a hard time thinking of an example. This is the main
(only?) scenario where choosing C or D form for an internal
representation would affect userspace.
>
> Why not rather choose NFS? It would be more compatible with Linux GUI
> applications and also with Microsoft Windows systems, which uses NFC
> too.
>
> Please, really consider to not use NFD. Most Linux applications really
> do not do any normalization or do NFC. And usage of decomposition form
> for application which do not implement full Unicode grapheme algorithms
> just make for them another problems.
> Yes, there are still lot of legacy application which expect that one
> code point = one visible symbol (therefore one Unicode grapheme). And
> because GUI in most cases generates NFC strings, also existing file
> names are in NFC, these application works in most cases without problem.
> Force usage of NFD filenames just break them.
As I said, this shouldn't be a problem because what the application
creates and retrieves is the exact name that was used before, we'd
only use this format for internal metadata on the disk (hashes) and for
in-kernel comparisons.
> (PS: I think that only 2 programming languages implements Unicode
> grapheme algorithms correctly: Elixir and Perl 6; which is not so
> much)
[1] https://developer.apple.com/support/apple-file-system/Apple-File-System-Reference.pdf
--
Gabriel Krisman Bertazi

On Tuesday 05 February 2019 14:08:00 Gabriel Krisman Bertazi wrote:
> Pali Rohár <pali.rohar@gmail.com> writes:
>
> > On Monday 28 January 2019 16:32:12 Gabriel Krisman Bertazi wrote:
> >> The main change presented here is a proposal to migrate the
> >> normalization method from NFKD to NFD. After our discussions, and
> >> reviewing other operating systems and languages aspects, I am more
> >> convinced that canonical decomposition is more viable solution than
> >> compatibility decomposition, because it doesn't ignore eliminate any
> >> semantic meaning, like the definitive case of superscript numbers. NFD
> >> is also the documented method used by HFS+ and APFS, so there is
> >> precedent. Notice however, that as far as my research goes, APFS doesn't
> >> completely follows NFD, and in some cases, like <compat> flags, it
> >> actually does NFKD, but not in others (<fraction>), where it applies the
> >> canonical form. We take a more consistent approach and always do plain NFD.
> >>
> >> This RFC, therefore, aims to resume/start conversation with some
> >> stalkeholders that may have something to say regarding the normalization
> >> method used. I added people from SMB, NFS and FS development who
> >> might be interested on this.
> >
> > Hello! I think that choice of NFD normalization is not right decision.
> > Some reasons:
> >
> > 1) NFD is not widely used. Even Apple does not use it (as you wrote
> > Apple has own normalization form).
>
> To be exact, Apple claims to use NFD in their specification [1] .
Interesting...
> What I
> observed is that they don't ignore some types of compatibility
> characters correctly as they should. For instance, the ff ligature is
> decomposed into f + f.
I'm sure that Apple does not do NFD, but their own invented normal form.
Some graphemes are decomposed, and some not.
> > 2) All filesystems which I known either do not use any normalization or
> > use NFC.
> > 3) Lot of existing Linux application generate file names in NFC.
> >
>
> Most do use NFC. But this is an internal representation for ext4 and it
> is name preserving.
Ok. I was in impression that it does not preserve original names, just
like implementation in Apple's system, where char* passed to creat()
does not appear in readdir().
> We only use the normalization when comparing if names
> matches and to calculate dcache and dx hashes. The unicode standard
> recomends the D forms for internal representation.
Ok, this should be less destructive and less visible to userspace.
> > 4) Linux GUI libraries like Qt and Gtk generate strings from key strokes
> > in NFC. So if user type file name in Qt/Gtk box it would be in NFC.
> >
> > So why to use NFD in ext4 filesystem if Linux userspace ecosystem
> > already uses NFC?
>
> NFC is costlier to calculate, usually requiring an intermediate NFD
> step. Whether it is prohibitively expensive to do in the dcache path, I
> don't know, but since it is a critical path, any gain matters.
>
> > NFD here just makes another layer of problems, unexpected things and
> > make it somehow different.
>
> Is there any case where
> NFC(x) == NFC(y) && NFD(x) != NFD(y) , or
> NFC(x) != NFC(y) && NFD(x) == NFD(y)
This is good question. And I think we should get definite answer for it
prior inclusion of normalization into kernel.
> I am having a hard time thinking of an example. This is the main
> (only?) scenario where choosing C or D form for an internal
> representation would affect userspace.
For decision between normal format, probably yes.
> >
> > Why not rather choose NFS? It would be more compatible with Linux GUI
> > applications and also with Microsoft Windows systems, which uses NFC
> > too.
> >
> > Please, really consider to not use NFD. Most Linux applications really
> > do not do any normalization or do NFC. And usage of decomposition form
> > for application which do not implement full Unicode grapheme algorithms
> > just make for them another problems.
>
> > Yes, there are still lot of legacy application which expect that one
> > code point = one visible symbol (therefore one Unicode grapheme). And
> > because GUI in most cases generates NFC strings, also existing file
> > names are in NFC, these application works in most cases without problem.
> > Force usage of NFD filenames just break them.
>
> As I said, this shouldn't be a problem because what the application
> creates and retrieves is the exact name that was used before, we'd
> only use this format for internal metadata on the disk (hashes) and for
> in-kernel comparisons.
There is another problem for userspace applications:
Currently ext4 accepts as file name any sequence of bytes which do not
contain nul byte and '/'. So having Latin1 file name is perfectly
correct.
What would happen if userspace application want to create following two
file names? "\xDF" and "\F0"? First one is sharp S second one is eth (in
Latin1). But file names are invalid UTF-8 sequences. Is it disallowed to
create such file names? Or both file names are internally converted to
"U+FFFD" (replacement character) and because NFD(first U+FFFD) ==
NFD(second U+FFFD) only first file would be created?
And what happen in general with invalid UTF-8 sequences? Because there
are many different types of invalid UTF-8 sequences, like non-shortest
sequence for valid code point, valid sequence for invalid code points
(either surrogate pairs code points, or code points above U+10FFFF,
...), incorrect byte which should start new code point, incorrect byte
when decoding of code point started, ...
Different (userspace) application handles these invalid UTF-8 sequences
differently, some of them accept some kind of "incorrectness" (e.g.
non-shortest form of code point representation), some not. Some
applications replace invalid parts of UTF-8 sequence by sequence of
UTF-8 replacement character, some not. Also it can be observed that some
applications use just one replacement characters and some other replace
invalid UTF-8 sequence by more replacement characters.
So trying to "recover" from invalid UTF-8 sequence to valid one is done
in more ways... And usage of any existing way could cause problems...
E.g. not possible to create two files "\xDF\xF0" and "\xF0\xDF"...
> > (PS: I think that only 2 programming languages implements Unicode
> > grapheme algorithms correctly: Elixir and Perl 6; which is not so
> > much)
>
> [1] https://developer.apple.com/support/apple-file-system/Apple-File-System-Reference.pdf
>
--
Pali Rohár
pali.rohar@gmail.com

Pali Rohár <pali.rohar@gmail.com> writes:
> On Tuesday 05 February 2019 14:08:00 Gabriel Krisman Bertazi wrote:
>> Pali Rohár <pali.rohar@gmail.com> writes:
>>
>> > On Monday 28 January 2019 16:32:12 Gabriel Krisman Bertazi wrote:
>> >> The main change presented here is a proposal to migrate the
>> >> normalization method from NFKD to NFD. After our discussions, and
>> >> reviewing other operating systems and languages aspects, I am more
>> >> convinced that canonical decomposition is more viable solution than
>> >> compatibility decomposition, because it doesn't ignore eliminate any
>> >> semantic meaning, like the definitive case of superscript numbers. NFD
>> >> is also the documented method used by HFS+ and APFS, so there is
>> >> precedent. Notice however, that as far as my research goes, APFS doesn't
>> >> completely follows NFD, and in some cases, like <compat> flags, it
>> >> actually does NFKD, but not in others (<fraction>), where it applies the
>> >> canonical form. We take a more consistent approach and always do plain NFD.
>> >>
>> >> This RFC, therefore, aims to resume/start conversation with some
>> >> stalkeholders that may have something to say regarding the normalization
>> >> method used. I added people from SMB, NFS and FS development who
>> >> might be interested on this.
>> >
>> > Hello! I think that choice of NFD normalization is not right decision.
>> > Some reasons:
>> >
>> > 1) NFD is not widely used. Even Apple does not use it (as you wrote
>> > Apple has own normalization form).
>>
>> To be exact, Apple claims to use NFD in their specification [1] .
>
> Interesting...
>
>> What I
>> observed is that they don't ignore some types of compatibility
>> characters correctly as they should. For instance, the ff ligature is
>> decomposed into f + f.
>
> I'm sure that Apple does not do NFD, but their own invented normal form.
> Some graphemes are decomposed, and some not.
>
>> > 2) All filesystems which I known either do not use any normalization or
>> > use NFC.
>> > 3) Lot of existing Linux application generate file names in NFC.
>> >
>>
>> Most do use NFC. But this is an internal representation for ext4 and it
>> is name preserving.
>
> Ok. I was in impression that it does not preserve original names, just
> like implementation in Apple's system, where char* passed to creat()
> does not appear in readdir().
>
>> We only use the normalization when comparing if names
>> matches and to calculate dcache and dx hashes. The unicode standard
>> recomends the D forms for internal representation.
>
> Ok, this should be less destructive and less visible to userspace.
>
>> > 4) Linux GUI libraries like Qt and Gtk generate strings from key strokes
>> > in NFC. So if user type file name in Qt/Gtk box it would be in NFC.
>> >
>> > So why to use NFD in ext4 filesystem if Linux userspace ecosystem
>> > already uses NFC?
>>
>> NFC is costlier to calculate, usually requiring an intermediate NFD
>> step. Whether it is prohibitively expensive to do in the dcache path, I
>> don't know, but since it is a critical path, any gain matters.
>>
>> > NFD here just makes another layer of problems, unexpected things and
>> > make it somehow different.
>>
>> Is there any case where
>> NFC(x) == NFC(y) && NFD(x) != NFD(y) , or
>> NFC(x) != NFC(y) && NFD(x) == NFD(y)
>
> This is good question. And I think we should get definite answer for it
> prior inclusion of normalization into kernel.
>
>> I am having a hard time thinking of an example. This is the main
>> (only?) scenario where choosing C or D form for an internal
>> representation would affect userspace.
>
> For decision between normal format, probably yes.
>
>> >
>> > Why not rather choose NFS? It would be more compatible with Linux GUI
>> > applications and also with Microsoft Windows systems, which uses NFC
>> > too.
>> >
>> > Please, really consider to not use NFD. Most Linux applications really
>> > do not do any normalization or do NFC. And usage of decomposition form
>> > for application which do not implement full Unicode grapheme algorithms
>> > just make for them another problems.
>>
>> > Yes, there are still lot of legacy application which expect that one
>> > code point = one visible symbol (therefore one Unicode grapheme). And
>> > because GUI in most cases generates NFC strings, also existing file
>> > names are in NFC, these application works in most cases without problem.
>> > Force usage of NFD filenames just break them.
>>
>> As I said, this shouldn't be a problem because what the application
>> creates and retrieves is the exact name that was used before, we'd
>> only use this format for internal metadata on the disk (hashes) and for
>> in-kernel comparisons.
>
> There is another problem for userspace applications:
>
> Currently ext4 accepts as file name any sequence of bytes which do not
> contain nul byte and '/'. So having Latin1 file name is perfectly
> correct.
>
> What would happen if userspace application want to create following two
> file names? "\xDF" and "\F0"? First one is sharp S second one is eth (in
> Latin1). But file names are invalid UTF-8 sequences. Is it disallowed to
> create such file names? Or both file names are internally converted to
> "U+FFFD" (replacement character) and because NFD(first U+FFFD) ==
> NFD(second U+FFFD) only first file would be created?
>
> And what happen in general with invalid UTF-8 sequences? Because there
> are many different types of invalid UTF-8 sequences, like non-shortest
> sequence for valid code point, valid sequence for invalid code points
> (either surrogate pairs code points, or code points above U+10FFFF,
> ...), incorrect byte which should start new code point, incorrect byte
> when decoding of code point started, ...
>
> Different (userspace) application handles these invalid UTF-8 sequences
> differently, some of them accept some kind of "incorrectness" (e.g.
> non-shortest form of code point representation), some not. Some
> applications replace invalid parts of UTF-8 sequence by sequence of
> UTF-8 replacement character, some not. Also it can be observed that some
> applications use just one replacement characters and some other replace
> invalid UTF-8 sequence by more replacement characters.
>
> So trying to "recover" from invalid UTF-8 sequence to valid one is done
> in more ways... And usage of any existing way could cause problems...
> E.g. not possible to create two files "\xDF\xF0" and "\xF0\xDF"...
Basically there are 2 ways to sanely handle invalid utf-8 sequences
inside the kernel. I don't see much gain in handling different levels
of incorrectness. Opening up to "we now accept surrogate characters,
but reject unmapped code points (which we must do, because of stability
of future unicode versions)", makes everything much more unpredictable.
Anyway, two ways to handle invalid sequences...
- 1. An invalid filename can't exist in the disk. This means
reject the sequence and fail the syscall when coming from the
userspace, and flagging it as an error to be fixed by fsck when
identifying any of these sequences already on the disk. This has
obvious backward compatibility problems with applications that want
to create filenames with invalid sequences.
- 2. An invalid filename can exist in the disk as a unique sequence.
In this case, we must decide how to handle invalid sequences that
eventually will appear. The only sane way is to consider the entire
sequence an opaque byte sequence, essentially falling back to the
old behavior, which prevents userspace breakage. We loose the
normalization/casefold feature for that directory entry only, but
the file is still accessible when using the exact match.
Any variant of these, like trying to fix invalid sequences or trying to
do a partial normalization/casefold as a best effort are insane to do in
kernel space.
Patch 09 already implements both of the sane behaviors. Through a
flag in the file system, which defaults to the second case, ext4 will
either reject or treat invalid sequences as opaque byte sequences.
There are more details about handling of invalid sequences in the patch
description.
--
Gabriel Krisman Bertazi

[-- Attachment #1: Type: text/plain, Size: 9288 bytes --]
On Wednesday 06 February 2019 11:04:24 Gabriel Krisman Bertazi wrote:
> Pali Rohár <pali.rohar@gmail.com> writes:
>
> > On Tuesday 05 February 2019 14:08:00 Gabriel Krisman Bertazi wrote:
> >> Pali Rohár <pali.rohar@gmail.com> writes:
> >>
> >> > On Monday 28 January 2019 16:32:12 Gabriel Krisman Bertazi wrote:
> >> >> The main change presented here is a proposal to migrate the
> >> >> normalization method from NFKD to NFD. After our discussions, and
> >> >> reviewing other operating systems and languages aspects, I am more
> >> >> convinced that canonical decomposition is more viable solution than
> >> >> compatibility decomposition, because it doesn't ignore eliminate any
> >> >> semantic meaning, like the definitive case of superscript numbers. NFD
> >> >> is also the documented method used by HFS+ and APFS, so there is
> >> >> precedent. Notice however, that as far as my research goes, APFS doesn't
> >> >> completely follows NFD, and in some cases, like <compat> flags, it
> >> >> actually does NFKD, but not in others (<fraction>), where it applies the
> >> >> canonical form. We take a more consistent approach and always do plain NFD.
> >> >>
> >> >> This RFC, therefore, aims to resume/start conversation with some
> >> >> stalkeholders that may have something to say regarding the normalization
> >> >> method used. I added people from SMB, NFS and FS development who
> >> >> might be interested on this.
> >> >
> >> > Hello! I think that choice of NFD normalization is not right decision.
> >> > Some reasons:
> >> >
> >> > 1) NFD is not widely used. Even Apple does not use it (as you wrote
> >> > Apple has own normalization form).
> >>
> >> To be exact, Apple claims to use NFD in their specification [1] .
> >
> > Interesting...
> >
> >> What I
> >> observed is that they don't ignore some types of compatibility
> >> characters correctly as they should. For instance, the ff ligature is
> >> decomposed into f + f.
> >
> > I'm sure that Apple does not do NFD, but their own invented normal form.
> > Some graphemes are decomposed, and some not.
> >
> >> > 2) All filesystems which I known either do not use any normalization or
> >> > use NFC.
> >> > 3) Lot of existing Linux application generate file names in NFC.
> >> >
> >>
> >> Most do use NFC. But this is an internal representation for ext4 and it
> >> is name preserving.
> >
> > Ok. I was in impression that it does not preserve original names, just
> > like implementation in Apple's system, where char* passed to creat()
> > does not appear in readdir().
> >
> >> We only use the normalization when comparing if names
> >> matches and to calculate dcache and dx hashes. The unicode standard
> >> recomends the D forms for internal representation.
> >
> > Ok, this should be less destructive and less visible to userspace.
> >
> >> > 4) Linux GUI libraries like Qt and Gtk generate strings from key strokes
> >> > in NFC. So if user type file name in Qt/Gtk box it would be in NFC.
> >> >
> >> > So why to use NFD in ext4 filesystem if Linux userspace ecosystem
> >> > already uses NFC?
> >>
> >> NFC is costlier to calculate, usually requiring an intermediate NFD
> >> step. Whether it is prohibitively expensive to do in the dcache path, I
> >> don't know, but since it is a critical path, any gain matters.
> >>
> >> > NFD here just makes another layer of problems, unexpected things and
> >> > make it somehow different.
> >>
> >> Is there any case where
> >> NFC(x) == NFC(y) && NFD(x) != NFD(y) , or
> >> NFC(x) != NFC(y) && NFD(x) == NFD(y)
> >
> > This is good question. And I think we should get definite answer for it
> > prior inclusion of normalization into kernel.
> >
> >> I am having a hard time thinking of an example. This is the main
> >> (only?) scenario where choosing C or D form for an internal
> >> representation would affect userspace.
> >
> > For decision between normal format, probably yes.
> >
> >> >
> >> > Why not rather choose NFS? It would be more compatible with Linux GUI
> >> > applications and also with Microsoft Windows systems, which uses NFC
> >> > too.
> >> >
> >> > Please, really consider to not use NFD. Most Linux applications really
> >> > do not do any normalization or do NFC. And usage of decomposition form
> >> > for application which do not implement full Unicode grapheme algorithms
> >> > just make for them another problems.
> >>
> >> > Yes, there are still lot of legacy application which expect that one
> >> > code point = one visible symbol (therefore one Unicode grapheme). And
> >> > because GUI in most cases generates NFC strings, also existing file
> >> > names are in NFC, these application works in most cases without problem.
> >> > Force usage of NFD filenames just break them.
> >>
> >> As I said, this shouldn't be a problem because what the application
> >> creates and retrieves is the exact name that was used before, we'd
> >> only use this format for internal metadata on the disk (hashes) and for
> >> in-kernel comparisons.
> >
> > There is another problem for userspace applications:
> >
> > Currently ext4 accepts as file name any sequence of bytes which do not
> > contain nul byte and '/'. So having Latin1 file name is perfectly
> > correct.
> >
> > What would happen if userspace application want to create following two
> > file names? "\xDF" and "\F0"? First one is sharp S second one is eth (in
> > Latin1). But file names are invalid UTF-8 sequences. Is it disallowed to
> > create such file names? Or both file names are internally converted to
> > "U+FFFD" (replacement character) and because NFD(first U+FFFD) ==
> > NFD(second U+FFFD) only first file would be created?
> >
> > And what happen in general with invalid UTF-8 sequences? Because there
> > are many different types of invalid UTF-8 sequences, like non-shortest
> > sequence for valid code point, valid sequence for invalid code points
> > (either surrogate pairs code points, or code points above U+10FFFF,
> > ...), incorrect byte which should start new code point, incorrect byte
> > when decoding of code point started, ...
> >
> > Different (userspace) application handles these invalid UTF-8 sequences
> > differently, some of them accept some kind of "incorrectness" (e.g.
> > non-shortest form of code point representation), some not. Some
> > applications replace invalid parts of UTF-8 sequence by sequence of
> > UTF-8 replacement character, some not. Also it can be observed that some
> > applications use just one replacement characters and some other replace
> > invalid UTF-8 sequence by more replacement characters.
> >
> > So trying to "recover" from invalid UTF-8 sequence to valid one is done
> > in more ways... And usage of any existing way could cause problems...
> > E.g. not possible to create two files "\xDF\xF0" and "\xF0\xDF"...
>
> Basically there are 2 ways to sanely handle invalid utf-8 sequences
> inside the kernel. I don't see much gain in handling different levels
> of incorrectness. Opening up to "we now accept surrogate characters,
> but reject unmapped code points (which we must do, because of stability
> of future unicode versions)", makes everything much more unpredictable.
Yes, this just make lot of mess.
> Anyway, two ways to handle invalid sequences...
>
> - 1. An invalid filename can't exist in the disk. This means
> reject the sequence and fail the syscall when coming from the
> userspace, and flagging it as an error to be fixed by fsck when
> identifying any of these sequences already on the disk. This has
> obvious backward compatibility problems with applications that want
> to create filenames with invalid sequences.
Personally I'm for this variant. If directory is marked as "Unicode" I
would expect that file names in that directory are in Unicode. And not
mix of garbage (bytes, Latin1) and Unicode.
If directory is marked as Unicode and some application wants to store
into that directory Latin1, I think it should be really prohibited.
Otherwise, why such "Unicode" flag is there if it cannot be enforced?
> - 2. An invalid filename can exist in the disk as a unique sequence.
> In this case, we must decide how to handle invalid sequences that
> eventually will appear. The only sane way is to consider the entire
> sequence an opaque byte sequence, essentially falling back to the
> old behavior, which prevents userspace breakage. We loose the
> normalization/casefold feature for that directory entry only, but
> the file is still accessible when using the exact match.
>
> Any variant of these, like trying to fix invalid sequences or trying to
> do a partial normalization/casefold as a best effort are insane to do in
> kernel space.
+1
> Patch 09 already implements both of the sane behaviors. Through a
> flag in the file system, which defaults to the second case, ext4 will
> either reject or treat invalid sequences as opaque byte sequences.
>
> There are more details about handling of invalid sequences in the patch
> description.
>
--
Pali Rohár
pali.rohar@gmail.com
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

Gabriel Krisman Bertazi <krisman@collabora.com> writes:
> Regarding Casefold, I am unsure whether Casefold Common + Full still
> makes sense after migrating from the compatibility to the canonical
> form. While Casefold Full, by definition, addresses cases where the
> casefolding grows in size, like the casefold of the german eszett to SS,
> it also is responsible for folding smallcase ligatures without a
> corresponding uppercase to their compatible counterpart. Which means
> that on -F directories, o_f_f_i_c_e and o_ff_i_c_e will differ, while on
> +F directories they will match. This seems unaceptable to me,
> suggesting that we should start to use Common + Simple instead of Common
> + Full, but I would like more input on what seems more reasonable to
> you.
>
> After we decide on this, I will be sending new patches to update
> e2fsprogs to the agreed method and remove the normalization/casefold
> type flags (EXT4_UTF8_NORMALIZATION_TYPE_NFKD,
> EXT4_UTF8_CASEFOLD_TYPE_NFKDCF), before actually proposing the current
> patch series for inclusion in the kernel.
Hey Ted,
Any comments about this bit before I move on and propose a new version?
--
Gabriel Krisman Bertazi