Works on my Windows XP, PHP 5.4.8, *if* the PHP file is stored in ANSI
(="Windows") encoding. Doesn't work if stored in UTF8. So I suspect it's
an encoding issue: the accented character "è" is stored as \xE8 in the
file system, but if your script is UTF8, then your $path will contain
\xC3\xA8 instead. With an "explicit" $path = __DIR__ . "\Eug\xE8ne.txt"
it works when the script is UTF8.

> Works on my Windows XP, PHP 5.4.8, *if* the PHP file is stored in ANSI> (="Windows") encoding. Doesn't work if stored in UTF8.

Strange situation.
I changed my PHP-files encoding to UTF-8, but the problem still occurred.

So I suspect it's> an encoding issue: the accented character "è" is stored as \xE8 in the> file system, but if your script is UTF8, then your $path will contain> \xC3\xA8 instead. With an "explicit" $path = __DIR__ . "\Eug\xE8ne.txt"> it works when the script is UTF8.>

THAT helped!

I added a replace:
$path = str_replace("è","\xE8",$path);

and now it IS readable from PHP.

Now I wonder if I should make a whole list of such replaces....
Sounds horrid, doesn't it?

But your idea brought me to the following idea:
$path = utf8_decode($path);

Which works flawlessly (on my set of only 13000 filenames)!

So at least I have it fixed for NTFS.
Thanks for pointing my head in the right direction.

Regards,
Erwin Moller

> Greetings,> Thomas>

--
"That which can be asserted without evidence, can be dismissed without
evidence."
-- Christopher Hitchens

On 09/10/13 12:16, Erwin Moller wrote:> On 10/9/2013 12:58 PM, Thomas Mlynarczyk wrote:>> Erwin Moller schrieb:>> >>> How can PHP open files on the local filesystem that contain certain>>> characters, like umlauts, accents, etc?>> >> $path = __DIR__ . '\Eugène.txt';>> var_dump( PHP_VERSION, file_exists( $path ) );>> > > That didn't help since my files are not stored in working dir.> >> Works on my Windows XP, PHP 5.4.8, *if* the PHP file is stored in ANSI>> (="Windows") encoding. Doesn't work if stored in UTF8.> > Strange situation.> I changed my PHP-files encoding to UTF-8, but the problem still occurred.> > > So I suspect it's>> an encoding issue: the accented character "è" is stored as \xE8 in the>> file system, but if your script is UTF8, then your $path will contain>> \xC3\xA8 instead. With an "explicit" $path = __DIR__ . "\Eug\xE8ne.txt">> it works when the script is UTF8.>> > > THAT helped!> > I added a replace:> $path = str_replace("è","\xE8",$path);> > and now it IS readable from PHP.> > Now I wonder if I should make a whole list of such replaces....> Sounds horrid, doesn't it?> > But your idea brought me to the following idea:> $path = utf8_decode($path);> > Which works flawlessly (on my set of only 13000 filenames)!> > So at least I have it fixed for NTFS.> Thanks for pointing my head in the right direction.> > Regards,> Erwin Moller> >

and thanks for raising and solving that one..

tucked away in case its ever needed..

>> Greetings,>> Thomas>> > >

--
Ineptocracy

(in-ep-toc’-ra-cy) – a system of government where the least capable to
lead are elected by the least capable of producing, and where the
members of society least likely to sustain themselves or succeed, are
rewarded with goods and services paid for by the confiscated wealth of a
diminishing number of producers.

On 10/9/2013 1:18 PM, The Natural Philosopher wrote:> On 09/10/13 12:16, Erwin Moller wrote:>> On 10/9/2013 12:58 PM, Thomas Mlynarczyk wrote:>>> Erwin Moller schrieb:>>> >>>> How can PHP open files on the local filesystem that contain certain>>>> characters, like umlauts, accents, etc?>>> >>> $path = __DIR__ . '\Eugène.txt';>>> var_dump( PHP_VERSION, file_exists( $path ) );>>> >> >> That didn't help since my files are not stored in working dir.>> >>> Works on my Windows XP, PHP 5.4.8, *if* the PHP file is stored in ANSI>>> (="Windows") encoding. Doesn't work if stored in UTF8.>> >> Strange situation.>> I changed my PHP-files encoding to UTF-8, but the problem still occurred.>> >> >> So I suspect it's>>> an encoding issue: the accented character "è" is stored as \xE8 in the>>> file system, but if your script is UTF8, then your $path will contain>>> \xC3\xA8 instead. With an "explicit" $path = __DIR__ . "\Eug\xE8ne.txt">>> it works when the script is UTF8.>>> >> >> THAT helped!>> >> I added a replace:>> $path = str_replace("è","\xE8",$path);>> >> and now it IS readable from PHP.>> >> Now I wonder if I should make a whole list of such replaces....>> Sounds horrid, doesn't it?>> >> But your idea brought me to the following idea:>> $path = utf8_decode($path);>> >> Which works flawlessly (on my set of only 13000 filenames)!>> >> So at least I have it fixed for NTFS.>> Thanks for pointing my head in the right direction.>> >> Regards,>> Erwin Moller>> >> > > and thanks for raising and solving that one..> > tucked away in case its ever needed..>

I don't have a good feeling about my "fix".
It worked, but I don't know exactly what is going on.

I actually hoped PHP would handle such things 'the right way', whatever
that might be. ;-)
Now I wonder what happens if my code happens to run on some *nix OS.
Ideally my PHP code is OS agnostic.

Regards,
Erwin Moller

--
"That which can be asserted without evidence, can be dismissed without
evidence."
-- Christopher Hitchens

On Wed, 09 Oct 2013 17:06:21 +0200, Erwin Moller wrote:> I don't have a good feeling about my "fix".> It worked, but I don't know exactly what is going on.> > I actually hoped PHP would handle such things 'the right way', whatever > that might be. ;-)> Now I wonder what happens if my code happens to run on some *nix OS.> Ideally my PHP code is OS agnostic.

PHP is; it's the OS you're running on that's playing up with how it's
encoding the file names. That's beyond PHP's control and PHP is bending
to the will of whatever encoding your source file is in, mostly by
ignoring it.

--
"'I'm not sleeping with a jr. high schooler! I have a life-sized doll
that looks like one.' Uh huh. That sounds SO much less pathetic."
-- Piro's Conscience www.megatokyo.com

On 09/10/13 17:06, Erwin Moller wrote:>> On 09/10/13 12:16, Erwin Moller wrote:>>> On 10/9/2013 12:58 PM, Thomas Mlynarczyk wrote:>>>> Erwin Moller schrieb:>>>> >>>> > How can PHP open files on the local filesystem that contain certain>>>> > characters, like umlauts, accents, etc?>>>> So I suspect it's>>>> an encoding issue: the accented character "è" is stored as \xE8 in the>>>> file system, but if your script is UTF8, then your $path will contain>>>> \xC3\xA8 instead. With an "explicit" $path = __DIR__ . "\Eug\xE8ne.txt">>>> it works when the script is UTF8.>>> THAT helped!>>> >>> I added a replace:>>> $path = str_replace("è","\xE8",$path);

But to not have to do work arounds, use the same charset as the file
system when you write your scripts, a mixed charsetup usually will cause
issues when forgetting to convert.

>>> and now it IS readable from PHP.>>> >>> Now I wonder if I should make a whole list of such replaces....>>> Sounds horrid, doesn't it?> I don't have a good feeling about my "fix".> It worked, but I don't know exactly what is going on.> > I actually hoped PHP would handle such things 'the right way', whatever> that might be. ;-)

No there is no magic in PHP that would it change what you have hardcoded
in the script to something else just for you using a file system which
don't use the same character setup as you wrote the script in.

> Now I wonder what happens if my code happens to run on some *nix OS.> Ideally my PHP code is OS agnostic.

Depends on which charset is used for the file system, if they use utf-8,
then no issue, of they use big5 or something else, then you have an
issue again.
You will most likely end up with issus with the file paths as other
operating systems uses / instead of \ (which is used as an escape
character).

There is no “ANSI encoding“. Usually “ANSI encoding” means Windows-1252.
[0] It would be either coincidence or strange if this worked, because FAT32
uses the “OEM character set”, i. e. one of the various IBM code pages, 437
for English, and NTFS uses UTF-16BE [1]. The letter “è” has Windows-1252
code 0xE6, IBM437/IBM850 code 0x8A, and Unicode code point U+00E8 [2]
(encoded in UTF-16 as 0xE8 [3]). It follows that you cannot mean
Windows-1252 by “ANSI”.

> So I suspect it's an encoding issue: the accented character "è" is stored> as \xE8 in the file system,

Yes, it is, but only with NTFS.

> but if your script is UTF8, then your $path will contain \xC3\xA8 instead.

Which cannot work with NTFS.

> With an "explicit" $path = __DIR__ . "\Eug\xE8ne.txt" it works when the> script is UTF8.

But only with NTFS and compatible filesystems.

PointedEars
--
Anyone who slaps a 'this page is best viewed with Browser X' label on
a Web page appears to be yearning for the bad old days, before the Web,
when you had very little chance of reading a document written on another
computer, another word processor, or another network. -- Tim Berners-Lee

> Thomas Mlynarczyk wrote:> >> Erwin Moller schrieb:>>> How can PHP open files on the local filesystem that contain certain>>> characters, like umlauts, accents, etc?>> >> $path = __DIR__ . '\Eugène.txt';>> var_dump( PHP_VERSION, file_exists( $path ) );>> >> Works on my Windows XP, PHP 5.4.8, *if* the PHP file is stored in ANSI>> (="Windows") encoding.> > There is no “ANSI encoding“. Usually “ANSI encoding” means Windows-1252. > [0] It would be either coincidence or strange if this worked, because FAT32 > uses the “OEM character set”, i. e. one of the various IBM code pages, 437 > for English, and NTFS uses UTF-16BE [1]. The letter “è” has Windows-1252 > code 0xE6, IBM437/IBM850 code 0x8A, and Unicode code point U+00E8 [2] > (encoded in UTF-16 as 0xE8 [3]). It follows that you cannot mean > Windows-1252 by “ANSI”.

The letter "è" is encoded in CP-1252 as /0xE8/[1]. In UTF-16 it is
encoded by *two* bytes: 0x00 0xE8 (or vice versa, depending on the
endianess).

I have created a file "tèst" on a German Windows XP on NTFS, and started
a PHP shell:

>>> $fs = glob('t?st')
>>> $fs[0]
't\350st'

Apparently, the file name is *read* by PHP as if it was encoded in
CP-1252. Either the description on MSDN[2] is wrong, or PHP uses a
Windows API that converts the filename's encoding. I presume the
latter, being aware (but not (yet) convinced) that there might be
another reason for this behavior.

> Thomas 'PointedEars' Lahn wrote:>> Thomas Mlynarczyk wrote:>>> Erwin Moller schrieb:>>>> How can PHP open files on the local filesystem that contain certain>>>> characters, like umlauts, accents, etc?>>> >>> $path = __DIR__ . '\Eugène.txt';>>> var_dump( PHP_VERSION, file_exists( $path ) );>>> >>> Works on my Windows XP, PHP 5.4.8, *if* the PHP file is stored in ANSI>>> (="Windows") encoding.>> >> There is no “ANSI encoding“. Usually “ANSI encoding” means Windows-1252.>> [0] It would be either coincidence or strange if this worked, because>> [FAT32 uses the “OEM character set”, i. e. one of the various IBM code>> pages, 437 for English, and NTFS uses UTF-16BE [1]. The letter “è” has>> Windows-1252 code 0xE6, IBM437/IBM850 code 0x8A, and Unicode code point>> U+00E8 [2] (encoded in UTF-16 as 0xE8 [3]). It follows that you cannot>> mean Windows-1252 by “ANSI”.> > The letter "è" is encoded in CP-1252 as /0xE8/[1].

You are correct (to some extent); I must have slipped into the wrong row.

My point is, however, that _Windows_-1252 is very likely _not_ what is
expected by the filesystem. By “coincidence”, the code *points* for
Windows-1252 and Unicode are the same from 0+00A0 to U+00FF, and the used
character is within that range. This code will break for characters whose
Unicode code point is above U+007F but outside this range. In general, it
will be unreliable because Windows-1252 does not have the interleaved zero-
octet that UTF-16 has (NTFS), and Windows-1252 and IBM437 & friends (FAT32)
are incompatible above 0x7F.

> In UTF-16 it is encoded by *two* bytes:

Two octets, to be precise. I was aware of that (as you could see further
below) but I oversimplified here.

> I have created a file "tèst" on a German Windows XP on NTFS, and started> a PHP shell:> >>>> $fs = glob('t?st')>>>> $fs[0]> 't\350st'> > Apparently, the file name is *read* by PHP as if it was encoded in> CP-1252.

Interesting. 0350 would correspond to 232 and 0xE8, indeed.

> Either the description on MSDN[2] is wrong,

Unlikely.

> or PHP uses a Windows API that converts the filename's encoding.

It would suffice if it discarded all zero-bits in *this* case as the code
would be {74 00} {E8 00} {73 00} {74 00}.

> I presume the latter, being aware (but not (yet) convinced) that there> might be another reason for this behavior.

It would be interesting to see how this works with NTFS with characters
outside the specified range whose Unicode code point is above U+007F. For
example, U+0100 (“Ā”; LATIN CAPITAL LETTER A WITH MACRON) would be encoded
in one UTF-16 code unit, 0100, which would be encoded in UTF-16LE as 00 10.
Just stripping the zero-octets would result in <LF> (whose code point is
0x10 which is 020). Just reading the octet with the lower address would
result in 0x00 which terminates a C string. If the result is _not_
something equivalent to 't\020st' or 't', something else is happening.

PointedEars
--
Prototype.js was written by people who don't know javascript for people
who don't know javascript. People who don't know javascript are not
the best source of advice on designing systems that use javascript.
-- Richard Cornford, cljs, <f806at$ail$1$8300dec7(at)news(dot)demon(dot)co(dot)uk>

> On 10/9/2013 1:18 PM, The Natural Philosopher wrote:>> On 09/10/13 12:16, Erwin Moller wrote:>>> On 10/9/2013 12:58 PM, Thomas Mlynarczyk wrote:>>>> Erwin Moller schrieb:>>>> > How can PHP open files on the local filesystem that contain certain>>>> > characters, like umlauts, accents, etc?>>>> >>>> $path = __DIR__ . '\Eugène.txt';>>>> var_dump( PHP_VERSION, file_exists( $path ) );>>> >>> That didn't help since my files are not stored in working dir.>>> >>>> Works on my Windows XP, PHP 5.4.8, *if* the PHP file is stored in ANSI>>>> (="Windows") encoding. Doesn't work if stored in UTF8.>>> >>> Strange situation.>>> I changed my PHP-files encoding to UTF-8, but the problem still>>> occurred.>>> […]>>> I added a replace:>>> $path = str_replace("è","\xE8",$path);>>> >>> and now it IS readable from PHP.> […]> I don't have a good feeling about my "fix".

PHP has no built-in support for character encodings (but it has extensions
for that). Your strings are read octet-wise from lowest to highest address
as they are, that is, as the *editor* encoded the characters between the
string delimiters. If you write “"è"” in an UTF-8 encoded source file, the
character between the delimiters will be encoded C3 A8. If you write the
*same* character in a Windows-1252-encoded source file, it will be encoded
E8.

If your filesystem is FAT32, it will probably expect 8A if its locale is
English (IBM437) or Central European (IBM850), for example. If your
filesystem is NTFS, it will expect E8 00 (UTF-16_LE_; my mistake); if you
omit the zero octet it *might* work, but it does not work reliably.

> Now I wonder what happens if my code happens to run on some *nix OS.

The operating system is not the issue; the filesystem is. However, usually
Linux will run on ext2 to ext4, where AFAIK any character encoding can be
used. So there is a good chance that your code will break there.

> Ideally my PHP code is OS agnostic.

In that case you will probably have to detect the filesystem, and its
encoding, and use the encoding that is expected by the filesystem. Or
prevent such filenames from occurring in the first place.

I suggest to encode PHP source files with UTF-8 _without BOM_. If you write
non-ASCII characters, you know what the encoding is, and you have a greater
character set so that fewer characters need to be escaped.

PointedEars
--
Prototype.js was written by people who don't know javascript for people
who don't know javascript. People who don't know javascript are not
the best source of advice on designing systems that use javascript.
-- Richard Cornford, cljs, <f806at$ail$1$8300dec7(at)news(dot)demon(dot)co(dot)uk>

> It would be interesting to see how this works with NTFS with characters > outside the specified range whose Unicode code point is above U+007F. For > example, U+0100 (“Ā”; LATIN CAPITAL LETTER A WITH MACRON) would be encoded > in one UTF-16 code unit, 0100, which would be encoded in UTF-16LE as 00 10. > Just stripping the zero-octets would result in <LF> (whose code point is > 0x10 which is 020). Just reading the octet with the lower address would > result in 0x00 which terminates a C string. If the result is _not_ > something equivalent to 't\020st' or 't', something else is happening.

U+0010 denotes <DLE>, <LF> is U+000A[1]. Anyway, I created a file
"tĀst" and did:

>>> glob('*')
Array
(
)

Apparently, something else is happening.

FWIW, I tried the following, too:

>>> touch("test")
true

>>> touch("t\x00\x10st")
Warning: touch() expects parameter 1 to be a valid path, string given
in ...