A couple of points to do with unicode in a few areas - some minor, some
not so minor IMHO.
1. "The one parameter that is universally supported (to the extent that
is supported by the underlying JSON modules) is |utf8|. When this
parameter is enabled all resulting JSON will be marked as unicode, and
all unicode strings in the input data structure will be preserved as such"
I think this statement is misleading as there is no such things as
"marking" a string as unicode in Perl and unicode in Perl is not utf-8.
2. Following on from above "The actual output will vary"
Most JSON modules that can be used by JSON::Any are buggy, especially
wrt to unicode. Sometimes they do use the internal utf-8 flag as
indicator, sometimes they just fail, to set it. I would be careful
making claims unicode works if you set the mysterious |utf8| flag. For
example, unicode does not work at all with JSON::Syck via JSON::Any
unless $JSON::Syck::ImplicitUnicode = 1 is set.
Here is an example:
use strict;
use warnings;
use JSON::Any qw(Syck);
use Data::Dumper;
#$JSON::Syck::ImplicitUnicode = 1;
binmode (STDOUT, ":utf8");
my $str="\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE";
print Dumper([$str]);
my $js = JSON::Syck::Dump([$str]);
open OUT, ">uni.out";
binmode(OUT, ":utf8");
print OUT "$str\n";
close OUT;
open IN, "<json.out";
binmode (IN, ":utf8");
my $fd="";
$fd .= $_ while (<IN>);
print "string from file: " . Dumper($fd),"\n";
my $obj = JSON::Any->decode($fd);
print Dumper($obj);
produces:
$VAR1 = [
"\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE"
];
string from file: $VAR1 =
"[\"\x{e5}\x{8f}\x{b0}\x{e6}\x{89}\x{80}\x{e3}\x{81}\x{ae}\x{e6}\x{b5}\x{81}\x{e3}\x{81}\x{97}JAPANESE\"]
";
$VAR1 = [
'Ã¥ÂÂ°Ã¦ÂÂÃ£ÂÂ®Ã¦ÂµÂÃ£ÂÂJAPANESE'
];
but if you uncomment the ImplicitUnicode line it works correctly:
$VAR1 = [
"\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE"
];
string from file: $VAR1 =
"[\"\x{e5}\x{8f}\x{b0}\x{e6}\x{89}\x{80}\x{e3}\x{81}\x{ae}\x{e6}\x{b5}\x{81}\x{e3}\x{81}\x{97}JAPANESE\"]
";
$VAR1 = [
"\x{e5}\x{8f}\x{b0}\x{e6}\x{89}\x{80}\x{e3}\x{81}\x{ae}\x{e6}\x{b5}\x{81}\x{e3}\x{81}\x{97}JAPANESE"
];
However, this may just be a result of the fact that JSON::Syck does not
have a constructor ("creator" key in your %conf) and hence there is no
way to set the "utf8" flag as the line "if ( my $creator =
$conf{$key}->{create_object} ) " fails. BTW, you don't mention in the
pod you cannot do my $f = JSON::Any->new() when using JSON::Syck.
3. I don't really understand this "utf8" flag. What has it got to do
with Unicode? it is an encoding and therefore just a way of encoding
unicode codepoints and should only get involved when importing or
exporting data in to or out of Perl. The following code with JSON::XS
works fine without any "utf8" flags because Perl understands unicode and
so does JSON::XS:
use strict;
use warnings;
use JSON::XS;
use Data::Dumper;
binmode (STDOUT, ":utf8");
my $str="\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE";
print Dumper([$str]);
open OUT, ">uni.out";
binmode(OUT, ":utf8");
print OUT "$str\n";
close OUT;
my $js = JSON::XS->new->encode([$str]);
print "json encoded str is $js\n" . Dumper($js);
open OUT, ">json.out";
binmode(OUT, ":utf8");
print OUT "$js\n";
close OUT;
open IN, "<json.out";
binmode (IN, ":utf8");
my $fd="";
$fd .= $_ while (<IN>);
print "string from file: " . Dumper($fd),"\n";
my $obj = JSON::XS->new->decode($fd);
print Dumper($obj);
producing:
$VAR1 = [
"\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE"
];
json encoded str is ["台所の流しJAPANESE"]
$VAR1 = "[\"\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE\"]";
string from file: $VAR1 =
"[\"\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE\"]
";
$VAR1 = [
"\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}JAPANESE"
];
JSON::XS was given a unicode string and gave me back a JSON encoded
unicode string. This was only encoded in utf8 when it was written to a
file. On reading the file back, Perl is told the file is utf8 encoded
and hence translates the utf8 into unicode characters which we pass
through JSON::XS to get our object back containing unicode characters.
4. For some reason JSON::Any appears to use JSON::XS to_json method but
that converts any true unicode chrs in the input string to a binary
string of utf8 octets on output. As a result, when the |utf8| flag is
set JSON::Any has to call decode to get back a unicode string when in
fact simply using JSON::XS->new->encode would have done the right thing
from the start.
I appreciate JSON::Any must be a real pain to keep consistent when all
the JSON modules are so different but I'd at least change the pod to
warn about unicode inconsistencies instead of suggesting it just works
across the board.
You may be asking why I don't just use JSON::XS directly and that is
because I'm actually using POE::Filter::JSON and that uses JSON::Any.
Martin
--
Martin J. Evans
Wetherby, UK

The utf8 flag passed to JSON::XS was me being a doofus, i got it the opposite way round.
This is now fixed in trunk and Chris should release it shortly.
As for Syck - IMHO it should die for now (like it does for JSON::DWIW), and later add proper
support by localizing implicitunicode in encode/decode.
On Thu Oct 11 08:04:03 2007, MJEVANS wrote:
Show quoted text

> I think this statement is misleading as there is no such things as
> "marking" a string as unicode in Perl and unicode in Perl is not utf-
> 8.

My english-fu is weak. Perhaps you can be tempted into improving the docs? I was trying to
convey that with utf8 => 1 passed to JSON::Any all data is going to be in wide chars, that is
utf8::is_utf8 is true (i used the word marking because of the utf8 flag telling perl to decode
the utf8 octets and give back wide chars instead of 0-255).
Show quoted text

> 2. Following on from above "The actual output will vary"

Commented on above
Show quoted text

> Here is an example:

Just FYI (a bit off topic), Devel::StringInfo was written to ease writing such demonstration
code.
Anyway, do you think maybe you could translate this into a test, specifically for JSON-Syck,
like the JSON-XS test in JSON::Any tests utf8?
t/10-unicode.t currently skips Syck because it's pretty much broken.
Show quoted text

> 3. I don't really understand this "utf8" flag. What has it got to do
> with Unicode? it is an encoding and therefore just a way of encoding
> unicode codepoints and should only get involved when importing or
> exporting data in to or out of Perl. The following code with JSON::XS
> works fine without any "utf8" flags because Perl understands unicode
> and
> so does JSON::XS:

> I appreciate JSON::Any must be a real pain to keep consistent when all
> the JSON modules are so different but I'd at least change the pod to
> warn about unicode inconsistencies instead of suggesting it just works
> across the board.

Those are code bugs, not doc bugs =)
Unicode support is possible to get consistently, and should really be supported, otherwise
JSON::Any is pretty much useless for any scenarios involving unicode data.
Thanks for taking the time to make such a detailed report,
Regards,
Yuval

> The utf8 flag passed to JSON::XS was me being a doofus, i got it the
> opposite way round.
> This is now fixed in trunk and Chris should release it shortly.

Can you point me at the subversion repository please then I can try it out.
Show quoted text

> As for Syck - IMHO it should die for now (like it does for
> JSON::DWIW), and later add proper
> support by localizing implicitunicode in encode/decode.
>
> On Thu Oct 11 08:04:03 2007, MJEVANS wrote:
>

> > I think this statement is misleading as there is no such things as
> > "marking" a string as unicode in Perl and unicode in Perl is not

> utf-8.
>
> My english-fu is weak. Perhaps you can be tempted into improving the
> docs? I was trying to convey that with utf8 => 1 passed to JSON::Any
> all data is going to be in wide chars, that is utf8::is_utf8 is
> true (i used the word marking because of the utf8 flag telling perl
> to decode the utf8 octets and give back wide chars instead of 0-255).

I am happy to take a shot at improving the docs if you can point me at
the subversion repository I can get a copy of the latest to provide
patches against.
Show quoted text

> > 2. Following on from above "The actual output will vary"

>
> Commented on above
>

> > Here is an example:

>
> Just FYI (a bit off topic), Devel::StringInfo was written to ease
> writing such demonstration
> code.

Just tried that - nice pointer - thanks.
Show quoted text

> Anyway, do you think maybe you could translate this into a test,
> specifically for JSON-Syck,
> like the JSON-XS test in JSON::Any tests utf8?

Probably given a pointer to subversion repository.
Show quoted text

> t/10-unicode.t currently skips Syck because it's pretty much broken.

It can be made to work if ImplicitUnicode is used. I used JSON::Syck for
ages (with a few problems) but just changed to JSON::XS.
Show quoted text

> > 3. I don't really understand this "utf8" flag. What has it got to do
> > with Unicode? it is an encoding and therefore just a way of encoding
> > unicode codepoints and should only get involved when importing or
> > exporting data in to or out of Perl. The following code with

> JSON::XS

> > works fine without any "utf8" flags because Perl understands unicode
> > and
> > so does JSON::XS:

> > I appreciate JSON::Any must be a real pain to keep consistent when
> > all the JSON modules are so different but I'd at least change the
> > pod to warn about unicode inconsistencies instead of suggesting it
> > just works across the board.

>
> Those are code bugs, not doc bugs =)
>
> Unicode support is possible to get consistently, and should really be
> supported, otherwise JSON::Any is pretty much useless for any
> scenarios involving unicode data.

> I am happy to take a shot at improving the docs if you can point me at
> the subversion repository I can get a copy of the latest to provide
> patches against.

Great!
Show quoted text

> It can be made to work if ImplicitUnicode is used. I used JSON::Syck for
> ages (with a few problems) but just changed to JSON::XS.

I meant JSON::Any's Syck support is not on par with the others due to lack of OO. There are a
few other issues IIRC, and I didn't have time to redo it all to support this properly. I think
some sort of proxy object should be written, e.g. JSON::Any::SyckWrapper which will localize
the vars according to a config it was instantiated with before encode/decode. Maybe it can be
done with just closures, too. However $j->handler should still report that it's Syck somehow.
Show quoted text

Thanks.
If I have anything I'll just supply patches - write access to the
repository is not required as I wasn't planning on becoming a full
JSON::Any team member - just a one-off contributor.
Martin
--
Martin J. Evans
Wetherby, UK

> ok, I'm more confused now.
> I've downloaded the trunk from subversion and built it. I can see in the
> diffs between r37 and r38 the sense of utf8 appears to have been changed:
>
> + local $conf->{utf8} = !$conf->{utf8}; # it means the
> opposite
> +
>
> If I run the script below with utf8=>1 it works as before but I thought
> you'd reversed the meaning of utf8 so it should work without setting utf8?

No, you need to set utf8 => 1 (maybe it should be renamed to unicode => 1). It only reverses it
for JSON::XS. JSON::PC and JSON take utf8 => 1.