$.fn.data and html5 data- attributes - broken implementation

Description

The type detection built into retrieval of values from html5 data- attributes via $.fn.data in its current implementation is horribly broken.

Support was provided for booleans true and false, null, and complex objects (arrays and object hashes). These are all handled by setting the attribute's value to the JSON-encoded version of the data. However, the same treatment was not provided for parsing strings.

data-name='"joe"' currently parses as a string with the value '"joe"', when it should in fact be parsed as string 'joe' - no quotes. Current implementation does not allow for passing a string with the value "true", "false", "null", nor a stringified version of a complex JSON object.

Invalid JSON values (ie: data-name="joe") can be made to fallback on string value rather than generating a JSON error. But strings in their proper JSON encoded form must be properly supported.

If support for booleans, null and complex objects is going to stay, all data- attributes should be parsed with parseJSON, not a hacked-together soup of sometimes-parsing as JSON, sometimes not. If there is truly a performance boost from prematurely detecting booleans/nulls/numerics rather than parsing proper JSON, then these shortcut detections should be built directly into the parseJSON function, not in a single place within $.fn.data.

@paul.irish: You have misunderstood the problem, read the bug report very carefully and pay attention to the number of quotes surrounding the various strings. Also see the link to the Reddit comment above that shows the problem with examples in more details.

@rwaldron: I think examples are just being given this way for the sake of brevity (although it's technically valid HTML with single quotes), of course if you use double quoted attribute values, then the HTML would be:

data-name="&quot;joe&quot;"

But that's not the issue as you know, it's that this needs to be parsed as JSON, giving:

joe

But instead it is not parsed at all and gives:

"joe"

Regarding your other examples, if you're going to use double quotes in your HTML, then you obviously have to escape your entities accordingly, eg. in PHP:

{{{{
htmlentities( json_encode( $data ), ENT_COMPAT, "utf-8" )
}}}}

will give you output that can be used inside double quoted HTML. There's no way around this - a "bummer" as you say but anything outputted from a server to be used in HTML needs processing anyway, this is simply a step that should *already*, *always* be taken, so it's not like there's really any extra hassle when doing these things programatically. It will just look ugly if you're writing the HTML by hand, reading it, etc. but this is simply how it is.

This is a very small issue with a very convenient abstraction. Stripping quotes can be easily done in post-processing on the client (not to mention, why put them in on the server in the first place?). The data is in no way mangled, it just needs a little more massaging.

@nate: Handling the parsing in that way would be very very slow. It would make it such that we have to do a full code eval (and try/catch) for every attribute - which we most certainly don't want to do. We only want to do it for attributes that look like they could contain JSON data.

If you care about accessing a guaranteed string value from a data attribute then you can just access it using the same way you've always been able to access it: .attr("data-some-attr"). The .data() API is well beyond simple string values and being able to get the results as actual JavaScript objects is quite important.

If we want to turn this bug into a discussion regarding performance of parseJSON and moving logic relating to JSON parsing into there - we can certainly do that. But right now this bug is lumping together a few issues (string parsing - which there is a very obvious solution to - and better than the double-quoting weirdness that was mentioned - and the performance of parseJSON, which is another point of discussion). If this is purely a code-organization/performance discussion (as it appears it is) then it is not a top priority for 1.4.4 which is primarily tackling regressions and bug fixes.

No, not only is this not a small issue, it completely breaks the usefulness that this function might otherwise have had. It is broken and unusable for many cases in it's current state.

why put them in on the server in the first place?

Because that is how a string is encoded in JSON. This function parses all JSON types except for strings - so any time a string is passed it will corrupt the data you have send - it will come out differently than it went in.

You can't encode everything *except* strings as JSON on the server because a string could end up being a valid boolean, integer, etc. Then you end up with a string being turned into something which is not a string. So yes, it does mangle data.

We only want to do it for attributes that look like they could contain JSON data.

Currently it has some way of detecting arrays and objects, I'm not entirely sure how it's doing this as I have not yet checked the code, but if it's doing something like checking the first character for { or [ as a quick way to test if it's likely to be JSON, why can't it simply do the same for "?

If this is purely a code-organization/performance discussion (as it appears it is) then it is not a top priority for 1.4.4 which is primarily tackling regressions and bug fixes.

This is NOT such a discussion - I honestly don't see how this ever made it into 1.4.3 in it's current state. I don't see why there are fixes for some data functionality in 1.4.4 when it's not even usable by so many people due to this bug. I'm sure @nate won't be using it any time soon while it remains in this state, and I know I sure won't, and we're not the only ones...

and better than the double-quoting weirdness that was mentioned

This comment makes me think that perhaps you haven't quite understood the problem here too, but I may be wrong. There's no "weirdness" happening, there is simply a broken function that isn't doing the one sensible thing that makes sense.

I've confirmed this broken-ness with many other people after seeing the number of responses by people who don't seem to fathom the extent of the problem ... it seems like once someone has read through the various cases thoroughly and thought about how they would use this function in a real-world scenario, they realise that it is, in fact, broken... I suggest anyone else commenting on this bug do some thorough thinking on the subject before adding your 2 cents.

They *are* properly supported when the attribute is JSON. JSON syntax as described at ​http://json.org says a JSON string should start with either '{' or '[' so the only thing I suppose we can debate is whether leading spaces should be recognized. The 5 characters "joe" are not valid JSON.

Single-quoted attribute values are valid HTML. That makes it possible to specify valid JSON in markup.

I want a call to .data("name") with an attribute of data-name="joe" to return the three-character string joe. No further magic should be applied for that case. As for the other magic values like null, true, and false, I would be perfectly happy if they just returned the string to avoid more special cases and we could save the magic exclusively for JSON. In most cases the code will know what it is expecting, and typeof can be used for the other cases.

They *are* properly supported when the attribute is JSON. JSON syntax as described at ​http://json.org says a JSON string should start with either '{' or '[' so the only thing I suppose we can debate is whether leading spaces should be recognized. The 5 characters "joe" are not valid JSON.

Have you ever looked at the JSON spec for more than 3 seconds? It clearly states that: A JSON string is denoted using ", and any "..." is a valid JSON string. In fact { and [ notate objects and arrays, NOT strings!

The 5 characters "joe" IS VALID JSON - FOR THE 500TH TIME!!! This is THE ONLY way to denote a string using JSON!

What on earth is going on here, how is it that the entire jQuery team don't even know how JSON works and why is 1.4.4 out without a fix for this? Your major feature doesn't even work!!!!!!!!!!!!!!!!

Sorry if I'm being a bit crazy here but everyone other than nate here doesn't even appear to understand this bug report, so can someone please just spare 30 minutes of their time to read it carefully and don't reply unless you actually know what you're talking about - THANK YOU!

@dmethvin: Sorry, I must apologise and I replied a bit hastily there as this bug has been driving me crazy - I just realised that you were referring to the standard JSON structures. So maybe not technically wrong but individual variables: booleans, strings, numbers and null are all still considered valid JSON strings... Every JSON encoding and decoding function I have come across treats them in this way, if you JSON encode a string 'joe' then it will become '"joe"' - even jQuery's .parseJSON will turn '"joe"' into 'joe' ... the reason this .data() function is problematic is that you can't encode your variable (of any given type) on a server, for example, and retain the variable along with the type when it's read through .data(), so any time you want to pass a string you simply can't and have to put it inside an array or object, which really shouldn't be the case.

Leo, yes I was referring to the standard JSON object or array. By the term "JSON string" I meant a string that contains the character representation of a JSON object or array. That is the only case where the implementation tries to parse the characters returned from a data-* attribute as JSON. The documentation does not say that all data-* attributes are parsed as JSON, because they are not.

The implementation does not treat any other attribute string as JSON because (IMO) it would be less useful to do it that way. To repeat, I want a call to .data("name") with an attribute of data-name="joe" to return the three-character string joe. I do not want everyone who writes markup for jQuery to be required to wrap an extra set of quotes around a simple string attribute.

I'm going to close this ticket because the implementation and documentation are in perfect agreement right now as far as I can tell, and they are working as intended. I'll leave the needsreview tag so the team can take a final look at it for 1.5 fine-tuning, but I will personally lobby hard against any change that requires doubly-quoting string attributes.

Leo, I'd also suggest that you review your tone in the original ticket and responses. It's hard to be objective about these suggestions given the way they've been presented.

@dmethvin: Sorry but I must reply again because you have just stated some things about the functionality which aren't quite true.

The implementation does not treat any other attribute string as JSON because (IMO) it would be less useful to do it that way.

It treats a string containing 'true' or 'false' as a boolean.

It treats a string containing 'null' as null.

It treats a string containing a number as an integer/float.

So maybe when your string is 'joe', yes, it will come through as 'joe'. But when your string is 'true' it will not come through as a string at all, it will be converted into a boolean value.

Due to *this* passing a string is absolutely useless because there is no guarantee that it will remain a string.

If you don't want people to have to put quotes around their strings then you need to also stop parsing the null, boolean and number types so that the strings are actually preserved.

Again, I apologise regarding my tone but I have tried to explain this so many, many people and it seems like I've got through to about 3 out of 100, and it's seriously making me doubt my own sanity. But still no one, yourself included, can explain to me how I am supposed to put a string into a data attribute in my HTML, and get the same string out when I call .data() ... because this function doesn't do this.

One final example (quoting you):

To repeat, I want a call to .data("name") with an attribute of data-name="joe" to return the three-character string joe.

That's fine.. it does this.

How about:

To repeat, I want a call to .data("name") with an attribute of data-name="true" to return the four-character string true.

It does not do this!

Falling back to parsing as a string when nothing else can be matched is fine, but there needs to be a way to explicitly specify that your variable is a string or all kinds of weirdness, inconsistencies and bugs are going to be introduced into so many pieces of code.

I agree that this issue can make this feature unusable in many cases. I would like to use it to pass a user's login from the server to Javascript, as data-username="..." and then access it as data('username'). However, I can't be sure that the username won't be "true" or "false" or "123", in which case the value will be returned not as string but as boolean or a number. I don't want to have to code special cases for such usernames.

I agree with everyone speaking against this change, and have raised basically every issue you have directly with the rest of the team, but there are key players that simply do not see eye-to-eye on this. Their responses are ‘you can always just use .attr if you don’t like it’, ‘it’s a cool feature that is more useful than dangerous, despite what you are saying’, and ‘well, we can’t remove it, because it’s there now’.

The number issue can come up in other fun ways. “5E2201” is a number. “0xED” is a number. “0702” is a number, obviously, but it’s interpreted as octal, so actually means 450. None of these rules should be surprising if you’ve worked with numbers in JS a lot, but many people haven’t.

Anyway, I am really not sure how much more there is worth talking about here. I will continue to push internally as I can to get this changed, and you are welcome to continue to vote on this ticket (even closed tickets that are voted highly will be considered), but any further arguing in comments is likely to fall on deaf ears.

@snover: Thanks for the extra info. I understand not wanting to remove or change this in any way that would break anything, but is there a real argument against the addition of detecting JSON strings? Everything else including the fallback to string anyway would still apply so nothing would be broken or lost. I would be against anything more drastic than this anyway, but this seems like the most sensible thing to do at this point to at least make the function useful, but also less hazardous as people will at least have the choice to use the safe notation after reading about it.

The most worrying thing about this is that lots of people have already started writing code using this functionality not realising the problems with it. There are already plenty of bits of code floating around out there just waiting to break when the right string comes along, so maybe the very least that could be done for now is adding a warning to the function page letting people know that they can't rely on their plain strings being preserved.

And while it doesn't implicitly state that strings are preserved, enough people have already assumed that this is the case and started using the function without stopping to think about these implications.