Introduction

A common programming task (the only task?) is moving and manipulating data, sometimes between different languages. This can be a chore since language creators don't implement any standard serializing format for nested data such as arrays. In fact, many languages don't support any native serializing at all.

Scripting is all the rage these days. Perhaps, it's just the particular field I work in, but I seem to run into more and more scripting languages. Yes, there are the ubiquitous languages, JavaScript, ASP, PHP, etc... but there are countless other scripting languages for industrial robotics control, camera control, phone switching, on and on.

At some point, you want to get data into or out of these languages, and the choices can be vague or less than optimal. XML is standard, but XML data is bulky and hard to parse. An XML parser is not something I really want to add to ACME's Burger Flipper Scripting Language, just so you can get burger flipping statistics into your Total Meal Performance package. Ultra simple string formats are not extensible, and require you to constantly upgrade everything to handle new parameters. Other custom formats I see used are often inefficient, limited in usefulness, or buggy.

The goal of this article then, is to provide an easy to implement minimal format for simple data exchange especially between new and obscure scripting languages.

Instead of calling it ETIMFFSDEEBNAOSL, which sounds German, I'm going to refer to this format formally as Simple Cross-language Serializing or SCS, so we have a shorthand, and later, in design reviews, we can use cool phrases like, 'I'll just SCS that data to you' and make nearby management people feel dumb.

Goals

Unfortunately, the section above already one-upped this one by not only mentioning the goal, but making a joke about it. So, instead of restating it, I'll just go into more detail.

Above all, simplicity and flexibility will be stressed. There will be ways in which the protocol could be expanded to reduce the size of the serialized data, such as using binary, or compression, etc.... But the trade off would be to increase the complexity of the encoder / decoder, and thus increase the implementation / debugging time and raise the risk of compatibility flaws and lazy implementations.

Since this data format is chiefly for scripting formats that are new and/or obscure, we want something that is quickly and easily implemented.

So, steps to obtaining our goal will be to...

Define a compact portable data format for nested data.

Outline an easy to implement encoder / decoder.

Provide example encoder / decoder in C++, PHP, and JavaScript.

Use

To make this article a little easier to read and understand, I'm going to cover a few example uses before I go further, since I don't have a cool picture to put on top of the article. Here is a simple example of how SCS will be used for, say, passing data from PHP to JavaScript, for instance...

Why another format?

If you Google 'pass array PHP JavaScript', substituting your favorite languages, you'll find various implementations including many lazy implementations of standards like XML. The goal here is to define the simplest possible practical implementation. In fact, ideally, a lazier implementation will not even be possible.

Note on XML. It seems I've run into many people that claim XML is the end all format. I've never seen such strong claims of one-size-fits-all before, despite the many thousands of formats that have gone before. But, it seems many are determined to make XML the only format for data exchange. Someone has even once suggested to me that I base-64 encode streaming video data and wrap it in XML to make it 'standard'. I don't share, and am not going to consider, such a narrow view on any format or language. There is a clear trade-off between a formats feature set and the complexity of implementation. I will offer what I hope is an objective comparison with XML for the challenges within the scope of this article.

What about other standards? There is a huge list of similar protocols, check out XML Alternatives for a start. After many hours of pouring through many formats, I was unable to find a good fit for the objectives explained here.. If I've missed an identical solution, you're welcome to rub it in. But, I doubt you will be able to say it was obvious. At some point, one just has to bite the bullet and get things done; at least, I'm sharing in an obvious place...

Custom solutions. Another thing I'm going to look at, is a common runaway encoding technique many people use on custom implementations. It seems that their implementation started as a flat representation, such as a=1,b=2,c=3, then recursion was added to handle nested data. Though the simplicity is hard to beat, the data expansion from nested encoding can be pretty significant. It also makes the nested values hard for a human to read. Because of these two factors, I will stray from simplicity to solve this one problem.

All data will be encoded as strings using the URL encoding scheme as described in RFC 1738.

Pretty simple? I choose RFC 1738 encoding because many script languages have built-in functions, and if not, it's easy to implement (see the C++ ScsSerialize.h header for an example). Additionally, much of the same logic behind using this encoding in URLs applies here. This format also gives us the advantage of being able to read most data rather easily. Here is an example of an encoded array:

Parsing

As mentioned, the big pro here is that this can be easily parsed. Here is an example PHP implementation. You'll notice that the encoding function is similar in complexity to the above runaway example; however, the decoding function is more complex. This extra complexity avoids the re-encoding of data. Myself and a few others attempted a simpler decoder and struck out. I'd be very interested if anyone is able to do better.

I know it's kinda long, but I'll go ahead and post the C++ version just so this article is as complete as possible. The mad cut-and-pasters will appreciate it, I'm sure. The actual encode / decode functions are about the same, but I added functions for converting from strings to integers and doubles etc... Just to make things easy to use. C++ does not have built-in support of this type.

Comparison

In terms of simplicity, it's hard to get much simpler. The only simpler versions I have seen are of the runaway type, or language specific. Such as manually outputting a JavaScript array, for instance. In this case, our work is lost if we want to now switch to another target language.

One way in which XML excels, as seen below, is human readability. Although it is possible to decipher the SCS string, it is not as clear unless you add new line characters. It would have been possible to make the decoder white space agnostic, but it would have required tokenizing the data. This would have just been something someone could leave out of the implementation, and thus we would have strayed from our goals. Also, the introduction of white space could potentially cause problems when pasting data as strings into source files. This has priority here as being closer to our goals of cross-language communication. Though we attempt to make it somewhat readable, take into account that human readability is not a priority for SCS when considering your options.

In terms of bandwidth, say for an AJAX project. Consider the following array...

This typical XML output weighs in at 517 bytes. I struggled a little with whether or not to remove the header and formatting characters. I decided to leave them since this really is a lot of the argument for using XML, to be 'standard'. This is actually cheating a little since I used the same URL encoding instead of the more common base-64. But, XML allows me this.

The typical runaway implementation. It should be noted that this example particularly amplifies the redundant encoding issue. There are other instances where it would be competitive though never significantly better. Notice the severe mangling due to the recursive encoding.

Flexibility

We are not going to attempt to encode variable types or other properties such as minimum and maximum values at the parser level. But these things can still be done in the framework of the current protocol. For example, consider the following XML:

<variablename=xtype=floatmin=-10max=10>3.14</variable>

We can represent this type of information by just adding a sub array. In the case of XML, the content or value field is implicit between the tags. We will need to add an explicit 'value' field. And the result is actually shorter than the minimal XML.

You'll find most data structures can be represented well enough in this protocol. It's usually just a matter of efficiency, especially when dealing with high-bandwidth, binary data like live video or audio. Then again, what format covers everything well?

Conclusion

I think that supplies a good idea of what was being attempted, and what was achieved. A few notes...

Notice that the supplied functions allow you to easily serialize parts of the array as well as the whole array. Also, you can decode one array into a larger array. This is a subtle but powerful construct.

The property bag concept achieved in the C++ implementation is a powerful addition to the language. It can severely cut development time when dealing with data. The nice thing about C++ is that you can describe how exactly you want operators to behave. I actually use a more advanced form of this class that allows serializing/deserializing into lots of formats like the Windows Registry, INI files, URL GET and POST variables, MIME formats, database, etc... This can be an enormously powerful way to handle generic data. I know I didn't invent this by the way, there are many examples out there...

I'd like to add more languages to this example. Perl, Python, VB, come to mind. If anyone wants to donate, please feel free.

Nice idea, but the code (shown) is not quite there - here are the bugs I had to fix to make it work correctly.
I didn't use the C++ but I assume it will have similar problems.
In the serializer you need to add a terminal "," to the string. The deserializer looks for the comma and backs up and without that it always leaves out the last element. This effects both php and javascript.
In the javascript you have to replace every occurance of 'x_param [e]' with 'x_param.charAt(e)' otherwise the deserialzer doesn't work at all in IE - though it does work correctly in Opera, Mozilla, and Safari.
Less important, unless you could have mangled data - in the deserializer you should only attampt to fill the output array if the array index is valid (i.e. not 0) this is easily accomplished by adding a set of {} as shown for the php case:
....
else
{
$a [0] = rawurldecode ($a [0]);

For the c++ version, it would be better if you provide all the includes and the macro needed to compile the sources. Because as it is, it can't be compiled (I tried under VC6 and VC8). Can you add those things to your sources ? I have already tried to rewrite the macros and correct some mistakes, but that's not enough.

Not sure about the PHP counter part (haven't checked) but the Javascript decode part doesn't support multi dimensional arrays. The following bit of code (obvious where it goes) should fix this:

var level=1;
var from=scs_s+e-scs_s+1;
while (level>=1) {
// get the next close
var nextCloseIndex= x_params.substr(from).indexOf('}');
if (nextCloseIndex!=-1) {
// if there is a next close, check if there is an open in between
var enclosedNextOpenIndex= x_params.substr(from, nextCloseIndex).indexOf('{');
if (enclosedNextOpenIndex==-1) {
// if there is not an open in between, we go down one level and continue from the close
level--;
from+= nextCloseIndex+1;
} else {
// if there is an open in between, we go up one level and continue from the open
level++;
from+= enclosedNextOpenIndex+1;
}
} else {
// There should be at least one close after an open
alert("Invalid array");
return 0;
}
}

I did look at JSON. I won't argue that if parsers for more standard formats are already available in the languages you're working with, it may be a better choice. But I still think there's room in the world for simpler solutions.

JSON goes a step further than the most simplistic solution by specifying data types in it's lexicon and/or syntax. For instance, in JSON, to properly support numbers, I also have to implement strings at the parser level even if I don't use them, or I'm non-standard, i.e. not really using JSON anymore.

One thing that spurred this on, was a developer that wrote an XML 'parser' that worked by spliting the data on '' again on 'var=', etc... When the data was updated to hold new params for another part of the system, that 'parser' broke, creating unexpected extra work and changes. I wanted something so simple it would be hard to screw-up.

This line causes the article to scroll horizontally make it difficult to read:
Department=?Accounting%3D%3FJohn%253D%253FMarried%25253DYes%252526DOB%25253D1-14-78%252526Pets%25253D%25253FFish%2525253D8%25252526Dog%2525253D1%25252526Cat%2525253D2%252526ValidCharacters%25253D.-_%252526InvalidCharacters%25253D%2525255B%2525252C%2525253D%2525255D%2526Mary%253D%253FMarried%25253DNo%252526DOB%25253D7-2-82%252526Pets%25253D%25253FDog%2525253D1%252526InvalidCharacters%25253D%25252521%25252540%25252523%25252524%25252525%2525255E%25252526%2525252A%25252528%25252529