Highly Disappointed w/ the new XML API Database

I am astoundingly disappointed with the new XML API Registry.

I'm not going to pretend that I represent more than a tiny fraction of consumers of the OpenGL registry, but I think we could have been accommodated in the transition much better than this. A bit of background would be important, I think:

I have a hobby of writing bindings for C libraries to higher-level languages, usually Lua. I take pride in the quality and "intelligence" of my bindings, they are never 1:1 mappings to the C functions and arguments but are tailored to the capabilities of the language in question. A good portion of my work involves writing parsing code that goes over the declarations of the interface of what I'm binding in order to automatically generate a large portion of the bindings, usually in Lua or, increasingly, PowerShell (that is, regardless of the target-language, the parsing and code-generation bits are usually written in Lua or PowerShell).

Much of what goes into writing a good parser like that involves programming it to know when it can automatically generate a good binding, and when to alert the programmer that it requires more information, additional annotations, if you will. Most of OpenGL requires only some annotations on types in order to auto-generate good bindings to most of the functions...only a few keys ones require manual intervention.

A while back, I started on a project to bind OpenGL to Lua, in two versions, a Legacy binding and a Modern binding. The OpenGL specfiles are...were...a godsend, enabling large portions of this to be done automatically in high quality.

The specfiles are rich in semantic information that's useless to a language like C but truly important to higher-level languages. As the readme file in the XML API SVN says, much of it wasn't perfect or easy to translate, but that's okay...it was important and useful anyway. Strong enumerations are a critical feature of generating good high-level bindings, and the specfiles supported that notion. In/Out annotations? By-value, by-reference, by-array annotations? All good. Even the [COMPSIZE(...)] annotations, which the readme files admit, aren't machine-translateable...but they are a critically-important annotation for projects like this nonetheless! All of them allow a parser to notify the programmer that additional information is required...and often, the information can be added in annotations in one place to supply the needs necessary to machine-generate code in multiple places!

I don't see why much of this information couldn't have been added to the new XML database as optional information, recommendations, ANNOTATIONS, if you will. Moreover, what information was retained was translated into such a horrible format that it's even harder to parse than it was before! The new XML format's declarations for commands are little more than marked-up C declarations, absent of any information not used in C...which is fine for C programmers, but then, C programmers don't need to write parsers to generate C headers from the XML API database, because OpenGL provides a Python script for that, AND the resulting C headers!

Perhaps, it will help if I compare-and-contrast, from a parser's point of view, the kind of information the old specfiles and the new XML database provide:

What can we see here?
* We know the name of the function is 'CallLists' and we don't have to scan for a "gl" or "wgl" or "glx" to automatically remove, that has no need to be there.
* We know the return-type is 'void', as it directly says what the return is, we don't need to write a partial C parser just to get that info.
* Parameter 'n' takes a SizeI, by-value, as an input (we'll get back to 'n' in a moment.)
** SizeI is a semantically-distinct type, representing a...well...a size, and project-specific human annotations relating to the data in the gl.tm file can tell us what kind of argument-verification code to generate for our target language - we can have specific code for verifying SizeI-type arguments.
* Parameter 'type' is a ListNameType, by-value, as an input. From parsing the enumfiles, we know that ListNameType is an enumerated type, and our target language handles those differently...moreover, from parsing the enumfiles, we know at least a superset of allowable values for those, and in addition to checking glGetError() afterwards for GL_INVALID_ENUM, we can do our own checking ahead of time. We'll also be able to generate a more intelligent error message that says which parameter had an invalid enumeration value (some functions have multiple enumeration parameters) and we'll be able to generate a list of possible valid values. (I'll return to enumeration types in a bit).
* Parameter 'lists', most importantly, tells us that it's not just taking a pointer...it's taking as *input*, an array of types. Moreover, the annotation [COMPSIZE(n/type)] not only allows us to alert the programmer that manual intervention is required, but in this case, depending on the type of target language, we can even possibly not require much manual intervention at all. COMPSIZE tells us that both parameter 'n' and parameter 'type' merely describe the values being passed into 'lists'. If your target language has sufficient type-data attached to the values you pass in to 'lists', then both 'n' and 'type' can be inferred from the 'lists' value itself, and a sufficiently-intelligent parser can determine this.
* Lastly, the 'category', 'version', and 'deprecated' values...I can't speak for others, but for me, they've come in useful for generating lists of what I want to bind. I filter by them to pare down what I want to work with.

What can we see here?
* With some text parsing, we can see that the return type is 'void', and the command name is 'glCallLists'. We will have to write some code to remove the 'gl' from here.
* There is a parameter named 'n', taking a parameter 'GLsizei'. This tells us the representation of the input value in C, but doesn't give us any more semantic information besides the largest possible range of representable values.
* There is a parameter named 'type', taking a parameter 'GLenum'. Wow, that's just pitiful...the only checking we can do is that our target language passed some kind of enumeration value but we can't make even the most basic checks to see if it's the right kind of enumeration value. Strongly-typed enumerations? What are those?
* We have a parameter named 'lists'. It is of type 'GLvoid'. Our parser is then expected to parse out the presence of a 'const' before the type and a '*' after the type to determine...pretty much nothing. We'll have to alert the programmer that more information is required but we can't even give the most basic of annotations as to what might be required. Moreover, we don't even know that 'n' and 'type' parameters are merely descriptive of what's being passed into 'lists'...our programmer will have to notice to insert additional annotations to infer the data from them and not generate them as actual parameters of the equivalent function in our target language.

And that's it. We don't get any more information than that. This information is only useful to someone generating C headers...all information that would've come in useful for generating bindings to higher-level languages has been lost. And why? Did the information really need to be lost? Was it beyond the capabilities of XML to handle?

Moving on to Enumerations...
I said I'd return to COMPSIZE() and parameter 'type'. I'll admit that the specfiles are marginally difficult to parse...and yet it took less than a day for me to write PowerShell code to go through all of them, generate enumerations, along with properties on them for which definitions they were re-used from, what class they were from, verify everything, and generate warnings and errors for multiple-definitions. So let's look at ListNameType in the specfiles:

ListNameType enum:
use DataType BYTE
use DataType UNSIGNED_BYTE
use DataType SHORT
use DataType UNSIGNED_SHORT
use DataType INT
use DataType UNSIGNED_INT
use DataType FLOAT
use DataType 2_BYTES
use DataType 3_BYTES
use DataType 4_BYTES

Some quick searching shows that this comprises the entire definition of ListNameType, there are no additions to this definition elsewhere in the specfiles. We know exactly which values are allowed, and we even know their names without the GL_ pseudo-namespace attached. We can look up their definitions in the DataType class, which is a convenient location for our programmer to add some trivial annotations giving the size of the various represented data types, which our parser and code generator can make use of...as well in any other enumeration type that borrows values from DataType, such as NormalPointerType or PixelType.

Now let's look at the comparable data in the new XML API database...oh wait, we can't. In fact, the phrase "ListNameType" doesn't show up anywhere in it.

I think I've made my points clear...to close, I'll reiterate by saying that I'm aware I represent an extraordinarily small group of people, but I still fail to see how anyone thought this bit from README.PDF could've been much comfort to anyone:

It would be a big job to go backwards from the XML to .spec formats, and
we don't want to support this or enhance the .spec files going forward. Hope-
fully, people using the .spec files for other purposes will be able to transition
to the XML registry.

It makes me wonder at this point who thinks that transitioning from XML to .spec would be going backwards. The only advantage at this point that the XML data has is that the overall structure of the data has ready-made parsers available...but the data itself, what really matters, has only gone backwards. So much data is missing, and what data remains has actually become harder to parse...surely nobody could believe the old specfile format was harder to parse than full C declarations? And why was so much semantic data thrown away?

At this point, I'll continue to use the old specfiles to generate bindings up through OpenGL 4.3, because I can't use the new XML registry for it - it doesn't have the information for it. I'm quite worried for the future of my project and how it's going to function when I want it to target ever-newer versions.

And why? Did the information really need to be lost? Was it beyond the capabilities of XML to handle?

Did you read the post where they explained this stuff? He clearly said, "instead of the generic type syntax used in .spec, and does not include array length information (which was often inaccurate in the .spec files)." That last part is important; if the size information wasn't always accurate, then it's better to have no size information than to have size information that is potentially wrong.

Really though, the most important information you need is whether a parameter is an input parameter or an output parameter. That's what's going to make it hard for you to write a Lua binding.

Now let's look at the comparable data in the new XML API database...oh wait, we can't. In fact, the phrase "ListNameType" doesn't show up anywhere in it.

Fair enough.

Now look at the definition for TexImage2D in the old spec files. You'll see that the first parameter is of type "TextureTarget". Go look that up in the enum.spec file. You'll find this:

Well, I must say I'm surprised at how inaccurate the Enumerant listings were, I did read the explanations in the readme but didn't see sheer wrongness of the data being mentioned.

I guess I can understand that abandoning listing that kind of information is a lot easier than trying to fix and verify all that information in something of the scale of the OpenGL Registry...I won't pretend to know just how much effort that would take. If the information can't be trusted even when it's present, I suppose there's not much lost when it's gone.

Guess I'll have to come up with the extra information myself; it's not something I haven't done before, though not yet on the scale of something like OpenGL. Can't be helped...

I need to take issue with some of the assertions in your original post.

Firstly, you don't need to write a partial C parser for the GLsizei parameter. You just need to use an XML parser. The contents of the <proto> tag are quite clearly the return type (void) and a <name> tag. GLsizei is well-defined, and tells you that the length of your input array (of which more shortly) should be the same as this value (although IIRC it's not an error for the input array length to be greater than it). I'm failing to see how you're getting more info from the old way here.

I mentioned an XML parser in passing above, but it is an important point, because that's what you should be using to parse XML. I know that means you're facing a rewrite of already working code, but I really don't think you should be even contemplating putting XML through any kind of home-grown text parser. As I understand it, it's a matter of accepting some such pain now in exchange for making everything easier in the longer term.

Moving on.

You've a good point about that GLenum, but Alfonse's counter point must be borne in mind. Maybe it would have been better to add a <validvalues> tag to anything expecting a GLenum, and put some effort into adding those and keeping them correct? Then again, with limited resources one has to draw the line somewhere.

That aside, another important point is that no GLenum in OpenGL is typesafe to begin with, and OpenGL (the API as implemented, not as specified) freely allows you to put any old rubbish in here. Want to put GL_TEXTURE_COORD_ARRAY into the GLenum for glCallLists? Want to put 0xffffffff into it? Sure you can; it will generate an error at runtime for sure, but there is otherwise nothing to prevent you from doing it in code or at compile time.

And another point is that there are some calls in OpenGL where this actually matters. Take glActiveTexture, for example. A common usage of this is "glActiveTexture (GL_TEXTURE0 + textureUnitNum)", and that's actually required to be allowed and accepted so that a GL implementation can support more than 32 texture units. A similar pattern is frequently seen when specifying cubemap faces via glTexImage/glTexStorage. If you can't support these then you have a design problem in your own code that you should be looking to fix.

Finally, "<param>const <ptype>GLvoid</ptype> *<name>lists</name></param>" actually tells you almost everything you need to know about the last parameter. It tells you that it's an array (the "*"), that it's not changed by the call (the "const"), and therefore that it's an input parameter. Notice that you don't need to parse any of these out of unformatted text - they're all clearly tagged and an XML parser will let you pull them out.

What you don't get here is the annotation (compsize) telling you how the array is formed, but that just puts you in the position where manual intervention is needed, which is something you've already accepted in other cases.

I must thank you all for taking the time to point out many of my own errors; I had many incorrect assumptions going in and it would surely have reflected in my final work. I'm also thankful for your very professional replies towards someone not making a very good first impression. I would like to make a few clarifications, if I may:

All in all, it's not as bad as you're making it out to be.

Indeed. Or rather, it was never as good as I had believed it to be.

I mentioned an XML parser in passing above, but it is an important point, because that's what you should be using to parse XML. I know that means you're facing a rewrite of already working code, but I really don't think you should be even contemplating putting XML through any kind of home-grown text parser. As I understand it, it's a matter of accepting some such pain now in exchange for making everything easier in the longer term.

I was rather vague in using the word "parser" to refer to two different items; I never intended to home-roll an XML parser, I have several already available to me. My complaint was not with having to parse the XML markup (I was in fact looking forward to not having to maintain parsing code for a custom text format, I was similarly happy when the Unicode Character Database moved to XML), but the data itself contained therein, which I feel (still feel) to be somewhat inferior to the old data, semantically and in organization. Particularly, I still feel that tagged C declarations are hardly an ideal method for declaring the semantics of a function's interface, although with a better understanding of the purpose, or scope, of the new XML API Registry, I'm not suggesting that it be changed.

That aside, another important point is that no GLenum in OpenGL is typesafe to begin with, and OpenGL (the API as implemented, not as specified) freely allows you to put any old rubbish in here. Want to put GL_TEXTURE_COORD_ARRAY into the GLenum for glCallLists? Want to put 0xffffffff into it? Sure you can; it will generate an error at runtime for sure, but there is otherwise nothing to prevent you from doing it in code or at compile time.

Indeed, that is my point...at the binary level, or even at the C compiler level, there is nothing preventing you from stuffing anything into a GLenum parameter. But bindings to higher languages do allow you to make those kinds of checks at compiletime...and bindings to higher languages are my entire interest here. Furthermore, the runtime errors don't give terribly much information. When the language itself has this kind of semantic information available, error messages can be more informative, even at runtime: "GL_INVALID_VALUE" versus "Invalid value -42 passed to parameter 'width' of gl.LineWidth(), value must be within range 0.0 < value < Infinity"; Ada and PowerShell are capable of doing this kind of checking automatically, and most of my previous projects that auto-generated C code would automatically insert such checks as well in response to my annotations. On a similar subject, "glActiveTexture (GL_TEXTURE0 + textureUnitNum)" is indeed quite possible in a number of higher languages, particularly those which retain strings for the names of enumerated types at runtime, for instance: C# (and any .NET language, for that matter), Ada, and Lua (under certain common enumeration paradigms).

Speaking of enum checking, a few months back, I started working on an attempt to collate all of the enumerators that every OpenGL entrypoint can take. The idea being that, if we went about compiling that list systematically, function by function, we could then run a script to automatically compare each function's enum set against another function's to see how they differ, and to automatically put out equivalent sets. It even has ranges like glActiveTexture, where the range can be extracted from glGet call.

I managed to collate quite a few entrypoints worth of data. None for the really big functions like the various glGets, but version 1.1 was mostly done, as well as most of the main ARB core extensions.

If anyone feels like investing time in the project, I'm willing to accept updates. You can pull from the Bitbucket repo; you simply modify the data in the "function_enums" directory.

Alfonse captured a bunch of the things I'd have said had I been online earlier, but here are a few additional comments.

The goal of doing this wasn't to deprive people of things they're used to, it was to get the spec file infrastructure onto a modern technology path instead of a 25+ year old SGI custom and undocumented text file format, express semantic things about the APIs that can't be done in the old .spec files (such as the fact that many different extensions and GL versions can share the same enums & entry points), add support for other Khronos APIs like GL ES and EGL, document it to a greater degree, and make it easier to build tools around. Having some transitional pain is unavoidable, but the old spec files aren't being removed, just deprecated, so there's no great urgency about transitioning. As we add extensions and future GL/ES versions, there will be more motivation to move over to using the XML (especially because, frankly, .spec are such a pain to maintain that I let an enormous number of bugs accumulate in the public Bugzilla - most of which have been addressed in the XML files already and I'm working through the remainder). As we get feedback from people like those posting in this thread, hopefully the schema will evolve to become a better match to what other people are using it for - because we don't *know* what other people are using .spec for today, and all Khronos uses it for is generating C headers and as the enumerant/function/GLX opcode registry. This is the initial version, not the last word.

I'm not opposed to adding more semantic annotation to the XML schema, up to a point - array (maximum) lengths are certainly plausible since they would just be an optional attribute on <param> tags, while trying to capture all the legal parameter combinations, especially when they often depend on current GL state that isn't even available to a static typechecker, seems way out of scope to me.

As also mentioned upthread, a lot of the information in .spec is way out of date and has some inaccuracies. Most of the annotations come from SGI in the days of OpenGL 1.1 and (IIRC) were used to generated IRIX bindings to FORTRAN and Ada. Sometimes they have been added to newer material, but not consistently. The enumerant classes are especially inaccurate, particularly things like pixel formats and vertex data types which have been enormously extended over many versions. But even if they included every possible enum of a particular class, e.g. PixelFormat, they still couldn't be used as an accurate form of static typechecking. Many different calls taking e.g. "format" parameters accept different valid subsets of all the possible format enumerants, and it's also dependent on the GL version and the extensions supported. You can do all this checking, but it's not something accessible to most languages - you would end up writing something like the WebGL GL emulation layer just to provide more informative error messages (which in the era of GL_KHR_debug and numerous GL-specific debuggers, seems redundant anyway).

Finally, the C type declarations in gl.xml are isomorphic to the generic type syntax in gl.spec. There is no special sauce, no magic information in saying

param params Int32 out array [COMPSIZE(pname)]

vs. saying

<param>GLint *<name>params</name></param>

aside from the notation that the length of the array is some undefined function of the pname parameter. The way that the generic type syntax was generated in the first place, for everything after GL 1.2 or so entry points, was for me to run the C prototypes from extension specifications and the API specs through a rather hacky Tcl filter which generated equivalent generic type syntax. It's true that you have to do that additional translation step now. Since GL parameter types are all scalars, arrays, or pointers to same (up to two levels of nesting in the worst case, IIRC), the "parsing" stage is pretty simple pattern recognition. If someone wanted to write some Python code for type translation to include with the generator scripts, that would be a great contribution (a complete interface generator for languages other than C, even better - I tried to anticipate that by genericizing genheaders.py and reg.py to an extent, but it's just a framework that hasn't been built on, yet). I can dig up the Tcl for type translation if anyone really wants to take a swing at it, but the Tcl isn't pretty, nor have documentation.

I would say that the most crucial pieces of information for generating a binding to a non-C language are whether a pointer is an input or output and whether a pointer is an array or a single value. While it would be really nice to have an array count of some kind (even if it's just for the simplest of cases), these two pieces of information will make non-C bindings far easier to generate.

The input/output distinction is important because many non-C languages allow you to return multiple values. So to provide a native-seeming binding for such a language, you will want to remove an output variable from a binding's interface and turn it into a(nother) return value. And even languages that don't will sometimes want to turn functions that return nothing into functions that return a value if the function takes a single output parameter. In such languages, `glGetIntegerv` and its ilk should be returning its value, not taking some memory to be written into.

The array vs. single value distinction just makes it easier to deal with how you marshal data around. Even if you don't know how big of an array it is, it makes for a simple check of the user's inputs to the binding to say, "did the user pass me an array? If not, error." It also makes it easier to isolate which functions will need special processing to figure out how big the array is; you can't know that just based on the fact that the function takes a pointer.

I would say that the most crucial pieces of information for generating a binding to a non-C language are whether a pointer is an input or output and whether a pointer is an array or a single value. While it would be really nice to have an array count of some kind (even if it's just for the simplest of cases), these two pieces of information will make non-C bindings far easier to generate.

GetIntegerv is an interesting example because it can be passed a pointer to a scalar or to an array, depending on the state being queried. I'd think depending on the language, if you were returning a value from the query you would have to always return an array (which would often be of length 1). In a dynamically typed language I suppose you could return either scalar or array.

Re in/out, the ARB has in many cases gone back and modified existing APIs to add 'const' into the signatures for commands where that was appropriate, and keying off the 'const' gives the same in/out information as the .spec files - that gl.spec today says 'in' or 'out' is not a statement about the dataflow direction, but a mechanism for forcing glext.h to have the right parameter type declarations.

I'm pretty sure we haven't properly const-ized in every possible case, though. IOTW there are likely some commands, particularly vendor extension commands, which still take non-const pointers as 'in' parameters. If those cases can be identified then an additional <param> attribute to mark these cases would make sense.

GetIntegerv is an interesting example because it can be passed a pointer to a scalar or to an array, depending on the state being queried.

It would be easier to think of it as being passed an array. It's just that the size of the array could be 1 depending on other parameters. Contrast that to the `length` parameter of glGetShaderSource, which is always a pointer to a single value.