Simply change the index type state of an IBO right inbetween the calls of glDrawIndices. I am sure all of the model primitives will share the same type of indices, and even different models will have the same index type in most cases, so the reconfiguration of IBO' index type state will not be frequent. Besides, the same mess was happening with the primitive restart index until the fixed restart index value feature was introduced recently - the PRI had to be respecified according to the index type as it was always intentional to use the maximum possible value, but it required a function call separate from drawing call.
But yeah, the "GLint indexFirst" argument is better to be changed to something able to specify an offset which may not be aligned to the index type:

The respecification of the IBO states (index type, PRI, PSIes) inbetween model drawing calls is kind of senseful: different models may have a different set of primitive type used for their construction; they may come from different vendors which use different PSI-bindings conventions; as with the PSI help the whole model is drawn using the single draw call, the setup of an IBO states according to specifications of the model's data conventions seem to be logical way of drawing.
Howether, the lack of "mode" argument assumes that the primitive type becomes somewhat like an internal state and it is undefined until the first PSI encountered. Probably, it preserves the status since the last PSI encountering, so either the IBO shall start from one of the PSI or the default primitive type state must be set using some function just before calling the glDrawIndices (f.e. model's convention is to use GL_TRIANGLES by default unless the IBO starts from the explicit PSI). The normal drawing functions set the mode explicitly, but a proposed set of glDrawIndices functions avoid the disturbance of a primitive rasterization mode state as well as index type respecification (as I am just an OpenGL user I may predict any advantage here only intuitively).

The way I see the implementation of PSI is more like an exception handling, so respecifying it for every new primitive is not a desired way of using this feature - if two consecutive primitives have the same type, usage of the restart index between them should be preferred over the switching index. But the PSI usage, even if implemented as exception for every switching index encountered, should obviously work faster then calling a glDrawElements* function, right?

I think it would be a really good idea to decide how often that primitive mode is going to change. Really important. If it changes rarely, then having the hardware handle primitive mode changes dictated in the index stream is silly; it significantly complicates the front end to put it mildly. Emulating be the driver is a really bad idea since it would then need to walk the index stream to break it up, a really bad idea for non-unified memory architectures (and in truth a bad idea still for unified).

I admit that this is a nice idea in terms of making the API a touch nicer, but for drivers it is not the number of draw calls that is so bad, but rather state changes between draw calls; in that regard, for the point of view of the API, changing primitive mode is not even a state change anyways.

What are some use cases that make this feature, at the cost of complicating the hardware, worth while? This, if ever done, is going to cost sand so it had better make some rendering faster in such a way that application breaking draws indexed by primitive type is not reasonable.

Well, let's compare it with the PRI which we have implemented already by now. We have an index value, a single value, which can be set by the API, which is treated in a specific manner, unlike a real index value. So every time a new index is read it is compared to that special value. At this stage, the upgrade to PSI will require to change the checking method: instead of comparing an index to be equal to a single predefined value we check it now against the bounds, because now we have a range of special index values. So far, no much difference in performance as checking against bounds can be done as fast as checking for equality.
Next step is taken when the index is recognized to belong to a set of special indices. By now, with PRI we already have some sort of reinitialization when primitive is restarted. With PSI that reinitialization conditionally branches depending on the specific value of the special index encountered. As we bind a specific index values with the desired primitive modes before the draw call is made, then we can expect that some preparations are performed/precomputed/precompiled at that time so during the rendering the switching mechanism could work without a help of the driver.
I do not know how exactly the indexed rendering is implemented in hardware and how the PRI is implemented currently, but I have a "feeling" that advancing it to PSI could be done without too much "sand". This is question for the actual videodriver developers.

Originally Posted by kRogue

What are some use cases that make this feature, at the cost of complicating the hardware, worth while? This, if ever done, is going to cost sand so it had better make some rendering faster in such a way that application breaking draws indexed by primitive type is not reasonable.

I didn't claimed that the PSI should make drawing faster. The focus is on the user. In most cases the models originally built with different types of primitives are converted to GL_TRIANGLES just because no one wants to bother storing separate parameter sets for each primitive and call glDraw* multiple times.
The OpenGL is not just an independent thing - the user has to store a lot of info about what he has to draw. Multiple objects and structures. If model is built with different primitives, then array of primitive parameters (mode, indexcount, offset and so on) have to be allocated for each model. A dynamic array, which require a memory allocation. If an application has hundreds of models and each of them has a detalization level submodels - then we have a horrible amount of a small memory allocation calls, which fragments the main memory. And every memory allocation is a potential source of error. The more of them done, the higher the risk of error to occur. And if one decides to optimize the usage of vertex and index buffers by merging some models which share the same vertex type, all that dynamic arrays have to be recalculated (indices shifted, base offsets added or whatever). With a help of PSI we have just one set of parameters for a single draw call for each model, even if it is built with a multiple types of primitives. It is a huge simplification for the user side, and even if it will not make rendering faster by itself, it will take the overwhelming burden from the user, so some other optimization techniques could be taken advantage of more easily. As well it will save the main memory, which is otherwise polluted by all that drawing data.
It is really illogical to store part of the primitive definition in an index array (indices) and the other part - in a main memory (primitive type) as those are two parts of the same thing. The VBO stores the information about the actual points of the model, which are pretty much an independent items in terms of vertex transformations, but what kind of a data does the IBO represent without a primitive type mapping?! Specifying the primitive type in a draw call is the same senseless thing like specifying an internal format of a texture every time it is bound - nonsense. Every set of indices can correctly work only with the primitive type they were generated for, just like a texture data can be fetched correctly only if it is viewed through the right internalformat. So from this point of view, the GL_DRAW_INDIRECT_BUFFER should store the primitive types and references to the index ranges, linking the primitives with their indices, but it doesn't. The user still has to manage the individual primitives (grouped by types at best case) just like that is the data he may want to manipulate. Really?
But the way the index buffer is used is more like the reading of a text string - indices are picked consecutively, so inserting the PSI into proper positions is a very straightforward solution for marking the borders at which the primitives of a given type start.
Representation of the idea in a graphical way may look like this:

In other words, if IBO stores the PSI along with the normal indices, that IBO fully defines a geometry and the only thing the user has to store is the bounds of the area of the IBO which is related to the object the user wants to draw. This approach is very consistent with the general usage of an OpenGL: we upload the data, define it's structure, configure the pipeline, and make use of the data by telling the OpenGL when to draw it and where. But if the user has to store a part of the model's technical definition on his side, then both the application and OpenGL are involved in the low-level drawing mess, so the OpenGL serves just a half of it's purpose, isn't it?

So the advantage of PSI extension is not only the minor shrink of IBO size due to the optimal packing of indices into a type of primitives they fit better (instead of using the universal primitive type for all); not only the drawing time benefit from glDraw* calls which could be minimized to one per model instead of one per each primitive of that model, but the major advantage is the minimization of the pollution of a main memory due to a multiple memory allocation calls the application must make during the initialization (or model reloading) to allocate an arrays of primitive type definitions for each model. Removing the headache of primitive-by-primitive drawing sequence the user must follow to get the model drawn will also make the application code lighter - this is advantage the user will appreciate. And here I do not mean the newbie user writing the first app! The time (so as money) the debugging of a messy code takes (code dealing with multiple dereferences and dynamically allocated objects) should also be considered.

I do think that the technical experts are to be consulted about the possible ways of PSI implementation.

I didn't claimed that the PSI should make drawing faster. The focus is on the user. In most cases the models originally built with different types of primitives are converted to GL_TRIANGLES just because no one wants to bother storing separate parameter sets for each primitive and call glDraw* multiple times.

No, No, No!!!
This sounds like the most pointless reasoning imaginable.
The focus should be to guide the user to provide the data in a manner that allows the most efficient execution. In case you haven't noticed, the most recent talk about 3D has not been about making a more user friendly API but to push it closer to the hardware in order to reduce driver overhead. Driver overhead has become one of the most important issues with graphics performance. And you ask for more of it.

What you want has absolutely no place in the driver, it's merely a convenience feature - but one that puts a huge burden on the hardware because it requires specific implementation.

I was faced with a similar setup recently - creating a buffer from data that contained triangles, strips, fans, and even quads and quad strips making up a single model object
My solution was to turn everything into triangles but I didn't want to change all the code that generates the data, so what I did was to generate the buffer data as it was but inserted some fake primitive restart markers into it and then passed it to a conversion function that made a list of triangles out of it. Effectively the data generation still can assume it creates all kinds of primitives but the 3D hardware never will see any of it, it will only see the final triangles that can be rendered with a single glDrawElements call. You should do the same.

And the number of saved indices can be considered irrelevant, the maintenance overhead will easily negate any of it.

That is exactly the common "solution" that makes me so sad! What is the point of having different primitive types supported if no one is using them?! Why to introduce the PRI? Triangles do not require that, right?
Besides, what was the difference in the quantity of indices between the original model's version and the triangle-only version? Smg like more then two times bigger, I guess? Well, the internal texture format GL_UNSIGNED_SHORT_4_4_4_4 is also just 2 times smaller then {GL_RGBA&GL_UNSIGNED_BYTE}, but I doubt the GPU can operate with 4-byte values directly, so the unpackings must be performed all the time. But still, that complicated format (as well as many others of the same kind) was introduced into the core version of OpenGL just to halve the size the texture could take. And again: how many users do actually use it and how much more complications it brought to the drivers?

"Driver overhead has become one of the most important issues with graphics performance. And you ask for more of it."

If so, then let's deprecate all flat primitive types except for triangles and patches - good? Let's also deprecate PRI then, because fetching groups of indices of standardized quantity is so much easier - just iterationally stride by n indices, pick next batch of n indices and rasterize a new primitive - this way there will be no need to check indices for special values which cause misalignment for the stride and preventing such an easy-going!

IMO, if the primitive types with undefined index quantities (strips, fans) are not scheduled for deprecation, then they need to be given a "full support" so users would be encouraged to use them. PRI is just a half-way toward that: primitive is restarted, but mode stays the same. I do not think that making a second step to solve the problem completely is such an unaffordable complication.

"The focus should be to guide the user to provide the data in a manner that allows the most efficient execution."

If the same amount of triangles could be rasterized faster in GL_TRIANGLES mode rather than in strips, quads, fan or any other modes, then there is nothing to argue about - downconverting other types of primitives into a unified array of triangles is easy. But if there is no difference in rendering speed, then the small benefits like IBO size saving and user-side simplifications start to play toward the PSI extension.

Do you have a usage scenario for this extension, such as an algorithm that would be accelerated by it? The reason I ask is because over the years I have run into very few cases where points, lines, and triangles would share the same shader or GL state.

I think such an extension could switch between prim types that generate the same basic GL primitive type (points, lines, triangles, adjacency-types). However, I wonder if hardware can switch rasterization modes efficiently, from triangle rendering to lines, and back again. Certainly with a geometry shader active you'd be restricted to a single GL primitive type.

For some background on my GL experience: I've occasionally found myself wanting to switch between lines and line-strips in the same draw. I don't use triangle strips or fans except in extremely specialized cases that would be their own draw batch (circle drawing, for example). If tristrips or trifans were deprecated I wouldn't shed a tear I'm not at all interested in the old fixed-function pipeline.

That is exactly the common "solution" that makes me so sad! What is the point of having different primitive types supported if no one is using them?! Why to introduce the PRI? Triangles do not require that, right?

Baggage from older times? Remember, quads have already been deprecated because hardware support is poor.
Also, why is this solution so bad? It does precisely what you want, it only requires a bit of groundwork on the CPU side - just like it is with matrices in the core profile. There's absolutely no need for the driver to handle them, all it really needs to do is upload the data.

Originally Posted by Yandersen

Besides, what was the difference in the quantity of indices between the original model's version and the triangle-only version?

Depends on the model. On average 1.5 times larger. But let's be clear about one thing: You need HUGE models for this to have an impact. Model data is static, you upload it once to the GPU and forget about them, who cares if they take 1 MB or 1.5 MB of index storage ALTOGETHER?
If you are this concerned about space, you can still check to see if you can convert everything to strips, then all you need to take apart is the fans and quad strips but I really didn't bother with that because common wisdom currently says that single triangles are better for the hardware.

Originally Posted by Yandersen

Smg like more then two times bigger, I guess? Well, the internal texture format GL_UNSIGNED_SHORT_4_4_4_4 is also just 2 times smaller then {GL_RGBA&GL_UNSIGNED_BYTE}, but I doubt the GPU can operate with 4-byte values directly, so the unpackings must be performed all the time. But still, that complicated format (as well as many others of the same kind) was introduced into the core version of OpenGL just to halve the size the texture could take. And again: how many users do actually use it and how much more complications it brought to the drivers?

What do I care if the index buffer gets a bit larger? The memory it takes is still a fraction of the textures required for drawing all this stuff, so all things considered, we are talking about less than 5% space savings, all things considered. That's nothing! That's simply not worth adding new logic to the hardware.
And that's where these texture formats come in: Let's take a highly complex model with 100000 triangles. That's 300000 vs maybe 200000 indices, i.e. a difference of 400000 bytes. Now let's take a skin texture. For a model of this detail it'd have to be at least 1024x1024, if not larger. But let's stick to 1024. With all mipmaps generated, such a texture is 5.5 MB in RGB32 format, halving that amounts to 2.75 MB of space savings, even more if you use other compression formats. See the relation between texture and index buffer? It's almost 7:1 - if you got a second skin for the same model you are at 14:1. That's why formats with a smaller memory footprint exist. As a whole the texture to index ratio will even be far higher than in this contrived example. If you want to save space, save where you can get huge savings with small investment, do not try to get small savings with costly investments.

Originally Posted by Yandersen

If so, then let's deprecate all flat primitive types except for triangles and patches - good?

No. Let's deprecate everything that's not commonly supported across existing hardware. The API should mirror what can be done efficiently by the driver, not what allows the most convenience to the programmer. It only gets bad if the reduced feature set puts some severe limitation on what can be done and how it can be done.

And that's still the crux here: In order to implement this feature in the spec you need hardware supporting it! But hardware currently does not support it, meaning it has to resort to expensive emulation steps to support it. You simply do not want that in the most time critical part of the entire driver, namely the draw calls. You want those to be as efficient as they possibly can be.

Originally Posted by Yandersen

Let's also deprecate PRI then, because fetching groups of indices of standardized quantity is so much easier - just iterationally stride by n indices, pick next batch of n indices and rasterize a new primitive - this way there will be no need to check indices for special values which cause misalignment for the stride and preventing such an easy-going!

No, strips and fans still have their use and I'd still use them if it makes sense. But it gets very hard to justify the effort to mix both, aside from poor data design. To be honest, as of this writing the only means to get such a mix of primitives I know is from the GLU tesselator and from models as old as Quake 2's MD2 format. Anything newer has already been optimized for better hardware use. So, sorry, I really have no clue what this would be there for.

Originally Posted by Yandersen

IMO, if the primitive types with undefined index quantities (strips, fans) are not scheduled for deprecation, then they need to be given a "full support" so users would be encouraged to use them. PRI is just a half-way toward that: primitive is restarted, but mode stays the same. I do not think that making a second step to solve the problem completely is such an unaffordable complication.

No, they are not. They do not need to have to because they still get universal support by all existing hardware, meaning that issuing one draw call results in the driver starting one GPU operation. But if the hardware can't switch on its own to a different primitive type your suggestion would mean that it has to resort to emulation to satisfy your request, meaning it has to analyze the buffer on the CPU, see where some primitive type change occurs and then dispatch single draw calls. I can't stress this enough: YOU DO NOT WANT THAT!!!.

Originally Posted by Yandersen

If the same amount of triangles could be rasterized faster in GL_TRIANGLES mode rather than in strips, quads, fan or any other modes, then there is nothing to argue about - downconverting other types of primitives into a unified array of triangles is easy. But if there is no difference in rendering speed, then the small benefits like IBO size saving and user-side simplifications start to play toward the PSI extension.

No, they don't. You want a small benefit that'd require a significant investment in hardware complexity. That game will never play off. If you have an index buffer on the GPU, it has to be in a form that the GPU can efficiently consume, not in a format that's as small as possible and certainly not in a format that allows you to do shortcuts in your CPU code.

Have you ever asked yourself why even the most newfangled indirect draw calls allow no switch of primitive types? Right, that's because the hardware is not designed to do it. And ultimately that's the only thing a new feature should be measured against: If it got universal hardware support, yes put it in, if it'd require emulation, leave it out. And that's the end of story, you can argue as much with your buffer size savings - they don't mean anything if they cause inefficiencies - especially if it's for a problem that can already be solved with existing features.

And trust me, this one's extremely low on the hardware makers' radar. They have absolutely no motivation to add features that provide no quantifiable benefit for some measly memory savings.