In case it's some help, confronted with a similar problem, I stopped rendering trees as whole sprites and divided their rendering into chunks aligned to a grid. Grid squares were marked as dirty if anything was animating or moving in them or if any of the trees were changed. If a grid square wasn't dirty, it would be rendered from the previous frame's tree cache and none of the trees in it would generate fresh draw calls. This allowed the large stretches of functionally static trees to render very quickly.

Supporting forests composed of individual trees is a huge pain for any sprite-based game. Best of luck in solving your tree problem.

Very interesting read. The Virtual Texture approach may be a bit of overkill for Factorio though, since it optimises for the general case were only a subset of the sprites is used, and for these sprites only small sections of the actual pixels are used besides.

That may be common for 3D games, but in Factorio, if a sprite is used, all of the pixels are used (unless on screen edge, which is an edge case )

Virtual Textures would just make your Fragment Shaders needlessly complex to select the right tile using the Indirection Texture (as described in Advanced Virtual Texture Topics you mentioned).

A simpler approach would be to make the entire sprite atlas dynamic, not just the BitmapCache one. Suppose the hardware can only support 16 textures at a time. Start out with blank textures and dynamically fill them with these sprites the frame being rendered requires. Also keep track of what sprites in each texture have been used recently.
When the rendering pass filled up the 15th texture, decide which texture has the largest amount of unused sprites. Copy (glCopyTexSubImage2D) the sprites still in use to the 16th texture which will now have unoccupied and unfragmented space to add more sprites. The texture copied from is cleared for the next time space runs out.

Basically, this is an implementation of a copying garbage collector as used for example in Java Virtual Machines. This solves your fragmentation problem very efficiently.

To distribute the garbage collection overhead, whenever no garbage collection is needed during the entire rendering of a frame, and there are textures with sprite usage below a certain threshold, the least used texture could be preemptively garbage collected to prepare for the next bottleneck.

The sprite atlas would thus be dynamically updated as the player moves around, and would not need to be as big since most factories do not have that many different entities in one location.

Some additional logic is needed to keep animation frames together, keep sprites used in the same render layer together, prioritise train sprites that can move into view at any time, etc.

Also, dynamically adding lower resolution versions to the atlas when the player zooms out would greatly improve performance with low VRAM.

Factorio also has a lot of shadows. If they are only polygons filled with a single RGBA color they could probably be stored as vectors and rendered as geometry somehow to make them occupy less VRAM.

Good point. Alternatively, you can store them in a separate texture with format GL_RED, so only one byte per pixel is used instead of four. In fact, since the shadows do not seem to be using different shades of darkness, the shadow sprite can be horizontally compressed by a factor of eight by just using one bit per pixel (shadow on/off). A dedicated Fragment Shader could select the correct bit during rendering. This way shadows would be reduced in size by a factor of 32.
Furthermore, storing shadows at an even lower resolution (1/2 or 1/4) and interpolating inside the Fragment Shader would give them a blurry edge, which might actually be good looking. I think interpolation would need to be coded separately instead of using a GPU sampler though because of the 1 bit per pixel.

In fact, since the shadows do not seem to be using different shades of darkness, the shadow sprite can be horizontally compressed by a factor of eight by just using one bit per pixel (shadow on/off).

I never noticed that most shadows in base are pitch-black until you just mentioned it. So i took a look at the base files and there are a few that have semi-transparency. E.g. hr-H-T-6-unload-connect-shadow.png and hr-solar-panel-shadow.png are good examples. Also (my own) mods don't have any black shadows at all (though that's more due to default blender settings than personal choice), so it would require all shadows to be recreated to look good (unless incidentially just doing a 50%-opaque-to-black/white-threshold clipping on those shadows randomly happens to look decent). In any case you'd be optimizing away the possibility for better shadows in the future.

Speaking of shadows i also have some tall structures that cast shadows on top of other structures, that ofc doesn't work in the shadow layer that's rendered below everything else, so i have them in layers not marked shadow (which i guess makes the point moot for shadow layer optimizations :p).

Discard operation doesn't improves performance, just tells the driver not to write anything to the output buffer. The reason is that the GPUs use vector processors, the entire GPU core must execute the same operation on all the pixels its processing, which is quite many. If at least one of them didn't need to be discarded, the entire core would still be busy rendering that one single pixel, nullifying any would-be gains. Additionally, the GPU may simply take the "discard" operation as the output write-lock flag and compute the entire shader anyway. For the same reason you avoid using IF-switches for performance gains: the core would run all the commands anyway so you just wouldn't gain anything, that could actually end up running worse.

As per Apple's performance guidelines for 2d apps, what you should do is to create a 3d (2d) model that approximates the outline of your non-transparent sprite and draw that instead. Pixels outside the mesh are never drawn and thus actually saved. The cost of using extra few polygons is so marginal it's not even a point to consider. Do consider however computing texture coordinates in vertex shader, rationale being that computing it for 10-or-so points should be computationally cheaper than doing it for 100 000-or-so points (also GPUs tend to incur additional penalty just for computing texture coordinates in pixel shaders, on top of actual computational cost), but vertex shaders are known to be whimsical performance-wise so do some testing.

Some explanation added for anyone wondering how it works. Please don't consider that patronizing.

Very interesting read. The Virtual Texture approach may be a bit of overkill for Factorio though, since it optimises for the general case were only a subset of the sprites is used, and for these sprites only small sections of the actual pixels are used besides.

I don't understand what you mean by "optimises for the general case were only a subset of the sprites is used", because needing only fraction of your texture data is prerequisity to be able to stream textures in the first place. All in all I disagree with your statement. It is method how to render geometry with very large textures using little VRAM. In our case, the large textures are sprite atlases.

Virtual Textures would just make your Fragment Shaders needlessly complex to select the right tile using the Indirection Texture (as described in Advanced Virtual Texture Topics you mentioned).

Yeah, it adds some complexity, and it would be more convenient if we didn't have it, but it is just extra texture fetch and little bit of math. After all, the Advanced Virtual Texture Topics paper was written 10 years ago, and we are not utilizing GPU's computational power otherwise. Also GPUs from past few years have support for sparse textures (GL_ARB_sparse_texture2, Tiled Resources in Direct3D 11+), so we could remove pixel shader complexity for new GPUs.

A simpler approach would be to make the entire sprite atlas dynamic, not just the BitmapCache one. Suppose the hardware can only support 16 textures at a time. Start out with blank textures and dynamically fill them with these sprites the frame being rendered requires. Also keep track of what sprites in each texture have been used recently.
When the rendering pass filled up the 15th texture, decide which texture has the largest amount of unused sprites. Copy (glCopyTexSubImage2D) the sprites still in use to the 16th texture which will now have unoccupied and unfragmented space to add more sprites. The texture copied from is cleared for the next time space runs out.
...

Hmm, I have not considered dividing the problem into smaller chunks like this, thanks for suggestion. However, it still doesn't sound like something simpler to do. My biggest concern with dynamic atlas is what should happen when the atlas becomes nearly full (as in - almost all sprites in it are needed) and what is performance of keeping track of sprites, packing them to atlas dynamically and running defragmentation. But I could give it a day to test it out.

I would like to go in direction of making virtual atlas packed more efficiently - we do it on startup because we have ton of settings that modify how atlases will be layed out (and what will be in them) + mods can replace vanilla graphics, so no need to load that if it's not going to be used. But everyone could have the same virtual atlas, so we could compute much better layout of sprites in it offline during deploy - if we go with rendering sprites as polygons instead of rectangles, we could account for that in packing ... if we do it offline it doesn't matter the algorithm runs 10 minutes.

Discard operation doesn't improves performance, just tells the driver not to write anything to the output buffer. The reason is that the GPUs use vector processors, the entire GPU core must execute the same operation on all the pixels its processing, which is quite many. If at least one of them didn't need to be discarded, the entire core would still be busy rendering that one single pixel, nullifying any would-be gains. Additionally, the GPU may simply take the "discard" operation as the output write-lock flag and compute the entire shader anyway. For the same reason you avoid using IF-switches for performance gains: the core would run all the commands anyway so you just wouldn't gain anything, that could actually end up running worse.

Thank you for the information. Here is what I observed exactly:
We cache terrain tiles from previous frame, in order not to have to re-render entire terrain as player moves just few pixels. If player doesn't move, the cached terrain is blitted to game view framebuffer and nothing else happens. The timings that I mentioned are of this blit operation (query GPU timestamp, draw cached terrain, query second GPU timestamp).
We have shadows as separate sprites from entities, and we draw them to offscreen buffer first, so they don't add up as they overlay each other; and then we blend the offscreen buffer with 50% opacity over the game view. This "shadow map" has large areas of fully transparent pixels as shadows don't cover entire screen. And if I measure this blending of shadows to game view, it is faster than blitting terrain, and it gets slower as bigger area is covered by shadows, regardless of me discarding transparent pixels explicitly or not. But this is not what I saw on GT 330M, there the shadow layer took same time as terrain, unless I explicitly discarded transparent pixels. So my theory is, that modern GPUs somehow discard transparent pixels by default (either baking the discard into the shader, depending on current blend mode; or maybe they have logic in Output Merger for it?)

As far as I know, discarding pixels does make sense especially if your pixels shader is simple and ROPs are likely to be bottleneck, isn't that true? But I agree not shading pixels that would be discarded in the first place is be better.

one thing I can say after playing Warframe, is "Don't be afraid to make the streaming visible" it is VERY visible when you switch 'frames in the inventory that it's streaming the body mesh and textures even after it's drawn onscreen. I can't see it being intrusive to take a few moments longer for the full-resolution textures to stream in, because you don't need full-resolution all the time, and when you do zoom in, it'd be nice to see everything in the full dev-intended glory... I don't think it'd be too intrusive to have the 0.5x or Normal Res mipmaps for Full Res linger a few extra frame draws behind the zoom level. maybe it'll bug some folks' eyes, but that's something that has to be tested for

that is to say, if you can decouple the texture streaming from the simulation state, it can make streaming Agnostic to the simulation, and possibly even run in a parallel thread because it's not touching the determinism

Discard operation doesn't improves performance, just tells the driver not to write anything to the output buffer. The reason is that the GPUs use vector processors, the entire GPU core must execute the same operation on all the pixels its processing, which is quite many. <snip>

Interesting. I was under the impression that if large clusters of adjacent fragments all contained an early-out discard, then some/most of them would avoid the remaining pixel shader expense. But, it seems at least on my GPU (Fuji) you are right, because reading your post here caused me to realise there's a big performance hit in my own project that I should fix - thanks for that!

That said, my understanding is that there are two relevant expenses here - the cost of the fragment shader (which isn't avoided if you do an early discard), and the cost of writing to the render target (which is avoided by the discard). So, whether discarding in the fragment shader helps is dependent on whether you're fill rate bound or fragment calculation bound. Am I mistaken about that? Mostly when I hear of people cutting up sprite geometry to minimise fragment shader invocations, it seems to be mobile developers. Perhaps mobile GPUs are more liable to be bottlenecked by fragment shader computations for some reason? I'm kind of just guessing now - I've run out of relevant knowledge.

Very interesting read. The Virtual Texture approach may be a bit of overkill for Factorio though, since it optimises for the general case were only a subset of the sprites is used, and for these sprites only small sections of the actual pixels are used besides.

I don't understand what you mean by "optimises for the general case were only a subset of the sprites is used", because needing only fraction of your texture data is prerequisity to be able to stream textures in the first place.

I meant that it covers a more general case because of what I wrote in italics: only small sections of the actual pixels [of a sprite] are used. In 3D games it is apparently very common to have large high-res sprites of say a wall, but then the scene only contains a small wall, so only a small section of the sprite is needed. Virtual Textures allow you to just promote the tiles containing that section to VRAM. This is not so useful for Factorio, as it almost always needs the entire sprite. In fact, since the Virtual Texture tiles will almost always contain some unused pieces of other sprites on their edges, it seems to me that is utilizes VRAM less efficiently.

When the rendering pass filled up the 15th texture, decide which texture has the largest amount of unused sprites. Copy (glCopyTexSubImage2D) the sprites still in use to the 16th texture which will now have unoccupied and unfragmented space to add more sprites. The texture copied from is cleared for the next time space runs out.
...

Hmm, I have not considered dividing the problem into smaller chunks like this, thanks for suggestion. However, it still doesn't sound like something simpler to do. My biggest concern with dynamic atlas is what should happen when the atlas becomes nearly full (as in - almost all sprites in it are needed) and what is performance of keeping track of sprites, packing them to atlas dynamically and running defragmentation. But I could give it a day to test it out.

Keeping track of sprite usage should be as simple of giving them a last-used-in-frame timestamp. Defragmentation should never be needed, since during the copying garbage collection, all sprites that survive are moved to the top of the new texture, so there would simply be no gaps (unless you are talking about something else?). I don't expect that packing them dynamically into the texture would be that computationally expensive, unless you want to make that more space efficient by packing them as polygons instead of rectangles, as you said.

When the atlas becomes nearly fully in use, I'd say the player has a lot of different entities all together in the same view. You would have the overhead of copying 90+% surviving sprites to the new texture and gain little free space, so that's inefficient. In this extreme case I'd just fill the last empty texture, and when that runs out, clear the least-used texture (or pick one at random) and accept the cache misses that follow.

It's worth pointing out that Virtual Textures have exactly the same problem when too many tiles are required at the same time, although you can drop the tiles one by one instead of an entire texture of sprites at once.

Ideally, when the entire atlas is in use, and the FPS drops as a result of it, a temporary mode could be entered in which the textures are filled with low-res sprites until the scene becomes less demanding again. Preferably this should be an option for the player, to prefer low-res over low FPS. I myself would certainly prefer high-res most of the time, and low-res without frame drop occasionally.

... - if we go with rendering sprites as polygons instead of rectangles, we could account for that in packing ...

The space efficiency of tightly packing sprites together as polygons may offset the cost of Virtual Texture tiles rarely aligning with the exact sprite you need.
If packing as polygons is too expensive to do dynamically, my suggested approach may indeed lose the advantage. It's hard to predict for me which is more efficient though.
Still, I suspect that making the entire atlas dynamic as I suggested may just be a lot easier from where you are now.

i have no experience with hardware rendering stuff, but i was under the assumption that a proper modern video card could handle shadows and transparency using shaders instead of dedicated sprites? (at least i think thats how Minecraft handles it, and its why the shadows look very 3D while everything else doesnt)

isnt it possible to use shaders and mipmapping to handle all those parts and not have soo many render passes ?

Discard operation was introduced to do alpha testing in 3d graphics. This allowed you to render your transparent-textured geometry with no blending whatsoever (so that you could utilize the depth buffer) and use the texel alpha value to test against some other value (such as 0.5), and use "discard" to tell the GPU that this pixel should not be outputted. The pixel would either be overwritten from texture, or left as it was before the render operation. The effect was this sharp-edged fake transparency on fences and grass. It didn't exist to save performance in the first place, and due to vectorized nature of GPU processors, even theoretically it wouldn't do that in a sensible application (rendering huge chunks of transparent pixels is not sensible).

Blending is memory bandwidth intensive and somewhat computationally intensive operation. Simply disabling the blending could immediately improve rendering performance by 15-25%. It would make sense if GPU tried to silently drop it if it detected that intended operation wouldn't do anything.

As I said, discard operation doesn't usually have any positive performance impact whatsoever, even if alpha testing is the first thing you do. On my machine simply having "discard" anywhere in the shader tanks performance by up to 30%, even if it's never actually used. That said, it's a Windows install and Windows isn't exactly known for having good OpenGL drivers. Either way, clipping transparent sections off sprites using low poly hull meshes would be a better option. The shadow map - just can't be helped.

At this point, I think it would be simpler and faster to render actual 3d models instead of using 2d graphics at all. The way I see it, continuing using 2d graphics is more of a programming challenge than anything else. The 3d models would render faster and there wouldn't be nearly as much memory related issues. With current staff, it's reasonable to create high quality 3d models and LODs for everything. And the aesthetic choice argument is just weak: it doesn't look any better than 3d, because it's all just pre-rendered 3d graphics. Plus, tons of people would prefer actual 3d graphics. Tons of people don't buy this game right now because it's 2d.

i have no experience with hardware rendering stuff, but i was under the assumption that a proper modern video card could handle shadows and transparency using shaders instead of dedicated sprites? (at least i think thats how Minecraft handles it, and its why the shadows look very 3D while everything else doesnt)

Sprites don't contain any 3d data to construct proper shadows in any fashion. You could do a makeshift shadow by rendering the same sprite in black, half-transparent, and slanted (to simulate sun angle). But it would look terrible for anything more geometrically complicated than a vertical pole.

Virtual Textures allow you to just promote the tiles containing that section to VRAM. This is not so useful for Factorio, as it almost always needs the entire sprite. In fact, since the Virtual Texture tiles will almost always contain some unused pieces of other sprites on their edges, it seems to me that is utilizes VRAM less efficiently.

Yeah, I thought that might be problem too, but it turned out not to be one (as long as we don't want to push VRAM usage to some extremely low value - like 32MB). First of all, 128x128 px is not by any means large ... it's 2x2 game tiles in high-res. Second, the not-used part is not completely dead, it still contains some sprite data which might end up being needed in a next few frames, and we can make probability of this happening higher by smarter sprite layout (for example grouping all player armor1 animations together, armor2 animations together, etc.). I don't try to figure out which parts sprites are visible and request only those parts, but I have it on my todo to test for terrain tiles, as concrete has 512x512px textures of of which only single 64x64 region could be used (especially with hazard concrete).

It's worth pointing out that Virtual Textures have exactly the same problem when too many tiles are required at the same time, although you can drop the tiles one by one instead of an entire texture of sprites at once.

As I mentioned in the FFF, when the physical texture becomes full, the system will start replacing high resolution tiles. The fact that I can just modify indirection table to point to lower resolution tile is extremely helpful. When testing vanilla high-res with 256MB physical texture, the fallback to lower resolution was not happening though.

The space efficiency of tightly packing sprites together as polygons may offset the cost of Virtual Texture tiles rarely aligning with the exact sprite you need.
If packing as polygons is too expensive to do dynamically, my suggested approach may indeed lose the advantage. It's hard to predict for me which is more efficient though.

2D rectangle bin packing is NP-hard problem. There are reasonably fast algorithms that give some non-optimal solutions (but not that far off from optimal), but I suspect not packing everything all at once, but adding rectangles in smaller batches will throw a wrench into things. Anyway, the whole thing sounds quite non-trivial to me. When I have multiple competing ideas and I don't know which one is the best, I either like to try them out all, if that would take too long, I start quickly prototyping them from the simplest one, and stick with the first one that seem to work well (often even the idea simple to implement turns out to be good enough). So that's how I approached the streaming. Virtual texture mapping looked much simpler to implement, and even the crude first prototype that didn't remember which tiles were in the physical texture more than 2 frames back, worked much better than I expected, so I sticked with it.

Still, I suspect that making the entire atlas dynamic as I suggested may just be a lot easier from where you are now.

The bitmap cache is replaced by virtual texture mapping with the texture data being kept in RAM, next big challenge is to load them from disk on demand, but different streaming technique would not make that easier. The big question is - who does it help? How much is it needed? How much time is worth spending on it. The thing is, when I started experimenting with streaming, I didn't know the latest Steam HW survey stats, and I was under an impression high video memory usage is big problem for large portion of players. When I was testing the streaming on low end HW, I found out I can load 0.16 with high-res sprites and mods that increate total VRAM usage to 4GB just fine on old GPU with 1GB VRAM, even with "Low VRAM mode" disable, and it doesn't even run that much worse than in 0.17 with streaming, so maybe we should optimize other areas first before trying to push streaming tech forward?

Blending is memory bandwidth intensive and somewhat computationally intensive operation. Simply disabling the blending could immediately improve rendering performance by 15-25%. It would make sense if GPU tried to silently drop it if it detected that intended operation wouldn't do anything.

I was going to say I tried to disable blending for draws that I know don't need it and it didn't make a difference, but now I double checked the code and it just set blend state so that background is overwritten, but blending is enabled. So, I'll need to retest. But I definitelly tried to replace it also with CopySubresourceRegion (DirectX) and it still didn't seem to make a difference.

As I said, discard operation doesn't usually have any positive performance impact whatsoever, even if alpha testing is the first thing you do. On my machine simply having "discard" anywhere in the shader tanks performance by up to 30%, even if it's never actually used.

Is that with depth-testing enabled? If so, discard in fragment shader disables Early-Z test.

Is that with depth-testing enabled? If so, discard in fragment shader disables Early-Z test.

It's in "2d mode", so "standard" settings: blending enabled, depth buffer (& testing) disabled. Put otherwise, it's a default rendering mode in Love2D engine - if that tells you anything. It uses SDL2 as a backbone and having inspected its source code (and even made some contributions) I'm pretty sure there isn't an appreciable difference between liblove and libsdl in terms of rendering performance - for something as simple as this test anyway.

I think the game could still be converted to 3d mode without collossal amounts of effort. You could start with using orthographic projection and character & biter models rendered as 3d in otherwise flat 2d environment rendered the usual way - without depth testing but with blending and depth sorting. Of course, if conversion is not realistic at this point (easier to finish the game as it is) then it would be perfectly sensible to carry on with pre-rendered graphics. But if it's realistic or even preferable then having already done collossal amount of work on 2d graphics shouldn't hold you back; heed the warning of the sunk cost fallacy.

Is that really less performance intensive than generating one in-game by using the already existing sprites?

Game objects are modelled as 3D objects and are then projected ('rendered') onto a 2D plane ('sprite'). The shadow has to be calculated from the 3D object. So you can't "generate" a shadow live, because the sprite lacks the 3D information required. Imagine trying to guess the height of a perfect cuboid when you only see one side of it.

I think what he's asking is why in the atlas (There was one shown in the FFF) is there a shadow layer separate from a main layer. Wouldn't it be more efficient to merge them into a single layer when adding to the atlas?

if you read previous FFFs... maybe. if there's a mask layer too, probably, but just main sprite and shadows... it depends. for things like power poles where there's a lot of vertical in the sprite, but horizontal in the shadow, then the answer is no because you'd end up with a lot of dead space. but for big fat sprites like an oil refinery, then you probably would save space fusing them into one.

@bob: Pre-baking would remove the ability to control wheather a shadow casts onto or behind an entity.

Did some experiments with shadows and it's kinda weird. Obviously some shadows should be cast on top of other entities, while other long-ish shadows would (realistically) be cast behind things. But the logic behind this isn't really good. It's based on object position so if object A's center is further north than object B's center it's shadow won't be cast onto B. Except that the construction preview always casts on top. (And other weirdness)

Example 1, the refinery doesn't cast a shadow on any of the chests. Only the shadows cast onto each other:

refineryshadow.png (163.66 KiB) Viewed 850 times

Example 2, the leftmost pole should be casting a shadow on the tank, but it's too far north and so the shadow gets clipped. The preview for the third pole shows the correct shadow.