GPU Instancing performance variation

Hey Guys, I implemented GPU instancing, but everything works great when no scripts are attached to the prefab that I am instancing. I can instantiate 20,000 spheres and still get a descent 35FPS, with 10,000 objects I get roughly 65 FPS and with 5,000 92 FPS. but when I attach a very simple script with one line of code to the prefab performance decreases drastically.

With 5,000 I get 23FPS with 10,000 I get 9 FPS

and with 20,000 3 FPS.

When I did the same process with a fairly more complex script... roughly 200 lines of code, the performance dropped even more and when I tried 20,000 objects my computer could hardly instantiate them and eventually It got stuck. Basically I am assuming that there is a lot of behind the scene things unity is doing to compile all the scripts attached to all those prefabs and that is what is causing the performance lag. But interestingly enough when I check the performance in the profiler I only get a big consumption of "Physics", then in "Others" but "Scripts" does not even appear. See attached image as reference.

Is there any way to save performance when more complex scripts are used? These are just small tests I did for a larger project I am developing where I have quite a long set of scripts, and my goal is to at least be able to run 5,000 instances at a descent frame rate, 35-50 would be fine.

3 Replies

C# JIT-compilation is only done once when your C# script is loaded for the first time, and I think Unity adds some minor overhead afterwards to cache references to Unity Event Functions like e.g. Update().
However, this overhead does not increase with the amount of objects that use that script, so it's unlikely to be the issue in your case.

Since your script is so simple, the overhead is most probably the cost of calling code in an managed assembly (your C# scripts) from an unmanaged executable (the Unity Engine). That overhead is quite big and you should therefore always try to minimize the use of per-frame events like Update() as much as possible.

When you have to update a large number of objects with the same script, you should therefore make sure to update them all from the same script instead of making them all update themselves.

Unity is very aware of that issue and has therefore released the Entity Component System to help implement such scripts in clean way avoiding the overuse of Update() events.

If you still need more performance, the C# Job System is the way to go because it even allows your code to run multithreaded, but it is a bit harder to implement.

"Basically I am assuming that there is a lot of behind the scene things unity is doing to compile all the scripts attached to all those prefabs and that is what is causing the performance lag."

For C# the compilation happens at build time, not at run time, though there could be an implication at runtime based on the way you build .NET code. On some platform targets that may well be compiled to native code (which is to say compiled to the CPU's native instructions, so there's no compilation at runtime).

What you're observing is the code itself functioning. GPU instancing is a powerful technique limited to the visible display of those instances, but that won't have an impact on objects with code instantiated for every one of them. While the graphics will be more efficient, that does nothing to stop the fact that 5,000 objects are instantiated (for 5,000 instances) in C#, and code is going to be executed for each one during Update and FixedUpdate (where applicable).

What may be required in this situation is to say what your code does. This is not with an aim to improving efficiency of that code (which may have limited impact as it is being multiplied by the number of instances), but to imagine a way of implementing the purpose of that code without being attached to each instance. Without knowing why you have the code attached I can't hope to advise, but if you can fashion a supervisor, something that controls the instances, you may be able to greatly improve performance. For most optimization, the basic rule is that changing the algorithm, or the underlying method of performing the work, may have the largest impact on performance.

If you're hardcore about this, you might investigate the methods used to apply C++ code to objects, rather than use C# code. Here, again, the impact varies greatly (from minuscule to miraculous). It may be exactly the direction you'll want, or it could be a nightmare you should avoid, depending on how deep you tend to go and whether or not you're comfortable with C++. Google provides resources (you're basically making a native plugin resource). However, before you consider it you must weigh your situation and objectives. This would start, primarily, as a learning experience and engineering experiment to see if it works for you. If you can't afford to invest time into the investigation, you're not a candidate for it. Where the fundamental problem is still the attachment of code to thousands of objects (and the associated work Unity performs calling Update and FixedUpdate, and/or whatever else is hooked in), C++ can offer SOME benefits (which could be exactly what you need), or no benefit at all. Once you've practiced some C++ code in Unity, you find it merely an alternative (once the prerequisites are out of the way) method of coding, though with considerable overhead for each platform target.

That said, I think you'll observe that if you attach scripts that do nothing, you probably find little impact. If you then take that 'do nothing' script and add trivial work in Update and/or FixedUpdate, you'll find an overhead you can measure by comparison, which is just the work required to make the method calls on 5,000 objects. After that, everything Update or FixedUpdate does will be multiplied by the number of objects running that code. This is why if you can imagine any way to accomplish your intentions with code that is not attached to every object has the highest potential impact.

and it still dropped performance horribly. And the second script I testes was just some vector math of aligning and cohesion behaviors used in flocking. My real project at the moment is composed of different classes, the ones that are directly doing the computation on the instances is 1 base class that has roughly 1,400 lines of code which 2 other scripts inherit from. Those 2 sets of scripts are the ones attached to the prefabs and together they are around 800 lines. I have done some C++ a while back, but not using it on its full potential. I am also doing some nasty distances checks, although not on every Update call, but I am using a KDTree for that.

Regarding this:

"but to imagine a way of implementing the purpose of that code without being attached to each instance. Without knowing why you have the code attached I can't hope to advise, but if you can fashion a supervisor, something that controls the instances, you may be able to greatly improve performance."

Since the increment of a position is minor work, you've demonstrated the overhead encountered when Unity's framework must schedule a method call for every C# object running Update code.

Unfortunately, transform.position can't be update from threads (as far as I know). However, if there were a container of all the transforms, then in some supervisory class the Update method could sequence through this container to perform the work. Further, the work could be limited to only part of the collection (to limit performance impact), such that a subsequent update would continue onto the next portion, until all are completed (this is something that can't be done with Update on each instance).

Key, therefore, is how to create such a container. Perhaps in Start (or Awake) each instance could register itself (adding itself to the container owned by the supervisor class). That would be a one time setup delay.

This is obviously limited to the transform, which I used as an example following the code you posted. If you conduct an experiment, you'd know the relative impact (I must admit I'm working on a theory about Update being heavy, and responsible for the problem you're describing). Further, you could experiment with the idea of processing only a portion of the population in each update cycle. For example, say for 5,000, your experiment shows that is still too heavy, but 1,000 is acceptable. So, for each Update in the supervisor, process only those from 0 to 999. On the next update, process 1000 to 1999, then 2000 to 2999, until completed. Then, subsequently, start at zero again. In this way you limit the weight of the overhead, and for 5,000, assuming Update attempt to fire every 1/60th of a second, it would be like animating the population at 12 frames per second (given my example values).

I'd should point out, too, that it is well known (for most languages) that a loop can be optimized by processing multiple entries in line. Compilers often do this automatically, so it is not always much of a difference, but it may be worth a try if you need to push a little further. The idea is that instead of incrementing an index by 1, increment by 2, or 4. If incrementing by 4, perform 4 of the actions in line on index+0, index+1, index+2 and index+3 before the loop continues. This can require a "remainder" follow up when the volume isn't evenly divisible by the stride (4 in that example).

Your answer

Hint: You can notify a user about this post by typing @username

Attachments:
Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

Welcome to Unity Answers

The best place to ask and answer questions about development with Unity.