I represent these functions in bytecode format, as a piece of bytecode that given an input position p computes a vector field strength F_i.

By running each step of the virtual machine over a thousands of input points, the cost of decoding and interpreting the bytecode instructions is amortized over all those points.

This means that we get the bytecode decoding "for free" -- the bytecode can run at nearly native speed.

Bytecode format

In the last article I didn't say much about what format I used for the bytecode. Generally speaking, designing a bytecode format can be tricky, because you have to balance the compactness (keeping programs short) against the decoding cost (keeping bytecode fast).

Lucky for us, we don't care about either of these things. Compactness doesn't matter, because our programs will be very short anyway (just a few instructions). Decoding cost doesn't matter (much), because it is amortized.

When it doesn't really matter I always pick the simplest thing I can think of. In this case it is something like:

(instruction) (result) (argument-1) (argument-2)

Here, instruction is a 4-byte instruction identifier. result is a 4-byte channel identifier that tells us which channel the result should be written to. argument-1 and argument-2 are either channel identifiers or Vector4's with constant arguments. (Instructions of higher arity would have more arguments.)

Note that using 4 bytes for instructions and registers is beyond overkill, but it is the simplest option.

One annoyance with this representation is that I need different instructions depending on whether argument-1 or argument-2 is constant. For a 2-arity instruction, I need four variants to cover all cases. For a 4-arity instruction (such as select), I would need 16 variants.

There are two ways of dealing with this. First, I could make the code that executes each instruction a bit more complex, so that it can handle both constant and register arguments. Second, I could make all instructions operate only on registers and have a single instruction for loading constants into registers.

Unfortunately, both of these option results in significantly slower bytecode. In the first case, the extra logic in each bytecode executor makes it slower. In the second case, we need extra instructions for loading constants, which increases the execution time.

So at least for two argument functions, the best option seems to be to have separate code for handling each argument combination. For four argument functions, it might be better to use one of the other options.

Just to give you some example of how the bytecode works, here is some raw byte code and the corresponding disassembled bytecode instructions:

High-level language

You can't really expect people to author their effects in raw bytecode, or even in our "bytecode assembly language". Effect authors will be a lot more productive if they can use a more comfortable language.

I decided to create such a language and model it after HLSL, since it serves a similar purpose (fast processing of vectorized data). Programmers interested in writing vector field effects are probably already used to working with HLSL. Plus, if at some point we want to move some of this work to the GPU we can reuse the code.

To show what the high level language looks like, here is an implementation of a whirl effect:

If you squint, you may notice that this high level code exactly corresponds to the low level bytecode in the previous
example.

Just as with HLSL, although this looks like C it actually isn't C. Things that work in C may not work in this language and vice versa. I'm quite strict when I parse this. I figure it is better to be start by being strict rather than permissive. This gives you more leeway to extend or modify the language later while keeping backwards compatibility. A strict syntax can always be loosened later, but if you design the language with a too permissive syntax you can paint yourself in a corner (case in point: Ruby).

I usually don't bother with Lex or Yacc when I write a parser. They are OK tools, I guess, but if I can get by without them I prefer not to have the extra precompile step and to have code that is a bit more straightforward to read and debug.

For this language I parse the overall structure with recursive descent, and then use Dijkstra's algorithm to process each statement in the function body.

I generate the bytecode directly from the shunting yard algorithm. When I pop an operator from the operator stack I generate the bytecode for computing that operator and storing the result in a temporary register. I then push that register to the value stack so that the result can be used in other computations. Temporary channels are recycled after they are popped of the value stack to minimize the channel count.

Constant patching

Constants in the bytecode can be changed when an effect is played. I do this by directly patching the bytecode with the new constant values.

When I generate the bytecode I keep track of where in the bytecode different global constants can be found. This patch list is a simple array of entries like:

(hashed constant name) (offset in bytecode)

When playing a vector field effect, the gameplay programmer specifies the constant values with a table:

VectorField.add(vf, "whirl", {radius = 10})

I look through the patch list, find all the offsets of constants named "radius" and replace them with the value(s) supplied by the gameplay programmer.

Since globals can be patched later, I can't do constant folding when I generate the bytecode. (Without global patching, I could just check if both arguments were constants when I popped an operator, and in that case, compute the constant result and push that directly to the value stack, instead of generating a bytecode instruction.)

I could reduce the instruction count somewhat and improve performance by doing a constant folding pass on the bytecode after the globals have been patched, but I haven't implemented that yet.

Physics integration

In my physics system I maintain a list of all awake (non-sleeping) actors. I apply wind from a vector field with an explicit call:

This extracts the position of every awake actor that matches the collision filter and sends that list to the vector field for evaluation. It then does a second loop through the actors to apply wind forces from the returned wind velocities.

I've chosen to have an explicit step for applying wind, so that you don't have to pay anything for the wind support unless you actually use it. Having an explicit step also opens up the possibility to have other types of vector fields. For example, there could be a vector field representing gravity forces and a corresponding function:

The fact that the wind is only applied to awake actors is important. Without that check, the wind forces would keep every actor in the world awake all the time, which would be really expensive for the physics engine. Just as with gravity, we want physics objects to come to rest and go to "sleep" when the wind forces are in balance with other forces on the actor.

This of course creates a problem when the wind forces are varying. An actor may be in balance now, but a change in the wind direction could change that. A leaf that is resting on the ground may be lifted by a sudden updraft. Since we don't apply the wind forces to sleeping object we can't get that behavior. Once a leaf has come to rest, it will stay put.

This problem is most noticeable when you have drastic effects like explosions in the vector field. It looks really strange when actors are completely immobile and "sleep through" a big explosion.

I deal with this by having a function for explicitly waking actors in an AABB:

If you want to play a drastic wind effect (like an explosion), you should first wake the nearby actors with a call to wake_actors(). This ensures that all nearby actors will get the wind forces from the explosion (since they are now awake).

I apply the wind force with the standard formula:

F = 1/2 r v^2 C A

Where r is the density of air, v is the relative velocity of the air with respect to the object (so v = v_wind - v_object, where v_wind is the wind speed and v_object is the object's speed). C is a drag coefficient that depends on the object's shape and A is the object's reference area.

For C and A, I actually loop through all the physics shapes in the actor and estimate C and A based on those shapes. This is by no means a perfect approach. There are many situations where C might be really different from what such an estimation gives. For example, an object that is heavily perforated would receive much less wind force.

However, I want to have something in place that gives decent behavior in most cases, so that it only very rarely has to be changed. The less artists have to mess around with physical parameters, the smaller is the chance that anything gets messed up.

Note that the wind force is just air resistance with a velocity for the air. So by implementing wind you get the "air resistance" behavior "for free".

Rotation

If you compute the drag force using the formula above and apply it to a physics actor, it won't add any rotation to the actor. This is actually correct. The drag force, as we compute it here, has no rotational component.

Yet it feels counter-intuitive. We expect objects to rotate when they are blown about by the wind. Leafs and papers certainly swirl around a lot when the wind blows.

What happens in that case is actually a second order effect. When the wind blows around an object you get zones of high and low pressure as well as turbulence, and it is the forces from these interactions that affects the object's rotation.

These interactions are tricky to model accurately and they depend a lot on the object's shape. Right now, I'm not even trying. Instead I use a much simpler approach: I apply the drag force a bit above the object's actual center of mass so that it produces a torque and makes the object rotate. This is a complete hack that has no basis at all in physical reality, but it does add some rotation. At least it looks a lot better than applying the wind force without any rotation.

It should be possible to do better -- to make some kind of estimate of what rotational forces wind induces when it blows against typical physics shapes: boxes, spheres, capsules, etc. Just give my a couple of days in a wind tunnel and I'll try to come up with something.

But if we don't know beforehand what G(p) will be we don't have that option.

We could write our system so that it supported a limited set of specific effects, with hardcoded C++ implementations. For example, there could be an "explosion" effect with some parameters (radius, strength, etc), an "updraft" effect, a "whirl" effect, etc. Similarly we could have support for a variety of standard shapes, such as "sphere", "cylinder", "capsule", etc. And perhaps some different types of falloffs ("linear", "quadratic"). Perhaps also some temporal effects ("attack-sustain-release", "ease-in-ease-out").

But it is hard to know where to draw the limit with this approach. Exactly what effects and shapes and falloffs and time curves should the system support? The more things we add, the more cluttered the system becomes. And the system is still not completely general. No matter how much we add, there will still be some things that the user just can't do, without disturbing a programmer and get her to add a new effect to the system. This means that the system is not truly data-driven.

Whether this is a problem or not depends a lot on your development style. If you are a single artist-programmer working on a single game you may not even care. To you code and data is the same thing. Who cares if you have to add something to the code to make a special effect. That is what the code is for.

At Bitsquid, however, we are in a different position. We are making a general purpose engine to be used on multiple platforms for all kinds of tasks. We can't put game specific code in the engine or everything will end up a total mess. Sure, our licensees could modify their cloned copy of the source to add their own effects. But that is not an ideal solution. It forces them to learn our code, it makes it harder for us to reproduce their bugs, since our code bases have now diverged and it makes it harder for us to modify and optimize the source code without putting our licensees in merge hell.

So our aim is always to be completely data-driven.

But how can we represent a general function as data? There are really only two possibilities:

As a piece of executable machine code.

As a piece of bytecode that gets executed by a virtual machine.

The first approach is the fastest of course, but it has two drawbacks. First, machine code is platform dependent. Writing a system that can dynamically generate machine code for a lot of different targets is no small undertaking (though it could be simplified by using LLVM). Second, and more serious, many systems simply don't allow us execute dynamically generated machine code.

The inevitable conclusion is that we have to use bytecode (perhaps coupled with a machine code compiler on the platforms where that is feasible).

Unfortunately, as everybody who has used a dynamic language without a JIT compiler knows, bytecode is slow. Usually, at least a factor 10 slower than machine code. And remember that one of our design goals for this system was that it should be fast. We said in the beginning that it should be able to handle at least 10 000 queries per frame.

So what can we do?

The Massively Vectorized Virtual Machine

At this point it makes sense to stop and think a bit about why bytecode is slow. If you look at the code of a virtual machine, it is essentially a tight loop that repeatedly does three things:

Decode the next bytecode instruction into operation + arguments.

Jump to the code that performs the operation.

Execute the operation.

The third step is usually just as fast as handwritten machine code would be. Computing a+b is not more expensive because it was triggered by an OP_ADD bytecode instruction.

So all the overhead of bytecode, the thing that makes it "slow", is found in the first two steps.

Well then here is an idea: what if we could reuse the computations that we make in those two steps?

Remember that our goal is to compute G(p) for a lot of points p_i. We want to evaluate the same function, the same bytecode instructions, for a lot of different data points. In that case, why repeat the expensive operation of decoding the bytecode instructions again and again for each point? Why not just decode the instruction once and then execute it for all data points?

So, with that change, our virtual machine loop now becomes:

Decode the next bytecode instruction.

Jump to the code that executes it.

Execute that single instruction for all the input data.

With this change, the cost of decoding the bytecode is now amortized over all the query points. The more query points we have, the less time (proportionally) we will spend on decoding bytecode. With enough points (>1024) that time should be nearly negligible . In other worlds, our bytecode should be able to run at nearly the same speed as native machine code.

In a quick test I made, the overhead of a bytecode implementation compared to native code was just 16 % -- a far cry from the 10x slowdown we have come to expect.

Fleshing out the Details

Since we are computing a vector function on vector input and we want it to run as fast as possible, it makes sense to use SSE (or its equivalent on other platforms) and represent all our data as vector4 intrinsics.

Virtual machines can be stack-based or register-based. Stack-based machines produce more compact bytecode since the arguments are implicit. Register-based machines need fewer instructions to accomplish a task, since they don't have to juggle things around on the stack. In our case, compact bytecode doesn't buy us much, since our programs are short and the decoding cost is amortized. On the other hand, accomplishing the same thing with fewer instructions means less code to execute for each query point. So a register-based virtual machine seems to be a clear win.

Here is what the code for an explosion effect could look like in a made-up intermediate language for our virtual machine. The effect produces a wind of 50 m/s outwards from the center of a sphere of radius 5 m located at (2,4,0):

Here position is the input query position and output is the output result of the function. direction and lensqr are temporary variables.

Note that the final operation adds the result to the output register instead of overwriting it. This allows us to merge multiple effects by simply concatenating their bytecode. So to evaluate G(p) for a large number of points, we can first intersect the AABB of the points with the AABB of each individual effect G_i(p). Then we merge the bytecodes of each intersecting effect into a single bytecode function G'(p) that we finally evaluate for each point.

We can feed position and output to the virtual machine as arrays of intrinsics:

Note that since we are running the bytecode one instruction at a time for all the data, the local variables (direction and lensqr) need to be arrays too, since we need to remember their value for each of the input positions.

We could allocate arrays for these local variables and pass them to evaluate just as we do for positions and output. But that seems a bit wasteful. A complicated function could have twenty global variables or more, meaning that with 10 000 particles we would need to allocate 3.2 MB of temporary memory. The amount needed will vary widely, depending on how complicated the function is, which is driven by the data. This makes it hard to do a memory budget for the system.

So let's use an alternative approach. We allocate all local variable buffers from a "scratch space" which is provided by the caller:

Now the caller has complete control over the amount of temporary memory the system uses. It is predictable and can be made to fit any desired memory budget.

To make this work, we need to chop this scratch memory up into areas for each local variable. The size of those buffers then determine how many input positions we can process at a time.

For example, suppose we have 256 K of scratch memory and 8 local variables. Each local variable then gets 32 K of memory, which can hold 2 K Vector4I's. So this means that instead of processing all 10 000 particles at the same time when we execute an opcode, we process the particles in 5 chunks, handling 2 048 particles each time. The cost of decoding the bytecode gets amortized over 2 048 particles, instead of over 10 000, but it is still negligible.

The nice thing about this approach is that we always use a constant, predictable amount of scratch space, regardless of how many query points we process and how complicated the function is. Instead we scale down how many particles we process at a time.

Since both input data and local variables are now Vector4I buffers, the inner loop of the virtual machine is simple to write, it will look something like:

An Example

Here is a YouTube video that shows a vector field implemented using this method. Unfortunately, the YouTube compression is not very nice to a video that contains this much high-frequency information. But at least it gives some idea of the effect.

The video shows 20 000 particles being animated by the vector field at a query cost of about 0.4 ms on a single thread (of course, parallelization is trivial, so you can divide that by the number of available cores).