A blog by Michael Abrash

Christmas always makes me think of Mode X – which surely requires some explanation, since it’s not the most common association with this time of year (or any time of year, for that matter).

IBM introduced the VGA graphics chip in the late 1980s as a replacement for the EGA. The biggest improvement in the VGA was the addition of the first 256-color mode IBM had ever supported – Mode 0x13 – sporting 320×200 resolution. Moreover, Mode 0x13 had an easy-to-program linear bitmap, in contrast to the Byzantine architecture of the older 16-color modes, which involved four planes and used four different pixel access modes, controlled through a variety of latches and registers. So Mode 0x13 was a great addition, but it had one downside – it was slow.

Mode 0x13 only allowed one byte of display memory – one pixel – to be modified per write access; even if you did 16-bit writes, they got broken into two 8-bit writes. The hardware used in 16-color modes, for all its complexity, could write a byte to each of the four planes at once, for a total of 32 bits modified per write. That four-times difference meant that Mode 0x13 was by far the slowest video mode.

Mode 0x13 also didn’t result in square pixels; the standard monitor aspect ratio was 4:3, which was a perfect match for the 640×480 high-res 16-color mode, but not for Mode 0x13’s 320×200. Mode 0x13 was limited to 320×200 because the video memory window was only 64KB, and 320×240 wouldn’t have fit. 16-color modes didn’t have that problem; all four planes were mapped into the same memory range, so they could each be 64KB in size.

In December of 1989, I remember I was rolling Mode 0x13’s aspect ratio around in my head on and off for days, thinking how useful it would be if it could support square pixels. It felt like there was a solution there, but I just couldn’t tease it out. One afternoon, my family went to get a Christmas tree, and we brought it back and set it up and started to decorate it. For some reason, the aspect ratio issue started nagging at me, and I remember sitting there for a minute, watching everyone else decorate the tree, phased out while ideas ran through my head, almost like that funny stretch of scrambled thinking just before you fall asleep. And then, for no apparent reason, it popped into my head:

Treat it like a 16-color mode.

You see, the CPU-access side of the VGA’s frame buffer (that is, reading and writing of its contents by software) and the CRT controller side (reading of pixels to display them) turned out to be completely independently configurable. I could leave the CRT controller set up to display 256 colors, but reconfigure CPU access to allow writing to four planes at once, with all the performance benefits of the 16-color hardware – and, as it turned out, a write that modified all four planes would update four consecutive pixels in 256-color mode. This meant fills and copies could go four times as fast. Better yet, the 64KB memory window limitation went away, because now four times as many bytes could be addressed in that window, so a few simple tweaks to get the CRT controller to scan out more lines produced a 320×240 mode, which I dubbed “Mode X” and wrote up in the December, 1991, Dr. Dobb’s Journal. Mode X was widely used in games for the next few years, until higher-res linear 256-color modes with fast 16-bit access became standard.

If you’re curious about the details of Mode X – and there’s no reason you should be, because it’s been a long time since it’s been useful – you can find them here, in Chapters 47-49.

One interesting aspect of Mode X is that it was completely obvious in retrospect – but then, isn’t everything? Getting to that breakthrough moment is one of the hardest things there is, because it’s not a controllable, linear process; you need to think and work hard at a problem to make it possible to have the breakthrough, but often you then need to think about or do something – anything – else, and only then does the key thought slip into your mind while you’re not looking for it.

The other interesting aspect is that everyone knew that there was a speed-of-light limit on 256-color performance on the VGA – and then Mode X made it possible to go faster than that limit by changing the hardware rules. You might think of Mode X as a Kobayashi Maru mode.

Which brings us, neat as a pin, to today’s topic: when it comes to latency, virtual reality (VR) and augmented reality (AR) are in need of some hardware Kobayashi Maru moments of their own.

Latency is fundamental

When it comes to VR and AR, latency is fundamental – if you don’t have low enough latency, it’s impossible to deliver good experiences, by which I mean virtual objects that your eyes and brain accept as real. By “real,” I don’t mean that you can’t tell they’re virtual by looking at them, but rather that your perception of them as part of the world as you move your eyes, head, and body is indistinguishable from your perception of real objects. The key to this is that virtual objects have to stay in very nearly the same perceived real-world locations as you move; that is, they have to register as being in almost exactly the right position all the time. Being right 99 percent of the time is no good, because the occasional mis-registration is precisely the sort of thing your visual system is designed to detect, and will stick out like a sore thumb.

Assuming accurate, consistent tracking (and that’s a big if, as I’ll explain one of these days), the enemy of virtual registration is latency. If too much time elapses between the time your head starts to turn and the time the image is redrawn to account for the new pose, the virtual image will drift far enough so that it has clearly wobbled (in VR), or so that is obviously no longer aligned with the same real-world features (in AR).

How much latency is too much? Less than you might think. For reference, games generally have latency from mouse movement to screen update of 50 ms or higher (sometimes much higher), although I’ve seen numbers as low as about 30 ms for graphically simple games running with tearing (that is, with vsync off). In contrast, I can tell you from personal experience that more than 20 ms is too much for VR and especially AR, but research indicates that 15 ms might be the threshold, or even 7 ms.

AR/VR is so much more latency-sensitive than normal games because, as described above, they’re expected to stay stable with respect to the real world as you move, while with normal games, your eye and brain know they’re looking at a picture. With AR/VR, all the processing power that originally served to detect anomalies that might indicate the approach of a predator or the availability of prey is brought to bear on bringing virtual images that are wrong by more than a tiny bit to your attention. That includes images that shift when you move, rather than staying where they’re supposed to be – and that’s exactly the effect that latency has.

Suppose you rotate your head at 60 degrees/second. That sounds fast, but in fact it’s just a slow turn; you are capable of moving your head at hundreds of degrees/second. Also suppose that latency is 50 ms and resolution is 1K x 1K over a 100-degree FOV. Then as your head turns, the virtual images being displayed are based on 50 ms-old data, which means that their positions are off by three degrees, which is wider than your thumb held at arm’s length. Put another way, the object positions are wrong by 30 pixels. Either way, the error is very noticeable.

You can do prediction to move the drawing position to the right place, and that works pretty well most of the time. Unfortunately, when there is a sudden change of direction, the error becomes even bigger than with no prediction. Again, it’s the anomalies that are noticeable, and reversal of direction is a common situation that causes huge anomalies.

Finally, latency seems to be connected to simulator sickness, and the higher the latency, the worse the effect.

So we need to get latency down to 20 ms, or possibly much less. Even 20 ms is very hard to achieve on existing hardware, and 7 ms, while not impossible, would require significant compromises and some true Kobayashi Maru maneuvers. Let’s look at why that is.

The following steps have to happen in order to draw a properly registered AR/VR image:

1) Tracking has to determine the exact pose of the HMD – that is, the exact position and orientation in the real world.
2) The application has to render the scene, in stereo, as viewed from that pose. Antialiasing is not required but is a big plus, because, as explained in the last post, pixel density is low for wide-FOV HMDs.
3) The graphics hardware has to transfer the rendered scene to the HMD’s display. This is called scan-out, and involves reading sequentially through the frame buffer from top to bottom, moving right to left within each scan line, and streaming the pixel data for the scene over a link such as HDMI to the display.
4) Based on the received pixel data, the display has to start emitting photons for each pixel.
5) At some point, the display has to stop emitting those particular photons for each pixel, either because pixels aren’t full-persistence (as with scanning lasers) or because the next frame needs to be displayed.

There’s generally additional buffering that happens in 3D pipelines, but I’m going to ignore that, since it’s not an integral part of the process of generating an AR/VR scene.

Let’s look at each of the three areas in turn.

Tracking latency is highly dependent on the system used. An IMU (3-DOF gyro and 3-DOF accelerometer) has very low latency – on the order of 1 ms – but drifts. In particular, position derived from the accelerometer drifts badly, because it’s derived via double integration from acceleration. Camera-based tracking doesn’t drift, but has high latency due to the need to capture the image, transfer it to the computer, and process the image to determine the pose; that can easily take 10-15 ms. Right now, one of the lowest-latency non-drifting accurate systems out there is a high-end system from NDI, which has about 4 ms of latency, so we’ll use that for the tracking latency.

Rendering latency depends on CPU and GPU capabilities and on the graphics complexity of the scene being drawn. Most games don’t attain 60 Hz consistently, so they typically have rendering latency of more than 16 ms, which is too high for AR/VR, which requires at least 60 Hz for a good experience. Older games can run a lot faster, up to several hundred Hz, but that’s because they’re doing relatively unsophisticated rendering. So let’s say rendering latency is 16 ms.

Once generated, the rendered image has to be transferred to the display. How long that takes for any particular pixel depends on the display technology and generally varies across the image, but for scan-based display technology, which is by far the most common, the worst case is that it will take nearly one full frame time for the pixel with the most delayed time between frame buffer update and scan-out to reflect the update. At 60 Hz, that’s 16 ms for the worst case, the worst case being where it’s nearly a full frame from the time the frame buffer is rendered until a given pixel gets scanned out to the display. For example, suppose a frame finishes rendering just as scan-out starts to read the topmost scan line on the screen. Then the topmost scan line will have almost no scan-out latency, but at 60 Hz it will be nearly 16 ms (almost a full frame time – not quite that long because there’s a vertical blanking period between successive frames) before scan-out reads the bottommost scan line on the screen and sends its pixel data to the display, at which point the latency between rendering that data and sending it to the display will be nearly 16 ms.

Sometimes each pixel’s data is immediately displayed as it arrives, as is the case with some scanning lasers and OLEDs. Sometimes it’s buffered and displayed a frame or more later, as with color-sequential LCOS, where the red components of all the pixels are illuminated at the same time, then the same is done separately for green, and then again for blue. Sometimes the pixel data is immediately applied, but there is a delay before the change is visible; for example, LCD panels take several milliseconds at best to change state. Some televisions even buffer multiple frames in order to do image processing. However, in the remainder of this discussion I’ll assume the best case, which is that we’re using a display that turns pixel data into photons as soon as it arrives.

Once the photons are emitted, there is no perceptible time before they reach your eye, but there’s still one more component to latency, and that’s the time until the photons from a pixel stop reaching your eye. That might not seem like it matters, but it can be very important when you’re wearing an HMD and the display is moving relative to your eye, because the longer a given pixel state is displayed, the farther it gets from its correct position, and the more it smears. From a latency perspective, far better for each pixel to simply illuminate briefly and then turn off, which scanning lasers do, than to illuminate and stay on for the full frame time, which some OLEDs and LCDs do. Many displays fall in between; CRTs have relatively low persistence, for example, and LCDs and OLEDs can have a wide range of persistence. Because the effects of persistence are complicated and subtle, I’ll save that discussion for another day, and simply assume zero persistence from here on out – but bear in mind that if persistence is non-zero, effective latency will be significantly worse than the numbers I discuss below; at 60 Hz, full persistence adds an extra 16 ms to worst-case latency.

So the current total latency is 4+16+16 = 36 ms – a long way from 20 ms, and light-years away from 7 ms.

Changing the rules

Clearly, something has to change in order for latency to get low enough for AR/VR to work well.

On the tracking end, the obvious solution is use both optical tracking and an IMU, via sensor fusion. The IMU can be used to provide very low-latency state, and optical tracking can be used to correct the IMU’s drift. This turns out to be challenging to do well, and there are no current off-the-shelf solutions that I’m aware of, so there’s definitely an element of changing the hardware rules here. Properly implemented, sensor fusion can reduce the tracking latency to about 1 ms.

For rendering, there’s not much to be done other than to simplify the scenes to be rendered. AR/VR rendering on PCs will have to be roughly on the order of five-year-old games, which have low enough overall performance demands to allow rendering latencies on the order of 3-5 ms (200-333 Hz). Of course, if you want to do general, walk-around AR, you’ll be in the position of needing to do very-low-latency rendering on mobile processors, and then you’ll need to be at the graphics level of perhaps a 2000-era game at best. This is just one of many reasons that I think walk-around AR is a long way off.

So, after two stages, we’re at a mere 4-6 ms. Pretty good! But now we have to get the rendered pixels onto the display, and it’s here that the hardware rules truly need to be changed, because 60 Hz displays require about 16 ms to scan all the pixels from the frame buffer onto the display, pretty much guaranteeing that we won’t get latency down below 20 ms.

I say “pretty much” because in fact it is theoretically possible to “race the beam,” rendering each scan line, or each small block of scan lines, just before it’s read from the frame buffer and sent to the screen. (It’s called racing the beam because it was developed back when displays were CRTs; the beam was the electron beam.) This approach (which doesn’t work with display types that buffer whole frames, such as color-sequential LCOS) can reduce display latency to just long enough to be sure the rendering of each scan line or block is completed before scan-out of those pixels occurs, on the order of a few milliseconds. With racing the beam, it’s possible to get overall latency down into the neighborhood of that 7 ms holy grail.

Unfortunately, racing the beam requires an unorthodox rendering approach and considerably simplified graphics, because each scan line or block of scan lines has to be rendered separately, at a slightly different point on the game’s timeline. That is, each block has to be rendered at precisely the time that it’s going to be scanned out; otherwise, there’d be no point in racing the beam in the first place. But that means that rather than doing rendering work once every 16.6 ms, you have to do it once per block. Suppose the screen is split into 16 blocks; then one block has to be rendered per millisecond. While the same number of pixels still need to be rendered overall, some data structure – possibly the whole scene database, or maybe just a display list, if results are good enough without stepping the internal simulation to the time of each block – still has to be traversed once per block to determine what to draw. The overall cost of this is likely to be a good deal higher than normal frame rendering, and the complexity of the scenes that could be drawn within 3-5 ms would be reduced accordingly. Anything resembling a modern 3D game – or resembling reality – would be a stretch.

There’s also the problem with racing the beam of avoiding visible shear along the boundaries between blocks. That might or might not be acceptable, although it would look like tear lines, and tear lines are quite visible and distracting. If that’s a problem, it might work to warp the segments to match up properly. And obviously the number of segments could be increased until no artifacts were visible, at a performance cost; in the limit, you could eliminate all artifacts by rendering each scan line individually, but that would induce a very substantial performance loss. On balance, it’s certainly possible that racing the beam, in one form or another, could be a workable solution for many types of games, but it adds complexity and has a significant performance cost, and overall at this point it doesn’t appear to me to be an ideal general solution to display latency, although I could certainly be wrong.

It would be far easier and more generally applicable to have the display run at 120 Hz, which would immediately reduce display latency to about 8 ms, bringing total latency down to 12-14 ms. Rendering should have no problem keeping up, since we’re already rendering at 200-333 Hz. 240 Hz would be even better, bringing total latency down to 8-10 ms.

Higher frame rates would also have benefits in terms of perceived display quality, which I’ll discuss at some point, and might even help reduce simulator sickness. There’s only one problem: for the most part, high-refresh-rate displays suitable for HMDs don’t exist.

For example, the current Oculus Rift prototype uses an LCD phone panel for a display. That makes sense, since phone panels are built in vast quantities and therefore are inexpensive and widely available. However, there’s no reason why a phone panel would run at 120 Hz, since it would provide no benefit to the user, so no one makes a 120 Hz phone panel. It’s certainly possible to do so, and likewise for OLED panels, but unless and until the VR market gets big enough to drive panel designs, or to justify the enormous engineering costs for a custom design, it won’t happen.

There’s another, related potential solution: increase the speed of scan-out and the speed with which displays turn streamed pixel data into photons without increasing the frame rate. For example, suppose that a graphics chip could scan-out a frame buffer in 8 ms, even though the frame rate remained at 60 Hz; scan-out would complete in half the frame time, and then no data would be streamed for the next 8 ms. If the display turns that data into photons as soon as it arrives, then overall latency would be reduced by 8 ms, even though the actual frame rate is still 60 Hz. And, of course, the benefits would scale with higher scan-out rates. This approach would not improve perceived display quality as much as higher frame rates, but neither does it place higher demands on rendering, so no reduction in rendering quality is required. Like higher frame rates, though, this would only benefit AR/VR, so it is not going to come into existence in the normal course of the evolution of display technology.

And this is where a true Kobayashi Maru moment is needed. Short of racing the beam, there is no way to get low enough display latency out of existing hardware that also has high enough resolution, low enough cost, appropriate image size, compact enough form factor and low enough weight, and suitable pixel quality for consumer-scale AR/VR. (It gets even more challenging when you factor in wide FOV for VR, or see-through for AR.) Someone has to step up and change the hardware rules to bring display latency down. It’s eminently doable, and it will happen – the question is when, and by whom. It’s my hope that if the VR market takes off in the wake of the Rift’s launch, the day when display latency comes down will be near at hand.

If you ever thought that AR/VR was just a simple matter of showing an image on the inside of glasses or goggles, I hope that by this point in the blog it’s become clear just how complex and subtle it is to present convincing virtual images – and we’ve only scratched the surface. Which is why, in the first post, I said we needed smart, experienced, creative hardware and software engineers who work well in teams and can manage themselves – maybe you? – and that hasn’t changed.

Or actually, maybe you could approximate the fluid-filled container in software? Render to a buffer that’s larger than the screen, and then change the position of the scanned-out region if there’s angular motion without much linear motion. You could skip the rendering latency, anyway, for movements that don’t absolutely require a re-render. Or does optics distortion mean that even angular movements require a re-render?

Typical head angular-movements are faster and higher-acceleration than head linear-movements, aren’t they?

Hardware panning seems like it’d be a good solution for angular motion, but there’s a problem for dynamic scenes. The problem is that the panning corrects for head rotation, but it doesn’t account for object motion in the scene, so the panning will conflict with the motion of moving objects. That wouldn’t matter if the angular motion was constant, but it’s not. Also, as you point out, panning doesn’t work for translation, and there’s always translation, even when you just turn your head.

Yes, head rotation is typically faster and higher-acceleration than head translation.

As noted in my reply to Karl Bunch, upon reflection I think hardware panning is an experiment worth running, although it’s still not clear to me that it’ll work well with dynamic scenes or head translation.

This sounds like feasible option when you start to account for our brains immense capability to fill in the gaps. As soon as you are able to somewhat reconstruct the scene during fast rotations / translations while rotating the experiment has quite good odds to actually give believable result. As long as you are able to reach certain level of speed even with non-true visual cues our brains are wicked on reconstructing what the eyes are expecting to see. ( http://www.moillusions.com/2006/05/black-and-white-seen-in-color-illusion_14.html ) That one is just for reference. As an example of how much our brains have to fake information for us. Or for example all those blind spot tests where brains fill in the missing parts.

Remember that while rotating your eyes always need fixed point to look at. For moving objects in respect to the viewer it may be problematic, But for overall world view it should be splendid (You can try to experiment with this by trying to rotate your head and try to make your eyes rotate smoothly along. You jump from point to point with eyes instead of rotating). So in fact the described speed of head rotation is not that dramatic for focus point as its the only thing you can clearly see and rest is faked by your brains through the visual cues. The overall scene change though is fast as Michael described, and for that you should be able to use the panning technique to approach immersing results.

All in all I just wanted to remind to not overlook that there is quite much faking going on all the time in our field of view by our brains. So studying and understanding those effects and potentially finding out how to reduce the complexity and how to exploit those brain features may be constructive for your quest to get Kobayashi Maru moment. Hard numbers are nice but not always compulsory to fool our self on thinking that the scene is real.

That’s a nice visual trick, but I wouldn’t say it demonstrates the brain faking information, since it’s just an afterimage effect. And while your eyes and brain definitely do fill in and reconstruct a lot of what you see, the specific problems that result from too-high latency are the sorts of things the eyes and brain are evolved to detect; if latency is too high, far from filling in, they’ll call your attention to the discrepency.

> So in fact the described speed of head rotation is not that dramatic for focus point as its the only thing you can clearly see and rest is faked by your brains through the visual cues.

I’m not sure what you mean by that. If you fixate on a point in front of you while wearing a see-through AR HMD, then turn your head rapidly while continuing to fixate on the same point, you will see that point clearly, with no fakery, as you say. The rest of the scene will be blurry but you won’t notice, which is what I think you mean by fakery. But the key is that you will also see the virtual image on the HMD shift relative to that point by a distance that’s proportional to the speed and the amount of latency between tracking and photon emission. And that is the problem with latency.

For the reconstruction of missing regions, I suggest looking into algorithms that alter stereo images. Especially disparity adjustment on stereo image pairs solves the same problem, usually by warping the input with texture filtering like interpolation.

Now, implementing that in a RAMDAC would be a nice challenge. Think about it: if the traditional RAMDAC is replaced by decicated hardware that gets two color images plus depth as input and is directly linked to the tracking sensors, then there should be no longer any reason to adjust the rest of the rendering pipeline.

Gregor

PS: could you please check the email adress verification for the comment form? It is broken and does not accept valid addresses like gmueckl@gregor-mueckl.de. Thanks!

Well, yes and no. Certainly there have been successful warping implementations; you can find them easily on the Internet. However, none of those applications was one where the images needed to register with reality as you moved your head. In particular, translation can be a big problem, because proper warping would require drawing newly-exposed areas for which there is no information in the rendered scene. Possibly it could work well enough anyway, and it’s worth experimenting with, but there is a big difference between things that work on monitors and things that work in AR/VR.

I will look into the email address verification problem – sorry for the inconvenience.

I think it would be more than a pixel or two. It’s easy to produce large parallax shifts with modest head motions; next time you’re talking to someone, try shifting your head to the side and see how much their heads move against the background.

I suppose the gap size is highly dependent on the speed of translation and the distance of the background.

Perhaps this is where prediction can be used without affecting the flow of motion too drastically. If I the POV is known to be translating in some direction the scene could be rendered first with a predicted camera position then normally, in this case with depth info that can be shifted as the beam is being raced. If the gaps are left unfilled they should show the predicted scene in the gaps. There’s probably a more efficient way to do it, but it should work for a test.

Interesting – but if that worked well enough, wouldn’t that imply that the predicted position alone would work well enough? Otherwise, the predicted pixels used to fill the gaps would look wrong – actually more wrong that if you only used predicted pixels, because the mix of predicted and shifted pixels would make mis-registration of the predicted pixels immediately evident.

Inpainting is your friend there are so many inpainting approaches that are for example used in image/video/stereoscopic processing I am sure there is a suitable one for this. If it is fast enough that’s the question but considering HMDs have tiny resolutions at the moment I think it would be

Sure, there are lots of approaches that work well for stills and videos, and trying them for HMDs has come up previously in the comments. However, none of them have had to deal with compensating instantly for head motion in a way that doesn’t trigger anomaly detection; that’s not to say they can’t work, just that we’ll have to see.

Does only rendering edge information make the racing the beam approach more feasible?

Here’s a hybrid approach: keep a rendered cubemap of the scene at all times, but then composite in edge information for each eye on top of it while racing the beam. i.e., for that split second where you’re outside of the optimal rendered area, you’d have something almost anaglyph-like in terms of 3D quality on the display.

The goal is to provide something on the screen whenever the head turns too quickly. So you would have the actual images for each eye with an overscan area as was previously proposed.

What I’m proposing would still require additional hardware, unfortunately. If you get outside that overscan area, it could sample a cube of the area around the camera. (i.e., you’re always generating this cube texture each frame, in a manner not unlike a reflection texture.) The problem is there’s only one cube, so in order to have depth information that’s different for each eye, you would composite the edge information for each eye into the scene at scan out. I’m focussing on edges because I’m assuming that’s the information your eye is going to be looking for in these rapid movements. (I’m also making the assumption that rendering the edges of objects is simpler than rendering the entire scene, and could be implemented in simplified hardware in a way that races the beam.)

While it would far from cover every case (e.g., post processing filters could interfere, bumpmap textures would be ignored, etc.), you could probably collect some basic edge information from your triangles during the render of the entire scene into the cube texture. This simplified list of edges could then be passed to the dedicated hardware for compositing into the image at scan-out. Not sure what algorithm would be used to composite the edge info into the scene; perhaps blend in a thick line with a light-dark-light pattern that tries to mimic what you would see if you applied a sharpen filter to an image?

Regarding the overscan/cube idea – it’s a variation on the hardware panning idea that’s come up multiple times, which is worth trying out, but has a couple of weak spots, most notably translation.

Regarding the edges – it’s hard to see how that could work properly. Wouldn’t the old edges still be in the panned image? Anyway, I don’t see how the edges could fit perfectly into the panned image, and if they don’t, you’ll see wobbling very clearly. But perhaps I’m missing something.

I don’t know if you have tried this: just show blank for that frame during your high velocity head movement. After all that is how our brain deals with eye movement. High speed movement may be easy to follow, but blink? Not so much.

Yes, we all learned that we’re blind during saccades, but in fact that’s not really true. We don’t get sharp images, but blurs on the retina convey information. In any case, the speed of motion I described – 60 degrees/second – is in the range within which the eye can do smooth pursuit of moving objects, and your eye can do focused tracking of stationary objects as you turn your head at much higher speeds. So latency is a problem in common situations that don’t involve being temporarily blinded to some extent by saccades.

The other problem is… latency. In order to blank and unblank as you suggest, you’d have to end it rapidly enough so that the eye didn’t notice that the scene had vanished. But if it took even 16 ms between the time you decided to unblank and the time the scene was displayed, the eye would see the blanking. So latency means that blanking can’t be a solution for latency

Do you think there’d be a use case where vision-blocking AR (ie. a camera showing you a real-world view on a VR display) is used to increase the capacity of human vision then? When my eyes are focussed on the on-screen reticle on a fast-paced FPS I can aim and shoot much faster than moving my head around and waiting for my eyes to fixate on a target, and I know there’s some training that high-end athletes use to get around that and learn to work without fixating their eyes.

So maybe real-life soldiers and counter-terrorists would provide a market here, or even just some cultures around the world might accept vision-blocking AR as a welcome step on the road to becoming cyborgs. Either way, it’d be a cool toy, the way people with short sightedness use mobile displays with cameras today.

In a post a while back I briefly discussed why I don’t think this is likely to happen for a good long while. It may be true that you can react faster because you don’t have to change accommodation distance, but it is probably also true that you will get eye fatigue because the distances your eye vergence report don’t match your accommodation distance. More importantly, the quality of the visual experience with video-passthrough is far inferior to seeing the real world on every axis, and it’ll be a long time before that changes. Not to mention how incredibly anti-social it would be to have your eyes completely covered – not that see-through AR is great in that department either!

Very interesting read. And I really loved the old Mode X, and the way it was ‘generalized’ by tools like fractint to get closer to SVGA resolutions/color combinations on (appropriate VGA) hardware.

I find particularly interesting that, if I understand your analysis correctly, the largest part of the AR/VR-killing latency comes from the display hardware: even if the input and processing times were cut down to a couple of ms, the display latency from “commodity” hardware would keep overall timings borderline impractical.

Not that this comes at a surprise: after all, the display technology is probably the one that so far has progressed the _least_ at the PC (and tablet and phone) level, even with the higher resolutions and IPS and whatnot. So maybe this is the right time to start pushing its limit and make faster displays a (virtual or agumented) reality?

Display technology has progressed just fine in areas where it mattered, such as increasing monitor refresh rates until flicker was gone. The problem is that there’d be no real benefit to building faster displays for existing uses. Faster displays will come about only if AR/VR can justify that happening.

Wow, Michael, what an eye-opening article!
I wonder if we can “cheat” in another way [crazy idea ahead]. Our eyes have “full resolution/full color” vision only in their very center, and the brain interpolates the sparse data from our peripheral vision to fill in the gaps. Given that AR/VR screens are very close to the eyes themselves (and assuming we have some eye-tracking capabilities), can we considerably decrease our rendering efforts?

I doubt that “racing the beam” or “increase scan-out speed without increasing frame rate” can work. Latency is, ultimately, defined as the time between ‘an event happens’ and ‘max. time until the results of that event reach the display’. If you have a 16ms interval between frames, that time can sometimes be smaller than 20ms total, but not always – and you need that “always” for seamlessness.

In fact, I strongly suspect that the human brain can adapt to, and ultimately tune out, a few pixels of lag ina VR system. If that lag is consistent.
Beam racing and similar techniques will decrease the minimum lag but not the maximum, thereby increasing variance and making adaption harder.

No, racing the beam can work. There’s no specific event that has to be displayed at any particular time; all that’s needed is that what’s displayed at any time be very close to right for that time. There are reasons why having 16 ms gaps between frames can cause problems, which I’ll go into at some point, but not because frame time causes a lag between a specific event and the time it’s displayed.

The human brain can adapt to a lot of things, but that doesn’t mean that things will look right or that it won’t experience fatigue and/or simulator sickness. And note that we’re not talking about a few pixels of lag – we’re talking about dozens of pixels under reasonable circumstances. And the artifacts this produces is really one of the things your visual system is built to detect.

I think game design can account for the added latency to some extent. I have an HMZ-T1 and there are definitely games that stand out as being more immersive because of their pacing and/or have other game mechanics better suited to the HMZ-T1’s performance limitations. Skyrim is a great example, the first time I played it in 3D with my HMD was a revelation. Hawken is an even better example, despite it’s 3D support on the HMZ-T1 not being usable it is still incredibly immersive. You start to feel like you are inside the game, it’s incredible. The performance isn’t as great but it’s so much more fun.

It is certainly true that games that don’t require you to move your head quickly will tolerate latency better. Of course, games that don’t require you to move you head also won’t benefit as much from AR/VR, since you only get the full effect if you shift your viewpoint around, so there’s a tradeoff there. In general, I think VR will succeed in proportion to the extent to which games that are compellingly unique to VR are developed, and that could include designing for latency tolerance. AR obviously will depend on unique compelling content, because existing content won’t port at all well to AR.

I wonder if some of the latency could be soaked up by a physical adaptation of the display panel. Imagine a display that adjusts position manually/mechanically in relation to the head movement, enough to give the latency time to catch up with the underlying movement. People won’t just sit in a chair and spin 360 constantly. If the physical display could shift the angle of view or the pan/scan of viewable pixels perhaps the rendering could happen a bit slower and catch up when the user stops for a second to focus etc.?

You could render a ‘latency’ buffer past the edges of the panel’s physically viewable viewport and then the panel could expose that zone along with head movement while the rendering continues to work on catching up with the actual shift. When the head comes to a stop for even a handful of milliseconds the display can recenter while the scene is rendered to “time current” position.

I can see that over-rendering and hardware panning is going to be a popular theme in comments Now that I think about it more, I’m not sure whether it’d work or not; the only way to know is to try it. It definitely won’t work for translation, but translation is slower than rotation.

Actual mechanical movement of the display seems unlikely to work well. Mechanical movement takes time, and the noises it produces would likely be distracting; imagine your HMD buzzing and whining as you look around…

As a follow-up I was thinking that you could render overscan and have the IMU/Display tightly coupled so they can shift the display in sub 7ms time while the cpu works to catch up in the overscan buffer zone. Meanwhile when the user stops for a bit you can catch everything up again as needed.

The IMU and Display could be “locally” connected to shift the viewable display port on movement and the cpu can respond as needed to update the display with new details and/or make sure the overscan area is updated.

You need a new kind of display, where the display has memory, and you can send a message to the display to move it’s contents X pixels left/right/up/down.
Then the software can generate extra data outside of the display bounds, and the scan-out engine can just send the now visible bits to the screen.
That reduces the scan-out and display latency to near zero.

But that assumes a much tighter coupling between the scan-out engine and the display – normal HDMI is not going to work.

All of this is available with current technology, but it’s not commercially viable.

You might be able to partially simulate it by creating a custom ASIC that receives a signal from the display chip at around 60Hz on one side and drives a 200Hz+ signal out the other at the display. Then the ASIC can manage the display memory and receive “scroll” instructions.

As I said, this is clearly going to be the theme of the comments on this post. And, as I said, it’s worth trying out.

I will point out that if you pan to cover, say, 32 ms of latency, and you’re turning your head at 120 degrees/second (which sounds fast, but is quite reasonable; try it), then that’s four degrees of latency – twice the width of your thumb at arm’s length. That may be enough so that the errors in a linear shift are noticeable. Translation tends to be a good bit slower, but the errors from a linear approximation to translation versus proper rendering are far more evident than for rotation. So it may or may not work. Which is why trying it is the right thing.

When we were working on Quake, John Carmack tried posterizing the entities and reusing them for one frame, rendering them only every other frame. It instantly looked like a bunch of cardboard cutouts moving around. It surprised me that such a small error could be so visible. This case may be like that. Or not.

Finally, if I was going to build a custom display, I’d go for faster downloading of the image first. I’m sure that’ll reduce latency without any potential complications.

>When we were working on Quake, John Carmack tried posterizing >the entities and reusing them for one frame, rendering them only >every other frame. It instantly looked like a bunch of cardboard >cutouts moving around. It surprised me that such a small error could >be so visible. This case may be like that. Or not.

That sounds more like a background/foreground issue with shadows and aliasing. Humans are very sensitive to changes at the edges of objects.

Well, it wasn’t shadows, because we didn’t have any shadows Agreed that we’re sensitive to edges – but if you just pan something that should be subtly warping, it could produce the same kind of effect, especially if there are moving objects in the scene that don’t move quite right.

Kobayashi Maru solution: give up on pixels altogether and bring back Asteroids-like vector graphics. Very few bits to sling around and would make for some awesome VR game effects. Realism isn’t everything.

That’s certainly occurred to me. Very tempting, especially with a retinal laser. But in a world where people play such realistic games on monitors, vector graphics for AR/VR would certainly seem like too big a step backward to consumers to be appealing.

What if you render the scene larger than the display, then feed the tracking info direct from the tracker to the dispaly as fast as possible, as each pixel reaches the display it’s position is adjusted based on the positioning offset it’s getting from the tracking system. Of course this only works for rotational effects, it doesn’t help at all with translation.

Nicely stated and analyzed. In particular, it’s key to shift each pixel (or, more realistically, each scan line) to the right position based on the latest data, rather than just panning the whole image; positioning each scan line separately cuts latency nearly 16 ms, arguably close to 0 ms. But as you say, translation is a problem.

It strikes me that you could overdraw the scene (e.g. for a wider view frustum centered on the last point of view or leading if the field was already in motion) and then snap to your actual FOV from the positional data received at the last moment before the scene is sent to be rendered. Then it’s less important how long you’re taking to render the scene, because as long as the motion isn’t so violent as to make your FOV outside of your larger scene, your latency is still very low.

Also, I wonder if it wouldn’t be less disruptive to drop a frame here, rather than rendering one that you suspected would be very out of register with the scene behind.

Dropping a frame turns out to be incredibly noticeable, because the previous frame’s pixels show up in very much the wrong place, or, if you blank the screen instead, the blinking-out of the image is obvious. These are the sorts of anomalies your eyes and brain just can’t overlook. As a result, AR/VR requires at least a hard 60 Hz frame rate.

Assume rendering happens at the standard 30-60Hz, all at once. We render an image slightly bigger than the viewport, that allows us to shift the image around.

That leaves a few approaches:
A) the reverse of the classical “optical image stabilization” as used in photography to counter high-frequency motion.
B) a “soft” solution in the display driver, which shifts the image around at a higher frequency than the actual rendering, again to counter high-frequency motion.
C) for scanning displays, like Lasers, the beam could be continually compensated as it scans, with optical or software solutions.

Solution (A) is nice in that the motion compensation happens after the photons left the display, providing the lowest latency. The best, if expensive, choice for LCD/OLED type displays. (B) would require that the actual refresh rate of the LCD is in the 120+ Hz range, but is otherwise a software solution. (C) is for Lasers or CRTs.

All have in common that the actual rendering doesn’t have to improve on latency. Much like OIS in photography, this works best for static scenery, but the computer graphics could be rendered at even lower rates, 20-30Hz, as long as shifting the image in the viewport can compensate for that. Software solutions could even compensate for all 6 DOF, which is harder optically. Also, sensors with the required quality do exist in photographic image stabilization systems, and should be usable for VR/AR, too.

Even though tracking external motion will still lag with the image stabilization approach (though no more than with conventional displays), motion sickness due to one’s own movements should be considerably reduced, removing the most important show-stop bug in the system.

I don’t think (B) will work, because there will be dynamic objects moving around, and panning between rendered frames won’t match their motion, so you’ll get stuttering motion.

I’m not sure I see the difference between (A) and (C); as discussed in earlier replies, I’m not sure if they would work, but it’s worth a try. Translation is a major stumbling block.

I am doubtful that rendering below 60 Hz would work; you could easily get 5-10 degrees of movement that you’re trying to compensate for, and that’s big enough that linear approximation would be off by quite a bit. Not to mention that translation would be big enough at that point to be a clear problem. However, I’m not sure I know what you mean by, “Software solutions could even compensate for all 6 DOF, which is harder optically. Also, sensors with the required quality do exist in photographic image stabilization systems, and should be usable for VR/AR, too.”

Wouldn’t using a dual-buffer help with the racing beam approach? Sure it would double the latency of the frames but it would also mean that the rendering/tracking can occur at the same time as the scan-out, effectively parallelizing the two most expensive parts of generating the frame. It would also reduce tearing which I believe would be a major problem in AR/VR.

But my guess the only way to solve this problem for good is with custom hardware. A screen that does asynchronous rendering of each of its pixels (every time a pixel changes in the frame buffer it is displayed automatically) would solve the problem. But this approach is probably beyond what the current displays are capable of.

Also I don’t understand why using racing beam approach would need a complete overhaul in the rendering department. Why do you need to guarantee that a full frame is done before it is displayed? With 60hz/120hz you shouldn’t really need to bother with checking if the frame is done or not, just display what is in the frame buffer. But using dual-buffer approach would also solve this problem. But if not using dual-buffer in the racing beam approach you should use interlacing lines instead of blocks (if technology possible) because it would reduce the boundaries problem.

I’m not sure how a double buffer would help. Could you explain? Note that in terms of latency, tracking isn’t expensive once you use an IMU with sensor fusion.

OLED pixels change state as soon as the pixel data changes. But you still have to get the pixel data to the OLED, and that’s a serial process, so that doesn’t really help.

If you display the frame buffer before it’s finished, objects that will be overdrawn will be visible. Imagine a cube in front of another cube. Now imagine that the far cube is drawn first, and scan-out happens at that point, before the front cube is drawn – you’d see a cube that should be fully occluded.

I think interlace would produce very noticeable jagged edges, so it probably wouldn’t be a great solution.

I was thinking about a similar solution (dual buffers) and kept thinking about the problem of getting screens that can change fast enough.

Would it be possible to put two mobile-screens (just the actual TFT foil) behind each other and feed each of them the result of one buffer (with the other one turned to black, to avoid “ghosting”).

Given that tracking takes 1ms, rendering 7ms, and outputting takes another 8ms (16ms total). You could output the rendered image from the first buffer, to the first screen, while the second buffer is being filled and outputs afterwards to the second screen, while the first one is rendered again. This would essentially give you double the framerate while keeping the delay the same, if I didn’t get this entirely wrong?

I don’t know if two screens could be overlaid like that. But even if so, I’m not sure where the win is here.

First off, framerate itself isn’t the issue; latency to get a frame on the display is, and I don’t think the proposed approach affects latency. Current GPUs actually do output the rendered image for one frame while rendering the next frame, which I think is the same thing your approach does. So I don’t think there’s a problem that needs fixing there. Note, btw, that outputting takes 8 ms only if you’re running at 120 Hz.

The double buffer I suggested was about doing rendering AND tracking at the same time as the scan-out. While one frame is being sent to the screen the rendering occurs in the other frame, but as you said to Frieder the rendering is done in this way even without double frame buffers (a fact I was not aware of).

The front cube would appear in the next frame wouldn’t it? This might reduce the frame latency enough so that it would not be noticeable.

Also interlacing might warrant some test. Maybe some kind of hardware based anti-aliasing could fix the jagged edges? As in the screen does it automatically, by bleeding some pixels into their neighbors.

But again this runs into the custom hardware problem. I guess the only way is to simply make a custom GPU to Screen interface that does the transferring in parallel instead of serially. But this generates a whole sort of new troubles like solving the synchronization problems, and the number of wires necessary between the two. Hard drives shifted from parallel (IDE) interfaces to serial (SATA) interfaces because of these problems.

Yeah I am not a hardware guy (more of a front end web dev), sorry about my wild guesses.

On a portable device the OLED screen and the memory holding the framebuffer are already physically near. You shouldn’t need to get the pixel data to the OLED because it’s already almost there

To make use of that you would need a new protocol for communication between the GPU and the screen, so that means new hardware. How to get that new hardware?

Well, my motion sickness disagrees with the idea that monitor and TV technology is already good enough latency-wise. Consider a PC game running in windowed fullscreen mode, displayed on an 60Hz LCD, connected with a 1 meter cable or a TV that’s doing frame interpolation. A solution that gets the lowest-possible latency here should also provide displays suitable for AR/VR.

I have an alternative solution. Most of the problem comes from head rotation, not translation or anything else; and for small angular displacements, you can approximate a camera rotation by translating the image. So draw a frame that’s slightly bigger than the viewport, then as close to the display as possible, do a last-second translation based on the most recent sensor data. Then you can use a normal slow rendering and scan-out path, but get most of the benefit of a low-latency one.

I can’t say specifically what we’ve tried there, but I can comment that OLEDs switch in microseconds, so the limitation wouldn’t be in the displays themselves, but rather in the controllers that feed state changes into them.

That would be a good solution – but getting a highly customized panel built would require major capital expenditures, and there’s no market yet to justify that. In fact, it might not even be possible to pay LCD manufacturers to build a custom panel like that, because they have their hands full keeping up in a highly competitive market; throwing resources at a one-off custom panel for which there’s no guarantee there’d ever be a large market might well rank low on their priority list, even if they got paid for the work.

You might not be able to convince them to build you a custom display, but you might be able to get them to give you specs for driving the display with your own controller.
They might not be willing to build the panel for you without the controller, but should be possible to cut the power lines to the built-in controller, and piggy-back your new controller onto the display.
(My company builds custom FPGA based products, sometimes we do weird stuff to make things work

And then you could build a controller in an FPGA to conduct experiments. You could even start with an off-the-shelf FPGA core, just google for “fpga lcd controller core”, which would save you a bunch of startup time.

If you need to find an FPGA person, some of the universities high-performance computing labs are doing FPGA research.

An idea: Have the renderer compute a slightly larger-FOV scene than is actually visible on the display, then have the orientation tracking directly control (inside the headset hardware) which portion of that scene is delivered to the display, with the game renderer catching up on the next frame. Wouldn’t this eliminate the game renderer from the critical path for the case of following head turns?

Of course, it won’t be exactly perfectly rendered, since one’s two eyes are not exactly stationary, and the image would also probably have to be warped to properly recenter it on the display, depending on the projection used.

Warping is a point that hasn’t been raised in previous comments, and it has some interesting potential. However, it requires substantial processing power on the display side, and you need to store the whole frame on the display, both reading and writing it, which adds cost and electrical power demands. And this still doesn’t address head translation.

The Kobayashi Maru opportunity here might be in field-sequential color microdisplays. A number of them already operate at 360 or 480 Hz of field refresh in order to get 120 Hz of effective frame rate. So when minimizing latency is key, you can run the rendering loop once per color field instead of only once per frame. Alternately, you could run the panels in monochrome mode and get a true high-frame-rate display without any rainbowing effects, albeit in grayscale or Terminator red.

Another option in the microdisplay realm are the DMD micromirror devices from TI. These have an inherent pixel-modulation speed of up to 5000 Hz. Granted, they are only binary black-and-white devices at that frequency, so normally you have to trade off modulation speed for bit depth via dithering. But the same trick applies if you want to get fancy and do intra-frame rendering to change the data in the middle of dithering to account for low-latency head motion.

Excellent thoughts. If only I thought a monochrome display could be successful, that would open up a number of interesting possibilities. But, alas, I don’t. Running the loop once per color field doesn’t really solve the problem because when you render, you don’t know what velocity is going to be by the time the fields get displayed.

DMDs do have those high frame rates, and that’s useful for experiments. But they also have the field-sequential issue, and there’s not really a version that would be a good match for an HMD.

I think a monochrome display would be successful simply because the public have already accepted it. The average consumer will know exactly what you are talking about and what to expect simply from the two words “Terminator vision”. That sort of pre-existing resonance is a rare bonus for a new technology and is potentially exploitable.

Not only that, but the general public have been conditioned for the best part of a century that colour is to be expected as an evolution, not as a prerequisite; newspapers, cinema, television, computers, video games, cell phones, the list goes on.

I’m certainly no graphics expert (though I’ve been around long enough to have read the 1991 Dr. Dobb’s article mentioned), but would it help to drop the resolution only during periods of high acceleration? So if we’re walking around and not moving our head quickly, we’d get nice circa 2012-era 3D graphics, but when turning our head, we’d get pre-2000-era graphics until the velocity drops and we can bump up the resolution again? It seems to me that when I move my head quickly, my eyes can’t absorb high-resolution images until my eyes focus again on something (i.e., my eye/head velocity relative to what I’m tracking is pretty low), so there’s no point in “wasting” the pixels when they aren’t useful.

I also assume that we only need to generate higher resolutions around the area being tracked by the eye, since the area of highest acuity is only 10-15 degrees in radius. Perhaps it would help to create a display that has two separate parts: an inner area that updates faster with higher resolution, and an outer area that updates less frequently, with lower resolution.

Very good thoughts. However, you can move your head very quickly but keep your focus on the same place, and you can see perfectly clearly – try it. Lower resolution would be very noticeable in that case, so it wouldn’t help.

As for higher resolution for the fovea than everywhere else – check out the comment thread from the last post for extensive discussion. My guess is that it’s unlikely that it would work to update the two areas at different rates, because it would produce a visible boundary between the two and make it obvious that there are two separate regions behaving differently. That should be easy to test by simply updating the center of the display at a different rate than the periphery (but at the same resolution). It also turns out to be tricky to have resolution that varies as you describe. Finally, if the lens is fixed, you need something like 50 x 50 degrees at high resolution and high update rate, so the savings aren’t the order of magnitude you might hope for.

I recall seeing a simulator at the NASA Ames research lab that used used an inset foveal region. I only visited there once, in the 90’s, so I don’t know the details of how it was blended into the second wider field-of-view image. I do recall that they were using two light valve projectors per eye for the displays, with a fiber-optic bundle to get the image to each eye. Oh, and they had a pretty awesome 6-DOF motion platform (looks like they’ve still got that). Also, at the time I visited, I was only able to see it without the foveal inset, and not on the motion platform.

A little more digging found this 1993 HMD tech summary, which (page 14) says this was probably the CAE FOHMD. Based on other pages about that HMD, it sounds like it needed a custom-molded helmet per-user to ensure visual alignment. No wonder I didn’t get to try it in its full glory.

So, it’s possible a high-res inset could be updated faster, though I’m sure that would introduce noticeable tearing. Another thought for scanout latency is to tile the display (e.g. something like this)

Marc, great to have you posting. For those who don’t know, Marc did a lot of key research early on in VR, not to mention coming up with a truly elegant homogeneous coordinate rasterization approach.

The inset foveal region sounds very cool; I’ll see if we can dig up anything about it. NASA did a lot of really deep, impressive research into this sort of stuff. Although it’s probably not really consumer-ready if you have to custom mold helmets on a per-user basis

We tried the one tiled wide-FOV HMD, and it was pretty terrible. It’s hard to see how you could get the tiles to match up properly without a lot of heavy, bulky, expensive optics.

I have also been thinking for a while about lowering the resolution for either fast moving objects on the screen or fast movements from the head, but if that’s still not working as you say maybe a solution would be to implement eye tracking in the VR glasses themselves to reduce the resolution for everything else than what you are focusing on? This would not have worked on a TV screen with possible multiple users but Imean with the glasses on it will only be you that’s looking at the screen anyway.

Yes, that came up earlier in the comments, and is potentially interesting. It would reduce rendering time, but would not reduce transmission time to the display unless significant hardware changes were made.

How close could you get approximating the view position/orientation? If the error could be bounded in a reasonable range you could render a larger view frustum and just warp the image. That would reduce the latency of the time to warp the image from the guessed render (a few ms hopefully).

Games have used this sort of trick to generate 2 eyes for 3D from a single 2D image. That’s a simple left right translation though and it also wrecks translucent rendering like thick clouds of particles. Worst case you could move translucents after the warp, though that would add to the latency.

You’d get smearing artifacts and lighting/reflection discontinuities if the position is too far off of the guess. If you wanted antialiasing, you’d need a pretty complex resolve. Who knows though, complex resolve functions are on the upswing right now.

I wonder if it would make sense to render a large area than neccesary at the front end, and then after the rendering was complete select a section of that area to pass to the screen? You’d effectively eliminate the rendering latency. Downside is that I assume this would require new hardware, and if you misjudge the region you would need to draw badly enough you would get flickering of the view at the edge – better than things not moving right I’d expect, but only slightly.

I wonder if the Occulus Rift guys are looking at overclocking as a cheap way of attaining high response rate. You look at ads for LCD monitors and they all go on about “2ms response time” even when the maximum refresh rate is 60hz, and gamers have started to overclock monitors via software and hardware to achieve the refresh rates of the CRT days. John Carmack even mentioned it off-hand when he was talking about the Rift, most of the work is actually done to get 100hz+ LCD displays in commonly-used sizes, even if the driving electronics have never pushed it.

Unfortunately, mobile panels are not generally designed with low switching times or high refresh rates in mind. One possible solution is to overdrive the panel (Essentially, telling it to change further than it actually has to for a small period of time) to get faster switching times, but that is not a common feature in most display controllers.

It is not just a matter of making panels with faster switching/refresh, though; Most mobile panels these days use a MIPI interface, which rules out most LVDS/other display controllers used in high performance desktop monitors. Even if you were to make a mobile panel that can run at 120hz+, (Newer IGZO and OLED displays can do it on paper) everything else in the pipeline has to also support 120hz, and that is simply not happening. Some cheaper hardware tops out at 30hz, and nobody is bothering to support anything above 60hz yet. You cannot just use a nice LVDS controller and convert to MIPI, either, because the conversion hardware also supports a maximum of 60hz.

There is a lot of potential in mobile SOCs driving OLED displays at high frame rates, but it will be years before it is viable for consumer VR, if it ever happens.

Yes, I believe overdriving is one of the tricks used in desktop LCD panels to get the grey-to-grey switching times that were a matter of competition a few years ago (it avoids ghosting when running at 60hz), so it’s presumably a trick up the sleeve of a company that does decide to go and engineer a high-speed, high-density, small-size display.

I wonder if it’s possible with a bit of hardware hacking to rig up two controllers out of phase, with some sort of hardware multiplexer, to double the switching rate of a single pixel?

I still don’t know what technologies would be used to embed a display in see-through glasses (I can’t wrap my head around how focussing the display out several metres, or more, or both at once would work), but if it’s some sort of OLED display at the back of the head that gets bent around the head by optic fibre, you can also put 2 physical displays over the one “retina pixel”, and have some sort of high-speed supercontroller blocking one display if the controllers are so bad.

Sure, that would be possible, but as a latency-reduction technique, it’s effectively the same as using a GPU with more execution units. Either way, it’s a cost issue. Also, rendering isn’t responsible for the bulk of the latency, getting the data to the display and emitting the corresponding photons is. And that’s certainly a solvable problem, but it requires major hardware effort that no one has sufficient reason to undertake at the moment.

Actually, for rendering, why not render a full, say, 240 degree view (X and Y), mapped to the same display resolution/FOV that the screen is using, each frame? I.E. 80 degree fov 1080p means you render, centered, a 240 degree 6k view each time (eek! But not 120fps eek).

You then store this render in a texture, and then just read from that texture for subsequent frames until the next frame is ready. All you’re doing is displaying the correct portion of a texture! Just rendering the correct portion of the texture needed shouldn’t be a problem while the next “real” frame is being done. 1ms response time here we come. After all, you don’t need to render the motion of any characters or any interaction with the game at 240hz, you only need to keep up with the players head. If the rest of the world isn’t moving for those view milliseconds they aren’t really going to notice.

The other things I can think of, is that there’s been research done into “faked” 60hz stuff. Reprojecting pixels from the last frame onto an intermediate while the “real” frame is being done. Again, we don’t care if the animations are rendered at X times a second, 30fps is pretty decent and such. We only care that the correct portion of the world is displayed when the users head moves.

Generally an interesting idea, as noted previously. But it doesn’t work for translation. It also doesn’t work for anything but small rotations, because the texture starts deviating from the proper projection pretty quickly.

Faked 60 Hz probably wouldn’t work well for dynamic images, because moving objects wouldn’t advance properly on the faked frames and would seem to stutter.

If you cant make the display faster, why not make the head slower? Any combination of heavier HMD and a gyroscope could work. I mean most of these games are about being a soldier of some kind or racecar driver, always something which would actually imply wearing some heavy shit on your head for.

While it would be awkward at first, if your theory proves right and low latency will this massive real feel, people might accept it for the tradeoff in immersion.

Yes, we’ve joked about that, and it would be a handy solution if it was realistic for a product. But we’ve had plenty of experience with NVIS ST-50s, which are heavy and require lots of angular force, especially after we added a camera and a counterweight, and after a while it’s really uncomfortable.

To hit a different theme, I think you’re selling prediction short. The head *moves* quickly, but there are limits to its *acceleration*, and specifically it’s the inaccuracy due to a frame’s worth of incorrectly guessed acceleration you need to worry about. Some of your improvements might come from using late data better (pan or otherwise modify your image at the last millisecond based on the very latest data), but a lot might come from investing in making the absolute best guess you can 16 ms or whatever in advance so you have less patching-up to do.

I’d love to see a graph of angular acceleration (and ‘regular’ acceleration) from a real headset, to see how predictably people move their heads. When I snap my head 45 degrees to the left, how long am I accelerating and how long am I decelerating (or correcting if I overshot and turned 48 degrees at first)? How pretty vs. noisy is the curve? If you watch me for an hour, what’s the set of different motions I’m gonna make? What are my constraints (how far can I turn my head?) and habits (do I tend to end up looking squarely at a sound source or in-game object?)? And all that.

From a naïve model, it seems like you can still occasionally get significant angular error from one frame’s wrong acceleration. (If you assume I turn my head 90 deg by accelerating for 1/8 a sec and decelerating 1/8 a sec, but you mis-guess by one frame when I switch from accel to decel, you’ll have a 3.2-degree error that frame.)

This may never happen, but it would be amazing if, say, eye tracking or sensors on the neck (that ‘see’ what the muscles are doing) could help you make better predictions than you can with a gyro/accelerometer alone.

And panning isn’t all you can do to patch up an image. Render some close-up stuff late, or render some things as ‘billboards’ you can nudge around as if on a 2-D canvas at the last second. If you’re rendering at higher than display res, and rendering a depth map, you can do some distortion at the last millisecond too if the GPU is up for it. It’s a big deep research project to get all that right, but that’s what VR/AR is it seems.

> It’s a big deep research project to get all that right, but that’s what VR/AR is it seems.

Indeed it is

Very nice idea with using eye tracking or muscle sensing to enhance prediction. I have no idea whether it would work, but it’s another approach worth trying.

The head accelerates and decelerates shockingly fast – up to 1000 degrees/sec/sec. So at 60 Hz, the distance covered during a maximum acceleration frame could change by something in the neighborhood of 8 degrees, which is huge. Even 1 degree error makes the virtual images seem to not be in the world, and it happens every time you start or stop moving your head. Solving that would make AR/VR seem vastly more real.

There are a couple of issues with last-minute distortion. The big one is that it doesn’t work for translation, only rotation. The lesser one is that it doesn’t account for dynamic object motion, so moving objects might seem to stutter.

Holy cow – I hadn’t seen that! That’s kind of hard to believe; if you maintained that acceleration for 10 ms, you’d go from 0 to 500 degrees/second in that time. Not to mention what would happen if you maintained it for a full second I’ll definitely read it over.

I couldn’t see any jerk figures in the thesis. The jerk, not acceleration, is the main cause of sea sickness. I would guess the jerk is low during the head movements making the acceleration curve smooth and predictable.

Further to the thoughts on eye tracking – rather, or maybe in addition to using it as a movement prediction aid could you not us it to enable the detailed rendering of a tiny area of the screen – that which is being focused on at any given time? Due to the fact that the acuity of vision drops off massively away from the relatively tiny area of focus – something like the size of your thumb nail at arms length if I remember correctly – could you theoretically only have to render high detail in these areas, cutting down on your 16ms of render time (you can even potentially drop colour from the rendering of any content in the periphery of your vision as we don’t process colour past a certain angle)? You could also use the fractions of a second when we are effectively blind during saccades to redraw scenes potentially cutting into latency further. I’d imagine that you could also use eye tracking to better replicate the focal blur of normal vision.

I’m basing all of this on a fairly rudimentary understanding of the visual system and having once played with an eye tracker that managed to dynamically place a dot in the centre of your visual field no matter where you were looking so I have no idea about how actually feasible any of this might be!

Yes, that would reduce rendering time; it’s a good idea, and has come up previously in these comments. Of course, it adds eye tracking hardware, which increases cost, complexity, and power demands – no free lunch, but might well be worth it.

Someday, that could be how everyone interfaces visually. It’s a long way off, though. Not only is the technology crude, although improving, but also it requires surgery, and I’m pretty sure it’ll be a while before people would willingly undergo surgery to get a better virtual display. Also, the retina and optic nerve do a lot of work – which the brain is trained to work with – and my guess is that it’ll be a long time before we are able to replicate it. Given how hard it is just to display an image on glasses that the eye and brain are happy working with, I have no idea how long it would take to get things right with direct neural interfacing – but it’s certainly not going to be soon enough to matter for the first few generations of AR/VR.

I’m familiar with Sheila Nirenberg’s work, and it’s impressive – but there’s still a long way to go, as measured in years to product, or even to productizable technology. The decoding part also is impressive but is likewise nowhere near ready for prime time. It’s also not clear whether it would be useful for an interface, since at this point they’re just picking up images that people are seeing from the visual cortex, not anything that people are actually generating, so it’s unclear whether it generalizes to generated input. And they do say, “However, researchers point out that the technology is decades from allowing users to read others’ thoughts and intentions.”

So – cool stuff, but I stand by my original statement: “Someday, that could be how everyone interfaces visually. It’s a long way off, though.”

Yes, thanks to everyone for sharing their thoughts. The quality of the comments is remarkably high (and I have let everything through so far).

Given that you have a piece of hardware sitting on someones head it would be interesting to look into trying to detect muscle movement in the neck for cues on movement. No matter how fast your cam+gyro sensor pair gets you will always be fighting a resolution -v- noise fight. Knowing what the muscles are about to do could help you cheat and get moving ahead of time.
Obvious diffuculties are getting contact with the neck (no-one wants sticky pads and some of us are fairly hairy!) and filtering out signals from shouting etc but like I say it could be an interesting experiment.
Thanks for the fantastic write-up.

Further along on this, I remember reading a paper on subvocalisation technology that worked off the fact that there’s a neural “buffer” near your voicebox that you can read sounds off from even if the individual doesn’t intend to speak them out loud. I wonder if there’s an analogue for muscles.

Actually though you wouldn’t need to go that far – I suspect that you could guess how far the head is going to turn by noting that most skeletal movements are symmetric: one muscle accelerates its joint into the movement, then the other one decelerates it to a stop. So when the accelerating muscle starts to go slack and the decelerating one begins to tense, you should be ~half way through the turn..

I’d be willing to bet it’s not as neat as a switchover halfway through the movement – it never is And I wonder how much variation there is across the population. And there’s still the question of how exactly to pick up those muscle motions without intrusive sensors. Still, an interesting direction to look at.

It sure would! If we had quantum computing and direct neural interfaces, we’d have a pretty sweet system. And someday we may have that, but unfortunately neither one will help in the next five years, which is about as far as I can usefully think about right now.

I have a theory of motion perception which is derived from the fact that scrolling LED displays appear to show impossible italics. In this theory (which this margin is too small to contain) the precise sampling in time of each pixel should match the precise time when that pixel is output. So, this is like racing the beam and having a new camera per scanline, but also with a dt per horizontal pixel! With that setup it should be possible to create ultra-fluid-looking camera motion, I think (?) although I haven’t done the demo work to prove this (an old CRT and a matched video circuit looks like a minimum hardware requirement). Interestingly, the oldest scanning cameras and matched CRTs automatically did this, which is perhaps why old analog TV looked so good at the time.

It is therefore very interesting to hear about you racing the beam in VR. And interesting to infer that your VR experiments imply that hardware manufacturers need to think about making this possible.

The “obvious” approach to drawing like this is surely raytracing (that would certainly send us back to 2000-era rendering quality!) If your scene only moves at 60fps, say, but your camera moves faster, you have to reorient the ray generation per scanline, but can still use the same acceleration structure. Thus, a raytracer capable of rendering scanlines in real time could be adapted to grab orientation data at scanline rate and adjust each scanline’s rays accordingly. And you could even add prediction across the dx direction so each pixel is very slightly adjusted too (or add a scanline of latency so dt/dx can be calculated to exactly interpolate, rather than have a prediction which could be off).

Then, according to my theory, the correct match between sampling time and output time allows the brain to interpret the results as you would like. And I’m letting myself down here because I never found time to write my theory up properly, to convince you about this rule

Note that for this to work you would need exact timing information for the display scanout (e.g. HBLANK intervals and so on) And, as you say above, it should work better if pixels are on for short periods of time, rather than for 1/60th second, although I expect this dramatically affects brightness. The low-pass-filter effect of not doing this might not be too bad though – but it might make you feel as if you’re drunk, which could induce sickness.

I’m not sure if these theories/insights are of any value, but I thought I’d offer them anyway

Interesting – I’d never thought about raytracing each ray at the right time. Unfortunately, raytracing doesn’t match the model all games use, and doesn’t leverage hardware acceleration well, so it’s probably a non-starter right now. But it is a conceptually clever way to approach the problem. Nice!

That could help some, but it has issues with cost and electrical power demands, though. And it doesn’t do anything to reduce the display latency, which is the biggest part of latency and the hardest to reduce.

I saw a comment about blanking the image when the user turns at high speed. Instead of blanking, how about drawing a pre-made blurred image that distantly resembles the HUD, and fade into true HUD when the user stops, so it looks like the user is focusing, while not redrawing everything pointlessly?

And, you don’t have to rerender everything in case of AR, you only need to redraw modified areas. That’s not true for VR, however.

Good thought, and it might work. One possible issue is that when your eyes come out of saccade, they check what they see against what they expected to see when they started, and if it doesn’t match, they saccade again to find the right landing spot. If the image is still blurry at that point, I’m not sure what would happen.

Also, if you just keep your eye on something while you turn your head, you see just as clearly as if your head wasn’t moving, so blurring wouldn’t help there.

Finally, a saccade can take 150-200 ms, or around 10 frames. A lot can happen in that time in terms of objects moving in the scene, so just using a pre-made blurred image could cause confusion due to the objects not moving in the blurred image, then jumping to the right locations when you stop; it would depend on how much information comes through during saccades.

Interesting post. I can’t really contribute much in terms of discussion, but when I got to the end of the article and started reading through the comments, I noticed a name — MAbrash. That sounded familiar. This article was linked on digg.com so I kind of just ‘fell’ onto this page but I certainly remembered a Michael Abrash back in the days of old. Many interesting tutorials and especially that billion page book on my shelf about graphics programming =). A pleasure to virtually meet you

Nice to hear from you! Back in the pre-Internet days when I did most of my writing, I rarely got feedback from anyone who read what I wrote, so I always wondered if there was really anyone out there. (It didn’t help that I wrote for magazines like Programmer’s Journal, which only had a few thousand subscribers.) So it’s a pleasure to hear from a long-ago reader – it makes all those nights and weekends worthwhile

What if instead of blinking the display or trying to keep incredibly high performance, you factor in a “confidence blur”, where as an object starts to move, the related pieces of the HUD blur until the software is more confident in its location? It seems that mimicking the blur of human vision when things are in motion would be something the mind would readily accept.

That might work in some circumstances. However, it wouldn’t help with the case where you are keeping your eyes fixed on an object while you turn your head rapidly, which happens, for example, when you turn to look to the side; your eyes turn first, and then they remain locked on whatever you’re looking at while your head pivots to catch up with the eyes.

I’ve been reading up on design elements and how users perceive them from both a psychological standpoint and a tangible one. AR/VR seem to be about taking the user to “the next step” and helping immerse them into a seemingly realistic world.

Traditionally, we have a field of view that works in tandem with our central vision and our peripheral. This means that the “image” we see is absolute in that it fills the entirety of what we “see”, but we are only focused on things within our central vision.

State changes that occur in the periphery such as changing brightness/contrast has an effect on users which often results in them turning to address those things. This is why many advertisements choose to blink or have animation; even when we are not focused on them, our peripheral vision picks up these changes on the outer edges of our vision, and pull our attention to them.

In the same breath, what we are looking at with central vision is what we can analyze. It’s impossible for us to read a book with our peripheral vision; we have to bring the subject matter into central vision to analyze what we’re seeing and process the shapes and forms to both read and comprehend the input.

Devices like the Occular Rift encompass your field of view, and will have to deal with these anomalies (as you stated, the brain is VERY good and quick at determining when there are unrealistic oddities in the overall experience). This suffices for immersion (and doubly provides excellent head/eye tracking functionality) but leaves user input within that environment up to traditional methods (like keyboard/controller/gamepad). Being a fan of simulation games, I can only fathom how engrossing this will be once the technological aspects are ironed out.

This type of subject matter, and the sheer efforts that are needed to bring it to reality, are truly jaw dropping. Digital entertainment is an amazing field; there are so many varied, skilled and creative professionals workings towards bringing the next great thing; I hope consumers don’t lose focus and take for granted those who work so diligently towards blurring the line between what is virtual and what is not.

Hmm… My brain isn’t deeply into the technicalities of the problem here, but your mention of “racing the beam” immediately made me think of ray-tracing. Have there been any recent advances in using the trend toward many-core and more generically programmable GPUs to capitalize on the scalability of ray-tracing?

“This turns out to be challenging to do well, and there are no current off-the-shelf solutions that I’m aware of, so there’s definitely an element of changing the hardware rules here. Properly implemented, sensor fusion can reduce the tracking latency to about 1 ms.”

InterSense out of Burlington, MA has been offering COTS acoustic-inertial and optical-inertial fused 6DOF tracking systems for over a decade. By combining both filtering and prediction in the same model, they achieve pretty solid results… virtually all of my AR research work in 2001-2007 was built around their systems.

Sounds to me like the most straight forward solution is for someone to make a fabulous HMD specific display panel. I wonder what order of magnitude investment would be required?

In general, there seems to be a technology nexus in the making. The technology needs of the current form factors (PCs, consoles, phones, tablets) seem to be about tapped out in terms of resolution, rendering speed, even bandwidth. We always want better devices, but we are at the point of diminishing returns as far as user experience.

VR could be the next wave that drives both improvements in hardware and new device sales. At some point, the major hardware manufactures will embrace this reality and the awesome VR goods will start to flow. There are probably decades of headroom for VR hardware improvements.

Who will be the next to follow Oculus with a high FOV HMD? Apple? Microsoft? Samsung? Sony? I’m not personally an Apple fan, but I look forward to the day they sell and HMD.

Agreed that VR could drive hardware; in fact, I think I said that back in my discussion of AR versus VR. Also agreed that there are decades of headroom for VR; heck, there were decades of headroom for 3D accelerators, and VR is a much more complex problem.

As for who will follow Oculus – I’m not sure. The big companies may not be interested in VR, because it’s not really a platform in the sense that desktops, tablets, and phones are – more of a peripheral – while AR potentially is a platform.

I have read Ready Player One, and I loved it; I mentioned it a few posts back. And reading it did make me think harder about VR as a platform; when I discussed VR versus AR, I noted that it was possible that VR could become a platform. Personally, I think that would be pretty compelling, but I’d guess that VR as a platform, if it ever comes to pass, is a long way off; it’ll require much better resolution, and haptics that aren’t close to existing right now.

Very interesting and very clear read for someone who knows very little about VR.

It makes me wonder if we might soon (Valve-soon) be coming to a “nexus” of converging evolution from various trends in parallelism and move away from tessellation and rasters to vectors and rays. It seems that without rasterization beam racing would be the way to go.

However the problem of updating the scene in realtime remains, it seems that for this we would need many James T. Kirk to free us from the event loop and use the CPU cores to great effect. Would a continuous simulation where each solid body runs on its own core be imaginable? The GPU would also need to be moved closer to the main memory bus, and it would become a true co-processor.

Very interesting field to be a pioneer. I’m open to job offers but you’ll have to fly me in from Paris

I think it’s unlikely there’ll be a shift from triangles and rasterization anytime soon; there are just too many advantages, and it’s too deeply entrenched. We definitely won’t get a core per solid body in the foreseeable future; the more cores you have, the bigger the synchronization and communication problem. Having said that, obviously increased parallelism could lead to decreased rendering time, which would decrease latency, so that’s a promising – but hard – direction.

Michael, please don’t stop looking for your Kobayashi Maru mode. There is always one hiding somewhere outside the box, just like with Mode X.

There are SO many people waiting for *great* VR/AR to become reality, and VR I guess is the lowest hanging fruit right now. It NEEDS to be possible with current technology, even if some compromises has to be done, but without accepting a compromise that severely impacts quality.

I have not tried the Oculus Rift myself – but isn’t it possible that this first generation of VR just needs to be good enough, from what I hear it really is groundbreaking even though we’re not at 120hz displays or high enough resolution just yet.

I remember in the early days of quake when everyone was still using software rendering, and it was awesome… then one day there was 3dfx and OpenGL and everything changed overnight, what was previously awesome was now lightyears behind what 3dfx/OpenGL could do.

Could there be a similar leap between first generation VR technology, and then one day a 120hz display comes along and changes everything – or will these changes be more gradual than that?

Thanks for pursuing this Michael, I am looking forward to using whatever product(s) currently in development, hopefully sooner than later.

Yes, this could certainly be like 3D gaming, with gradual improvements punctuated by sudden leaps. The real questions are whether the first generation will be good enough to spark things and if so whether there’s a lot of headroom for improvement with feasible consumer-class technology. Like you, I hope the answers are both yes, but I don’t have the data to know for sure yet.

1. When you were describing the pipeline and adding up the cumulative latency, it struck me that whilst the sensor lag and the photon-switching lag were both pinned in place by ‘brick wall’ physical constraints, the frame-by-frame scan-out method of driving a display is present in the model only because of tradition; it seems to just be accepted that ‘that’s the way you do things’, and perhaps identifying & challenging what we take for granted is the path to breakthroughs.
Being locked-in to only ever pushing an entire frame – nothing less and nothing more – seems completely arbitrary, and particularly detrimental to this application.
With an unlocked display output, many things would become possible – one method from the top of my head (although probably only suitable for AR) would be to borrow a trick from video codecs: only bother telling the display what has changed, instead of wasting bandwidth (and more importantly, time) on pushing another copy of the same 66% of a scene.

2. It would seem trivial within a HMD to monitor the wearer’s pupil to determine what is being focused on, and you even get the benefit of sampling both eyes, for more reliability/accuracy. This could be used to direct a finite resource (rendering capacity and/or the arbitrary display output from 1.) toward the part of the scene that is most important.

I imagine that the key is to identify what parts of the system are absolutely essential (latency), what parts are potentially degradable without compromising the system (peripheral rendering), and how to degrade those parts in a way that provides a useful effect.

Good thoughts. In VR, though, every pixel changes every frame, in general, so there’s no advantage to doing partial updates. In AR that might not be true, although it is in the worst case. Also, a whole new transmission protocol would have to be developed to send partial updates to the display, and that’s a non-trivial undertaking.

Concentrating rendering quality where the user is looking is very interesting, and was discussed in the comments for the last post. The big issue there is how to build a system that has variable resolution that can be directed toward the fovea.

Any monitor can of course have infinite lag. However, I’m not sure how a 60 Hz monitor can have 2 ms lag, unless you’re racing the beam; each pixel only gets updated once every 16 ms, so if you render a frame and send it to the display, it could be up to 16 ms before all the changes are visible. As I noted in the post, it would be possible to send the data over faster and update the pixels faster, thereby reducing latency, but I don’t think any current panels do that, for the simple reason that there’s no benefit to doing so. I would be delighted to learn that I’m wrong

Actually camera tracking often has some drift too, depending on the traking algo. If we assume feature-points tracking, feature descriptor would be little different then looked from different position, and the error tend to accumulate. Global 3d reconstruction methods like global bundle adjustment analise a lot of image frames in one pass, and even with that sometimes have a problem with “loop closure” – essentially the same accumulated drift. The most comprehensive no-drift method is tracking already known 3d model, wich mean enviroment should be already pre-reconstructed…

I’m not sure why there would be cumulative drift. I haven’t seen any cumulative drift with optical tracking personally. Are you saying that if you look from a particular position and orientation, then move around, then return that exact same position and orientation, optical tracking would report a different pose from the first time?

I’d assume that both eyes need to be updated at 60hz in order to get the benefits of that frame rate, but has this actually been tested? What if both eyes are updated at 60HZ but with half a phase difference between them?

I can see 3 possibilities:
a) It feels worse than a regular 60HZ update, with a kinda of ‘stereographic tearing effect’
b) It feels the same as regular 60HZ update
c) It feels better than a 60HZ update, who knows, maybe feels like as smooth as 120HZ

Should be easy enough experiment to do, particularly by writing rendering a view twice at 20fps but staggered by 10 frames between them.

I propose the following ( in fact I can knock this up in 30 minutes using Unity3D so that is probably easier ).

Split the screen into 2 halves, with the same camera rendering to both sides, then view it using a piece of card so that both eyes each see one half each.

1) Now I will animate something in the 3D scene in front of the cameras, and initially update both viewports at 30fps synchronised together.

2) Then I will try updating both viewports still at 30fps but out of sync by one frame each. So basically update Left, update Right, update Left etc..

I want to test my assumption that the effect in step 2) is the same ( or worse ) than in step 1)

Without actually testing this, it is possible that my brain is already having to handle ‘out of sync’ information coming from both eyes, and has evolved a mechanism of ‘seeing through’ this noise.

It is possible that the effect in step 2) will be the same as viewing a video at 60fps.

—

I guess what I’m really asking is how well matched do the 2 visual fields really need to be? Presumably there is already quite a bit of distortion unique to each eyeball, and a semi random coating of liquid each time I blink.
Because I’m not aware of these distortions it is possible my brain is compensating by intelligently mixing the information coming from each optic nerve, and this system ( if it exists ) could potentially be exploited.

Worth trying out. I don’t think you can do this with a normal monitor, though, since you’re not going to be able to get your eyes to fuse on two images splitting the screen between them – they’re too far apart.

This will obviously work, since shutter-based 3D works that way. And it does reduce latency from render to photon for a given eye to 8 ms, which is interesting.

Actually, I can fuse such images on screen, if they are small enough. But by doing this one interferes with stereoscopic fusion in a big way. In your example, if you are about to render an object moving sideways, left-to-right, then after a left update and before the next right update the stereo disparity is smaller than immediately after the right update and before the next left update. In fact, it can even result in inverted parallax depending on object speed. In the general case there are all kinds of other problems: vertical disparity (object moves vertically), size mismatch (object moves in depth), etc. For small object motion the discrepancies may be subtle (I ignored them in work I did years ago using a Sony Glasstron stereo HMD), but they may well be headache-inducing nevertheless.

I’ve spent some time thinking about this in the past, and my conclusion is that we need to stop thinking about tracking -> rendering -> display as a linear sequence of steps. Part of the problem here is that we’re conflating the type of latency that is familiar to game designers — time from control input to audio-visual response — with a new type of latency based on the the physical motion of the player’s head. We need to keep this secondary latency low (7-20ms), but it’s still acceptable for controller latency to be relatively high by comparison (100ms).

Before I get into my solution, it’s also worth noting that the linear projections typically used by games are less suitable for head-mounted displays. Particularly for the large FOV promised by the Oculus Rift, it makes sense to initially draw the scene using linear projections and then reproject the scene by warping pixels to minimize distortion.

With these in mind, here’s the solution that I’ve arrived at:

Thread 1: Read the current head position (4ms) and render the scene to a large buffer that extends beyond the player’s FOV. In the extreme case, I could even imagine rendering an entire cubemap with the player at the center. Let’s say this takes 30ms.

Thread 2: Read the player’s current head position (4ms) and reproject the render buffer onto the display. Note that we’ve pulled the render step out of the equation. All we do is reprojection.

This solution (which would likely require triple-buffering to completely decouple the render and display threads) leaves the control latency high but keeps the motion latency low while allowing the rendering of far more complex scenes than would be viable if we treat tracking -> rendering -> display as linear steps. As a bonus, it could help solve the distortion issues that tend to arise when using a very large FOV with linear camera projections.

I dont know how useful this is, but this story might spark an idea for you. I was writing a game the other day and as an experiment, I set a background image (just tiled) to about 0.25% transparency and painted this over the previous frame framebuffer.. I then painted the foreground characters on the framebuffer.. This causes some pretty basic motion blur, but it removed a lot of the problems I was experiencing with sprites jumping about, image tearing, etc.

I’m not entirely sure if a similar method is used deliberately in mobile phones for the camera preview to make it less problematic. And the problem has existed for years in phosphor afterimage and LCD response time, so if artificially causing this phenomenon masks the original problem, it would be more acceptable.

That would be a form of motion blur or temporal antialiasing; it wouldn’t really help with latency, since it would actually increase it, but it can in fact help with artifacts, as you’ve found, and is an interesting approach to look at more closely.

Pretty obvious, but I was thinking about the different possible displays for VR, moving closer and closer to the retina.
For each solution the rendering area that needs to be at max resolution decreases, but the tracking requirements and sensitivity to latency increase.

1) The display is not tied to the user, e.g. the virtual world is projected on the walls of a room or a frame that surrounds the user (like the cockpit of a flight/racing sim). We don’t need to track anything so latency has no impact. 3D seems infeasible.
2) Fixing the display to the shoulders of the user, like a sort of 360 degree bubble helmet, we need to track tilting/rotation of the body (still ok for cockpit sims). 3D is not possible and lots of focus issues.
3) Fixing the display to the head, close to the eyes (VR goggles), we need to track the absolute position of the head, moving the eyes relative to the screen has no latency.
4) Fixing the display to the eye balls, with contact lenses, we need to track the absolute position of the eye balls. The advantage here is that the display area is really minimal, but eyeballs can move so fast, tracking must be a nightmare. I wonder what’s max acceptable latency in this case.

I think that racing sims are going to benefit right away from VR goggles (ala Occulus Rift), the user focuses on the center of the view (far ahead, down the road), and head movements are rather slow and continuous (like in turns). The added 3D and wider field of view will add a lot to the immersion.

The problem is getting the data quickly enough to the hardware drivers that drive the LCD panel. The Kobayashi solution is to render multiple versions of the image from a variety of angles around where the predicted centre will be. Ship this data off to different LCD drivers in parallel. Then use a mux to decide which LCD panel driver should be connected to the LCD panel. This delays the decision as much as possible, while converting the problem from a serial one to a parallel one.

Further recall that humans see less detail when images are moving quickly. That means one can probably get away with rendering more images at less detail when the predicted head position error is large, yet render fewer images at more detail when the error is small. This suggests a hybrid LCD panel driver which can be render one of a few lower resolution images, or a single highly resolved image would be valuable. (The data bandwidth for 1 high resolution image == the bandwidth for multiple low resolution images).

If one were going down the road of custom silicon, there’s no reason the LCD drivers + GPU not be on the same chip. That would reduce data-traffic on the video output path. (The VR goggles would contain the GPU, and while texture uploading would be slower, most uploaded data would be vertex lists, OpenGL commands and the like).

I like your out-of-the-box thinking – but it just pushes the problem back up the pipe to the GPU, which now has to do a lot more work. So the display latency decreases, but the GPU latency increases. And sure, you could have multiple GPUs, but remember this is a consumer device, so the cost has to be reasonable.

I’m also not sure how many versions you’d have to render in order to have adequate coverage for the range of accelerations that could occur before display occurs – it could well be a large number. And remember, the more versions you have to render, the farther ahead you have to predict for the first ones you render, because of the others being rendered later, which increases the cone of possible locations by the time the first ones might be displayed.

Good thought, but consider the case where you turn your head quickly while fixated straight ahead. The thing you’re staring at stays perfectly clear, even though your head may be be turning at 100 degrees/second. If you reduced fidelity, it would be very noticeable. Not to mention you could probably still tell that the virtual image wasn’t staying properly registered, because you couldn’t reduce fidelity enough to cover up multi-degree errors.

Reminds me a bit of the crytek solution for doing 3D rendering.
They don’t render the image twice but I think they use depth buffer data to render a shifted image faster (almost at no cost http://www.eurogamer.net/articles/digitalfoundry-crysis2-face-off?page=2 ).
So maybe one solution is not to send just a flat 2D framebuffer to the display, but also depth information, and then have some chip manipulate/adjust that image at the last moment based on head position (if the movement is fast, past a certain threshold).

Yes, adding depth info is something researchers have done with considerable success. The question is whether what works on a screen also works in an HMD, where it is supposed to exactly match the actual motion of your head and the real world, a constraint a screen doesn’t have. Definitely worth looking into.

But as Michael said in a earlier post, the main problem with trying to do some sort of approximated rendering to handle fast head movements is that the user could move his head really fast, but still keep his eyes fixed steadily on a given point in the world, and that situation would give very annoying visual artifacts.
Especially in many situations the eyes move first to some point of interest that’s a bit off-center (someone just walked in), and then the head follows while the eyes stay locked on that object.
So in the end maybe that some sort of precise eye tracking is necessary.
I have the feeling that a good solution would ultimately involve a sort of hierarchy of tracking and rendering – track the head and render a peripheral vision at lower rez/large are with a cheaper approximation, track the eyes to render central vision with higher rez/small area/lower latency. And composite the two.

Hmm. I like your idea to move the GPU into the goggles. That’d significantly reduce the required bandwidth required (you only need to upload the whole scene graph once, and then do incremental updates plus viewport angle fixes).

Anyway, a GPU is a specialized computer. There shouldn’t be any problem with rendering anything that’s not in the primary field of vision with lower quality, thereby improving latency.

I also wonder whether even tighter integration would work. After all, a LCD screen is just a write-only memory with a somewhat strange data path which semi-polarized light can pass through (or not). Why not modify that data path so that the GPU can write directly to the screen? No more “scan lines”; instead, tell it to render the visually important parts first / more often.

There are many pluses to having the GPU in the goggles, but some serious problems. The big one is that you’re limiting how powerful it can be, because you can only dissipate so much heat next to someone’s head. Also, if you want to not be tethered, you can only use mobile parts. Finally, you’d have to move the CPU there as well, for communications reasons, and everything else with it, and now you have put a lot more weight in the goggles.

If the GPU wrote directly to the screen, then you’d see intermediate states of the frame buffer. However, you could do a memory-to-memory copy from the back buffer directly to the screen, and that would be much faster and would work well.

Rendering certain parts more often or in a different sequence might or might not work, depending on the persistence of the pixels. The longer the persistence, the greater the likelihood that you’d see discontinuities between different parts of the screen, and remember that’s the sort of thing the eye and brain pick up very well.

Related to this, Turner Whitted and colleagues have been building an architecture (for display walls?) where they’ve moved all of the GPU/display processing stuff all the way into the display hardware; see their paper from HPG 2009 for a bit of information. Possibly useful food for thought on this front…

A colleague of mine is working on the tracking problem in a different context: a virtual walking stick for vision-impaired people.

It seems to me that in the context of the entertainment business, the tracking problem would be easier to solve if you could set up a reference in the environment that was easier to identify. If you could put, for example, single-wavelength light sources in the environment at more-or-less known locations, a camera tracker could fix on that more easily than it could full-colour visible light, and you could probably use a lower-resolution camera.

If it helps, think Wii remote. The IR sensor runs at a 100Hz frame rate on 2005-era commodity technology. Surely modern sensors can do better than that?

Very true – it’s much easier if there are consistent references to work with. But it turns out it’s still hard even then. Also, the solution has to be consumer-friendly in terms of price, set-up, and working in a variety of environments.

A virtual walking stick would be a great device – and quite a challenge!

One thing that came up is that “frame rate” is a wrong way to think about responsiveness and latency, at least if you’re talking about sensors, because so many algorithms can be implemented online. His prototypes involve plugging an HDMI cable straight into an FPGA. The latency of many image processing algorithms that he’s using is less than one frame, and can be measured in pixels (he used the phrase “twelve-pixel latency” for one algorithm) or scan lines.

He noted that he’s trying to use eye tracking as much as possible. In particular, he’s using the time that people spend blinking (which is a higher proportion of time than most people think) as a chance to avoid work. That’s a pretty clever idea.

Eye tracking has come up in the comments here quite a bit, but one point that I haven’t seen is that it could be used to drive visual importance: say, reduce the level of detail for any place where the user isn’t looking.

Level of detail reduction did come up on the topic of foveated rendering; at least it’s in the paper that was linked. That doesn’t help with the big latency issues, transmission to the display and getting to photon emission, but it can reduce rendering time, which is a plus.

How is your colleague using the blink time to avoid work? People blink for a few hundred milliseconds every 5-10 seconds, which is a few percent of the time, so the work avoided doesn’t seem significant; also, it’s not evenly distributed across frames, so it doesn’t help most frames.

Actually, I checked, and I misunderstood what he was doing. He’s actually using the fact that people typically blink when changing their focus, so there’s a short time after blinking when the eye isn’t looking at anything intently, so tracking doesn’t need to be quite as good.

That may not apply to VR, but it apparently works for this application.

I don’t think it’s a technical problem so much as a financial one and a ‘Kobayashi maneuver’ should recognise that. What this excellent post and discussion suggests is that if VR was as popular as mobile then the technology is already available. Most people want tech. for what it can do for them, so I’m not sure that promoting VR as a novel gaming experience is doing it any favors. That isn’t enough and depicts it as a niche within a niche.
The exercise market is even bigger than gaming as most people are concerned for their health and appearance. However, much of it turns fitness into something boring and painful whereas VR offers the chance to make it genuinely fun. These two markets are made for each other and could both do with a reassessment of the problem.
While I agree that latency and image persistence have to be overcome it’s also important to address the other factors that cause HMDs to induce motion sickness. This may mean avoiding symbolic navigation control such as gamepads or mice, but that could be beneficial if it involves using your whole body.

Consoles have done fine for a lot time as primarily gaming machines, but I agree that it would be much easier for AR/VR gaming to ride the coattails of something more broadly compelling, just as iPhone gaming became pretty much the biggest thing on the platform but only after the platform was already established by other applications.

Maybe the exercise market could be a good entry point for AR, but not for VR. Moving around in VR, let alone vigorously, is not a great idea.

Absolutely agreed about addressing other factors in simulator sickness. First, of course, it has to be determined what those factors are. Indirect navigation is definitely an issue – it gives me simulator sickness – but if you don’t have it, then you’ve sharply limited the range of possible applications.

Sure, as I mentioned, prediction works well most of the time. The problem is that when it doesn’t work (under acceleration), it actually makes things worse, because the result is even farther off than it would have been with no prediction. And it’s anomalies like that that jump out in our perception.

This is a fairly random thought, but how about using a polarised display (as normally used for stereoscopic rendering), and put a pair of LCD polarising filters in front of the eyes? That way, you can halve the scan-out latency by updating both displays in parallel (half a frame apart), and flipping the polarisation of the LCD filter to swap which one the viewer sees. Of course, that assumes that the display electronics allow for parallel access, but I can’t see an obvious reason why they wouldn’t (it seems an obvious thing to want to do for stereoscopic apps as well), and even if they don’t then two discrete displays could probably be used with some optical trickery to combine them.

I also don’t know if suitable displays exist in cellphone format, but there are definitely laptop-size ones out there, and if you don’t mind having an (even more) awkward set of optics on your face then there’s always something like the cellophane polarisation solution (http://individual.utoronto.ca/iizuka/research/cellophane.htm) that can be applied to existing regular screens.

I haven’t spent that long thinking about this so I do have this nagging feeling that there’s a critical flaw here somewhere, but at first glance it seems like a plausible solution that could be constructed with currently available parts…

Yes, that’s worth trying out – the half-phase idea was suggested in a comment a day or two ago. Basically it’s a way of sending half a frame at a time, so each half has only half the latency.

I’m not sure what you mean by “parallel” here, though. Are you suggesting updating only half the lines in each display at a time – basically interlacing? I had thought you meant update one display entirely, then the other a half-frame later.

And, yeah – the “parallel” was because I was guessing that the polarised all-in-one 3D displays probably work by having alternating columns of pixels for each eye (as parallax barrier 3D screens do), in which case the probability seemed high that the default driver boards at least would be configured to scan out both at once, and may not be easily adaptable to output the two images independently. So I wasn’t proposing interlacing as a solution, but more as a potential problem…

With a two-screen solution then that problem goes away entirely, though.

It’s mainly this issue that I’m worried about when the engine vendors (Unity, Unreal) promote their planned Oculus Rift support. No doubt it will work and will be easy to use, but how much extra work will they put into minimizing frame latency for VR?

Not knowing their internals it’s possible this simply won’t be a problem, but it’s also possible that they have a couple of extra frames of latency for CPU/GPU synchronization, multi-threaded engine design, etc. If that’s the case, I very much doubt they will undergo a major rewrite just to reduce latency for the Rift. Custom engines wouldn’t have this problem (I’d be interested to know how much work Carmack needed to put into the Doom 3 BFG engine in this respect), but I suspect many of the initial ‘designed for VR’ games won’t have the resources to build an engine themselves and so will use one an off-the-shelf one. This in turn means that how these engines address this issue will have a major impact on the initial perception of the viability of VR in general, and how much traction it gains moving forward.

Speaking of frame latency, there’s a good related post from Andrew Lauritzen over on B3D.

You didn’t want to get more into the 3D graphics pipeline, but isn’t there were the biggest risks are these days? Because we want to achieve high levels of parallellism, a lot of buffering is going on between processors. On most devices we have very little control over what happens when performing DirectX calls, for instance. It may take one or several frames before a draw call is actually send to the GPU. It is also inherint to the way the industry is going with massive parallelism. This often introduces latency between processors. How do you feel about that?
I sometime wonder if we’re going in the right direction. I find it weird that nowadays we need one or multiple separate CPU cores just to prepare and feed data to the GPU to achieve good parallelism. In a way it feels we’re doing something wrong.
Just a though: wouldn’t it be better for very low-latency solutions to use some sort of software rendering and issue rendering calls immediately in some way and move away from multi-core solutions that batch and buffer? Just some random thought, what do you think?

I see your point, but my experience has been that it’s possible to drive GPUs with little added latency, and they’re so much faster than software rendering (maybe 100X?) that they’re clearly the way to go.

And bear in mind that I love software rendering – so much so that I’ve written at least four 3D rasterizers

I meant software rendering more in the sense of an alternative rendering strategy where you favor low latency instead of high throughputs (altough I’m aware that high throughput is also good to minimize latency). But after a night of sleep, let me explain what I was trying to say. Assume the following game loop:

1. Handle input
2. Update (transforms, physics, logic)
3. CPU Render

So imagine we’d do this on a single CPU core. There’s already latency involved in the input handling here. If you press a button right after step 1, it is going to take another ~16ms before the button press is picked up and transforms are updates. And then there’s the delay that I mentioned between the CPU render and when the drivers/graphics API are actually submitting the data to the GPU. That’s another unknown factor. In my experience it can be big. Let’s assume 10ms. That is, in worst case, already 26ms latency.

GPUs like the PowerVR are optimized to minimize overdraw with their deferred rendering architecture, but that must introduce some form of latency, because tiles need to be completely filled with geometry before continuiing rasterization and shading. The profiles I have performed on DirectX graphics cards also show latency between draw calls and actual GPU activity. My theory is that by queueing work to the GPU, it is easier to keep it busy all of the time and so maximizing thoughput. I have a hard time finding actual details and numbers how these things work internally. If you have some pointers, I’d be very interested.

But back to the issue… to achieve better framerates, you might want to split up the render to one or more separate threads. To maximize paralellism, the input/update and render will run simultanously. However, the render thread will be rendering the view from the previous frame. The input frequency has increased causing better responsiveness, and the render frequency has increased, causing smoother visuals. But the latency between the button press and the actual render remains exactly the same!

It feels to me that you’re always looking at a scenario where all the steps involved (input/update/render) need to be fully handled in sequence – and parallelism will get you better framerates but not necessarily improved input latency.
Maybe it would be better to split up the ‘essentials’ from the ‘nice to haves’. With that I mean that orientation of the head, and the rendering from that viewpoint is key. But AI/physics/logic etc could run at a much lower frequency. But that would only solve latency introduced by the update.

I hope I made a bit of sense. I’m very interested how you approach those problems. You’re mentioning that you know ways to have very little latency between CPU and GPU interaction. Can you share?

Right, it’s much easier to use parallelism to increase throughput than latency. Your explanation made perfect sense.

Regarding latency between CPU and GPU, I’m merely observing that I am not seeing significant latency in this area when running fullscreen as the only active app. And that when I run an old game at 300 fps with tearing, the output seems to show up as quickly as I’d expect. There may be significant latency between input and CPU; that I haven’t checked. But CPU to GPU seems reasonable.

Perhaps the problem of overcoming current hardware limitations lies in a software solution.

To me, the problem is that the head can rotate or translate so quickly that the last image becomes outdated, meaning another frame of the game loop must run including a (likely) time-consuming render step.

What if the engine were designed to predict, and pre-process multiple steps (like a simplified and partial game logic with only basic movement prediction; no expensive collision detection, game scripts, sound processing, HTML GUI layout or equally ridiculous step) while another thread processes input from the HMD tracking and plugs in the few unknowns (view matrices) before passing it all off to the video card for rendering?

Two video cards would give even better performance so a true “double buffered” setup can be achieved.

The idea is to approximate object positions and view matrix movement to accommodate quick head rotations and translations without running an expensive game loop.

I see what you’re trying to accomplish here, but the core problem is that the head can rotate so fast that an image can become outdated while it’s being sent to the display. That’s key because it means that you can’t fix it on the CPU/GPU side except by prediction. And the problem with prediction is that it works most of the time but fails in a big way under rapid acceleration, at which point it’s briefly worse than no prediction would have been, giving rise to exactly the sorts of anomalies we’re trying to get rid of, but in exaggerated form.

Just another “cheat” around the hardware that I didn’t see proposed already:

I noticed that the brain is impressively good at making up a scene from both of our eyes, to the point that our vision can still work (and we wouldn’t even notice much) if one eye is occluded.
Given that an headset usually has two separate 60hz monitors inside, one could run them with 8 ms offset between their updates… effectively updating each half of the image at 120hz and shaving 8 ms from the (partial) latency – so that fitting the input + the rendering into an 8 ms frame would lead to 24 ms latency to present the input *somewhere*.
Also, improving the scanout to 8 ms while keeping the FPS to 60 would cut other 8 ms from the one-eye latency.

Still, I don’t know if it would work without feeling annoying or unnatural, given that the the user would constantly need to merge two slightly different images while playing.
Maybe removing the most recognizable hard edges with some motion blur would help?

As described, it requires that you render the next frame’s depth buffer and velocity field and paint out dynamic objects (for disocclusion filling) before generating any interpolated data. But other than the initial delay when generating that information, there doesn’t seem to be any reason the technique wouldn’t work with a per-pixel “race the beam” approach.

The resulting artifacts might be too much for an immersive VR experience though.

Yes, that sort of warping is an interesting direction to look in. I can’t say at this point whether it actually works; my biggest concern is translation, because even with depth, there’s no way to recover parts of the image that are initially occluded but become visible.

True…those parts can never be accurately recovered without some form of A-buffer or K-buffer. But the hack used in that technique (blending in neighboring unoccluded image data) appears to be reasonably effective in, admittedly, constrained environments.

How much can you reduce the effective latency by late-binding the view? There’s the direct approach of simply sampling the view at the last moment – when the GPU starts processing the frame instead of when the CPU submits the scene (conservatively to include all potentially visible objects).

Are there even more-aggressive opportunities to delay sampling the view?

> my biggest concern is translation
Error is larger near the camera, but quickly drops off with distance (i.e., 1/w reduces the error) Can you leverage that?

Render “distant” pixels from the current view (or project/predict the view a few msec into the future). This is similar to other suggestions of rendering something like a sky cube. But, reduce the error by “JIT” Reprojecting/warping this view based on the now-current view.

Finish by rendering the small set of objects near the camera.

All of this can be submitted at the same time, with “normal” latency. It reduces the _effective_ latency by late-binding the view for the warp and/or the close objects.

Would sampling the view at GPU start time instead of CPU submit time reclaim some latency? Would sampling after rendering all but the close objects reclaim some more?

I’m sure there are issues to work out where “close” and “distant” overlap. Like: an object can be both close and distant (e.g., terrain). Might require a clip plane at the transition distance. Might be challenging to avoid an annoying seam.

Doug
PS, Apologies that I haven’t taken the time to read all of the comments. I already discovered that my earlier comments weren’t original Please ignore me if others have already made the same suggestions

That idea has come up in the comments, but not in exactly that form. It could work. It is a lot of complexity, and it would require a change in the way things got rendered, since there would need to be object categorization and tracking. It reminds me of Talisman, and also reminds me that the worst case is the most important one in graphics, and here the worst case would be having to rerender everything (which could happen in a room if you were accelerating rotationally at 1000 degrees/second/second, or if you had rapid translation acceleration (there’s huge visual leverage when translating objects within a few meters – try it for yourself).

Finally, I’m not sure there is a “small set” of objects near the camera. Or, rather, I’m not sure they would be small – when you have someone a couple of meters away, they can easily take up a big chunk of the screen, and likewise for stuff like explosions – and you’d have to rerender all that, in which case there wouldn’t really be much in the way of savings here, and it might actually be slower, because farther stuff all got rendered rather than z-rejected.

Nonetheless, it’s an interesting approach to think about, and might be worth investigating. So many possibilities to check out!

As a motion capture guy, I don’t know that I have much to add on the topic of latency.

I have however been working on some similar problems, but from the opposite direction. I do a lot with helmet mounted facial capture. Both with single and multiple cameras. Two of our biggest issues are actor comfort and stabilization.

Actor comfort:
Looking at the Rift, I am concerned its going to be putting too much weight on the bridge of the nose. This will get painful quickly. Its hard to say at a glance what all is in there or how much it weights. In feature films, we had essentially unlimited budgets, the helmets were custom molded to the actors, they still experienced significant discomfort.

Stabilization:
In facial acquisition, knowing the difference between the helmet slipping around on the head and actual facial animation can be very difficult. There are no points on the human head that are stable enough to register the data too and the best techniques I know of require significant up front data sets for the actor. I am wondering (with a bit of work) you will be able to know where in space the display is, but that doesn’t necessarily tell you where the skull / eyes are. Is that jiggly difference going to be a problem?

I have some experience with MEMS sensors such as those in the MVN suits from XSens. My hunch is the drift you are experiencing will be a harder problem to solve than you anticipate. You might want to try some sort of vision based thing to compare against and correct. Similar to what Eliot Mack and the Lightcraft guys are doing. They have gyros on the camera and fiducials on the stage. I believe they are considered to be the best when it comes to real-time(ish) match moving.

I am very much looking forward to this future. As a gamer, but also for its many uses in Virtual Production. In the world of Virtual Production, (Avatar, A Christmas Carol, Tin Tin) we use real time motion capture for a variety of reasons. One thing we’d like to be doing is a sort of actor previs. The idea is that an actor walks out on the stage and has very limited understanding what the environment is due to the restriction of line-of-sight based optical capture. Wouldn’t it be cool if they could put on a Rift and see the digital environment they are in. They could really get a sense of where things are and how they relate to it. This would be much better than just looking up at the real-time screen at a Gods-eye view.

Caveat: I’m a terrible programmer and the last time I coded anything 3D was in the 1990’s. I’m working on the low tech H/W end of VR UI at the moment.

I agree that to improve rendering performance to allow smooth VR requires re-thinking the ways 3D is rendered onto high latency, low field of view flat screens.

I have one suggestion which could solve some problems mentioned in the context of so called “hardware panning” (which I’d rather call post-render panning since it needn’t really be done in the HMD) and other issues.
The problem with post render panning is that it warps the image in rotation, really breaks it in translation and can’t update fast moving objects in the scene in the intermediate frames it generates.

I suggest a change to way scenes are rendered: a PoV oriented scene breakdown.
One improvement Carmack introduced in the Oculus Rift Doom 3 demo is separating the game engine ticks from the rendering ticks. The image can be panned or translated faster than the engine updates the scene. The game still updates as fast as it did before but no longer holds the rendering up at very high FPS.
I suggest going further: some elements of the scene like the player’s virtual hands and associated objects need to be rendered very accurately and without lag, while something like the sky can be panned and translated and not updated as fast as dynamic objects.
A PoV oriented scene breakdown algorithm would sort out elements of the 3D scene according to how lag sensitive and how visually accurate they need to be.
Static scenery past a certain distance can be rendered less often and treated almost like a skybox. Closer and slow objects could be rendered separately and treated as sprites in intermediary frames, as long as they are far enough not to show noticeable parallax effects. At low angular resolutions this is quite near…
Only the closest and most dynamic objects would be rendered on top of the “fudged” scene in every frame.

That could help reduce rendering latency, which could in turn allow more sophisticated VR rendering. However, that comes at a significant cost in complexity, especially for dynamic analysis of what needs to be rerendered and how to composite. This is much like what Talisman did long ago; conceptually appealing, but maybe not enough of a win to justify the complexity and overhead. And this doesn’t address the worst case, where everything needs to be rerendered, which could well happen if you spin inside a room, or if you translate rapidly. Also, note that it doesn’t address latency to the display, which is the biggest, hardest to move element. However, it’s possible that this will be a useful way to get latency down, once simpler approaches have been exhausted.

We certainly need to put a lot of low latency displays into the hands of developers to get a lot of experiments running.

I hope the expected success of a commercial version of the Oculus Rift will warrant an OLED display with 120Hz or better refresh rate. I’ve not been able to find any easily accessible information on OLED driver boards but considering an OLED array is analogous to a memory array or a CCD array, there’s no reason we shouldn’t be able to drive them as fast and with a single chip.

Mad idea:
Can a case be made for low latency, high refresh rate driver for phone/tablet OLED screens for the purpose of saving battery life? Could a phone CPU/GPU run more efficiently if it spent less time idling waiting for V-sync?

OFC the OLED would use the same power… maybe even a tiny bit more when it’s switched.

I was talking about CPU/GPU energy savings, not the OLED. If the CPU/GPU idles at high clock speed, wasting power while waiting for a screen refresh, maybe refreshing sooner would allow the CPU/GPU to switch to lower clock speed sooner, thus saving a tiny amount of power. If this happens often enough during normal phone operation, it just might add up to something which makes a higher refresh rate worth while.

Mobile CPUs and GPUs are great at saving power when they don’t need to be running. And I don’t see why they’d need to be running waiting for vsync. I’m still not seeing where the power savings comes from with a low-latency, high-refresh display. The scene still needs to be rendered and sent to the display, so the total work is the same.

1) for each frame
Create a set of shaded polygon patches for an enlarged view for the scene.
Imagine like a reyes style rendering approach and using how they solve Depth of Field.
Each of the patches or vertices would have a velocity stored along with them.

2) for each ‘beam block’
Retransform each polygon patch that fits within this current block and rasterize the already shaded block.
Since each patch has a velocity we can handle faking motion of the objects, also can apply a proper head movement to each patch.
Holes between patches are an issue; can maybe just warp patches to cover up holes (like reyes does).

Issues with this.
Doesnt really map super well to current graphics hardware (expecially mobile). Should be doable on dx11 class hardware.
On something less than dx11 hardware (or maybe even still then), might have to do a good bit of patch creation in software not gpu (performance issues with this).
This would insert another new piece affecting latency into the pipeline, as you have to create and shade all the patches.
Still only shading once per frame, so shading latency wouldnt be improved and might even be worse (not sure it matters much).
Holes are a problem as mentioned above when patches are displaced too far.
The very last ‘beam block’ has a much higher potential intersecting patch count so you might get some sort of waterfall performance per block (last block much more expensive than first block).
This is only per-vertex shading. Per-pixel shading would be micropolygons; not quad optimal and expensive.
Doing any sort of patch-hiding in step (1) is tricky due to head movement. Maybe some sort of maximum possible hidden edge distance could be calculated (edges of patches would sweep out an hiding arc behind them).
Object motion is only going to be as good as whatever transform you store at the time of patch creation. The motion is also just one sample so changes in direction or something would still be delayed, and objects/patches might ‘skip’ on the next frame.
Probably another 20+ issues I can’t think of off the top of my head…

There would be a lot of tricky implementation details with this, but might be doable.
Basically it’s like doing screen panning in 3d with polygon patches that each have some form of transform stored at one timestamp (probably just linear).
Interesting idea, I’m not sure it’s the answer though without trying it (so many implementation issues to work though).

It’s an interesting approach, but not a trivial one to try out. It also might be more efficient to do this at an object level. Finally, if you don’t reshade, it could look wrong, especially when shadows are involved. But we won’t know until someone tries it.

I worked on VR for a military contractor (KEO) back in the mid-90s. We used an alpha-beta filter to predict head movement and to smooth out tracker-sensor noise. It worked decently well, but you could always find a movement pattern (snapping your head to the left then slowing to the right, etc.) that would beat the filter for at least a few frames.

The tracker-sensors were crap back then, too, and we had access to the best ones available. I hope they’re better now.

Interesting – I wouldn’t have expected that to work as well as you say. Smoothing is no problem, but prediction can’t handle sudden acceleration, unless it can see the future; in fact, since it’s partly based on the past, it will do less well at change points than no prediction.

Right – my point was that prediction fails at points of rapid change, and the failure cases are even worse than if you didn’t do prediction. So I was surprised that you got such good results. Thanks for the pointer to the AB filter; as you say, it might be worth experimenting with.

Re: overscan and scan out. It actually has been done in the mid-90s. Google “address recalculation pipeline” to find the papers. Of course at the time the rendering technology (SW and HW) was far from where it is today, so the problems were different, but the basic idea worked well.

Military systems tend to solve somewhat different problems, and soldiers can be trained and can be made to put up with a lot more than civilians. There’s a lot of good research there, but there’s no military technology I know of that would make a good consumer AR or VR product except for price.

Taking a step back a bit, I’ve been thinking about how to deal with lag in general in the context of games.
Seems like it would be useful to first be able to measure that lag (as perceived by the viewer) in an automatic/dynamic way. Once you have some idea for the lag you can start compensate for it in some situations.
I was wondering what’s the easiest way to measure the rendering lag, and couldn’t think of a solution that doesn’t involve a camera capturing the output on the display and analyzing it (that part also takes time, but it would be constant).
The camera and game engine would be synched on the same internal clock (the PC/console), and the game engine could be flashing a code once a while (the game engine renders a white block in the corner of the screen, which is easy to detect, or could render an actual time stamp, which would be harder to parse) and the camera detects it, and stores its time stamp for the event.
The difference between the engine timestamp and the detection camera system timestamp gives an estimate of the lag.
We could also add input lag in the mix by having the input device send an event (with a timestamp) down the pipeline and detect its corresponding display event.
But I have the feeling there’s a more obvious solution…

That seems reasonable, if you just want to be able to measure lag from when you submit a frame until it shows up. There’s at least one article on the Web by an online site that set out to do this for a bunch of games. I forget where I saw it, but it should be easy to turn up with a search.

I don’t understand why AR and VR are constantly being referred to as a single entity in this post, there are many practical AR applications that would not require low-latency real-time rendering. Am I missing something?

A while back, I defined AR as rendering virtual objects that appear to be located in the real world. That requires low latency real-time rendering. If you’re talking about things like displaying people’s names as they approach you, or even hovering their names over their heads, or arrows to show you which way to turn as you walk, that’s certainly useful, but I separate that out into what I call HUDspace applications, or just wearable. HUDspace is certainly easier (although there are challenges there that I don’t think people realize yet), but it enables a very different set of apps, and it’s true AR and VR that I’m interested in at this point. I discussed this here and here.

That’s 9 ms average response time after the pixel data has arrived. So it’s on top of the latencies I mentioned. Also, my experience is that LCD response times are considerably slower than the claimed times, and that for certain gray-to-gray transitions they’re much worse than average.

Foveated 3D would certainly be great if it could be done well in an affordable HMD.

I have a camera with a 240Hz viewfinder. I also know that cameras to come this year will use even more powerful OLED viewfinders with higher resolutions, and high refresh rates. So the research on these fast displays does exist, and the progress rate is impressive.. so i would say not all is lost concerning the high refresh rate solution

Right, those are nice in several ways, but the catch is that they have high pixel density but in a small area. What’s the overall resolution? Also, those microOLEDs are so small that expensive, heavy optics would be required to get a wide FOV out of them. This is a good example of what the post was about – these displays are being built to meet the needs of a market that justifies the costs of developing them, and as a result are not well suited to HMDs, although in different ways than panels.

The resolution is becoming very good at 2.4Mpix these days, 3+Mpix models coming for 2013. And the FOV for the latest models is becoming quite good as well.. this field of technology is improving fast, and HMD is just a step away from them. Remember that the early OLED displays were very limited (and the same goes for early LCD displays, by the way), and now i have a 7.7″ OLED tablet with a quite nice resolution (and fast display), and i know manufacturers are working on bigger, faster panels. So i wouldnt brush aside the progress in this field like you do.. i think it’s becoming more and more relevant.
So IMHO it’s mainly a matter of waiting until the “Moore’s Law of Electronic Displays” does its job…

What size of displays have 2.4 Mpix? 5″ or 6″ is about right for an HMD; 7.7″ is definitely pushing it in terms of weight and size. Also note that with Rift-style HMDs, that resolution is split between the eyes, so 1080p is about 1 Mpix stereo resolution.

Anyway, nothing you’ve listed helps much with latency. OLEDs turn pixel data into photons faster than LCDs, so that’s good. But they don’t accept data any faster at the same frame rate. Of course, that could be fixed for both OLEDs and LCDs, but the point of the article is that they’re not going to be fixed unless there’s a reason to do so, and currently there isn’t one. Unfortunately, there’s no Moore’s Law of Electronic Displays for speed of transmission of frames to displays.

To my knowledge, 7.7″ is the current maximum for mass production OLED films (we cannot call these “panels” anymore can we) with high density resolution, and manufacturers can produce any size below that of course (down to about 1″ i think, again with high resolution).
The point of my argument was, we can remove the display itself from the list of blocking problems.

As for bringing the frames to the display, is it possible to separate the GPU from the frame buffer itself ? So the FB could be HMD side while the GPU sits in the PC… While i dont think immediately of any mass product answering this need, the increasing ubiquity of hand held or wireless devices will make some people invest time in this. I think of streamed video games for example (where latency is the main issue as well).

Agreed, OLEDs would eliminate latency in the display itself, although not in the display controller/data transfer. If the GPU could write directly into memory shared with the display, that could potentially eliminate transfer latency (with a caveat I’ll note shortly); that would be true with any kind of display technology, not just OLEDs. However, that would require display manufacturers to make substantial changes to the overall display package, and there’s no incentive for them to do that – which was a major point of the post.

You can’t just have the GPU write into shared memory the display is pulling pixels from, because pixels get overdrawn while rendering a scene. However, you could potentially double-buffer the shared memory, which would solve that problem. I don’t know how rapidly various technologies could update pixels once the page was flipped.

Mmmm this is less of a kobayashi maru and more a corbomite maneuver, could you throttle the detail as you move, drop the updating on the color and focus more on the intensity? i know i perceive less detail when swishing my head about and perhaps drop resolution as well when moving? on the hardware side a more direct line to pixels cut out the consumer crap that comes with modern monitors, while a problem tearing could be avoided with a scattering effect with a bias towards the center of frame? as long as the scattering effect didn’t hold a the same scatter pattern it should look okay.

Honestly though this is the kind of problem that will be inevitably solved if VR takes off as you say. I would wager that the quickest way for us to get the problem solved is for all of us to stop thinking about it and become VR public relations people and just get as much the public interested in VR tech as we can.
Then the problem of latency will be solved regardless of whether or not it’s physically possible, Samsung will change the bloody speed of light if they have to.

one last bluff idea: galvanic vesicular stimulation, if that ever becomes practical could be used to make your head feel like it’s moving slower? so rather then having to play catch up we just slow down time for the user…

The problem case for this approach is when you stare at a fixed point in space and turn your head quickly – you don’t perceive less detail at all in that case, because your eye isn’t moving relative to what it’s focused on. And that’s how you usually look at something off-center – your eyes go there first, then your head catches up while your eyes stay focused on the new object of interest.

I’m not sure GVS would help here, but regardless, it seems far from being a practical tool at this point.

Sorry if this has been suggested, I haven’t had time to read the entire thread.

Here is a twist on the “GPU in the HMD” suggestion.

Computer GPU renders an “infinite FOV” around the user (i.e. everything that can possibily be seen around the user, regardless of head position), and transmits this frame to the HMD device. A much simpler GPU in the HMD renders the “actual” FOV (i.e. a subset of the “infinite FOV”) that is visible based on current head position.

My initial thoughts were along the same lines as many others. I’ve scanned the discussion and I might have missed something but here’s some additional thoughts/perspectives.

You could think of it as the goal being to have a window that provides a stable view into the virtual environment (VE). Latency in the movement of objects in the virtual world etc should not be a problem since that’s similar to how we see objects move on an ordinary screen. That is, if the window seems stable relative to the VE the framerate of the actual rendering should be less important. In these terms I would interpret what William wrote as a “real GPU” plus “simple virtual window GPU” combination, and it seems to me that many of the pitfalls discussed above can be addressed within this framework. One key idea (one of several touched upon above) might be to separate distant and near, and/or static and dynamic parts of the scene and handle these differently for composition by the simple GPU.

As an initial test and illustration I imagine is picking a smaller viewport within the HMD screen and render the VE into this one, and then have this viewport move around on the HMD screen to “stay still” as the head moves. This should allow you to start out with a small viewport that facilitates good performance in many other aspects (rendering etc) and then scale up as you improve the system. I can imagine many tricks to try in this setup, most of which have been mentioned in this discussion. In short, this is an approach that I would have liked to dig into.

The key is not the window seeming stable relative to the VE; it’s the VE seeming stable relative to the real world that your vestibulo-ocular reflex and proprioception function in. True, the window also exists in the real world, so in that sense yes, it needs to seem stable relative to the VE, but I don’t think that’s the core way to think about it.

> Latency in the movement of objects in the virtual world etc should not be a problem since that’s similar to how we see objects move on an ordinary screen.

I’m not sure what you mean by that. True, there’s latency on ordinary screens – but the whole point is that we don’t perceive that as embedded in the real world, so it doesn’t trigger VOR and other sensing systems.

As for near and far, etc. – the problem is that in any given scene, you can get any balance between those elements. On average, that might work well, but the worst cases are worse than the current approach. For example, what if a dynamic object fills your view? What if you’re in a small room? Being good on average doesn’t solve the problem, because the objective is to eliminate the discrepancies that trigger our perceptual system in the wrong ways.

Certainly your initial test would work well with a static scene and moderate head rotation. The question is whether it continues to work well with a dynamic scene, rapid head rotation, and especially rapid head translation.

Maybe you can compare having a window that’s stable relative to the VE but with higher latency in the dynamics of the VE to being in a VR cube with a box over your head with an actual window (one for each eye) in the front. (A VR cube also has problems with translation of course). I’m not sure about this, but my guess would be that this is much better than having a similar latency in tracking in a HMD? I think it’s much easier to accept a world where everything around you moves “at a lower framerate” if you yourself can move around fluidly in this environment, than if the latency is in your own movement. The VOR reflex should depend on how you feel yourself moving in the world, not on how the world moves.

My point is that I think it’s helpful to think separately about providing your view into the VE with minimal latency and the updating (and most of the rendering) of the VE. Sure, there are remaining challenges, but they don’t seem insurmountable. The whole view filled by dynamic objects might give you a lower frame rate on the updating of the world but if you separate the rendering as suggested you should still be able to look around with low latency etc and so I think the situation is much improved, even if it’s not optimal. Avoiding discrepancies is key, yes, but there are many different levels of discrepancies and maybe it’s a mistake to imagine that you must be able to render everything at a high constant framerate with low latency all the way before HMD-VR is interesting?

I really do like the idea of having a bit more intelligence in the HMD. It seems like a scalable solution to me that may be good both for initial experiments and for an envisioned future implementation. Essentially, I imagine the HMD as simulating a VR cube where you can look around freely in a “still image” of the VE and then these still images are updated from an external renderer at a lower framerate. Sure, there are drawbacks with this approach but I think there is much to try in such a framework relating to how to, e.g., best minimize extra rendering, compensate for head translation, and so on. I’m having trouble believing that making a projection and/or composition in the HMD should be a big problem in question of hardware, but maybe it is if you really want to push the price down?

Great discussion. Really cool that you’re on this now. I do remember you, and mode X, from when I really started to get into computers in high school, long ago.

It turns out that the update rate of dynamic objects matters, for reasons having to do with the interaction of the perceptual system and the display, something I may talk about at some point.

Obviously it would be handy to have processing power in the HMD. However, that makes the HMD more expensive, heavier, more power-hungry, and hotter. And of course it’s only worth doing if it reduces latency significantly.

Cool to hear from someone who read about Mode X! I never knew when I wrote that who, if anyone, was reading it – in pre-Internet days, I might get a few US mail letters about any given article, at best.

The clarity that we observe when we quickly move our head while focusing on a fixed point is predominantly due to the eye muscles physically stabilizing the image by ‘nulling out’ the motion of the head, not by visual processing after the fact, yes?

It does sound like you’d need to have eye tracking data to integrate with the head motions.

Not commenting on the latency issue, but I think eye-tracking could be very useful for raising the immersion factor. There was a mod for Skyrim that “focused” on whatever you pointed your crosshairs at. This made some sense, as you are often looking in the same place you are aiming. But not always. With eye-tracking, the game could always be focused on the object you are looking at. It honestly seems like more of a software challenge than a hardware one, getting the focusing algorithms right and everything. I’m not sure how expensive an eye-tracking mechanism would be, but I’m hoping the Rift (and others) implement eye-tracking for their consumer products.

I think eye-tracking will be a key part of VR and AR in the long run. I also think accurate, consumer-priced eye-tracking will be a significant challenge (unless we all are willing to wear contacts with induction coils in them, or better yet have induction coils surgically implanted in our sclera

Eye seeking isn’t the problem for latency – head motion is, because it moves the pixels relative to the real world. Even with everything miles away, you could easily turn 45 degrees rapidly to look somewhere else.

I think some mechanical support for the display would solve one problem: weight. If your display is supported on space by some mechanical structure (and either moved by your head in a simple mechanical way, or by using force sensing, like Rethink Robotics’ Baxter), then there is no limit to how much its weight has to be. This means that you can use any display technology, even just a computer monitor, if you just find a way to reduce their image, eg. by using mirrors, optics. This would allow you to solve the problems that are specific to AR/VR hardware.

Of course, this would not be suitable for massive consumer VR; but it could be useful for corporate VR and for specific VR applications.

That could work, but as you note, it would increase cost and complexity a great deal. As Dean DeJong, the optics guy I sit next, to noted, it would allow for more complex optics – and at the same time would reduce the number of HMDs we’d have to worry about manufacturing Since we’re going after the consumer market, this isn’t a viable path for us. I’d imagine this has already been tried at places like NASA, although I haven’t encountered anything about it myself.

How about rendering multiple frames with offset view points in parallel and then choosing the closest fit to the current head position? Sort of a special case/optimization of light field rendering. The offset view point selection could be guided with prediction, both along the predicted path and in the error area (this could help with the motion reversal issue). You could store a large number of past frames to give you a cache of view points to include in the closest fit search. The cache is essentially a discretized volume around the users current viewing area. The frame displayed to the user could use warping as suggested above, but with more than a single frame as input which would give you *something* to put into the holes created by translation. Throw more commodity hardware at the problem and capitalize on GPU throughput…

Conceptually that’s a nice approach, but the cache really only works if there are no dynamically moving objects. Rendering multiple frames for each set of predicted locations of course increases rendering time; yes, you could have multiple GPUs, but only at a high cost. The real problem there is that rotation can change quickly enough so that the number of tentative views would be quite high. Remember, accelerations can be up to 1000 degrees/second.

The caching could work somewhat better if you only rendered the static part of the world into the cache, then rendered the dynamic part later. But then you have highly variable performance, depending on how much of the scene is dynamic – an entity close to you would cover a lot of the screen.

_If_ there’s a different reasonable-latency requirement for AR vs VR, we could kill several birds (virtually!) in one stone by delaying reality to match the rendered frames latency – using a camera, and render the 3d scene as an overlay of the (recorded, delayed) environment (with an opaque screen). In effect, almost turning the AR problem into a VR one.

Almost – upto the accuracy/latency of the tracking system, which in a VR system only has to track position (hopefully allowing our brain to compensate for some positional/temporal inaccuracies), in a mixed AR/VR system such as this, it would also have to track good enough such that virtuality overlays reality solidly .

So if the Oculus Rift can provides a reasonable VR experience, it should be able to drive a similar mixed AR/VR experience if added with 2 outside cameras.

Sure, it would look a bit funny to wear such device in public, but it might progress AR issues which are currently masked by latency (such as tracking accuracy). And besides, there could be controlled environments at which such mixed AR/VR system would be better than plain AR.

There’s no evidence at this point that there are different latency requirements for AR and VR. Given that latency requirements are largely driven by the parameters of the human perceptual system, it would be reasonable for AR and VR latency requirements to be similar. True, AR may have tighter requirements because of registration with the world, but my experience is that it’s easy to see things shift due to latency in VR even though there’s nothing to register them against.

Using a camera to do video-passthrough AR would solve a lot of problems. However, I don’t think video-passthrough AR will be good enough for a number of years, for reasons discussed here. Certainly it’s possible that that could be useful in specific environments, but I don’t think it’s anywhere close to being ready for use in a consumer device. Maybe in 10-15 years…

To start, I know little to nothing of the subject matter, but any insight can help, right?

What if you try to approach the solution by singling out the ‘movement’ from the’moving picture.’

If I remember correctly most of our field of view is used to detect motion and doesn’t contain any color information.
Not to mention it’s fairly low-resolution. When it comes to movement, especially for our peripheral vision, our eyes can do fine with a fairly unsophisticated image.
Again, I’m not an expert, but I think it should be viable to make some great advances in terms on latency in this area by cheating our eyes and using the most unsophisticated graphics possible.

When it comes to colors and detail, we only perceive this when we stop to look at something. Whenever we look at something else and objects move towards our peripheral vision, we stops to actively perceiving those objects and our vision is mostly based on memory.

If we cheat our brain in taking over for some tasks, as it fills in missing information, there might be a hardware solution to the problem:
One part would be a high-frequency display (laser?), so we always have an up-to-date black-and-white reference and perception of motion. And on top of this there can be high-resolution transparent OLED display.
This way, using the motion tracker, it could be determined if it’s safe enough to use the high definition display, or if it needs to skip some frames during motion. A transparent OLED can also display partial renders, to leave certain or distant objects and background information it might smear black/transparent.
If a laser-display can actually be developed to be fast enough, the worst-case-scenario is that there’s some color-information that persists and smears for the duration of a frame when motion is initiated.

There are some interesting ideas in there – thanks for sharing your thoughts – and we’ve been thinking along some of the same lines. But overall I don’t think it works. For one thing, you do have peripheral color vision; there are fewer cones out there, but they exist. Find something brightly colored and then see how far into the periphery you can see that it’s colored. I can get past 90 degrees. Also, if you focus on something and continue to do so while you move your head, you see it very clearly, and if it became black and white, it would be very evident and look very wrong. Finally, I wish I could get a hi-res transparent OLED display suitable for an HMD, but I haven’t come across one yet.

These are still being worked on, Samsung has got working prototypes, but nowhere near mass production stage. But theres is R&D being done on them and we will have those at some point.. just a matter of time, a few years to wait give or take.

They’d be cool to have, but they’re not a solution by themselves. He was proposing a hybrid system, with something like a laser doing the fast, detailed stuff and the transparent OLEDs doing the rest of the FOV in black and white. Whether a system like that would work well is an open question on a number of axes. What would be great would be if you could just wear a transparent OLED visor and have AR – but you can’t focus at that distance, so you’d need optics to collimate the light, and that would distort the real world badly. So in general, transparent OLEDs don’t solve anything new.

Innovega’s approach is unique and very interesting. For those not familiar with it, they put a tiny lenslet in the middle of a normal contact, and the lenslet lets you focus at the distance of the HMD display. So unlike all the other approaches, they don’t collimate the light (make it come from infinity); they can actually project it on the HMD and you can focus on it there. Meanwhile, the rest of the contact functions normally, so you can see the real world too; the light from both just mix. You can read more about this approach on their website.

Obviously, the drawback is the requirement for contacts; certainly a substantial number of people will refuse to wear contacts (although I’d be fine with it).

Anyway, you are correct, Innovega’s contacts combined with transparent OLEDs could be a great AR system.

The other way around actually. The laser needs to fill the entire field of view. And the transparent OLED needs to display things in color and great distail, although this might be a pipedream for what transparent OLEDs be capable of.
Perhaps we needn’t use an OLED, just a specialized laser display with an extra “fast-pixel”.

The point I tried to raise is that there is a discrepancy between what needs to be displayed fast and accurate, and what needs to be displayed in detail.
What I’m getting at is the different nature of our cognition of motion and conscious sight. To detect motion we use older neural pathways, that aren’t part of conscious sight. It’s actually possible to be blind and still be able to perceive motion.
So the issue with VR as I see it, is we can’t avoid detecting a small discrepancy in position/motion, because our brain is hardwired to find this very alarming, and will redirect our attention towards this.

This is why I suggested developing a fast laser display with only rudimentary graphics to avoid our brain going into “Red alert!” (I’m more of a Next Generation fan.)

As for our actual graphics in the conventional sense, I assumed there would be more slack making use of neural plasticity.
Say for example we were to use a 498nm laser, a wavelength to which our rod cells are most sensitive to, and we won’t render our peripheral regions in detail/color with the OLED. You’ve raised the point earlier, and you’re right that we have cone cells that would detect this lack of color. But deprived of sensory information (mostly red) for long enough, and our brain starts to assume some color blindness in this region. This creates a twilight in between not being fully aware of the lack of color and our brain filling in the gaps with other information, either when we turn away from objects or towards them. This way we won’t need a detailed image right away, but only when our brain expects to be able to perceive it.

If you can’t cheat the Enterprise to go any faster, why not create a big enough “cognitive” gap.

Right, I see that you were using the laser for the wide-FOV B&W. Makes sense.

I still think you would notice stuff becoming colored as it got closer to the fovea. Sure, you could gradually blend the color in, but then you need a wider colored area, defeating the purpose of having a small hi-res color display. Also, of course, the eye moves, so the display would have to move with it; that idea came up earlier, and is a difficult problem.

Finally, the wide-FOV B&W area and the high-res color area would have to update at the same rate and stay in perfect sync; otherwise, the boundary between the two would be very evident (although I think it will be anyway due to the color transition).

I haven’t read all the other responses, but here’s a quick idea assuming you want a pure software solution. Attack the problem in 3 stages. First develop a predictive model of likely viewpoints for the next frame. Next render the scene from as many of these viewpoints as you can afford. Finally, implement the “racing the beam” solution but your “racing the beam” shader is just a 2-D image reconstruction filter based on the previously rendered viewpoints. The benefit of this solution is that stages 1 and 3 are application independent with well-defined APIs and can easily scale in cost and complexity. They can also be implemented in hardware eventually. Applications plug into stage 2 and can stick with a pretty conventional rendering pipeline.

Certainly a conceptually appealing approach. Steps 1 and 2 were, coincidentally, proposed a few posts ago, so I’ll let my response to that stand as the answer to this as well. If 1 and 2 worked well, then the image reconstruction is interesting, although there’s the question of how compute intensive it would be. But I don’t think 1 and 2 will work well enough (short version – too big a possibility space, too much rendering cost), so it’s kind of moot.

If you can make the image reconstruction stage good enough, I disagree that the probability space for prediction or the rendering cost is too large. You no longer have to render an accurate prediction, you only have to render a prediction that is useful during image reconstruction. A dumb first pass would simply be an image with a wider field of view and a couple view-aligned translations. I think asking the application to render 5 views instead of 1 is an acceptable cost for a technology like this.

I admit, the image reconstruction stage is where the magic happens, and if you can’t make that work then you’re back to square one.

Maybe so; I certainly don’t know for sure. But even if that’s true, you now have 5X the rendering, which could have been used to reduce rendering latency of a single frame to at least some extent. And unless you send all five images to the HMD and have it do image reconstruction, all you’ve gained is the latency time of the original rendering. If the original latency was 5 ms, then you’ve saved 5 ms minus the reconstruction time at most, so maybe 3-4 ms. But if you could have used 5X the rendering power, maybe you could have gotten the single-frame rendering time down to 2-3 ms, which is close enough to call it a tie, and with a much simpler system.

Of course, you could just take 5X as long to render, and say it doesn’t matter because latency doesn’t matter until reconstruction. However, that would mean you’d have to predict a lot farther ahead – 25 ms in the example above – and at that point things can move enough that reconstruction may not work well.

I understand your concern about shifting latency around, but not all latency is the same, right? We have basically infinite tolerance for what I’ll call “simulation lag,” for lack of a better term. What I mean is the virtual clock can exist a few seconds in the past or a hundred years in the past as long as the viewer has no real-world frame of reference. Our tolerance for input lag varies greatly. Several hundred milliseconds is acceptable for pushing a button, a hundred milliseconds is acceptable for firing a virtual gun, thirty milliseconds is tolerable for mouse-look, and apparently 7-15 milliseconds is necessary for HMD response. Latency is most problematic when the input is tightly coupled with the virtual camera because it causes simulation sickness.

For VR, it might be a worthwhile trade to lower the simulation frame rate and quadruple the simulation lag if doing so allows you to cut the latency in driving the virtual camera by half.

AR is problematic because there is such a clear frame of reference between the virtual simulation and the real world it is overlaid on. Does latency in AR cause simulation sickness though? I wouldn’t think so. The one advantage AR has is that all the concerns about how quickly and unpredictably viewers move their heads no longer apply. If you solve the VR virtual camera latency problem, you only have to worry about how quickly and unpredictably the real world objects you’re tracking in AR can move. I’m going to go out on a limb and claim that most AR targets are either far slower or more predictable than a viewer’s head.

My main point was that it’s not clear to me that this reduces tracker-to-photon latency significantly, net, unless you send all the frames to the HMD and do the reconstruction there, which would require a lot of power on the HMD, and also 5X the bandwidth to the HMD. Also, if you take 5X as long to render, you’ll have 20 ms more between rendering the first frame and reconstruction, which could make the reconstruction quality inadequate.

I don’t understand this:

“The one advantage AR has is that all the concerns about how quickly and unpredictably viewers move their heads no longer apply.”

The speed of movement of real world objects isn’t the key here; the speed of movement of your head is. And you move your head very quickly to look at new things. Even if that wasn’t the case, real world objects can move very quickly and unpredictably. How about a sparrow, or a hummingbird? Or a car or a baseball? Note, though, that you can do smooth pursuit of moving objects at much lower speeds than you can do focused tracking where you turn your head but continue to fixate on a static object, so in either VR or AR, it’s the latter that presents the higher-speed problem.

I suppose it is fair to summarize my suggestion like this: If you can do perfect prediction then you can have effectively zero latency. Given that you can’t do perfect prediction, if you can do perfect reconstruction then you can reduce the latency to the time between reconstruction and image reaching the eye. Honestly, I have my doubts about being able to do acceptable reconstruction (temporal aliasing would be huge issue here), but I feel confident you can push reconstruction to much later in the pipeline than full rendering–all the way into the HMD eventually.

Regarding my comments about AR, my second sentence better summarizes what I was trying to say: “If you solve the VR virtual camera latency problem, you only have to worry about how quickly and unpredictably the real world objects you’re tracking in AR can move.” I’m claiming that most AR applications I’ve seen have much slower and more predictable targets than a viewer’s head. Buildings, signs, even walking pedestrians and your baseball example are far more stable in screen space and predictable frame-to-frame than the position of a viewer’s head. The hummingbird is going to require custom hardware, obviously. At least if a caption or virtual overlay lags a bit, it probably isn’t going to cause simulation sickness the way a laggy VR environment will.

Anyway, I can’t imagine I’m telling you anything you don’t already know. This is what you get for inviting the Internet to play. Best of luck in your research!

I have a solution, but lack the technical background to know if it’s sufficient, or even 5 years feasible.

I propose this, disregarding all other problems, because I think the industry side problems of display manufacture will only be solved with appropriately placed pressure and money. A multi-channel, addressable display could logically display frames fast enough. The root problem seems to be not the display itself, but the way it’s addressed. That change will come when they see the benefit, but to create the benefit, you have to have all other ducks in a row.

My solution for rotation/translation:

This will take a lot of power, both in the renderer, and the headset. Less the headset, but would probably take a high power gpu->buffers->low power gpu concept.

So the idea works like this. Lets say your final frame is 1k by 1k. To translate, we need to render the frame from another perspective. Lets assume for the moment that we are translating in perfect planes, since rotation would be higher math, and would distort our pixels. What we need, is simply more frames. More precisely, we need to predict our frames. In a game engine, prediction should be possible 2-3 frames out, but I figure we only really need 1. We need 1 frame of “real-time”, and multiple frames of sub-divided time. In the 16ms it takes for the frame (really for the engine as well), we could generate 4 steps of time. Those are interleaved with the main frame going out. These subdivided time are progressively further away in translation and time. at 3ms, we are x*2 degrees from where we were in some direction, and 3ms in the future (per-estimating moving scene objects); 6ms we are x*4 degrees, 9ms is x*8, etc. (The exact values and such are a matter of experiment). Obviously it’s a lot of frames since you need to go left, right, up, and down. Assuming each frame is already oversize, and 2k by 2k, you might logically need, say, 16 more frames, for each 1 frame you generate. You could of course use a larger slice of time for smaller subdivision. Using just 2 steps, you need just 8 frames. Using motion estimation you could also stack them where appropriate, for instance if the head is already moving (or has started moving), it’s 3 one way, and 1 back the other. If the head is rotating, you pre-rotate the advance frames. 60fps turns into 480fps. Of course, having just 1 advance frame in each direction would in essence cut the degrees moved without data in half, for instance from 2 thumbs to 1 thumb, also turning 60fps into just 240fps. May need diagonals. And 5 years.

Good thoughts! The main problem I see is that whatever power it takes to render 16 tentative frames could have been used to render 1 frame much faster – it might not be 1-1, but it’s hard to believe that 16X the processing power wouldn’t let you either render the same frame with less latency or render a much more complex scene with the same latency. So it’s not clear that latency decreases, net, or that that’s the best use of all that power.

Another question is how well you can bound the range of possible poses given 1000 degree/sec accelerations, combined with translation. Granted you’d be likely to come closer to the right pose, but not necessarily close.

Finally, there’s still the time taken to get the image to the display and emitting photons, which this doesn’t address, so you’d still need to predict by that amount – and that’s the biggest part of latency currently.

Treat the left eye and the right eye completely separately, then (un)sync the images to render one eye then then other. You will not cut the latency of your example (4+16+16=36ms) but you will provide updated information to the user every 18ms.

This will cause “tearing” to occur for the user between the left and right eye where the object would not appear in the same place for each eye. This would look horrible, however it can be fixed with shutter glass technology. Initially it might seem silly and unnecessary to pair shutter glasses inside an HMD when you already have two separate screens, one for each eye. But in this case the shutter glasses would not be used to ‘split’ one screen into two images while cutting the framerate in half, they would be used to ‘fuse’ two screens together and fake a double frame rate.

This should reduce the pixel off time smear.

You wont get the full benefits of a 120hz screen, but off the shelf 60hz screens could be used with off the shelf shutter frame technology to produce a 120hz image until they do start making 120hz screens.

Yes, nice idea. The half-phase idea was proposed by a couple of comments previously, but using the shutters is a good addition, and, as you say, would reduce smear. It is still an open question how well displaying different frames to the eyes at different times would work; HMDs are significantly different from shutter glasses. But indeed it might work; another thing it would be worthwhile to try out.

What about a different kind of prediction that uses knowledge of the virtual world?

So far I’ve only seen prediction that uses tracking data. But what if we know a monster just jumped out and made a sound to the player’s right, while the player was turning to the left? Can we predict that the player is going to change direction and look to their right, *before* we get any sensor data showing they are doing that?

Nice out-of-the-box thinking, but the search space is way, way too large with this. The player could spin either way, or could translate in any direction. Also, there are many situations in which there’s not necessarily any evident external stimulus, or in which there are many stimuli (like being in a crossfire).

Your mention of “racing the beam” got me thinking about PowerVR’s Tile Based Deferred Rendering graphics hardware (TBDR). PowerVR chips are quite common in phones and tablets these days (all iOS devices). If I understand correctly, TBDR breaks the rendering workload of the GPU up in to smaller blocks for the purpose of improved caching. Actual texturing is deferred until some later time, and similar to deferred shading, texture lookups are only done per visible pixel. Admittedly I could be misunderstanding some details, but there might be some options for “Racing the Beam” on a TBDR GPU.

Also the Nintendo DS was an interesting specimen. It’s not a TBDR GPU, but it’s an equally unusual one. There was a hardware limit on the number of triangles that can exist (a couple thousand), as well it had a fixed per scanline fill rate. Once you exceeded the number of pixels allowed per line, rendering of new triangles would stop (oops!), and it would begin the next line. Thus, some games would do 2 full rendering frames to make 1 frame.

I guess it would be possible to render the top row of tiles and send them to the display, then the next row. You would certainly want to know the tile sizes, because doing a partial tile would be costly. And that could be relevant down the road, when AR is mobile.

The DS sounds like a fun GPU to program. I was always sorry I missed the Amiga, and the Dreamcast sounded fun to program as well.

Build a passive mechanical stabilizer for the display like a head-mounted Steadicam.

Mount the display on a spring-loaded and dampened articulated arm which is attached to the top center of your head. If you rotate your head quickly the display will move more slowly and then catch up with your new head position.

If the displays are not big enough and you can see past them when turning your head, you could get away with describing the device as a “magic lens”. People would accept that and learn not to turn their head too quickly if they wanted to keep in constant visual contact with the virtual objects.

Downside: you need larger displays. More moving parts; not as sleek. Dorky looking. Difficulty doing 3d if you can turn your head fast enough to see into the wrong eye’s display.

Other tricks:
— Use polarized glasses or contact lenses and share a single large display between both eyes.
— Project on a fixed transparent screen in front of your face, and spring-mount the projector instead of the screen.

I like the very different approach here, but wouldn’t you still have the problem of latency getting to the display, so unless you knew exactly how the dampened display would move over the next 16 ms or so, you still wouldn’t know the exact pose to render? Apart from that, as you mentioned, this would be large/heavy for an HMD. That’s certainly true about sharing a single large display between two eyes, and projecting on a transparent screen would put the projector a long way out in front of you, which would multiply the effect of its weight.

Some further thoughts: I think one could use eye-tracking to refine the prediction. I seem to move my eyes before I move my head to look at something else most of the time. The eye-tracking could give a prediction of how my head is going to move. Even if people do not naturally move their heads in this way, they could probably be trained to do so.

Secondly to “hide” latency, it would be worth trying to use machine learning to determine how each user’s head accelerates then decelerates to reach a particular angle. Given that our heads have different mass distributions, and we move at different speeds, there may be enough regularity in our head motions to improve the prediction of how we will move than what one would obtain using a model that does not learn.

Eye tracking as a predictive input is a clever idea, but while it’s true that you look before your head turns, your eyes don’t move smoothly – they move in a series of saccades, from which it would be very hard to extract precise predictive information. Likewise, as far as I know (and in my experience), machine learning of head movement has not revealed adequately accurate information. The problem is the degree of precision required, which is very high; knowing there’s going to be acceleration isn’t much use unless you also know exactly when it will happen and what the magnitude will be.

But eye-tracking can indicate the target of the head-movement. From what I’ve understood about the trouble with prediction, it is knowing when the user stops or changes direction, so, in that way, eye movement might be a heads up. (pun intended)

If this can actually solve some prediction hiccups, the extra hardware costs might be worth it if extra immersion is implemented using eye-tracking, like depth-of-field. (Or auto-aim, lol.)

I tried to think a little about the problem and thought that it seems that our brains prepares the movements of our body long before they are executed, if it were possible to extract these signals directly from the brain before being sent to the muscles that move the head, it might be possible to predict the position of the head before the head moves, thus the use of prediction could be a viable alternative, but although there are studies on how to read brain activity, mainly to allow control of prosthetic limbs using our thoughts, I don’t think there is any technology yet that allow us to know the movements of the head with accuracy, much less this technology would be accessible to consumers today, and even with all this complexity, I think it would not be sufficient for all situations, because if the head position depended of any external force to be predicted, then the prediction will not work.
I thought that the problem of accurate positioning of the image is attenuated for VR applications and pronounced for AR applications, because in AR you can clearly see the wrong position of the virtual image using as reference the image of the real world, but in VR the only reference I can think of would be proprioception, and I even thought that was possible to make the virtual movement not having the same size as the real movement (such as rotating the head 90° to generate a 180° rotation in VR) and still remain believable, because humans have poor sense of direction. If at least were possible to change our proprioception to sense that our head is in the position of the virtual head (using something like galvanic vestibular stimulation should not be enough), perhaps that would allow to make the sensation of a static virtual world, but I think it also would seem that our virtual body moves differently from our real body. Like when you dream in slow motion.

Good thoughts. However, regarding VR vs. AR, VR is all you see, with high contrast, while AR is set against the real world, with lower contrast. So if AR is wrong, you’ll notice, but it won’t conflict with proprioception to nearly as great an extent, which is important for avoiding simulator sickness. Certainly in VR you can train proprioception to a different setting; that happens in the real world just when you put on new prescription glasses (which always gives me a headache for a week). The problem with latency is that it doesn’t involve a consistent change in proprioception; the effect varies with acceleration, and your perceptual system is not capable of training to a varying input, because there’s no consistent state for it to adjust to.

And yes, it would be nice if we could get the signals from the brain before the head turned, but that doesn’t seem to be even on the radar at this point.

Fascinating post and set of comments. When I saw how long the page was thought I’d read the article only, but just couldn’t stop reading. Been playing with 3D since my Apple ][ doing animated monochrome (green/black) wireframes and turning off all the lights in the room to create rotating ‘holograms’ to fool my friends when I was a high school hacker. Always loved and been fascinated by 3D/VR/AR, thought VRML was going to be great, but… anyway I digress.

My first thought before reading any of the comments was the same with the ‘overscan’ larger framebuffer and ‘floating’ physical panel. But as pointed out, this can possibly cope with rotation, but not translation.

One of the major problems with the 60fps refresh, is that your head can turn faster than even a single frame can be drawn at, so pixels are going to be out of date as soon as they’re being sent to the display. So bottom line, as Michael has responded in the comments several times, we need faster display technology (pixel to photons) to reduce the amount of latency to an ‘acceptable’ level. Current consumer devices are too slow and there needs to be market pressure for the hardware vendors to create faster and cheaper devices.

So, with that said (and *not* out of the way, still a requirement) we need a better way to predict the (head movement) future. By doing this and combining some of the other mentioned techniques, eg. racing the beam, we may be able to get the actual photons closer to what should be coming out by guessing the future based on some additional inputs that could still be quite cheap for a consumer device.

Before I explain the idea though, there is one other thing to note. The Oculus Rift (or any HMD) will never weigh nothing. It can hopefully be made as light as possible, but never 0. Hence it will have inertia. So moving your head means its going to also have some latency before the HMD will move to where your head is and then catch up. This is also a problem because the IMU, no matter how fast it’s sample rate and latency, is on the HMD and so it won’t know the head has started moving until the HMD has started moving. Making this worse, some people have fairly flexible skin and so the skull may move and the skin then tries to catch up and then this starts pulling the HMD around. Granted, this is a small amount, but still another contributing factor to latency of your head moving (well your eyes inside of your head) and the display not catching up. This may actually be of (very minimal) use for the rotation problem, but not the translation problem.

So, rather than being a problem, can we actually use this to our advantage? (If you can’t fix a bug, call it a feature Some people talked about trying to detect muscle movements, but this is going to be hard, specific to setup for each user, messy to attach electrodes etc., etc. What about very sensitive pressure sensors strategically built into the mounting points of the HMD, or possibly additional point(s). The idea is to detect acceleration of the head before the IMU is showing it by using pressure sensors that are matched to their position on the HMD with their position to the head. So one example could be the bridge of the nose which is reasonably rigid (either side where glasses would normally sit,) as well as the back of head, but this is not so rigid. The (1D) pressure sensors if strategically positioned and angled correctly, or if 2D or 3D pressure sensors were available, could help determine an acceleration of the head that is about to happen. Ultimately a minimum of 6 pressure values will be needed, but not necessarily 6 mounting points. At a reasonably high sampling rate (500-1000Hz?), this could then be used to determine the amount of pitch/roll/yaw (single pressure sensors that are transformed to direction of pressure being applied/released) AND the X/Y/Z translation (combined pressure sensors values transformed but contributing to an axis.) We then use these ‘pre-acceleration’ values to predict where the HMD is going to be emitting photons from beforehand (when it catches up) to compensate the eyes position with the amount of overall latency that is measured from eye look at position in the 3D world to photons being emitted. This would have to calculate based on the inertia of the HMD (which should be a known weight and thus hopefully calculable) with the calibrated pressure sensors to predict the changes in head movement (acceleration.) The calibration should be automatic by detecting the at rest pressure values when the head is not moving. That can be detected by a ‘stable’ reading from the IMU for a minimum period of time.

As there are already prediction algorithms out there, these ‘pre-acceleration’ values of the HMD can be used to provide additional input of what is going to happen to head movement that needs photons emitted at a certain time and place. Traditional prediction algorithms based solely on the IMU are fine for constant velocity, but can’t deal with changing acceleration values. They can cope to some extent with linear acceleration, but not quickly altering acceleration. By reading pressure generated from rigid points on the skull to the HMD, the ‘pre-acceleration’ can help guess to some amount into the future what is going to be coming from the IMU and it’s sensor fusion algorithms. A perfect rigid point would most likely be some sort of tooth guard with appropriate pressure sensors, but this is obviously not as physically desirable. Unless of course we’re going to start talking about taste being a new output parameter for VR! But this alone is probably not enough for translation pre-acceleration values.

Anyway, it’s already getting TL;DR, if you’d like to know more about how I’d test this, fire me an email. Briefly, I’d create a simple OpenGL laser pointer program with a 2D ‘dot’ on a normal LCD screen (ie no actual 3D display, just the IMU and pressure sensors on the head gear mount) and strap a real laser pointer to the headgear that has the IMU and sensors on it. Then compare and tweak the algorithm for pre-acceleration by comparing my virtual laser pointer dot movement with the real laser pointer’s movement pointing at the same screen. I could actually video this to slow down and look at how the prediction algorithm could use the (pre-)acceleration values of my head with the movement of the virtual dot until they were getting to acceptable values of prediction. It’s probably the sort of thing I could use some of my simple microcontroller (eg AVR/MSP430) stuff for to prototype the concept with.

Very nice lateral thinking! Before I thought too hard about anything else, though, I’d want to know how far ahead of the IMUs the pressure sensor data would be. If it’s only a few milliseconds, it wouldn’t really change anything; if it’s 20 ms, that could be significant. Do you have any idea?

Off the top of my head (sic), no I don’t know without doing a proof of concept. As to how far ahead will depend on the inertia of the HMD as well as the selected mounting points for the pressure sensors. These would need to be combined and different placements tried to determine what/where might provide optimum data.

There was some really creative work in the mid 90’s by Matt Regan and Ronald Pose at Monash University in Australia. Basic idea was to render into ‘viewport independent display memory’ — i.e. a cube around the viewpoint — and then render from that cube into the headset framebuffer based on the orientation _at_the_start_of_each_scanline_. That is right — the latency on head rotations gets down to way below 7ms because you’ve decoupled scene rendering from final framebuffer rendering.

I saw a live demo and it was very impressive. As a proof point, they’d render a scene consisting of a single vertical line floating in space. Wearing a headset you could turn your head as fast as you wanted and it would appear rock solid without the slightest lag. But a viewer watching an external screen would see the line warp as the headset-wearer turned their head because the view orientation was being updated every scanline.

Yes, I’m familiar with the Regan paper – very cool – and per-scan-line panning is one of the approaches being considered. I am 100% sure this looks great with static scenes, at least at moderate rotation speeds (mind the warping) and even more moderate translation speeds (mind the gaps), but I don’t know yet how well this works with dynamic scenes.

One thing to remember about HMDs for the time being is that they are heavy, bulky, and attach poorly to the head.

We CANT move our head quickly in them. If we did, either the device would shift (instantly breaking the illusion) or actually fall off. But we are used to dealing with this, we regularly subdue the movements of our head to keep helmets or hats securely attached. Perhaps this effect will work to our advantage with HMDs.

Actually, IGN reports the Rift as targeted at around 220 grams; not heavy at all. And my experience with the prototype is that it is not bulky and attaches to the head much like a snorkeling mask or ski goggles. It was easy to move my head rapidly with no shifting and falling off.

I’ve worked quite a bit with heavy HMDs in the last year, and while it’s true that they slow head movement, they are also uncomfortable; I wouldn’t recommend making HMDs heavy as a strategy for coping with latency.

What about instead of two screen use a single screen that covers both eyes where the middle of screen use the parallax barrier ( http://en.wikipedia.org/wiki/Parallax_barrier ) to give different images to each eye? It would not solve any of your problems but it might generate cheaper and simpler hardware.

Sure, that could work. But the Rift already works by having a single screen; however, it’s very close to the eyes, so each eye sees only one half. This produces the same result (half the screen for each eye, although not distributed the same way) as the parallax barrier, but eliminates the need for the barrier, and allows the screen to be closer, thereby reducing the leverage its weight has on your neck muscles.

Michael, so it seems that the next biggest hurdle for the Oculus Rift team from a hardware perspective is the refresh latency (they got the fusion thing down to 2ms), and the only solution on the horizon would be an 120 Hz OLED display? I think OLED displays are coming strong this year (smartphones, CES) so the 120 Hz would be the next gig? From a business perspective, isn’t there enough incentive to make this hardware switch on the production line – i mean if the VR is at the door, i’d think there would be a lot of money involved as well to open it. Do we need to kickstart another project directly with a production line to see this through (120 Hz OLED)? I would imagine based on the enthusiasm of the initial kickstart that it would be possible to gather another batch! I’d like to get your view on this if possible as currently i don’t think we are too far off (1 year) from seeing OLED displays become mainstream – thanks!

I’m not sure what you mean by “they got the fusion thing down to 2 ms.”

VR may be at the door, but we’re not talking sales in the tens of millions for VR, like we are for smartphones, and all the display manufacturers are putting their resources into the highly competitive smartphone market. The amount of money that could be raised in a kickstarter project is peanuts on that scale.

120 Hz OLEDs might be good; they’re not as bright as LCDs, but that could be okay with VR, since there’s no ambient light to compete with.

LCDs and LCOS can also be built to run at 120 Hz or even higher; as usual, it’s a question of whether the cost of developing that is justified.

I am hopeful that OLEDs will become cheaper to build soon, i’ve read somewhere that some germans have found a way to print them much cheaper, i can’t find the article any more – if i understood you correctly that seems like the only clean solution to the refresh latency problem, in other words the other solutions are hacks compared to an OLED 120 Hz solution?

120 Hz would be better, for sure – lower transmission time to the display, more samples – but by itself it wouldn’t fix everything by any means. That would take more like 1K Hz and up, which seems unlikely to happen, at least using the standard pipeline.

If I didn’t miss anything in your excellent, but long post (and comments), you are assuming that there really must be some threshold of “quality” that reminds one of current gaming visuals. I would like to point out that (it seems to me) the biggest shift in gaming in the last few years is to what is known as “casual” games. These have graphics which often would have been totally feasible on a Commodore 64. So assuming that a head set must go “all the way” into high fidelity graphics may be a flawed assumption. I think excellent VR head tracking would be so awesomely excellent that it would still be compelling to play games with it which were very simple, monochrome, blocky, blurry, etc. This would provide a starting point of lower expectations to continue the evolution. Also, I work with the people who were the biggest consumers of the first round of 3d glasses – molecular biologists. This is a totally solid market that would love something lightweight and practical for molecular work. They focused on a protein, not the walls of a room or complex surrounding environment. Much less head motion to track but it would still be very useful. And the graphics quality can be quite low (turns out that molecules aren’t really made of colored sticks and balls anyway).

Your comments would be highly relevant to the previous post, about the low resolution of wide-FOV HMDs, and in fact, the Oculus Rift, with something like 10-15 pixels per degree, works well despite its low pixel density, confirming your comments (although I will say that when I use an HMD with 10X the pixel density, it’s awfully nice). But they don’t seem particularly applicable to the latency discussion in this post. It doesn’t matter how serious or casual the game is, or how good the image quality is – if the virtual images aren’t stable with respect to the real world, the experience will be poor, and the chances of simulator sickness are probably increased – and that’s exactly what tracking-to-photon latency causes.

What i exactly meant was, that the software could predict the next frame if it recognizes acceleration(indicator)/speed/translation higher than X, maybe a combined value of speed/translation. If the System renders at really high frame rates it wouldn’t be a problem to prerender 1 image with predicted position, when X is too high and is supposed to generate latency. Acceleration is increasing exponentially in most cases I think, so you have a few ms to recognize the next position. That process could easily be optimized because you always have the real position a few ms later and you’re able to record that. Maybe that’s the way to go if hardware/monitors aren’t in the next few years?

So about once every 10 seconds the programmer will get a free 200ms to do anything they need to do because the user is not seeing anything. Taking it a step further, If not overused, a quick flash of the screen for one frame may go un-noticed (like a subliminal message) but could force a blink at a strategic time. Perfect for AR scene repositioning/tweaking. (even large changes to a scene when you are not looking will go surprisingly un-noticed) Or at the end of a head turn where the scene keeps moving for another 50 after your head stops which gives you the wobbly laggy feeling that can cause motion sickness, or lead to user induced oscillations — An induced blink from a dedicated LED in the HMD could be triggered (in the 7ms timeframe) and completely cover the scene “catch up” with a blink.

At 60 fps That’s 12 frames, or during your 60dps head turn example is 12 degrees of motion. All free, because it doesn’t even have to be rendered.

I’m really not sure where to take this thought, but I wanted to throw it out there because when you are running things as time critical as VR, 200ms might really make the difference.

Interesting thoughts, with a lot of potential if all that works as described. One key is that your approach is dependent on being able to induce blinks on demand; do you have data indicating that either flashing the screen or flashing a dedicated LED does induce a blink and isn’t noticeable?

> even large changes to a scene when you are not looking will go surprisingly un-noticed

>> even large changes to a scene when you are not looking will go surprisingly un-noticed

> That’d be useful – what are you basing that conclusion on?

I remember playing with an experiment along these lines in some science museum (can’t remember which it was, it’s been some time). I tended to subconsciously notice that “something” changed between scenes, but when they were separated by a couple of black video frames I was 95% unable to determine *what* changed. As soon as the blank frames were absent, it was obvious.

So did the science museum exhibit make changes when you blinked? Or were there just a couple of black frames in-between, and if so, was it obvious that there were black frames? Just trying to reconstruct what you experienced.

Initially I considered a puff of air, but that would add cost and complexity to the system. A screen flash is free and an LED would not cost much more. A dedicated LED would bypass the screen refresh rate and frame buffer and get the eye blinking sooner so that would likely produce a better effect.

I’m certain a bright enough light is going to induce a blink. Whether it can be done in an unnoticeable manner is unfortunately just a hunch.

I imagine that some combination of brightness / area / length of flash / color / angle to the eye / one or both eyes, will be optimal. It may be slightly different for each individual as well. That is just something that will have to be tried.

National Geographic had a 3 part series called “Brain Games” they all were fun to watch but checkout youtube title “Brain Games – Pay Attention” at time 07:30 to see the effect of changes during a blank screen.

Very interesting how difference detection falls apart. However, the black screen is very noticeable, so I’m not sure how this could be useful in an HMD. And I bet if you make the black screen short enough so it’s not noticeable, it no longer disrupts difference detection.

I had a few thoughts about the overscan/address shift method. First, panning the visible region as part of the display hardware is only an approximation. Rotating the view direction vector changes the normal of the view plane. The address-shift method leaves the view plane normal unchanged. The net result will be a trapezoidal distortion of the final rendered result.

Second, the address-shift method could deal with some measure of translation (parallel to the view plane) by essentially panning over the posterized overscan render. Again, it will be an approximation…and perhaps an unacceptable one. And it would not deal with translation normal to the view plane. In the rotation case, shadows, visibility, and perspective information will be correct but trapezoidally distorted. In the translation case, visibility, scaling of close objects will not be preserved but distant objects should be mostly correct.

Third, all of the address-shift/panning approaches will fail terribly if the HMD makes use of leep optics as is done in the Rift. As you well know, the last phase of the rendering “fish eyes” the output to to match the Rift’s expanded FOV. Simply shifting addresses is not going to work on those deformed images. Maybe with flexible OLED displays, this all will go away by making the display screen and the render view port into spherical sectors.

Finally, the question to answer may be how much intelligence can be integrated display controller. Can we steal a trick from Walt Disney and build a “multiplane camera” rendering scheme with several images at different depths that can each be address shifted and then splatted together at the last microsecond by the display controller? What if 2D scaling and rotation can be added as well? It might preserve much more realizism. Or do you just complete the integration of simple Quake1 level texture/geometry handling in the controller where the main GPU does the shading, culling of the world and produces a geometric reduction associated textures. The display controller can do the final rasterization based on that reduced geometry and texture images to drive the panel one pixel at a time.

Those are excellent and very accurate observations. Sounds like you’ve done this before!

I don’t think the answer is to put intelligence in the display controller, although I could be wrong. First of all, there’s only so much heat you want to emit near someone’s head. Second, eventually you want to get rid of the tether, which means doing as little work on the HMD as possible. Third, putting more processing on the HMD increases cost and weight. Fourth, I am unconvinced that any kind of fix-up on the HMD will work well in all cases. I think the right solution is to reduce the time between the end of rendering and the emissions of the corresponding photons, which would make late fix-up unnecessary.

Also, it’s not clear to me that there would be less data to transmit with your clever scheme – you’d still have to send texture data for every pixel, it seems to me.

I can see how that might eliminate translational as well as rotational lag, since (all time frame references aside) it would be quite impossible to move your head with a 20ton particle accelerator attached to the HMD. Unfortunately, unless it will fit in a USPS fixed rate box, the shipping costs will be prohibitively expensive for most people.

If however a use for a tachyon generator could be found for a cellphone, we might see an affordable MEMS version available within the next few years.

I’m not sure that this has being brought up often enough in discussions about the Oculus Rift but how would one go about using a keyboard while wearing a VR headset? Has there being any consideration into having some kind of Leap Motion/Kinect camera built into the device that would support an AR interface, or some kind of gloves with haptic feedback in the fingertips to simulate tactility when using an AR keyboard/interface?

Wow , I was present at IBM in the UK when pcs first made the jump to 256 colours , a bank of three of them stared at me all day displaying bowls of fruit in glorious 256 colours , whilst I filled 1000s of envelopes with memos ..on work experience.. which prompted me to go to med school instead ..although it could have been the sight of those Fortran and COBOL manuals .. Re latency and VR .. We have to wait for technology and software to catch up to and solve your problems. There exists no actual example of VR or AR mainstream gaming or applications.Nor is there any VR capable hardware .I have used all the current hmds , from HMZT , smd ST1080 , to the Cinemizer OLED with which I am currently testing the Zeiss headtracker my first experience is here http://the-games-veda.blogspot.co.uk/2012/12/carl-zeiss-cinemizer-oled-headtracker.html?m=1 . I use these devices on a daily basis , like I told Palmer , I as an end user experience no latency , or drift from the Zeiss headtracker , of course if you we’re to measure it ,it would be there ,but I can perceive no latency in applications , or games. Now of course these are not VR applications , but the hardware to run those VR apps and those VR apps are not going to arrive any time soon , if at all ,because as yet there is no great demand beyond enthusiasts for VR and AR seems to be owned by Google and their vision of Project Glass.
Using hmds with headtracking is a new skill to learn , not everyone will be able to fully use or push the capabilities of even present head tracking solutions , in my tests I found my aim and target seeking improved , but this did not improve my skill with a trigger. I found the precision of the headtracker to be essential , I can follow this line of text and move the cursor to within a pixel and hold it there,without the drift or jitter you would expect. The last person mentioned how you see your keyboard and this is where I stand with 1.5 years of time using hmds .. You need to see your keyboard to find your bearings from time to time , I do this by looking down from the ST1080 and the Cinemizer OLED , FOV is misleading , what matters is how much of that FOV is crystal clear , ergonmics and comfort are of absolute import, immersion is a subjective experience ,I find playing Hawken with headtracking in 3D on the crystal clear 40 inch smaller FOV Cinemizer OLED more immersive than the HMZT series or ST1080. Until I get hold of my Oculus Dev kit I will remain sceptical about how such a large FOV can be delivered without optical distortion. I am not so much worried about resolutions as the smaller res Zeiss easily outperforms the HMZT series in many scenarios and also the true 1080p smd ST1080 in others …hmds in every day use tend to perform quite differently to their tech specs or users expectations. Oh and one last thing about FOVs sometimes it is beneficial to have a smaller FOV , where you are playing a FPS game and you can see your whole fov in a smaller screen size you can also see your enemy quicker ,read your HUD quicker and move your head quicker ..somewhat negating the latency inherent in all forms of computer input ..good against hypothetical remotes is one thing ..good against the living ..that is something else

I will note that packing the FOV into a smaller screen size makes sense on a monitor but is a bad idea for VR, and of course doesn’t even make sense for AR, since it would misregister. In VR, the thing is that your view in the virtual world would not match your proprioception, which is a recipe for simulator sickness.

What about combining together a bunch of good ideas already mentioned in this thread ?

– First predict as well as possible the position of the eye for the next frame
– Render the scene in a buffer with a slightly wider FOV than needed (a full scene impostor)
– Then race the beam rendering the impostor with warping using the real-time position of the eye (update the position line by line if necessary)
To get a good rendering quality with warping, deferred shading is probably the best solution.

If the position prediction is good enough the artifact of the warping should remain minimal (and only present for sudden movement)
As the response time is more critical for response to the change of position of the viewer than for the change of position of the object in the scene, the impostor can be updated less often than the final rendering. Or a combination of two impostors, one for the static objects of the scene and one for the dynamic ones would be ok.
It’s even possible that with this solution a remote rendering of the impostor (à la nVidia VGX) would be possible.

Could a 60 Hz display be hacked so that for each R/G/B/W signal, the signal is displayed off sync by 1/4 Hz respectively?
Then use tracker prediction to subsequently shift the R/G/B/W signal to their predicted position, effectively multiplying the refresh rate by 4.

That’s pretty much how color-sequential LCOS and DLP work. The result, unfortunately, is highly visible color fringing, since each color is drawn at a different location when the HMD is moving relative to the eye.

This has been an awesome read, comments included. I have been very excited for the Oculus Rift but I now see how much of a challenge their project is.

I’ve seen mentioned a few times in the comments the idea of using more than one display. Somewhere in my reading about the Rift and Palmer Luckey’s past experience in VR, I recall seeing that the use of two displays is complicated when it comes to the issue of synchronizing the two units. If you were relying on the cooperation of the two units to produce an effect it seems to me that it would very quickly break down as soon as you had any de-syncing.

I think the idea of using any sort of prediction is inherently flawed. One could possibly get extremely close to making correct predictions but it will never be perfect and really, really good guesses would require tremendous computing power. The only real example that works is the one where you tap in to a person’s brain such that you can know what their muscles are going to do before the muscles know it themselves; that’s not really prediction though.

You don’t need to tap into the brain directly, you can use surface electromyographic monitoring to detect neural activity in muscles. It’s used to control prosthetic limbs, and facial nerve EMG is used to predict head movement during skull-based surgery.

About rendering to a wider buffer and then crop at the last moment based on the current head position.

I’m not sure how to implement this easily though. Would it require some more specialized display hardware? Or is it possible to use a chain of 2 GPUs for this? One for slow world rendering, another for fast cropping operation. Or does that negate any benefit?
I think “cropping” is not enough because you would also need to apply the lens deformation correction after cropping as well, right?

Could scan out latency be reduced by using a very high density display in an interlaced mode were odd and even frames render only to odd and even lines respectively? If it takes 16 ms to scan out a whole image, could you scan out a frame to the odd lines in 8ms and then scan out the next frame to the even lines in the next 8ms?

With a 5″ 1920 x 1080 display you could split it into 960 x 1080 for each eye, then interlace, with each eye given 960 x 540 per screen per frame.

Given a high enough pixel density the vertical jitter would be negligible but I’m guessing the effect on brightness plus visible scan lines could be unacceptable tradeoffs. However, as an interim solution until other improvements render the need redundant….

Could rendering to every line but skipping pixels to make a checkerboard dithering pattern help against visible scanlines? I’m guessing it wouldn’t look any better…

I’m sorry. I feel like Penny on Big Bang Theory. Or at least Howard “Not a Doctor” Wolowitz. A head mounted VR headset is something I’ve dreamed about for decades though.

That might be a useful interim step, but I can guarantee the image would come apart in a big way for horizontal movements. With color-sequential LCOS, where there’s only 4 ms between color planes, the color fringes are very visible, and this would be twice as long. Also, there would be weird effects on horizontal movement, with either overlap of the two fields or the two fields coming apart.

I missed the Kickstarter campaign for the Rift and I wish those guys all success, but after reading through your article I’m now sceptical of the still to be released product.

The way I see it, the issue exists in hardware and that’s where the solution should be fixed\found. If the easiest solution is a high refresh rate display then that is how it should be acheived.

I get that VR is currently very niche. But I wonder if another fund raising scheme couldn’t be successfull to get a manufacturer interested. It seems to be that the console market has become somewhat stale. I wonder if one of the big 3 couldn’t be persuaded to weigh in and put up some capital necessary to get the manufacturing numbers. My guess is that this is what the Rift guys would be aiming for with their develpment package, trying to get enough momentum for a viable solution to be economically viable.

The solution might also come down to price point and how to accomodate that to the market. Even if for instance that the Rift could get the hardware to produce a consumer level product down to say $1000 USD I wonder if a bean counter boffin couldn’t come up with an economic solution.

My personal opinion is that Kickstarter-scale money is irrelevant to the decision-making of the huge companies that make displays. They’re in a highly competitive, fast moving marketplace in which success can result in tens of millions of displays sold, with billions in revenue, and that’s where they’re going to focus their resources; no Kickstarter can compete with that. The way to get display manufacturers on board is to have a thriving, growing market for VR/AR, and Oculus is taking the first steps down that path, which I think is great. One of the big companies could obviously fund hardware development, and that would move things forward as well, but I have no way to know if that’s happening.

That fixes rotation, but the whole surround has to be redrawn differently depending on your position, so you have the same problem for translation. And there’s never a real-world motion that’s purely rotation.

HDMI 2.0 is looking at raising the roof to 18Gbit/s, which would be enough to halve the scan-out rate of a 1920×1080 stereo image. Set the video mode to 4k/60Hz and render nothing in the bottom half of the image. The display would just ignore the bottom half using a special firmware

You could even extend this by rendering a second frame into the bottom half and get the display to retrace half way through. That would reduce the required bandwidth to 4k/30Hz which is within batting distance of HDMI 1.4a and Dual-Link DVI

That would theoretically reduce scan-out time from 16.6ms to 8.3ms. If using smaller resolutions even lower scan-out times could be achieved

At least that’s what I would expect to happen with my limited knowledge. Maybe it can’t be done due to some signalling technicalities

Yes, that’s a very good thought; you can do pretty much the same thing with no hardware/firmware changes at all by running at 120 Hz if the display can transition quickly enough. For your approach, there’s still the issue of how to get the display to ignore the extra data (firmware isn’t as hard to change as hardware, but it’s harder to change than software and often requires cooperation from the manufacturer), but it’s the simplest way to reduce transmission latency.

Pair the vr unit with MRI and it becomes easier to predict head and eye movement. Currently cost prohibitive, but look at the strides made with mind controlled remote-controlled vehicles. Training the users mind to give the same cues that control the current crudely controlled toys to pre-empt head movements would give the hardware additional time to compensate.

As was mentioned having a screen’s pixels coupled directly to their own dedicated memory sounds like an awesome project I’d love to get into.

But until that happens, could a grid of low resolution screens (5×5 @ 320×240) be updated in parallel faster than a single full resolution screen?

Maybe it would require one controller per screen to send the data in the standard serial encoding for it’s part of the grid, and a high bandwidth memory bus between the vram and each screen’s controller. Or 25 memory busses, or 25 video cards

I imagine the effect would be like looking out a screen door, and the lines between the screens would fade as you’re immersed into the game.

It seems like you can split the latency problem into two main areas. There is the latency involved with head-tracking (which requires a fast refresh rate) and traditional user-interaction latency (mouse-click lag) which our brains can already deal with ok, provided it doesn’t creep above ~100ms.

These two types of operations are also different in terms of the processing required. The head-tracking update needs to be fast to keep up with high rotation-rates, but doesn’t require the same level of feedback with the main processor that direct user interaction requires. I assume that rotational head-tracking primarily needs to essentially simulate the rotation of the visual field around a spherical plane, without much other change in the image. You could then simply transmit enough information to render all rotational processing required in the time before the next frame is served up by the main engine, which is in the best case 16ms of probable rotation. You would then let the VR device process the intermediate frames onboard, with the information from the head-tracing sensor. This additional transmitted information would take the form of the surrounding pixels around the the edge image.

Lets take your example of 60 degree/second head rotation, supposing that latency is 50 ms and resolution is 1K x 1K over a 100-degree FOV. You calculated that this results in approximately 30 pixels difference. If the graphics engine renders 1060×1060 pixels, this could be sent to a pre-display buffer in the VR display, which then displays a subsection, the centre 1Kx1K image. If the head tracker detects 60degree/second rotation, it could immediately begin shifting the displayed view through the scene already loaded in the pre-display buffer, additionally warping the image if it is a flat display. I’d wager that current-gen mobile graphics cards would be more than capable of sub-frame performance with this kind of operation.

Of course there are now the rest of the visual updates to process, such as any change in the scene as a direct result of head rotation feedback, as well as head-translation and direct user interaction. However these operations are on the more latency-forgiving side of things, so the user can more easily deal with the 16ms, or up to 50ms latency that comes with involving the main GPU/CPU. The VR device recieves the next frame whenever it can, correcting the start-point of the displayed subsection to line up with the movement accumulated since the frame began processing (16ms to 50ms ago).

My assumptions here are that:
1. Rotational movement requires minimal image processing within the latency of a total round-trip (as there is no prespective change). Or that by delaying this processing, the intermediate results would still be acceptable.
2. Artifacts caused by rotational movement contribute to most of the 7ms latency requirement, and that remaining operations can safely wait for a full round-trip from the main graphics processor.

I think you know what my question will be: On what do you base your assumptions?

Also, you won’t get pure rotation; there’s always significant translation of the eyes. Sit in front of a mirror, hold up your finger, and turn your head; there’ll be quite a bit of parallax between your finger and its reflection.

Finally, it’s not clear whether a linear shift is a good enough approximation when there’s really a spherical component involved, not to mention any distortion correction that the optics require.

That’s not to say it couldn’t work, of course. But hard data would be helpful.

For those interested in an crotchety old-timer’s perspective, I talk about using prediction with an extended Kalman filter to handle sensor and render lag in AR/VR in a post in the wearable computing Google+ community

Hi Thad – it’s a pleasure to have one of the luminaries of VR commenting!

Thanks for the link to your post. If you reread my post, you’ll see that I do in fact acknowledge prediction, and we do use prediction. It works very well – most of the time. But as I mentioned, when it fails – at times of rapid acceleration – it actually makes things worse than they otherwise would have been. For example, if you are turning your head rapidly and then reverse direction, the mis-registration at the turn point will be larger than it would have been without prediction, and is very noticeable; accelerations can be thousands of degrees/second/second. This is not hypothetical, but rather clearly evident in actual HMDs.

Your drum example is cool, but not directly relevant, I think. The key to HMDs is that head motion creates anomalies that the visual system is tuned to detect; that makes predicting head motion different from detecting any other sort of motion. If a drumstick’s movement has some mis-prediction when it changes direction, that won’t show up as a sudden visual shift.

Thanks for jumping into the discussion, and I hope to hear from you more in the future. And if you’d like to come to Valve and talk about VR, we’d be happy to fly you out!

Dunno if something similar was already mentioned (too much information),
but the guy who did “World Tour Racing” on the Atari Jaguar had a strange idea.
Since faster means less detail for the human eye, the resolution of the game changed with the speed of movement. That means sometimes the resolution changed every frame (that was possible with the Jaguar hardware). No native resolution on CRT though.
Finally it was too complicated, and he dropped the idea.

That might make sense on a monitor, but you can fixate on a point in front of you, and turn your head rapidly while you continue to fixate, and the relative motion of the HMD display and your eye can be very high without seeing any less detail. That’s how you typically look at something new – your eyes go there first, then your head follows while your eyes fixate on the new object.

This is getting exactly to the direction what I was referring at the beginning regards human perceptual limitations. It is quite overkill to try reproduce the whole field of view with same requirements when clearly that is not how our brains nor eyes work.

What about combing the hardware panning and racing the beam ideas? Do not render the full image on the PC but instead use the PC to simplify the scene. Remove all the geometry that can not be seen without moving the head at superhuman speeds.
Send a big texture for the whole frame with premultiplied illumination, pre-transformed and pre-filtered at the most likely viewport to the hardware. This big texture could contain “hidden pockets” that would only become visible if there is a head translation. You could generate locations for these pockets by doing z-only render passes from slightly translated positions and then comparing the (shifted) z-buffers. If the z values are different you found a pocket that is not visible at the moment, but can become visible by a slight head translation.

Good idea, and there are many reconstruction approaches along these lines. They do take extra processing, though, which can increase latency, and they all have their failure points. Really the only way to find out if they’re good enough is to try them.

Sure, trying out different approaches is really needed here. What I like about this kind of approach is however that it frees the PC from any serious “realtime” requirements and moves the timing critical parts completely to the hardware. The processing and memory requirements of such an reconstruction approach seem to be easily within the range of FPGAs or small ASICs without a need for external DRAM. An internal SRAM buffer big enough for maybe 10% of a frame seems to be enough. At least with current PC and operating systems racing the beam seems to need unrealistic real-time requirements.

Maybe the GPU developers can do something about this: Why can’t we have “output shaders”? Shaders that generate the output pixels for the display just in time for the display? If we then have a direct channel from the head tracker to these output shaders, all timing critical parts would completely be on the GPU and it would not matter if Windows decides that is absolutely needs to do something else for the next 2 ms.

Bear in mind that artifacts from vertical motion can be considerably more visible if you’re doing scan-oriented reconstruction, in addition to translation artifacts.

Output shaders would only work if you had a fixed render time per pixel, if I’m understanding you correctly. But there’s overdraw, and translucency, and rendered-to textures that are pose-dependent, so that’s not the case.

Oh, it just struck me (maybe it’s been obvious all along and you’ve posted something similar about parallax issues), but the idea of rendering to a wider buffer and then somehow “cropping” it at the last moment using the latest position data has additional issues (besides the fact that one has to apply lense pre-distortion after cropping).

Something like the Rift requires a stereo image, so, as the user rotates or/and tilts his head, we really need one buffer per eye, and therefore the whole idea of rendering to a single bigger buffer (akin to a 360 degree view) just won’t work well (there’s no depth info).
And augmenting a 360 degree image with a 360 degree depth buffer won’t help much either (cause when tilting and rotating, the motion of the 2 eyes involves some small translation).
It would be jarring if when I move my head in complex ways, suddenly the world becomes 2D.

Render multiple images in parallel from slightly different viewpoints, selected based on a model of plausible head movements; eg a moved-left image, a moved-right image, and a didn’t-move image. Choose the best image to based on sensor data at the last possible moment, and the position error from all latency preceding selection from n images is multiplied by 1/n, in the limit of many images. If rendering is significantly faster than necessary for the framerate, then this may be essentially free; eg, if the next frame is due in 16ms, and rendering takes 3ms per viewpoint, then that leaves time to speculatively prepare 4 alternatives. This requires fixing scanout, however, as HDMI et al aren’t fast enough to send multiple frames per frame.

That might work, but if scanout could be made fast enough to send a frame in 4 ms, that would solve most of the problem by itself. But that requires hardware changes, which, as I pointed out, is the stumbling block in dealing with display latency.

Also, if you rendered more than one tentative frame, each extra frame would have that much more latency. The first one you rendered would have 9 ms more latency in your example, which would allow for much more head movement in the interim, especially considering that the head can accelerate at thousands of degrees/second/second.

I'll post here whenever there's something about what I'm doing or about Valve that seems worth sharing. The initial post is an unusual one - it's long, my attempt to distill the experience of my first year and a half at Valve - but I think it's well worth reading to understand what I'm doing, why I'm doing it, and the context in which it's happening, and just to understand more about Valve in general.

Michael Abrash is the author of several books, including Zen of Code Optimization and Michael Abrash's Graphics Programming Black Book, and has written columns on graphics and performance programming for several magazines, including Dr. Dobb's Journal and PC Techniques. He was the GDI programming lead for the original version of Windows NT, coauthored Quake at Id Software with John Carmack, and worked on the first two versions of Xbox. He is currently working on R&D projects, including wearable computing, at Valve. He can be reached here.