Monday, September 19, 2016

The WebVR in Chromium implementation has been going through some major changes recently and so I thought it would be time to share some of my original notes before they end up becoming useless. Some may already be incorrect given the latest, but I've tried to make sure anything specific to WebVR 1.0 -> 1.1 has been "abstracted" away and what is left is the consistent portions of the architecture.

You can see the head of Brandon's branch in experimental here. You can also see some the log history for the most recent changes. Unfortunately, the "source of truth" for my diagrams is no longer visible after many rounds of rebasing, squashing, merging and otherwise keeping up with all of the blink changes getting rolled into the branch.

I'll lead with the diagram and then talk about some of the broad sweeps structures. Its a big diagram so expand it out to full size so you can really see what is going on.

In broad strokes the architecture is split into Browser, Renderer and GPU processes. The Oculus is shown in its own process since it is responsible for its own compositing and time warp functionality and has its own IPC.

Browser

The Browser process starts with a VRService and VRDeviceManager. These guys are how the browser wraps up all of the devices on your system. Each device is exposed via a VRDeviceProvider derivation and encapsulates the logic to both enumerate and poll the devices.

To communicate the VRDisplay, which you see in the Renderer process expects a VRDisplayPtr to be returned from Mojo across the IPC boundary. So all communication between the providers and the manager are done via Mojo compliant structures.

While a lot of boxes, this is conceptually simple. To implement a new device all you have to do is implement a new provider, plug in code that knows how to enumerate your device and communicate your device details back using the prescribed interface. The heavy lifting of communicating across the processes and controlling access to devices is all done in the hosting layers.

To bootstrap the VRService is registered with each Renderer using the Mojo service interface registration logic.

Renderer

The Renderer process is a bit more interesting and it contains all of the objects that talk to JavaScript and all of the V8 bindings. This is the layer that communicates composition details to the GPU process to make sure that submitFrame gets the texture from the WebGL surface to the device.

When a page request displays using navigator.getVRDisplays, the Mojo client service architecture kicks into high gear. There is some interesting bouncing around after you build a VRController (this is the client side of the VRService in the Browser process) to get to to the service through the frame's interface provider. This is pretty common for all services that bridge this gap and so I now consider this hookup boilerplate.

One curiosity at this level and maybe an optimization for the future is that each NavigatorVR will get its own VRController which means that the Browser may in turn communicate data to the same Renderer process multiple times. Normally for this type of architecture we try to keep these channels 1 to 1 and then multi-cast within the process.

Once the service connection is live the rest of the work happens mostly in VRDisplay, which in turn provides device information through some other required platform objects so that we could have a few properties. VRFrameData in the future will replace most of the extra interfaces, some of the enums, etc... Things get simpler in WebVR 1.1 and its a great time to fix this stuff and deprecate older interfaces since we don't have broad adoption and upgrading to the new API is fairly trivial.

GPU

The GPU to Renderer communications are using an old school IPC mechanism specific to to the GLES2 implementation. Any operation we need to request happen in the CPU process needs a new method and a bunch of auto-generated code. There were only 5 such methods during my original documenting step and they mainly related to creating, destroying and submitting frames to the VR specific compositor.

This entire interface, I've heard, may soon be replaced by Mojo and so digging into it doesn't seem like a very good use of time, but if there is enough interest I could share my notes here as well. It is somewhat interesting how the structures are built, copied, etc... Basically an entire GPU command buffer implementation specific to communicating between the Renderer and GPU.

Consequently, any VR implementation will need to implement a compositor. This is how the generic submitFrame call can communicate information to the device specific graphics stack which may take in a texture or be implemented as a swap chain. There are also device specific configurations such as whether or not to enable deformation or asynchronous time warp.

As of this diagram a potential problem is that the device is being accessed from two different processes. First in the Browser where there isn't a sandbox that prevents talking to the various device services. Second in the GPU where we instantiate, potentially, a second instance of the VR run-time to submit the frames. Because of this dual nature, "bad things"(tm) can happen and it is a source of communication and design within Chromium to figure out how to unify all of this into a single location. You could imagine solutions such as allowing compositing in the Browser process since this only implies doing a texture copy or swap. Or perhaps moving the VRService into the GPU process and getting the data from the single VR compositor source. A third option is to move all of this into yet another process which is complicating but great for Chrome's architecture.

Thankfully no matter which approach happens in the end, it most likely won't affect the Renderer code very much. Its an implementation detail rather than something that dramatically affects the shape of the API. It will affect performance and quality of the VR implementation to some extent, but that is always something to be finely tuned for best results.

Future Thoughts

I'll probably reserve any future discussions in this area until the code stabilizes a bit more. There are a lot of cool little changes coming up that I'm interested to see. For instance, the VRDisplay has to carry more data. In fact, to support certain compositors there might be device specific data that has to be sent through. Today, since the VR provider can be talked to from two locations its not possible to efficiently share this data so we'll have to figure out something. Shared memory maybe?

The security model around device access and frame focus will also be cool to talk about once all of that code lands. Currently there are bits and pieces available, but I don't see a fully cohesive picture. Lots of fun here!

Sunday, September 11, 2016

Okay, so I'm not saying this is even close to correct, but after a lot of code reading, reverse engineering and reading of documentation I think I'm finally ready to pass a bool value of some sort from the Blink code in the Renderer process all the way up to the Browser process.

I was quite surprised how many steps there were in plumbing even a simple value, but when you think about all of the architectural layers and how open source software is composed it kind of makes sense. Blink has to be wrapped by a content host of some sort to give it access to resources. That means some well defined interfaces and abstract communication channels for those layers. The content host in turn gets wrapped into some form of marshaling abstraction that allows it to serve the multi-process model. Some more objects in the browser in turn wrap these marshaling abstractions and separate the browser and content process. After all, why does a browser know about only a specific piece of content or one content type, maybe I can plug in any content right? Even the browser implementation specifics of UX and platform need to be broken out and that means even more interfaces. The following, very high level, very abstract diagram should help in understanding the layer breakdown.

Quick Note: The diagram was made before converting instances of WebKit to Blink. The two are mostly interchangeable in this article and I've tried to be clear that they effectively mean the same thing.

If you are interested in more specifics then the actual diagram looks something closer to these hand scrawled hieroglyphics I threw up on my whiteboard. The faint of heart should avert their eyes.

In this sample I was hypothesizing how to get information from VRDisplay up to the Browser. However, we will actually use a slightly different and preexisting message to decode this diagram. The message we'll use is FrameHostMsg_DidChangeName. By the time you get to a message like this you are already at the Chrome IPC layer talking from the Renderer to the Browser. So how did I even pick this message? Well, I first started in WebContentsObserver which is the list of things that a browser might want to know about from the content. There are other types of observers but this one has a lot of good stuff on it. That allowed me to track back to the message.

From the message we can then look for someone doing a "new" operation on the type. So FrameHostMsg_DidChangeName isn't just an enum or index, but a structure. You need to create one of these structures to pass to the Browser, so now we have a foot hold in the Renderer to continue our examination.

Starting at the End

Okay, enough foot-hold, lets work this backwards. In fact, my diagram is incomplete, it only discusses up to the Send and then it fudges the work that happens in the Browser itself. We need to understand that though since it shows the differences between where the Shell is and were the Browser side implementation for the hosting model is.

From the diagram, the WebContentsObserver is what a Shell would create in order to spy on the WebContentsImpl which is really a tree of pages. Since this is a tree, if you want to know specifically what "document" something is for you have to pass in the RenderFrameHost object to uniquely identify it. This is precisely what we do with DidChangeName. You can view the implementation of RenderFrameHostImpl::OnDidChangeName to see how it passes itself to the delegate.

To get into the OnDidChangeName method we had to implement an IPC handler. We do this in render_frame_host_imp.cc in our RenderFrameHostImpl::OnMessageReceived method. Its a simple mapping of the message name to a handler function. The other side of this channel will be the renderer process passing us the notifications. If you scan up a few lines you'll find the Send method which we can also use to communicate back down to the renderer side object.

Our implementation of OnDidNameChange shows the difference between the Shell and Browser code. We need some Browser level logic to tidy up the frame node's and structures required to make sure that the HTML 5 spec is adhered to and that our Browser side view is in sync. This is what I'm calling Render Host in the high level diagram.

The Shell doesn't get a crack until we call into the RenderFrameHostDelegate::DidChangeName method. By default the RenderFrameHostDelegate is a no-op object, but we derive our WebContentsImpl from it and pass it into each RenderFrameHost that we wrap. From here things get simple and the Shell is able to access the notification by first creating a WebContentsObserver and attaching this to the WebContentsImpl. You can have as many observers as you want per WebContentsImpl and it will simply mulit-cast the notification. A couple of quick searches on the Chromium Code Search should find you the relevant details if you want to dig in further.

Escaping the Renderer Process

We are now back to those hieroglyphic scrawls from before!

On this side of the pipeline we are looking at RenderFrameImpl::didChangeName. Note how the casing and structure is starting to change as we get into the content process. This is important since it means that understanding the model of how things are structures is more important than understanding just a particular name or naming scheme. When naming schemes change code search becomes less useful since you have to perform many smaller searches to chase a thread of thought to its end.

On this end, the code takes in two strings, names and formats them into a FrameHostMsg_DidChangeName structure that it can then Send to the Browser process. We get notified of these changes because it turns out our RenderFrame/RenderFrameImpl is also a WebFrameClient. If you think about the delegate capabilities before, then this WebFrameClient must be similar. Our RenderFrameHostImpl is actually a delegate that can provide a WebFrame (or WebLocalFrame) with additional services. One of which is listening to a frame name change notification.

This is where I think things get tricky, but the architectural layering is pretty good. Things that start with the Web* prefix are going to be in third_party/WebKit code. All of the public interfaces for crossing from WebKit (Blink) -> Renderer (or whatever Content host you want to implement) will be in third_part/WebKit/public. This includes our WebFrameClient.h where we would add any new capabilities or notifications.

For this next part we stay in WebKit band (Blink) but it starts to cross layering boundaries within Blink itself. At some point we'll cross into a boundary that I called Modules (represented as both core and modules in the source tree) in the above diagram. Think of this simply as any sub-component that is logically being snapped in. Something like the Gamepad API might be a good example. Gamepad doesn't need to know everything about Blink to work. Instead it needs to bind in at key locations and those should be layered to allow proper extensibility. How that binding works is another blog entirely, but suffice to say its pretty cool.

Let's quickly jump through the rest of this since it is mostly plumbing. Now that we've seen how our Renderer code which is hosting Blink can be a client, we simply need to figure out how to poke all the way out of the Modules and WebKit (Blink) layers to invoke the client itself.

Most objects will have access to the Document and so that is how I would normally start. First get the LocalFrame using the frame() method. This returns a local m_frame and in context you should probably be able to guarantee its existence.

Next Steps

Note, don't trust that this just works, dig in, or even just add a simple notification to the code and get it to compile. Understanding the structure of these objects is very important not just now, but for future work. One cool thing that you find is there are Remote* versions of the window, document and frame. This is because of out-of-proc iframes I think. Those features help explain why some of the complexity around remote/local things needs to exist. It also details why many of these operations are only on local interfaces and that a listener for a remote frame, document, etc... would be listening in another process on the local version of the interface. Its representation in this document is only structural.

If anyone has a complete commit, pull request, etc... implementing a new notification that can be shared let me know. I'll add it to the post. The process is no joke for sure and you end up touching many files to have both the header default implementation (making it not quite an interface, but handy since you can add new methods without breaking everyone) and a client implementation.

Sunday, September 4, 2016

Ever wondered how Chromium implemented Gamepads with all of their odd-ball characteristics like different key mappings, bugs and sometimes really long polling times? Well here goes...

Chromium is broken into at least 3 different chunks for Gamepad input. The first chunk is the service in the browser which surfaces Gamepad data. The browser has access to system level devices like Gamepad's while the renderer or content process doesn't. This means at least one process jump is necessary to push the data down.

The second chunk is the renderer itself. This chunk has to read data, preferably in a non-blocking and fast way, and then provide the data to the JavaScript APIs. We know that this comes through in 2 different ways. First, a Gamepad can be provided through the gamepadconnected and gamepaddisconnected events. They come through on the event object. The second way to get a gamepad is to call navigator.getGamepads() and retrieve the array of them. This may not return anything useful if the user hasn't interacted with the Gamepad at least once though. This is to ensure that a website can't use the Gamepad as another way to fingerprint you (find out information specific to you that can be used across navigations to reassociate your session even across different sites).

The third chunk are the Gamepad's themselves, known as providers. These connect to the actual device or system level APIs and might take some time depending on the API and drivers involved.

So this gives us our latency options. We want the Gamepad to be up to date and for this we have to poll. Chromium implements either a 16ms or 10ms polling rate depending on which branch you are in. This polling rate only affects how quickly the Browser process has the latest data available and it still needs to travel back to the renderer process.

There is a lot of data it turns out. You can have up to 4 gamepads connected (kind of arbitrary, but good enough for now) though we'll discuss later why this is soon to be insufficient. And each gamepad can have tons of buttons and axes that need to be reported. This gets even more interesting with matrices for location and orientation if we start to discuss motion controllers.

Passing all of this data down through an IPC channel would create latency. It's also a lot of data, so it might even be a bottleneck in the IPC system if there are "many listeners". The push model isn't a good idea here, so could we do better? We could try to pull the data only when the page asks for it. That might be better, but that too might be happening at a high rate, even if the data hasn't "changed". You may also have to do a blocking IPC in this case since you are requesting a response. It could be async, but then you get even more latency and your data is always from the "last frame" instead of the current frame.

Shared memory works great here and this is how Chromium has implemented a more optimal behavior. With shared memory we can lock and update/read as we see fit. This could create some problems, anytime you take a lock bad things CAN happen, but would they? Let's go back and think about our we moved away from IPC, it was because there might be too many consumers of the data, in this case readers, and thus if the readers are always locking, they may never allow a write. We have to do even better.

What guarantees do we need? First we need to guarantee that the writer can write when it wants too without too much blocking. Also that the readers can read without too much blocking. But more importantly that the readers don't get "corrupted" data. They need a full frame of data and the data is pretty larg, maybe as large as say 1k, and so we need to ensure that we got it all and the writer didn't intercede in any way.

Turns out we can lock for the writer and optimistically read for the reader and then do a post validation that our read was "complete". This is implemented in Chromium as a OneWriterSeqLock though comments and a TODO indicate there are other options. I won't go into those other options since I've not read that far into the code yet. So how does this guy work?

We wrap this structure into a GamepadHardwareBuffer (lock and underlying data) and wrap that again with a GamepadSharedBufferImpl (wraps the shared memory handle). We then synchronize on the lock object through operations such as WriteBegin and WriteEnd which are just forwards to the underlying lock implementation. Every time the writer starts a write, it increments a version counter. An odd value means the write is in progress and thus our readers should wait. When the writer is done, it increments the same counter. An even value means the write is done. To ensure concurrency an atomic structure is used, in this case the reads/writes to the version are synchronized. You can see this all happening in the GamepadProviderclass.

The read operation then comes in two parts. First a ReadBegin to retrieve the version in synchronized memory. The possible return values are either an odd value or an even value. If odd, then the ReadBegin yields in some manner and waits until the writer is done. This assumes the writer is pretty fast and that it isn't writing at an insane refresh rate.

If even, then the ReadBegin returns the value and that is our loop version. We then read the data in shared memory. This should either be the memory associated with our current version or some future version. Remember, we aren't locking and so the writer could have incremented the version and begun writing to the memory after we started our copy.

Once our copy is done, we do a ReadRetry and pass in our value. We read the current value from the shared atomic, compare to our current version, and make sure they match. If they do, then it means the values we read from shared memory were those that were committed and matched our version number. If the version number has since changed, then it means our read may have data from more than one version and we should discard it and try to read again. This is all implemented in the GamepadSharedMemoryReader class.

That wraps up our interrogation of the Gamepad API and how it shares data. There are no limitations on the number of clients in this case. Any client can wake up at any time and decide to read and the writer can always increment the version. There is very little contention other than the memory ordering requirements induced by the use of the atomic types. Note, that these atomic types are implemented under the covers by Chromium and are not simply std::atomic.

So, is this interesting? Are getting Chromium brain dumps from a passerby like myself helpful to anyone else?