How HTML5 remote desktop clients work

Earlier this week, I wrote about my initial thoughts on the Chromebook, and I talked a little bit about HTML5 remote desktop clients, specifically AccessNow from Ericom. In the comments, we also heard from the creator of Spark View, Walter Wang. Walter's comment, plus a subsequent phone call with Ericom, helped to shed some light on exactly how Chrome (and other HTML5 compliant browsers, which is all of the big ones now, I think) use HTML5 technologies to show remote desktops. In that article, I speculated that Ericom was somehow wrapping RDP and shipping it to the client. It turns out that what actually is happening is a bit more complex, and it involves translating RDP data for consumption by the browser. Before I get too far ahead, though, let's break this down.

There are two key technologies that enable remote desktop clients within a browser, WebSockets and Canvas. WebSockets is how the remote desktop data is sent from your environment to the browser, and Canvas is the technology that allows it to be redrawn on the screen.

WebSockets is a protocol/API that is built in to all the recent browsers that allows for continuous transmission of data via one TCP socket, as opposed to HTTP, which requires each request to have a response. Multiple requests, then, require multiple connections, which is pretty complex and inefficient for anything that needs to have a realtime feel to it. WebSockets changes this by essentially opening a channel between the client and the server that remains open between requests. The main drawback of WebSockets is that it only supports textual data, not binary data (which is what remote protocols use), which we'll get into later.

Canvas was created by Apple way back in 2004, and has grown into being a native HTML5 element. Canvas enables the ability to control every single pixel discretely through the use of javascript, which allows the browser to render 2d graphics dynamically. When you see animations or games that play in the browser and don't use Flash (i.e. HTML5 games like Angry Birds for Chrome), you're seeing Canvas in action. For remote desktop connections, the client (in this case, mostly a javascript program) consumes the data coming in via WebSockets and draws the desktop on the screen via Canvas.

Right now you may be thinking "Canvas...no binary data support...that's not RDP at all," which is absolutely correct. But if what you're using at the client isn't RDP, then how is this working? The secret there is with a gateway of sorts. Ericom calls this AccessNow Server (which is really just a lightweight service), and Spark View calls it a Spark Gateway. In both cases, these gateways establish an RDP session with the remote host and translate (or re-encode) that binary data into textual data for use with WebSockets. That text data is sent on to the browser where the client interprets that data and draws it on the screen with Canvas.

The entire process looks something like this (click for larger image):

Ericom has also introduced a version of AccessNow that works with VMware View. There's an added step that involves hosting the web client on a View server so that it can take advantage of the View Open Client, which handles authentication and desktop selection before handing the connection off to the AccessNow Server (remember, that's more of a service than a server). Ultimately, they view this as a way to expand endpoint support for VMware View to anything with an HTML5-compliant browser, which will level the playing field with Citrix when it comes to number of client devices supported by the platform.

At this point, AccessNow does not support virtual keyboards like what you would find on iOS or Android devices. It appears that only Spark View supports those types of devices today, although I haven't had a chance to actually look at the product yet. We know Walter reads this blog, though, so maybe he can comment :) Ericom has said that they are close to providing it, they just want to make sure they get it right before releasing the next version.

Since the Citrix HTML5 client hasn't been released yet, I'm not sure how it works. I imagine it has the same basic architecture, though, while utilizing some of Citrix's existing components (web interface, connection broker, NetScaler, etc...). It's my plan to do a HTML5 remote desktop client roundup when Citrix releases theirs, but if that winds up being too far out, I'll do it without them. It's all so new, though, it seems only fair to give it a little more time.

Based on his comment, Walter believes that WebSockets will be amended to include binary data at some point in the future, which may or may not eliminate the need for a gateway in the middle. There's not much doubt, though, that this article will be obsolete in the near future as more advances are made with HTML5 and remote desktop connections. Call it "Job Security," I guess :)

Great write up. I'm curious too as to how Citrix will do their HTML 5 client and whether there is now a dependency and/or session running thru this receiver for web component that will be running on a windows server. could be a big differentiator between devices with native clients or html 5 clients.

It's my understanding that a Windows server does not need to be in the path, just a service that can re-encode RDP as text data for WebSockets.

It remains to be seen how the Citrix clint will work, but if it is using the same architecture, they'll be subject to the same limitations that Simon mentioned. As implemented currently WebSockets/Canvas will strip all the features out of a protocol during re-encoding.

To me that means there isn't going to be much of a market for HTML5-based access to Windows apps unless/until a new solution is proposed.

I guess the purpose of using WebSocket instead of using socket directly is for safety. That will be dangerous if you can connect to local socket server in a webpage. That means the browser can not do the converting job for you. The only way left is do it on server side. Citrix already have this kind of server, so it’s easy for them to add this in. But just like Simon said, although VMWare can do that too, but that will lose the benefit of UPD. And PCoIP is a lan based protocol. The performance will be slow if it was converted to WebSocket compared with RDP. That’s only my thoughts, maybe wrong. I guess we cannot see VMWare’s html client very soon, especially seems VMWare never pay attention to client side.

There are two kind ways of converting for now. One is converting to another protocol, like Ericom did, I know some implementations just convert any protocols to PNG file and draw them on Canvas. Another one is like Spark View did, just convert the binary data to textual and process them on Canvas in original way. Obviously I prefer the next way. It’s too complicated to create a perfect protocol for me.

To clarify, Ericom AccessNow does NOT convert RDP into one big PNG for drawing on canvas. Rather, only bitmaps transmitted through RDP are converted to PNGs or JPGs. The other RDP instructions, such as draw line or fill area with pattern, are transmitted and executed by the client as-is. This approach, in our opinion, provides the best of all worlds.

Converting bitmaps to PNGs and JPGs has two main advantages:

1. PNGs and especially JPGs are much smaller than the original bitmaps (convert a .bmp file into a .jpg file to see what I mean). Thus they consume much less bandwidth when transmitted over the wire. This can be very important in WAN scenarios.

2. Browsers have built-in support for PNGs and JPGs so they can be drawn onto a Canvas using a single, efficient instruction. RDP bitmaps on the other hand, need to be drawn onto the canvas pixel-by-pixel by iterating over them in JavaScript.

The end result is that despite the overhead of WebSocket, Ericom AccessNow is often more efficient and performs better than the original RDP.

I actually wanted to say some HTML5 VNC implementations are using the PNG way. I haven’t got chance to check AceessNow’s implementation. Just knew it’s converting another protocol form Gabe’s previous article. I’m sorry if I make you or others misunderstood.

From what I know, it’s hard to say which way is better before you have a full test. Like everyone knows native code is faster, but my Android client (pure Java) is obviously faster than some native implementations (I don’t want to mention that names here), because it’s very expensive to invoke native code in Java. Also a lot of people are still thinking JavaScript is slow. So comes to PNG way, it’s also uncertain before you can have a test.

For example, the process of the PNG way is like this:

1. Decompress the compressed bitmap which is from RDP.

2. Compress it to PNG.

3. Transfer to client.

4. Let browser draw PNG.

5. Browser decompresses the PNG to bitmap.

6. Browser draws the bitmap.

The process of the original RDP way:

1. Transfer compressed RDP to client.

2. Decompress it on Client.

3. Draw bitmap pixels directly on Canvas.

Please correct me if there something wrong in my description. I’m not sure which way is better before a text (I may do a text if I got time).

(Note: You must be logged in to post a comment.)

If you log in and nothing happens, delete your cookies from BrianMadden.com and try again. Sorry about that, but we had to make a one-time change to the cookie path when we migrated web servers.