Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.

This extensions specification defines a new media type and constrainable property per Extensibility guidelines of the Media Capture and Streams specification [
GETUSERMEDIA]. Horizontal reviews and feedback from early implementations of this specification are encouraged.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

1. Introduction

Depth cameras are increasingly being integrated into devices such as phones, tablets, and laptops. Depth cameras provide a depth map, which conveys the distance information between points on an object's surface and the camera. With depth information, web content and applications can be enhanced by, for example, the use of hand gestures as an input mechanism, or by creating 3D models of real-world objects that can interact and integrate with the web platform. Concrete applications of this technology include more immersive gaming experiences, more accessible 3D video conferences, and augmented reality, to name a few.

This specification defines conformance criteria that apply to a single product: the user agent that implements the interfaces that it contains.

Implementations that use ECMAScript to implement the APIs defined in this specification must implement them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification [WEBIDL], as this specification uses that specification and terminology.

The term depth stream track means a MediaStreamTrack object whose videoKind of Settings is "
depth". It represents a media stream track whose
source is a depth camera.

The term color stream track means a MediaStreamTrack object whose videoKind of Settings is "
color". It represents a media stream track whose
source is a color camera.

5.1 Depth map

A depth map is an abstract representation of a frame of a
depth stream track. A depth map is a two-dimensional array that contains information relating to the perpendicular distance of the surfaces of scene objects to camera's near
plane. The numeric values in the depth map are referred to as depth map values and represent distances to near planenormalized against the distance between far and near plane.

A depth map has an associated near value which is a double. It represents the minimum range in meters and it defines
near plane which is a plane perpendicular to camera viewing direction on distance near value from the camera origin.

A depth map has an associated far value which is a double. It represents the maximum range in meters. It represents the minimum range in meters and it defines far plane which is a plane perpendicular to camera viewing direction on distance far
value from the camera origin.

A depth map has an associated horizontal focal
length which is a double. It represents the horizontal
focal length of the depth camera, in pixels.

A depth map has an associated vertical focal length which is a double. It represents the vertical focal length of the depth camera, in pixels.

A depth map has an associated principal point, specified by principal point x and principal point
y coordinates which are double. It is a concept defined in the pinhole camera model; a projection of perspective center to the image plane.

A depth map has an associated transformation from depth
to video, which is a transformation matrix represented by a Transformation dictionary. It is used to translate position in depth camera 3D coordinate system to RGB video stream's camera (identified by videoDeviceId) 3D coordinate system. After projecting depth 2D pixel coordinates to 3D space, we use this matrix to transform depth camera 3D space coordinates to RGB video camera 3D space.

Both depth and color cameras usually introduce significant distortion caused by the camera and lens. While in some cases, the effects are not noticeable, these distortions cause errors in image analysis. To map depth map pixel values to corresponding color video track pixels, we use two DistortionCoefficients dictionaries:
deprojection distortion coefficients and projection
distortion coefficients.

7.4.2 Transformation dictionary

The Transformation dictionary has the
transformationMatrix dictionary member that is a 16 element array that defines the transformation
matrix of the depth map's camera's 3D coordinate system to video track's camera 3D coordinate system.

The first four elements of the array correspond to the first matrix row, followed by four elements of the second matrix row and so on. It is in format suitable for use with WebGL's uniformMatrix4fv.

The videoDeviceId dictionary member represents the deviceId of video camera the depth stream must be synchronized with.

Note

The value of videoDeviceId can be used as the
deviceId constraint in [GETUSERMEDIA] to get the corresponding video and audio streams.

A color stream track and a depth stream track can be combined into one depth+color stream. The rendering of the two tracks are intended to be synchronized. The resolution of the two tracks are intended to be same. And the coordination of the two tracks are intended to be calibrated. These are not hard requirements, since it might not be possible to synchronize tracks from sources.

This approach is simple to use but comes with the following caveats: it might might not be supported by the implementation and the resolutions of two tracks are intended to be the same that can require downsampling and degrade quality. The alternative approach is that a web developer implements the
algorithm to map depth pixels to color pixels. See the
3D point cloud rendering example code.

The depthNear and depthFar constrainable properties, when set, allow the implementation to pick the best depth camera mode optimized for the range [depthNear,
depthFar] and help minimize the error introduced by the lossy conversion from the depth value d to a quantized d
8bit and back to an approximation of the depth value
d.

8. Synchronizing depth and color video rendering

This section is non-normative.

Note

The algorithms presented in this section explain how a web developer can map depth and color pixels. Concrete example on how to do the mapping is provided in example vertex shader used for 3D
point cloud rendering.

8.2 Transform from depth to color 3D space

The result of project depth value to 3D point step, 3D point (Xd, Yd, Zd), is in depth camera 3D coordinate system. To
transform coordinates of the same point in space, but to color camera 3D coordinate system, we use matrix multiplication of
transformation from depth to video matrix by the (Xd, Yd, Zd) 3D point vector.

WebGL: readPixels from float texture

This code creates the texture to which we will upload the depth video frame. Then, it sets up a named framebuffer, attach the texture as color attachment and, after uploading the depth video to the texture, reads the texture content to Float32Array.

Use
gl.getParameter(gl.IMPLEMENTATION_COLOR_READ_FORMAT); to check whether readPixels to gl.RED or gl.RGBA float is supported.

WebGL Vertex Shader that implements mapping color and depth

This vertex shader is used for 3D point cloud rendering. The code here shows how the web developer can implement algorithm to map
depth pixels to color pixels. Draw call used is glDrawArrays(GL_POINTS, 0, depthMap.width * depthMap.height). Shader output is 3D position of vertices (gl_Position) and color texture sampling coordinates per vertex.

10. Privacy and security considerations

A. Acknowledgements

Thanks to everyone who contributed to the Use
Cases and Requirements, sent feedback and comments. Special thanks to Ningxin Hu for experimental implementations, as well as to the Project Tango for their experiments.