Terms of use

This dataset is provided for research purposes only and without any warranty.
Any commercial use is prohibited.
If you use the dataset or parts of it in your research, you should cite the aforementioned paper:

Overview

The collection contains more than 39000 stereo frames in 960x540 pixel resolution, rendered from various synthetic sequences. For details on the characteristics and differences of the three subsets, we refer the reader to our paper. The following kinds of data are currently available:

Left view

Right view

Explanation

RGB stereo renderings:
Rendered images are available in cleanpass and finalpass versions (the latter with more realistic—but also more difficult—effects such as motion blur and depth of field). Both versions can be downloaded as lossless PNG or high-quality lossy WebP images.

Segmentations:
Object-level and material-level segmentation images.

Optical flow maps:
The optical flow describes how pixels move between images (here, between time steps in a sequence).
It is the projected screenspace component of full scene flow, and used in many computer vision applications.

Disparity maps:Disparity here describes how pixels move between the two views of a stereo frame.
It is a formulation of depth which is independent of camera intrinsics (although it depends on the configuration of the stereo rig), and can be seen as a special case of optical flow.

Disparity change maps:
Disparity alone is only valid for a single stereo frame.
In image sequences, pixel disparities change with time.
This disparity change data fills the gaps in scene flow that occur when one uses only optical flow and static disparity.

Motion boundaries:Motion boundaries divide an image into regions with significantly different motion.
They can be used to better judge the performance of an algorithm at discontinuities.

on request

on request

Raw 3D data:
The original source from which we derive disparities, disparity changes, optical flows, and (by extension) motion boundaries. Due to their extremely unwieldy size and the fact that they offer relatively little which is not covered by the other data, we will make these packs available on request.

Camera data:
Full intrinsic and extrinsic camera data is available for each view of every stereo frame in our dataset collection.

Downloads

Example pack

Want to get your feet wet? Wish to check out our data without downloading gigabytes of archives first? Then get our sample pack!

Contains three consecutive frames from each dataset, in full resolution!

Includes samples from all available data: RGB, disparity, flow, segmentation...

Bold sizes indicate that a compressed archive expands to a very much larger size (more than 100GB larger, or expansion factor > 10).
For batch downloads, here is a list of URLs.
MD5 checksums for all files can be found here.

DispNet/FlowNet2.0 dataset subsets

For our network training and testing in the DispNet, FlowNet2.0 etc. papers, we omitted some extremely hard samples from the FlyingThings3D dataset. Here you can download these subsets for the modalities which we used:

Bold sizes indicate that a compressed archive expands to a very much larger size (more than 100GB larger, or expansion factor > 10).
For batch downloads, here is a list of URLs.
MD5 checksums for all files can be found here.

Data formats and organization

Use bunzip2 to decompress .tar.bz2 files, and use "tar xf <file.tar>" to unpack .tar archives.
Caution, some archives expand to massively larger sizes.

The RGB image packs are available in both cleanpass and finalpass settings.
The cleanpass setting includes lighting and shading effects, but no additional effects.
In contrast, finalpass images also contain motion blur and defocus blur.
All RGB images are provided as both lossless PNG and lossy WebP (used in our experiments).
WebP images are compressed using a quality setting of 95%, using the publicly available source code (version 0.5.0). WebP offers 80-90% smaller files than PNG, with virtually indistinguishable results.

The virtual imaging sensor has a size of 32.0mmx18.0mm.
Most scenes use a virtual focal length of 35.0mm.
For those scenes, the virtual camera intrinsics matrix is given by

fx=1050.0

0.0

cx=479.5

0.0

fy=1050.0

cy=269.5

0.0

0.0

1.0

where (fx,fy) are focal lengths and (cx,cy) denotes the principal point.

Some scenes in the Driving subset use a virtual focal length of 15.0mm (the directory structure describes this clearly).
For those scenes, the intrinsics matrix is given by

fx=450.0

0.0

cx=479.5

0.0

fy=450.0

cy=269.5

0.0

0.0

1.0

Please note that due to Blender's coordinate system convention (see below), the focal length values (fx,fy) really should be negative numbers.
Here we list the positive numbers because in practise this catch is only important when working on the raw 3D data.

All data comes in a stereo setting, i.e. there are "left" and "right" subfolders for everything. The obligatory exception to this rule is the camera data where everything is stored in a single (small) text file per scene.

Camera extrinsics data is stored as follows:
Each camera_data.txt file contains the following entry for each frame of its scene:

...

Frame <frame_id>\n

frame_id is the frame index.
All images and data files for this frame carry this name, as a four-digit number with leading zeroes for padding.

L T00 T01 T02 T03 T10 ... T33\n

Camera-to-world 4x4 matrix for the left view of the stereo pair in row-major order, i.e. (T00 T01 T02 T03) encodes the uppermost row from left to right.

R T00 T01 T02 T03 T10 ... T33\n

Ditto for the right view of the stereo pair.

\n

(an empty line)

Frame <frame_id>\n

(the next frame's index)

...

(and so on)

The camera-to-world matrices T encode a transformation from camera-space to world-space, i.e. multiplying a camera-space position column vector p_cam with T yields a world-space position column vector p_world = T*p_cam.

The coordinate system is that of Blender: positive-X points to the right, positive-Y points upwards, positive-Z points "backwards", from the scene into the camera (left-hand rule with middle finger=x, index finger=y, thumb=z).

The right stereo view's camera is translated by 1.0 Blender units, with no rotation relative to the left view's camera.

The image origin (x=0,y=0) is located in the upper left corner, i.e. a flow vector of (x=10,y=10) points towards the lower right.

Non-RGB data is provided in either PFM (single channel or three channels) or PGM format, depending on value range and dimensionality.
While PFM is a defined standard (think "PGM/PPM for non-integer entries"), it is not widely supported.
For C++, we recommend the very excellent CImg library. For Python+NumPy, see this code snippet.

Disparity is a single-channel PFM image.
Note that disparities for both views are stored as positive numbers.

For data which depends on the direction of time (optical flow, disparity change, motion boundaries), we provide both forward and backward versions.

Please note that the frame ranges differ between scenes and datasets:

FlyingThings3D: 6–15 for every scene

Driving: 1–300 or 1–800

Monkaa: Different for every scene

The FlyingThings3D dataset is split into "TEST" and "TRAIN" parts.
These two parts differ only in the assets used for rendering:
All textures and all 3D model categories are entirely disjoint.
However, both parts exhibit the same structure and characteristics.
The "TRAIN" part is 5 times larger than the "TEST" part.

Each of these parts is itself split into three subsets A, B, and C.
The same rendering asset pools were used for each subset, but the object and camera motion paths are generated with different parameter settings.
As a result, motion characteristics are not uniform across subsets.

We did not use the entire FlyingThings3D dataset for DispNet, FlowNet2.0 etc.: samples with extremely difficult data were omitted. See here for a list of images which we did not use.

Frequently asked questions (FAQ)

Q:I want to use depth data, but you only provide disparity!
A: Depth is perfectly equivalent to disparity as long as you know the focal length of the camera and the baseline of the stereo rig (both are given above). You can convert disparity to depth using this formula: depth = focallength*baseline/disparity
Note that the focal length unit in this equation is pixels, not (milli)meters.

Q:I would like to render my own data. Can I get the Blender scene files?A: We do not have the necessary licenses for all the assets that we use. We cannot redistribute them.

Q:How do I get depth in meters? What is the metric scale of the dataset?A: No metric scale or measure exists for these datasets. The focal length is listed as "35mm", but this is only within Blender. It cannot be extrapolated to depth measures in the dataset.

Q:What is the disparity/depth range of the dataset?A: There is no feasible fixed range. If you normalize the values, expect extreme outliers.

Q:Are semantic segmentation labels / class names available?A: No, these datasets have no notion of semantics. The labels are random across objects and scenes without any consistency guarantees. The labels are essentially meaningless.

Q:Where are the "0016.png" images?A: The "0015.pfm" flow files in "into_future" direction (and "0006.pfm" in "into_past" direction) describe optical flows, but their target images ("0016.png"/"0005.png") are not included in the dataset. If you want data for supervised flow training, please ignore these files.

Changelog

12Oct2018: Started adding torrent downloads

20Aug2018: Added downloads for FlyingThings3D subset used in our papers