The license agreement for data usage implies the citation of the two papers above. Please notice that citing the dataset URL instead of the publications would not be compliant with this license agreement.

Subjects

The motions were performed by 11 professional actors, 6 male and 5 female,
chosen to span a body mass index (BMI) from 17 to 29. This provides a moderate
amount of body shape variability as well as different ranges of mobility. The
subjects wore their own regular clothing, as opposed to special motion capture
costumes, to maintain as much realism as possible. We use 7 subjects (3 female
and 4 male) for training and validation, and 4 subjects (2 female and 2 male)
for testing.

Scenarios

Directions

Discussion

Eating

Activities while seated

Greeting

Taking photo

Posing

Making purchases

Smoking

Waiting

Walking

Sitting on chair

Talking on the phone

Walking dog

Walking together

The dataset consists of 3.6 million different human poses collected with 4
digital cameras. Our data is organized into 15 training motions containing
walking with many types of asymmetries (e.g. walking with a hand in a pocket,
walking with a bag on the shoulder), sitting and laying down poses, various
types of waiting poses and other types of poses. The actors were given detailed
tasks with examples in order to help them plan a stable set of poses between
repetitions for the creation of training, validation and test sets. In the
execution of these tasks the actors were however given quite a bit of freedom
in moving naturally over a strict, rigid interpretation of the tasks.

Laboratory Setup

Our laboratory setup is schematically representated in the
figure. It allows us to capture data from 15 sensors (4
digital video cameras, 1 time-of-flight sensor, 10 motion
cameras), using hardware and software synchronization. The
capture area was about 6m x 5m, and within it we
had roughly 4m x 3m of effective capture space, where
subjects were fully visible in all video cameras. Digital
video (DV) cameras (4 units) are placed in the corners
of the capture space. A time-of-flight sensor (TOF) is
also placed next to one of the digital cameras. A set of
10 motion capture (MX) cameras are rigged on the walls
to maximize the effective experimentation volume, 4 on
each left and right edge and 2 roughly mid-way on the
bottom horizontal edge. Detailed specifications of the
system are given in table 1. A 3D laser body scanner
from Human Solutions (Vitus Smart LC3) was used to
obtain accurate 3d volumetric models for each of the
subject actors participating in experiments.

Image Data

We use 4 basler high-resolution progressive scan cameras to acquire
video data at 50 Hz. They are on same clock and trigger as the motion
capture system which ensures perfect synchronization between the video
and pose data. The system's default calibration procedure is very simple
to perform but the camera model does not contain radial and tangential
distortion parameters. Since we strive for exceptionally high quality pose
information we have performed a more complex and more robust procedure
that fits all these parameters as well. The total number of video frames
for the entire dataset is over 3.6 Million.

Pose Data

Pose data is given with respect to a skeleton. For consistency and
convenience we use the same skeleton of 32 joints for all of our parametrizations.
In testing we reduce the number of joints the relevant ones e.g. leaving only one
joint for each hand and each foot.

Common pose parametrizations
considered in the literature include relative 3D joint
positions (R3DJP) and kinematic representation (KR).
Our dataset provides data in both parametrizations
with a full skeletons containing the same joints in
both cases. It also provides 2D joint positions since
some methods may require this data.

In the first case (R3DJP), the joint positions
in 3D space are provided. They are obtained from
the joint angles provided by the Vicon skeleton fitting
procedure by applying forward kinematics on the
subject skeleton. R3DJP is challenging because it is very
hard to estimate the size of the person. This problem
is obviated in practice by providing the same skeleton
(limb lenght) information for all subjects,including
those in testing, if needed. The parametrization is
called relative because there is a joint, usually called
root joint (roughly corresponding to the human pelvis
bone position), which is taken as the center of the
prediction coordinate system and the other joints are
estimated relatively to it.

The kinematic representation
(KR), considers the relative joint angles between limbs
and is more convenient because it is invariant to both
scale and body proportions. The dependencies between
variables are however much more complex making estimation
more difficult. The process of obtaining joint angle values involves a complex constrained non-linear optimization process. We devoted significant effort to ensure that data is clean and the fitting process is accurate.

We use the camera parameters to project the 3D joint
positions and obtain very accurate 2D pose information.

Outputs were visually inspected multiple times, during
different processing phases, to ensure accuracy. These
representations can be directly used in independent
monocular predictions or in multi camera settings. The
monocular prediction dataset can be increased 4-fold by
globally rotating and translating the pose coordinates
as to move the 4 DV cameras in a unique coordinate
system (code is provided for this data manipulation).
Poses are available at four-fold faster
rates than images from DV cameras. The code provided
can also be used to double both image and pose data
by considering their mirror symmetric versions.

Time-of-Flight Data

Time of flight information is obtained using MESA Imaging SR4000
from SwissRanger at a 25Hz rate. This is a relatively standard Time-of-flight
sensor on the market. One sensor was placed near one of the video cameras and
captured the entire set of motions. Synchronization with the system is done by
a software trigger. The camera itself has an internal clock at which data
acquisition is performed. Time-stamps were taken for each acquisition to maintain
synchronization. Analyzing the time-stamp information we found the acquisition
to work at the desired framerate in a great majority of cases and code is
provided to correct desynchronizations.

Scanner Data

All of the actors were scanned using a 3 sensor 3D scanner from Human Solutions
called Vitus Smart LC3. The meshes obtained are preprocessed by Human Solution
ScanWorks software and some manual intervention to repair the mesh was done by
our staff. The mesh is available in Wavefront OBJ format and Matlab code is
provided for loading the mesh.

Besides the laboratory
test sets created we also focused on providing
test data to cover variations in clothing and complex
backgrounds, as well as camera motion and occlusion.
We are not aware of any setting of this level of difficulty
in the literature. Real images contain people in complex
poses, but the diverse backgrounds as well as the scene
illumination and occlusions can vary independently and
represent important nuisance factors the vision systems
should be robust against. Although approaches to handle
such cases exist, in principle, in the literature, it is still
difficult to annotate real images. This section of our dataset
was especially designed to address such issues.

We create movies by inserting high quality 3D rigged
animation models in real videos, to create a realistic and
complex background, good quality image data and very
accurate 3d pose information. The mixed reality movies
were created by inserting and rendering 3D models of a
fully clothed man and woman in real videos. The poses
used for animating the models were extracted directly
from our laboratory test set. The
actual insertion required solving the camera motion of
the backgrounds, as well as its internal parameters, for
good quality rendering. The scene was set up and rendered using the mental
ray (raytracing) renderer, with several well-placed area
lights and skylights. To improve quality, we have placed
a transparent plane on the ground, to receive shadows.
Scenes with occlusion were also created. The dataset
contains 5 different dynamic backgrounds obtained with
a moving camera, total of 10350 examples, out of which
1270 frames contain various degrees of occlusion.

Code

Visualization : For inspecting the data we provide visualization code, available once you log in.

Baseline pose estimation : We provide a set of implementations for a set of baseline prediction methods. This also includes code for data manipulation, feature extraction as well as large scale discriminative learning methods based on Fourier approximations.

Precomputed Segments

We provide two types of segmentation for our video data. These are precomputed on the raw image data for most accurate results and are available in the download section of this website.

Bounding Box : To obtain very accurate bounding boxes we reproject our calibrated poses and fit rectangles around the projections.

Background Subtraction : Background models are obtained from image data separately acquired for this purpose. A graph cut with this unary potential and binary potential given by edges is used to obtain the final background subtraction result. The graph cut is performed only inside the bounding box.

Precomputed Features

For both segmentations (bounding boxes and background subtraction) we provide precomputed pyramid HoG with different parameters extracted both on the silhouettes alone and on the silhouettes with internal edges.

Acknowledgements

This work was supported by grants of the Romanian National Authority for Scientific Research, CNCS - UEFISCDI, under PNII-RU-RC-2/2009, CT-ERC-2012-1 and PCE 2011-3-0438. The authors are grateful to their colleague Sorin Cheran for his support with the Web server organization.