You can also find here a list
of publications that use the MuHAVi dataset.

New (12.09.2017):

A brand new set of temporal ground truths for
MuHAVi-uncut has been prepared that defines start-end of each
"sub-action" in an action

Silhouttes for each of the sub actions in MuHAVi-uncut are now available

Introduction

We
have collected a large body of human action video (MuHAVi) data using 8
cameras. There are 17 action classes as listed in Table 2 performed by
14 actors. We initially processed videos corresponding to 7 actors in
order to split the actions and provide the JPG image frames. These
include included some image frames before and after the actual action,
for the purpose of background subtraction, tracking, etc. The longest
pre-action frames correspond to the actor called Person1. Note that
what we provide is therefore temporally pre-segmented actions as this
was
typical when the dataset was first released. We now (see
below) provide long unsegmented sequences
for people to work on temporal segmentation.

Each actor performs each action several times in the
action zone highlighted using white tapes on the scene floor. As actors
were amateurs, the leader had to interrupt the actors in some cases and
ask them to redo the action for consistency. As shown in Fig. 1 and
Table 1, we have used 8 CCTV Schwan cameras located at 4 sides and 4
corners of a rectangular platform. Note that these cameras are not synchronised. Camera calibration information may be
included here in the future. Meanwhile, one can use the patterns on the
scene floor to calibrate the cameras of interest.

Note that to prepare training data for action recognition
methods, each of our action classes may be broken into at least two
primitive actions. For instance, the action "WalkTurnBack" consist of
walk and turn back primitive actions. Further, although it is not quite
natural to have a collapse action due to shotgun followed by standing
up action, one can simply split them into two separate action classes.

We
make the data available to the researchers in computer vision community
through a password protected server at the University Carlos III de Madrid, Spain. The data may be accessed by
sending an Email (subjected "MuHAVi-MAS Data") to Prof Sergio A
Velastin at
sergio.velastin@ieee.org
giving the names, email addresses and institution(s) of the researchers who wish to use
the data and their main purposes. We request this only to build a list
of people using this dataset to form a "MuHAVi community" with whom to
communitcate. The only requirement for using the MuHAVi data is to
refer to this site and to our publication(s) in the corresponding publications.

Figure 1. The top
view of the configuration of 8 cameras used to capture the actions in
the blue action zone (which is marked with white tapes on the scene
floor).

camera
symbol

camera name

V1

Camera_1

V2

Camera_2

V3

Camera_3

V4

Camera_4

V5

Camera_5

V6

Camera_6

V7

Camera_7

V8

Camera_8

Table 1. Camera view
names appearing in the MuHAVi data folders and the corresponding
symbols used in Fig. 1.

*** This section is mainly of historical interest. It is better to download the data in the MuHAVi-uncut set ****

On the table below,
you can click on the links to download the data (JPG images) for the
corresponding action

Important: We noted
that some earlier versions of that earlier versions of MS Internet
Explorer could not download files over 2GB size, so we recomment to use
alternative browsers such as Firefox or Chrome.

Each
tar file contains 7 folders corresponding to 7 actors (Person1 to
Person7) each of which contains 8 folders corresponding to 8 cameras
(Camera_1 to Camera_8). Image frames corresponding to every combination
of action/actor/camera are named with image frame numbers starting from
00000001.jpg for simplicity. The video frame rate is 25 frames per
second and the resolution of image frames (except for Camera_8) is 720
x 576 Pixels (columns x rows). The image resolution is 704 x 576 for
Camera_8.

Un-cut original video sequences
(mainly in MPEG2) for each camera (the recordings are continuous and
contain the acted actions but also the gaps and breaks in
between).

Ground truth describing times of
start and completion (frame numbers) of each sub-action in each video file
by each actor (Note: the community's views are wellcome to agree on a
set
of metrics to evaluate temporal segmentation methods)

Silhouettes computed by Z.Chen´s algorithm
(the rationale is that these are realistic silhouettes typical of the
state of the art and people are invited to test the robustness of their
human action recognition and temporal segmentation algorithms based
such realistic, and "imperfect", segmentation)

Here are a couple of samples that do not need a user name and password to download:
Camera2A video sample
Camera2A sample silhouttes (parameter= 6.0)

You can now download the full-length
videos from here:Because of the
length of these videos, use RightClick/"Save Link As" and use a high
speed network:

Each compressed archive contains files named %d.png where the number is
the frame number. In each, file black (0) represents the background,
white (255) the foreground i.e. the silhouette and grey (127) is a
detected shadow
(normally to be considered as background).

The "Parameter" is one of the factors that affect foreground detection
in
terms of true positives vs false positives. When tested against the manually annotated silhouettes,
a value of 3.0 produces a TPR (true positives rate) of around 0.78 and
a FPR (false positives rate) of around 0.027
while 4.0 gives around 0.71 and 0.013 and 5.0 gives 0.625 and less than
0.01 (i.e. less noise but less foreground). As many of the false
positives tend to be noise outside the main silhouette, we expect that
most people will use the set with higher TPR and reduce the false
positives e.g. with morphologial filtering. When publishing results can
you please ensure that you give full details of any pre-processing of
this kind.

**** Historical note
When we first published MuHAVi we provided a spreadsheet with the times (frame numbers of when an action started and when it finished). Incidentally, it also described how
the MuHAVi JPEG sequences were obtained from AVI files
extracted from manually obtained temporal markers. Please also note
that we discovered that there was a bug in mplayer (that converted from
AVI to JPEGs)
that resulted in some skipped frames in the JPEG sequences). In any case, we have found this to be of less use than we expected because:

Each action (e.g. "walk and turn back") was
conducted by each actor a number of times (typically 3), but the
annotation only contained the start and end of the (3) actions as a
whole and not of each one separately.

Actions such as "walk and turn back" could really
regarded as two or three sub-actions: walk (toward one end of the
stage), turn, walk (back to the other end of the stage) and it would be
nice to annotate them separately

Finally, we found that there were errors in the annotation!

**** end of historical note

The ground truth file can be obtained here
in spreadsheet format (we are grateful to Erwann Nguyen-van-Sang,
intern MSc student from the U. of Strasbourg, who spent many hours to
produce this annotation).

Below is an extract from the spreadsheet.

The first column refers to the
camera and actor numbers.

The second column header gives the
action (e.g. "WalkTurnBack", "RunStop").

The numbers on the second column for
each person give the frame number, in the video sequence, where the
action starts and in the third column where it ends (this is somewhat
subjective, of course and the community needs to agree on a metric that
would not unjustly penalise algorithms).

If the action was repeated (that is
almost always the case) the start and end frames are given in the
fourth and fifth columns and so on

Camera 2 from dvcam3-1-6.0 and dvcam3-2-6.0

WalkTurnBack

RunStop

Start S1

End S1

Start S2

End S2

Start S1

End S1

Start S2

End S2

Start S3

End S3

Start S4

End S4

Actor 1

377

607

627

867

7387

7526

7527

7666

7667

7776

7777

7837

Actor 2

1297

1517

1537

1777

8267

8366

8367

8446

8447

8546

8547

8597

Actor 3

2087

2327

2347

2597

8867

8946

8947

9046

9047

9156

9157

9227

Actor 4

2967

3187

3217

3457

9717

9796

9797

9886

9887

9976

9977

10037

Actor 5

3847

4077

4097

4367

10347

10446

10447

10546

10547

10626

10627

10687

Actor 6

4747

5017

5047

5337

11187

11296

11297

11396

11397

11496

11497

11577

Actor 7

5677

5947

5967

6207

11927

12036

12037

12126

12127

12226

12227

12307

To help those working with this data, we have extracted (from the long
silhouettes sequences) each sub-action described in the above
spreadsheet into separate sub-sequences (divide up by actor, action and
camera). As there are many of those, it is best to download the whole
set from here (2.7GB)

**** Historical note

Note: this material is only
historical (as we do not have well documented sources of these results)

Masks
obtained by applying two different Tracking/Background Subtraction
Methods to some of our Composite Actions

Each zip file
contains masks (in their bounding boxes) corresponding to several
sequences of composite actions performed by the actor A1 and captured
from two camera views (V3 and V4) for the purpose of testing
silhouette-based action recognition methods against more realistic
input data (in conjunction with our MAS training data provided below),
where the need for a temporal segmentation method is also clear.

We
recommend using this subset of MuHAVi to test Human Action Recognition
(HAR) algorithms independently of the quality of silhouettes. For a
fuller evalution of a HAR algorithm, please consider using MuHAVi
"uncut" instead.

We
have selected 5 action classes and manually annotated the corresponding
image frames to generate the corresponding silhouettes of the actors.
These actions are listed in Table 4. It can be seen that we have only
selected 2 actors and 2 camera views for these 5 actions. The
silhouettes images are in PNG format and each action combination can be
downloaded as a small zip file (between 1 to 3 MB). We have also added
3 constant characters "GT-" to the beginning of every original image
name to label them as ground truth images.

On the table below,
you can click on the links to download the silhouette data for the
corresponding action combinations.

Table 4. Action
combinations corresponding to the MAS data for which ground truth
silhouettes have been generated.

NEW! The table
below contains links to the corresponding AVI video files (in MPEG2)
from which the JPEG file sequences were extracted which were then used
by the manual annotators to get the silhouettes (Note that due to a
software bug the JPEG sequences had a couple of frames missing towards
the end of the sequence, therefore the AVI files would not exactly
correspond to the silhouette frames. As this happens toward the end, it
should not significantly affect work that evaluates automatic
silhouette segmentation and that uses performance metrics based on
aggregating an averaging results over the whole sequence)

Table 5. Action
combinations corresponding to the MAS data for which ground truth
silhouettes have been generated.

Finally, the following table documents the frames that
were manually segmented so that you can test foreground segmentation
algorithms (i.e. this table tells you the correspondence between JPEG,
AVI and PNG frames in the dataset). Please note that the human
annotators worked on the JPEG files and hence there is a one to one
correspondence between JPEG and PNG files. Because of a bug we later
discovered on the version of mplayer
that was used to generate the JPEG frames, there is small difference in
the number of frames in the AVI files, but we still suggest you use the
AVI files as the JPEG were effectively transcoded from the original
MPEG2 videos)

ActionActorCamera

GT
InitFrame

GT EndFrame

GT
NFrames

JPG
Nframes

AVI
Nframes

KickPerson1Camera3

2370

2911

542

3001

3003

KickPerson1Camera4

2370

2911

542

2997

2999

KickPerson4Camera3

200

628

429

731

733

KickPerson4Camera4

200

628

429

721

723

PunchPerson1Camera3

2140

2607

468

2746

2748

PunchPerson1Camera4

2140

2607

468

2750

2752

PunchPerson4Camera3

92

536

445

642

643

PunchPerson4Camera4

92

536

445

645

647

RunStopPerson1Camera3

980

1418

439

1572

1574

RunStopPerson1Camera4

980

1418

439

1572

1574

RunStopPerson4Camera3

293

618

326

751

753

RunStopPerson4Camera4

293

618

326

749

751

ShotGunCollapsePerson1Camera3

267

1104

838

1444

1446

ShotGunCollapsePerson1Camera4

267

1104

838

1443

1445

ShotGunCollapsePerson4Camera3

319

1208

890

1424

1426

ShotGunCollapsePerson4Camera4

319

1208

890

1424

1426

WalkTurnBackPerson1Camera3

216

682

467

866

868

WalkTurnBackPerson1Camera4

216

682

467

860

862

WalkTurnBackPerson4Camera3

207

672

466

836

838

WalkTurnBackPerson4Camera4

207

672

466

839

841

GTInitFrame: Frame number for the start of the
manual annotationGTEndFrame: Frame number for
the end of the manual annotationGTNFrames: Number of manually
annotated frames = (GTEndFrame-GTInitFrame+1)JPGNFrames: Total number of frames in
the JPEG sequence (slightly less than AVINFrames)AVINFrames: Total number of frames in
the AVI sequence

We have reorganized
these 5 composite action classes as 14 primitive action classes as
shown in the table below.