NEWS!: We recently (Aug. 2010) collected a
new soccer data set at
ESPN Wide World of Sports. We used scissor
lifts with adjustable heights to mount 3 synchronized 720P-HD cameras for
covering one half of the field. We collected two games with camera heights of 60 feet. One
of these games was recorded at night under flood lights. We captured the
third game with cameras' height of about 25 feet. We tested our framework on
22,000 frames of this data (this is besides the results on 60,000 frames of
PIXAR soccer data given in our
CVPR paper). Results on the new data can
be found in our
journal paper (in preparation).

1. Introduction:

Visualizing multi-player sports has grown
into a multimillion dollar industry. However, inferring state of
a multi-player game is still an open challenge. This is specially true when the context of
the game changes in a dynamic and continuous manner. Examples of such sports
include soccer, field hockey, and basketball. Our work is
geared towards automatic visualization of this particular subset of sports.

One of the main technical challenge for
sports visualization systems is to infer accurate player positions in the face
of occlusion and visual clutter. One solution to this end is to use multiple
overlapping cameras, provided the observations from these cameras could be fused
reliably. Our work explores this question of efficient and robust fusion
of visual data observed from multiple synchronized cameras, and apply this
information for generating sports visualizations. These include displaying a
virtual offside line in soccer games, highlighting players in passive offside
positions, and showing players' motion patterns.

Our key contribution is the modeling and
analysis for the problem of fusing corresponding players' positional information
as finding minimum weight K-length cycles in complete K- partite graphs. The
algorithm-class we propose to this end uses a dynamic programming based
approach, that varies
over a continuum of being maximally to minimally greedy in terms of the number
of paths explored at each iteration. We use our proposed algorithm-class for an
end-to-end sports visualization framework, and demonstrate its robustness by
presenting results over 60,000 frames of real soccer footage
captured over five different illumination conditions, play types, and team
attire.

2. Framework Overview:

Following are the main steps we have in our sports
visualization framework:

2a. Background Subtraction

We begin by adaptively learning per-pixel Gaussian
mixture models for scene background. These models are used for foreground
extraction by theresholding appearance likelihoods of scene pixels. The input
and output to this step are given in the following figure. Note that the while
this step allows player pixels quite successfully, it also extracts the shadow
pixels as a part of the foreground. Such shadow pixels can be problematic for
player tracking, and therefore need to be removed.

2b. Shadow Removal

While there are numerous appearance based methods
for shadow removal [21], they mostly work best for relatively soft shadows. In
soccer games however, shadows can be quite strong. We therefore rely on
geometric constraints of our multi-camera setup for robust shadow removal.

Consider the following figure, where only shadow
pixels of the player are view independent. This enables us to remove shadows by
warping extracted foreground in one view onto another, and filtering out the
overlapping pixels. We begin by finding 3X3 planner homographies between each
pair of views, such that for any point in one view, we know a distinct mapping
for it in the second view.

In cases where a player is partially occluded by a
shadow, simply relying on these geometric constraints might result in losing
image regions belonging to occluded parts of players. To avoid this, we apply
chromatic similarity constraints of original and projected pixels before
classifying them as shadow versus non-shadow. The intuition here is that the
appearance similarity of shadow pixels across multiple
views would be more than that for non-shadow pixels.

The input and output of the shadow removal step are
shown in the following figure. Notice that some parts of the player are also
removed while removing the shadows, however by and large this method of shadow
removal performs quite well.

2c. Player Tracking

We track the player blobs using a particle filter
based framework. We represent the state of each player using a multi-modal
distribution, which is sampled by a set of particles. To propagate the previous
particle set to the next, we perform the three-step procedure of Selection,
Prediction and Measurement. Here Selection implies the step of importance
sampling of a set of particles from the previous step based on how well they fit
the measurement for the last frame. Prediction implies the application of a
dynamic model on the selected particles. Finally, measurement relates to ranking
the particles in terms of how well they match the measurement from the current
frame. These three steps are repeated for each of the frame in the video. This
entire process is illustrated in the following figure.

2d. View Dependent Blob
Classification

We classify the tracked blobs on a per-frame and
per-view basis. We pre-compute the hue and saturation histograms of a few (~5)
player-templates of both teams as observed from each view. During testing, we
compute this hue and saturation histograms for the detected blobs, and find
their Bhattacharyya distances from the player-templates of the corresponding
view. We classify each blob into offense or defense teams based on the label of
their nearest neighbor templates. The pipeline of blob-classification for one
particular view is shown in the following figure.

The output of the tracking and player
classification on an example frame is shown in the following figure.

2e. Data Fusion for Player
Classification

To transform players’ location observed from
multiple cameras into a shared space, we project the base-point of all blobs
observed from each camera into real-world coordinates of the field. We pose
fusing location evidence of players observed from multiple cameras as
iteratively finding minimum weight K-length cycles in a complete K-partite
graph. Nodes in each partite of this graph represent blobs of detected players
in different cameras. The edge-weights in this graph are a function of pair-wise
similarity
between blobs observed in camera-pairs, and their corresponding ground plane
distances. Correspondence
between a player’s blobs observed in different cameras is equivalent to a
K-length cycle in this graph. This problem setup is illustrated in the following
figure.

Specifically, we can state our problem as given a
complete K-partite graph G with K tiers, we want to find the minimum weight
cycle c in G, such that c passes through each tier in K once and only once. A
complete K-partite graph and a node-cycle are shown in the following left more
and right most figures respectively. We iteratively find and remove K-length
minimum weight cycles from G until there remain no more cycles in in the graph.

Note that as our problem is cyclic in nature, the
edges we find must start and end at the same node. Note that while using
traditional dynamic programming, there is no guarantee that the shortest path
returned by the algorithm would necessarily end at the same node as the source
node. We therefore need to modify our graph representation such that we could
satisfy the cyclic constraint of our problem, while still using a
dynamic programming based scheme.

Assume the size of all nodes V in G is n. For each
node v in V , we can construct a sub-graph Gv with K + 1 tiers, such
that the only node in the 1st and the (K + 1)st tier of Gv
is v. Besides the 1st and the (K + 1)st tiers of Gv,
its topology is the same as that of G. This is illustrated in the 2nd
figure above.

Note that the shortest cycle in G involving node v
is equivalent to the shortest path in Gv that has v as its source and
destination. Our problem can now be re-stated as given G, construct Gv
for all v in V . Find shortest K length paths P = {pv in Gv
for all v in V} that span each tier in Gv once and only once. Find
shortest cycle in G by searching for shortest path in P. There is an inherent
tradeoff between efficiency and optimality of this search problem, which is
analyzed in detail in the paper.

3. Multi-Player Sports
Visualization

We use our framework to generate various automatic
sports visualization, three of which are enlisted below.

3a. Offside Line Visualization

An important foul in soccer is the offside call,
where an offense player receives the ball while being behind the second last
defense player (SLD). We want to detect the SLD player, and to draw an offside
line underneath him/her. To test the robustness of our proposed system, we ran
it on approximately 60,000 frames of soccer footage captured over 5 different
illumination conditions, play types, and teams’ attire.

We compared the performance of our proposed system
with that of finding the SLD player in each camera individually, and with
naively fusing this information by taking their average. Our proposed fusion
mechanism out performs the rest with an average accuracy of 92.6%. The naive
fusion produces an average accuracy of 75.7%. The average accuracy across all 3
individual cameras over all 5 sets is 82.7%. To the best of our knowledge, this
is the most thorough test of automatic offside-line visualization for soccer
games available.

3b. Passive Offside
Visualization

Offence players can be in an offside state either
actively (get directly involved in the play while being behind the SLD), or
passively (be present behind the SLD and not get directly involved in the play).
Fig. 10 shows an example illustrating the offense player in passive offside
state automatically highlighted using our proposed framework. Visualizations
such as these can be used in assisting viewers
predict whether or not an offside foul is likely to take place.

3c. Passive Offside
Visualization

Visual broadcast of soccer games only shows an
instantaneous representation of the sport, where no visual record of what
happened over some preceding time is usually maintained. There are two important
challenges in having a lapsed representation of a game. Firstly, automatic
detection of players’ actions is hard. And secondly, summarizing these actions
in an informative manner is non-obvious. To this end, we consider players’
movement as a basic representation of the state of a game, and use our framework
to
visualize development of a game over a window of time (see Figure below).
Visualizing such holistic movements of players accumulated over time can
potentially help viewers’ understanding of how a game is progressing,
identifying the various defense and offence strategies being used, and
predicting the subsequent game-plan for each of the teams.

4. Conclusions and Future Work

We have presented a novel modeling and search
method for fusing evidence from multiple information sources as iteratively
finding minimum weight K-length cycles in complete K-partite graphs. As an
application of the proposed algorithm-class, we have presented a framework for
soccer player localization using multiple synchronized static cameras. We have
used this fused information to generate various
sports visualizations, including the virtual offside line, highlighting players
in passive offside state, and showing players’ accumulated motion patterns. We
have presented a thorough analysis of the robustness of our framework by testing
it over a large and diverse set of soccer footage.

In the future we want to apply our algorithm-class
for a wider set of correspondence finding problems, including matching for depth
estimation, trajectory matching using multiple cameras, and motion capture
reconstruction. Furthermore, we want to use our visualization framework for a
variety of sports, including rugby, hockey, and baseball.