Detecting Humans in Dense Crowd Images

Introduction

Human detection in dense crowds is an important problem, as it is a prerequisite to many other visual tasks, such as tracking, counting, recognizing actions or detecting anomalous behaviors, exhibited by individuals in a dense crowd. This problem is challenging due to large number of individuals, small apparent size, severe occlusions and perspective distortion. However, crowded scenes also offer contextual constraints that can be used to tackle these challenges. In this paper, we explore context for human detection in dense crowds in the form of locally-consistent scale prior which captures the similarity in scale in local neighborhoods and its smooth variation over the image. Using the scale and confidence of detections obtained from an underlying human detector, we infer scale and confidence priors using Markov Random Field. In an iterative mechanism, the confidences of detection hypotheses are modified to reflect consistency with the inferred priors, and the priors are updated based on the new detections. The final set of detections obtained are then reasoned for occlusion using Binary Integer Programming where overlaps and relations between parts of individuals are encoded as linear constraints. Both human detection and occlusion reasoning in proposed approach are solved with local neighbor-dependent constraints, thereby respecting the inter-dependence between individuals characteristic to dense crowd analysis. In addition, we propose a mechanism to detect different combinations of body parts without requiring annotations for individual combinations. We performed experiments on a new and extremely challenging dataset of dense crowd images showing marked improvement over the underlying human detector.

Motivation

Crowd analysis is fundamental to solving many real-word problems. It is important for management of crowded events, such as protests, demonstrations, marathons, rallies, political speeches and music concerts which are characterized by gatherings of thousands of people. It has use in the design of public spaces and infrastructure, as well as in their expansion and modification, by analyzing the counts of customers and commuters that frequent and travel through these places. It has applications in computer graphics as well, where crowd simulation models can be learned using data from real-world crowded scenes. But, perhaps its most important use is in visual surveillance and anomaly detection.

Dense crowds offer a set of challenges when it comes to visual analysis: fewer pixels per target, perspective effects and severe occlusions. But, they also provide constraints which can be employed to tackle these challenges. These can be both contextual (spatial) or temporal constraints. In this paper, we explore the use of spatial or contextual constraints for improving human detection. Consider, for instance in the image above, it can be observed that the scale or size of neighboring individuals is similar. Furthermore, although the scale changes across all the image, the change in scale is gradual due to the perspective effect and position of camera which is generally overhead.

Framework

Our approach obtains human detections using an underlying human detector. The scale and confidence of these detections is used to infer scale and confidence priors by using Markov Random Field. In an iterative manner, the confidences of detection hypotheses are modified to reflect consistency with the inferred priors, and the priors are updated based on the new detections. Finally, the set of putative detections are globally reasoned for occlusion, resulting in bounding boxes on the visible parts of humans as output. This framework is shown in the figure below:

Scale and Confidence Priors

In a densely crowded image or video, human detection becomes difficult primarily due to the smaller target size and severe occlusions. But, the scale of a human in crowded scene provides cue to what the scale should be in the immediate surrounding of the associated detection. We can transfer the knowledge of scale from a point in scene to its surroundings using the scale and confidence of that particular human detection. The figure below illustrates this idea. Given scale and confidence priors, the confidence for detection hypotheses is altered to reflect conformity with the priors. However, since both the priors and detections are dependent on each other, this necessitates an iterative mechanism where the priors are improved using given detections, and detections are improved using updated priors.

Figure: Intermediate computations of scale and confidence priors: (a) The scales and confidences from detections in an image are transformed into a 2d graph. (b) The observed scale prior is obtained, (c) which is then smoothed through MRF. The corresponding confidence prior is also shown in (d). Heat map is used in (b)-(d) where brighter colors indicate larger values.

Global Occlusion Reasoning

Human detector places the bounding boxes without taking into consideration nearby individuals, the resulting detections have significant overlap. It is also possible that the bounding box does not cover an individual entirely, due to a relatively higher confidence generated by the detector with fewer parts. Thus, we propose to infer the correct bounding boxes for all the individuals in the scene through occlusion reasoning whose goal is to expand and contract the bounding boxes so that they only but entirely cover the visible parts of the respective individuals. And due to cyclic dependencies among humans in crowds (A occluding B, B occluding C, ...), we pose occlusion reasoning as part-visibility inference problem for all individuals in an image which can be solved in a global fashion through Binary Integer Programming (BIP).

Figure: Linear constraints for Binary Integer Programming: (a) shows the DPM model for a single person and the respective part numbers. To ensure all parts selected by IP are contiguous, we use chain constraints between parts as shown with different colors. Similarly, models for two occluding persons are shown in (b). The overlap constraints ensure that occluded parts are rejected by the algorithm, thus giving bounding boxes consisting of visible parts only. Results of Occlusion Reasoning: Two individuals are shown in (c) with their bounding boxes for root and deformable parts. (d) After reasoning for occlusion, only visible parts are selected, thus resulting in better localization.

Results

We performed experiments on a challenging set of 108 crowd images, downloaded from Flickr. The images cover a variety of scenes and crowd densities, as some are sparse while other are dense. Some of the images depict marathons containing humans in standing poses, while other images are of parks and offer more difficult poses. The qualitative and quantitative results are shown below:

Figure: White bounding boxes signify true detections (TP), black boxes indicate false alarms (FP), while green represents miss-detections (FN). In (a), the crowd is sparse with humans inclined at an angle due to camera position. In (b), the humans appear in varied poses, whereas (c) is characterized by severe occlusions. The proposed approach gives excellent results for all three scenarios.

Figure: Comparison of proposed method (red) with several other human detection methods. The proposed method outperforms all methods on both measures despite using an underlying detector with lower performance than comparison methods.