Finding the Weakest Link in Person Detectors 


Devi Parikh and Larry Zitnick



[Part Patch Dataset] [part detection visualizations] [paper] [slides]



Detecting people remains a popular and challenging problem in computer vision.  In this paper, we analyze parts-based models for person detection to determine which components of their pipeline could benefit the most if improved.  We accomplish this task by studying numerous detectors formed from combinations of components performed by human subjects and machines.  The parts-based model we study can be roughly broken into four components: feature detection, part detection, spatial part scoring and contextual reasoning including non-maximal suppression. Our experiments conclude that part detection is the weakest link for challenging person detection datasets.  Non-maximal suppression and context can also significantly boost performance.  However, the use of human or machine spatial models does not significantly or consistently affect detection accuracy. 




Person detection is an important, but open and challenging problem in computer vision. Recently, person detectors have made significant progress using part-based models. Researchers have explored various feature representations of images, different appearance models for parts, sophisticated spatial modeling of the object configurations, as well as expressive non-maximal suppression and context models. Each of these approaches propose a complex set of interdependent components to provide final detection results.  While the additional complexity of the approaches have led to increased performance, understanding the role of each component in the final detection accuracy is difficult.




We propose a thorough analysis of parts-based models to gain insight into which components of the pipeline could benefit the most if improved.  We accomplish this task by using human subjects to perform the individual components previously performed by the machine algorithm. For instance, instead of using a machine classifier such as a latent SVM trained on HoG descriptors to detect object parts, we use human subjects to label whether a small image patch contains a human's head, foot, torso, etc. Illustrations of the various tasks performed by human subjects is shown below:




Experiments and Results

We evaluate the detection accuracy of various detectors (see below) composed of various combinations of the components performed by human subjects or machine implementations.




Comparisons between various subsets of these detectors allow us to tease apart the influence of each of the components in the parts-based person detection pipeline. While we encourage you to take a look at the detailed comparisons in the paper, a summary of the results obtained on the PASCAL 2007 and INRIA datasets  can be seen below. We find that part detection is the weakest link in part-based person detection. Non-maximal suppression also influences the performance in a non-trivial way. However, the use of human or machine spatial models does not significantly affect detection accuracy. 




Part Patch Dataset

Among the large amounts of human data we collected as part of our experiments, we believe the following might be of interest to the community. 


We had human subjects classify overlapping image patches into one of eight categories: head, torso, arm, hand, leg, foot, other-part-of-person, not-a-person. The patches were extracted from 50 INRIA and 100 PASCAL (2007) images, and were displayed in isolation at random so that no contextual information from the image was available to the subjects. We extract patches from the original high-resolution as well as a low-resolution version of the images. Before extracting the patches, the high and low resolution images were transformed into one of the following representations: color (regular), grayscale and normalized gradient. This resulted in a total of 45,316 x 6 = 271,896 patches. 10 human subjects classified every patch into one of the 8 categories on Amazon's Mechanical Turk. 


snapshot of the data can be seen below, which shows example patches classified by most subjects as head, torso, leg and none.



Similarly, we also had 10 human subjects classify overlapping image sub-windows (total of 6,218 x 6 = 37,308 windows) as containing a person or not (similar to 'root' detection). As with parts, the sub-windows were extracted from high and low resolution color, grayscale and normalized gradient images. 


We provide this part (patch) and root (window) classification data as the Part Patch dataset.


** Download ** Part Patch Dataset [89.3 MB]



A subset of our human studies required human subjects to detect people using a pre-computed set of parts. The parts may be detected by other humans, or by a machine. In order to ensure that no prior information other than the detected parts is used by human subjects, we created visualizations that display the part detections, but no other information in the image. An example visualization can be seen below. 




** Browse ** Visualizations of some images using the human and machine detected parts can be viewed here: INRIA_50  PASCAL2007_100. The first six columns display human detected parts (on highres regular, grayscale, normalized gradient, lowres regular, grayscale and normalized gradient images), and the last column shows machine detected parts on high resolution images using the detectors of Felzenszwalb et al. 2010. For human detected parts, the colors correspond to the different parts of a person (red: head, green: torso, blue: arm, yellow: hand, magenta: leg, cyan: feet, white: root (person), black: none). Each patch is displayed with a color corresponding to the category that received the most votes across the 10 subjects. The intensity of the color corresponds to the number of subjects that selected the class. For machine detected parts, the six colors are arbitrarily assigned to six parts, and the intensity of the color corresponds to the score of the part detection.




D. Parikh and C. L. Zitnick

Finding the Weakest Link in Person Detectors

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011

[poster] [slides]





This material is based upon work supported in part by the National Science Foundation under Grant No. 1115719. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. 


[Thanks to Yong Jae Lee for the webpage template]