Understanding Predictions of Structured Probabilistic Vision Systems

Qing Sun
Spring 2014 ECE 6504 Probabilistic Graphical Models: Class Project
Virginia Tech

                                                                                                                                     Fig. 1 Overview of our approach


In this project, we are trying to address a fundamental and difficult question of “why does a learning-based vision system do what it does?”. We refer to such system that explains their behavior as transparent vision system. Based on the nature property of Dual Structure SVM, we can query which (subset of) training images explain what the system believe and why. We implement experiments on two datasets: iCoseg and CMU Geometric Context Dataset. Results show that queried images are either similar to the test images or provide some sub-structures which may contribute the difference among hypotheses.

We have witnessed significant progress in machine perception in past two decades. Today, there are commercial systems for face detection (Face.com, Apple iPhone), speech recognition (Siri) and handwriting recognition (Microsoft OneNote). But, when today’s machine perception system fail, they fail in a spectacularly disgraceful manner, without warning or explanations, leaving the user staring at an incoherent output, wondering why the system did what it did. This is due to lack of transparency and there does put much emphasis on good predictive performance in machine learning and computer vision communities today. Recently, large-scale annotated datasets such as PASCAL[1], ImageNet [2], SUN [3] have allowed us to train ever more powerful machine perception system from millions of images. However, as the datasets scale, it becomes increasing more difficult to understand why some methods work better than others, what kind of mistakes they make, and most importantly why they predict what they predict.

Challenge. Unfortunately, there is no consensus on even a definition of transparency. In order to cope with the significant level of noise and uncertainty, machine perception systems are probabilistic models, holding beliefs about all possible outputs. In a Bayesian view, this is done by incorporate evidence provided by the data (e.g., how much does this image patch look like a person?) with prior models of the world (e.g, people are usually upright with their heads above their torso) to form a posterior beliefs about the location of people in this image. These beliefs are then used for predictions. Thus, for each possible hypothesis (output) of a specific testing image, we are trying to figure out which (subsets of) training images support this belief the most, as illustrated in Fig. 1.


1. Notation

CRF Model.  Let  be a graph defined over the output variables , i.e.,  Let be the unary term expressing the local confidence at site for the label , and  be the pairwise term expressing compatibility of label and  at adjacent vertices. The score for any configuration  is given by the sum , and its probability is given by the Gibbs distribution: , where  is the partition function or normalization constant.

2. Generating Diverse Plausible Hypotheses

We want to find a set of M solutions  that are plausible (i.e., high-probability) and mutually non-redundant (i.e.,diverse). We approach this with a greedy algorithm-“Find me a configuration  that is high probable but different from anything else I have seen so far”.

Serial Diverse M-Best (DivMBest) [4]. Let be the best solution (or MAP),  be the second solution found and so on. At each step m, the next best solution is the highest scoring point  that is at least  “dissimilar” from each previous solution , where dissimilarity is measured under some function ∆(∙,∙) :

3. Querying A Good Set of Training Images to Support a hypothesis

Assume the score for each hypothesis to be linear in some combined feature representation of inputs and outputs, i.e.,. In principle, as we learn a weight , we have completely specified what our model believes in terms of the score. According to dual structure SVM [5], we have


where, . The score  can be rewritten as



Then, the training image which supports the hypothesis  the most for a given testing image can be obtained via



Fig. 2 iCoseg Dataset


Foreground-Background Segmentation. We used the co-segmentation dataset iCoseg et al. [6]. iCoseg consists of 37 groups of related images mimicking typical consumer photograph collections. Each group may be thought of as an “event” (e.g., images from a baseball game, a safari, etc.). The dataset provides pixel-level ground-truth foreground background segmentations for each image. We used 9 difficult groups from iCoseg containing 166 images in total (83 train images and 83 test images). For each test image, we query an image which supports this test image the most.

Geometric Labeling. We used CMU Geometric Context dataset of Hoiem et al. [7] with 150 train images, 50 test images and 7 categories: “ground”, “sky”, “left”, “center”, “right”, “porous”, and “solid”.

Qualitative Results. Fig. 2 shows that the queried images look much similar to the test images in terms of appearance or structure. And, it is very interesting that two groups of animals are supported by two pyramids in the last two examples. Fig. 3 shows the similar results on CMU Geometric Context Dataset and the bottom example further verifies that the MAP solution may not be the best one. On the other hand, there are some images which can not be supported by images similar to them (in terms of distance in HOG or Decaf space). As shown in Fig. 4, some local structures in train images may contribute the difference between hypotheses via enforcing constraints in Structure SVM. But, this needs further theoretical justification.



             Fig.3 CMU Geometric Context Dataset



Fig. 4 Difference between hypotheses



[1] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” in IJCV, vol. 88, pp. 303–338, 2010. 1

[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR, 2009. 1

[3] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in CVPR, 2010. 1

[4] D. Batra, P. Yadollahpour, A. Guzman-Rivera, and G. Shakhnarovich, “Diverse M-Best Solutions in Markov Random Fields,” in ECCV, 2012. 2

[5] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, “Large margin methods for structured and interdependent output variables,” in JMLR, vol. 6, pp. 1453–1484, 2005. 2

[6] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen, “iCoseg: Interactive Co-segmentation with Intelligent Scribble Guidance,” in CVPR, 2010. 3

[7] D. Hoiem, A. A. Efros, and M. Hebert, “Recovering surface layout from an image,” IJCV, vol. 75, no. 1, 2007. 3