Qing Sun
Spring
2014 ECE 6504 Probabilistic Graphical Models: Class Project
Virginia
Tech
Fig. 1
Overview of our approach
In
this project, we are trying to address a fundamental and difficult question of “why
does a learning-based vision system do what it does?”.
We refer to such system that explains their behavior as transparent vision
system. Based on the nature property of Dual Structure SVM, we can query which (subset
of) training images explain what the system believe and why. We implement
experiments on two datasets: iCoseg and CMU Geometric
Context Dataset. Results show that queried images are either similar to the
test images or provide some sub-structures which may contribute the difference among
hypotheses.
We
have witnessed significant progress in machine perception in past two decades.
Today, there are commercial systems for face detection (Face.com, Apple
iPhone), speech recognition (Siri) and handwriting recognition (Microsoft OneNote).
But, when today’s machine perception system fail, they fail in a spectacularly
disgraceful manner, without warning or explanations, leaving the user staring at
an incoherent output, wondering why the system did what it did. This is due to lack
of transparency and there does put much emphasis on good predictive performance
in machine learning and computer vision communities today. Recently,
large-scale annotated datasets such as PASCAL[1],
ImageNet [2], SUN [3] have allowed us to
train ever more powerful machine perception system from millions of images.
However, as the datasets scale, it becomes increasing more difficult to
understand why some methods work better than others, what kind of mistakes they
make, and most importantly why they predict what they predict.
Challenge. Unfortunately,
there is no consensus on even a definition of transparency. In order to cope
with the significant level of noise and uncertainty, machine perception systems
are probabilistic models, holding beliefs about all possible outputs. In a
Bayesian view, this is done by incorporate evidence provided by the data (e.g.,
how much does this image patch look like a person?) with prior models of the
world (e.g, people are usually upright with their
heads above their torso) to form a posterior beliefs about the location of
people in this image. These beliefs are then used for predictions. Thus, for
each possible hypothesis (output) of a specific testing image, we are trying to
figure out which (subsets of) training images support this belief the most, as illustrated
in Fig. 1.
1. Notation
CRF Model. Let be a graph defined over the output variables ,
i.e., Let be the unary term expressing the local
confidence at site for the label ,
and be the pairwise term expressing compatibility
of label and
at adjacent vertices. The score for any
configuration is given by the sum ,
and its probability is given by the Gibbs distribution: ,
where is the partition function or normalization
constant.
2. Generating Diverse
Plausible Hypotheses
We
want to find a set of M solutions that are plausible (i.e., high-probability)
and mutually non-redundant (i.e.,diverse).
We approach this with a greedy algorithm-“Find me a configuration that is high probable but different from
anything else I have seen so far”.
Serial Diverse M-Best (DivMBest) [4]. Let be
the best solution (or MAP), be the second solution found and so on. At
each step m, the next best solution is the highest scoring point that is at least “dissimilar” from each previous solution ,
where dissimilarity is measured under some function ∆(∙,∙) :
3.
Querying A Good Set of Training Images to Support a
hypothesis
Assume
the score for each hypothesis to be linear in some combined feature
representation of inputs and outputs, i.e.,
where, .
The score can be rewritten as
Then,
the training image which supports the hypothesis the most for a given testing image can be
obtained via
Fig. 2 iCoseg
Dataset
Foreground-Background
Segmentation. We used the co-segmentation dataset iCoseg et al. [6]. iCoseg
consists of 37 groups of related images mimicking typical consumer photograph
collections. Each group may be thought of as an “event” (e.g., images from a
baseball game, a safari, etc.). The dataset provides pixel-level ground-truth
foreground background segmentations for each image. We used 9 difficult groups
from iCoseg containing 166 images in total (83 train images
and 83 test images). For each test image, we query an image which supports this
test image the most.
Geometric Labeling. We
used CMU Geometric Context dataset of Hoiem et al. [7]
with 150 train images, 50 test images and 7 categories: “ground”, “sky”, “left”,
“center”, “right”, “porous”, and “solid”.
Qualitative Results. Fig. 2 shows that the queried images look much
similar to the test images in terms
of appearance or structure. And, it is
very interesting that two groups of animals are supported by two pyramids in the last two examples. Fig. 3 shows
the similar results on CMU Geometric
Context Dataset and the bottom
example further verifies that the MAP solution may not be the best one. On the other hand, there are some images which can not
be supported by images similar to them (in
terms of distance in HOG or Decaf space). As shown in Fig. 4, some local structures in train images may contribute the difference between hypotheses via
enforcing constraints in Structure
SVM. But, this needs further theoretical justification.
Fig.3 CMU Geometric Context Dataset
Fig. 4 Difference between hypotheses
[1]
M. Everingham, L. V. Gool,
C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc)
challenge,” in IJCV, vol. 88, pp. 303–338, 2010. 1
[2]
J. Deng, W. Dong, R. Socher, L.-J.
Li, K. Li, and L. Fei-Fei, “ImageNet:
A Large-Scale Hierarchical Image Database,” in CVPR, 2009. 1
[3]
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun
database: Large-scale scene recognition from abbey to zoo,” in CVPR, 2010. 1
[4]
D. Batra, P. Yadollahpour,
A. Guzman-Rivera, and G. Shakhnarovich, “Diverse
M-Best Solutions in Markov Random Fields,” in ECCV, 2012. 2
[5]
I. Tsochantaridis, T. Joachims,
T. Hofmann, and Y. Altun, “Large margin methods for
structured and interdependent output variables,” in JMLR, vol. 6, pp. 1453–1484,
2005. 2
[6]
D. Batra, A. Kowdle, D.
Parikh, J. Luo, and T. Chen, “iCoseg: Interactive Co-segmentation
with Intelligent Scribble Guidance,” in CVPR, 2010. 3
[7]
D. Hoiem, A. A. Efros, and
M. Hebert, “Recovering surface layout from an image,” IJCV, vol. 75, no. 1, 2007.
3