Xiao Lin Spring 2014 ECE 6504 Probabilistic Graphical Models: Class Project Virginia Tech
Goal
The goal of this project is to evaluate how prior knowledge about attributes would play a role in constructing joint part-attribute models.
Red, small with a cone-shaped beak, attributes and parts are tightly coupled together as we take a look at the cardinal.
Some attributes natually localize to parts, some attributes describe the layout of parts, and some others describe the object based on attributes of parts. (Figure)
Just like translation invariance and localized recognition which form the basis of the sliding window/convolutional neural net approaches, those intuitions have the potential to keep the joint part-attribute model as simple as possible.
Before any statement could be made about a design of such a joint model such as [1] however, we need to evaluate the intuitions and how part and attributes would help each other.
Approach
1. Evalutating the intuitions
We evaluate the intuitions by assessing the attribute prediction performances using only the corresponding information sources.
For attributes that localize to parts, we compare the attribute prediction performance using only the corresponding parts with using the entire object.
For attributes that depend on the layout of parts, we compare using only the layout with using the entire object.
For attributes that depend on several part attributes, we compare using only the predicted part-attributes with using the entire object.
In addition we also explore the performances with combinations of these information sources.
2. How part and attributes would help each other
In this study we focus on two specific types of interactions: how parts will help predicting the attributes and how attributes will help localizing parts.
We study how parts will help predicting the attributes by comparing the attribute classification performance with and without part layout and predicted part attributes.
We study how attributes will help localizing parts by comapring the part prediction performance with and without attribute predictions.
3. Features and classifiers
For parts and objects we extract DeCAF [2] features; for layout we extract the relative location of the centroid of parts in the object bounding box, explicitly mapped to the second order([x y x2 y2 xy]). For all tasks we use linear SVM with C=1 as our classifier.
We hope the features and classifiers are good enough such that they reflect our ability to capture and utilize the corresponding information in current computer vision systems.
Results
We perform experiments on the Caltech-UCSD Birds-200-2011 dataset [3]. The 28 attribute categories are assigned to part/layout/part attribute groups based on their names. There are 23 attribute groups that localize to parts, 2 groups might depend on the layout and 3 groups might be based on combination of part attributes.
1. Evalutating the intuitions
From the results below we could draw the following conclusions:
1) The classification performances of parts and part attributes groups are comparable to those using the entire object, which means we could safely construct graph structures where those part-oriented attributes depend only on parts or part attributes.
2) Ground truth layout outperforms the entire object on predicting layout oriented attributes, which means it is an important source of information when preducting those attributes.
3) The part attribute group can be predicted reliably using part attributes, which means they could with a small loss, be freed from long features of all parts.
2. How part and attributes would help each other
Time limited, we did not do the complete part detection setup, but instead evaluate on a classification setup using all ground truth part boxes + 15 random boxes per image. Random boxes and parts from different categories serve as negatives in both training and testing.
From the diagram below we can see that part information improves attribute predictions. However attributes do not seem to play a role in the localization of parts.
Here's more details.
Following the spirit of [1], we extract features of the object bounding box (root), ground truth part locations with hypothesized bounding boxes (part) and their layout (loc). We further predict part attributes using part features, and the predictions are then used as features (resp). We evaluate the attribute prediction quality with combinations of features.
The per-image attribute annotations are quite noisy. "Expert" denotes when using the typical per-category attribute responses assembled by experts as the prediction for per-image attributes. They serve as upperbounds of the attribute classification task.
Some parts are not visible and therefore not available in ground truth. We ignore those entries during training, and set their features to 0 during prediction. Some parts are referring to parts with the same appearance (left wing vs right wing), we share the attribute classifier weights of those parts (basically using both to train the same "wing" classifier).
Download the complete results on all 28 attribute groups with combinations of {root,part,loc,resp} and the performance of 15 parts with and without attributes result_attribute.xlsxresult_parts.xlsx.
Some additonal observations include:
1. Layout does not help when combined with other features.
2. Using part features of all 15 parts, despite being very high dimensional, still performs much better than part attributes. That means there's some information in the parts that those part attributes were unable to capture.
3. We also tried predicting attributes with "fake" parts which are not really associated with the corresponding attributes. And of course the performance is bad. Correct prior knowledge about the attributes is indeed useful. It's non-trivial to tie the classification performance of the entire object using a single part that being said.
4. Somehow layout helps when predicting the shape of the tail.
References
[1] Ning Zhang, Ryan Farrell, Forrest Iandola, and Trevor Darrell. "Deformable Part Descriptors for Fine-grained Recognition and Attribute Prediction." International Conference on Computer Vision (ICCV), 2013.
[2] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. "DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition." arXiv preprint:1310.1531, October 2013.
[3] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, Serge Belongie. “The Caltech-UCSD Birds-200-2011 Dataset.” Computation & Neural Systems Technical Report, CNS-TR-2011-001.