Object-Proposal Evaluation Protocol is 'Gameable'

Object-Proposal Evaluation Protocol is 'Gameable'

Neelima Chavali, Harsh Agrawal, Aroma Mahendru, Dhruv Batra

@misc{1505.05836,
Author = {Neelima Chavali and Harsh Agrawal and Aroma Mahendru and Dhruv Batra},
Title = {Object-Proposal Evaluation Protocol is 'Gameable'},
Year = {2015},
Eprint = {arXiv:1505.05836},
}

Despite the two different interpretations and goals of the term ‘object proposals'(Category-independent object proposals and Detection proposals), there exists only a single evaluation protocol. By evaluating only on a specific set of categories in a partially annotated dataset, we fail to capture the performance of the proposal algorithm on all the remaining object categories that are present in the test set, but not annotated in the ground truth. This makes the evaluation protocol 'gameable' or susceptible to manipulation (both intentional and unintentional).

For example:

Green:Annotated, Red:Unannotated
Method 1 with recall 0.6
Method 2 with recall 1.0

Figure 1: Method 1 visually seems to recall more categories such as plates, glasses that Method 2 missed. Despite that, the computed recall for Method 2 is higher because it recalled all instances of PASCAL categories that were present in the ground truth. Note that the number of proposals generated by both methods is equal in this figure.

Green:Annotated, Red:Unannotated
Method 1 with recall 0.5
Method 2 with recall 0.83

Figure 2: Again, Method 1 visually seems to recall more categories that Method 2 missed. Also, number of proposals generated by Method 1 is greater than Method 2. Clearly the recall for Method 1 should be higher. However, the calculated recall for Method 2 is significantly higher, which is counter-intuitive. This is because Method 2 recalls more PASCAL category objects.

In this work, we have demonstrated this gameability via a simple thought experiment. We have conducted a thorough evaluation of existing object proposal methods on three densely annotated datasets. Furthermore, since densely annotating the dataset is a tedious and costly task; we have proposed a set of diagnostic tools to plug the vulnerability of the current protocol.

Performance of difference proposal methods and two 'fraudulent' proposal method DPM and RCNN on PASCAL-Context

Performance on PASCAL Context, only 20 PASCAL classes annotated
Performance on PASCAL Context, 60 non-PASCAL classes annotated
Performance on PASCAL Context, ALL(20+60) classes annotated

Performance of difference proposal methods and two 'fraudulent' proposal method DPM and RCNN on MS COCO

Performance on MS COCO, only 20 PASCAL classes annotated
Performance on MS COCO, 60 non-PASCAL classes annotated
Performance on MS COCO, ALL(20+60) classes annotated

Measuring 'bias-capacity' on MS COCO

Area under recall vs #candidates for #trained categories, on MS COCO
Area under recall vs #trained categories, on MS COCO
Improvement in area under recall vs #trained categories, on MS COCO

Object Proposals Library

To evaluate and compare different object proposal methods, our Object Proposals library can be used. The code is available at github.

VT PASCAL Context Instance Annotations (New version released)

PASCAL-Context dataset provides full annotations for PASCAL VOC 2010 dataset in the form of semantic segmentations. We split these into three categories namely: Objects/Things, Background/Stuff and Ambiguous. We selected the most frequently occurring 60 categories other than 20 PASCAL categories and added instance level annotations for the same.

These annotations for the extra 60 categories are available here in the standard PASCAL xml format.

PASCAL-Context Semantic Segmentations
PASCAL-Context Instance Annotations

Original images for the dataset can be downloaded from the PASCAL VOC 2010 website. Original PASCAL-Context annotations are available here.

Please cite our paper if you use our code or annotations.