Exploring Nearest Neighbor Approach on VQA

Akrit Mohapatra
Fall 2015 ECE 5554/4984 Computer Vision: Class Project
Virginia Tech

Abstract

Computer Vision has made great progress through the years. Problems like Object detection, classification etc. have achieved near human accuracies. Visual Question Answering (VQA) is a new problem in the field of Computer Vision. The objective of this problem is to get computers to answer questions based on images. The goal of this project was to explore the accuracy of models on the VQA dataset using k-nearest neighbor search along with a "consensus" approach to predict an answer.

Teaser figure

The goal is to get computers to answer questions based on images.


© (http://visualqa.org/)


Introduction

The main motivation behind this approach is to explore performances of naive algorithms such as nearest neighbor on the VQA dataset which will help provide insight into how further algorithms can be developed to get better performance.

Approach

The "consensus" approach used is similar to the one described in Exploring Nearest Neighbor Approaches for Image Captioning.

Experiments and results

For every test question+image pair, 'k' nearest training question+image pairs are found based on the test question. For every question+image pair obtained, we find the cosine distance of the image from the test image. The top 'm' question+image pairs (1<=m<=k) with the least distance are chosen and a union of their 'multiple_choice_answer' field is taken to form a set of answers. The "consensus" answer is chosen as the one which occurs most frequently in the set of answers.

In this particular experiment 'k' was taken as 10 and 'word2vec' was the embedding used for the questions.

Overall accuracy of the Nearest neighbor approach is 36.58% and overall accuracy of the consensus approach is 39.70%. Consensus generally outperforms the simple nearest neighbor but in some specific question types such as 'what animal' and 'what sport' nearest neighbor significantly outperforms the consensus approach.

Per Question Type Accuracy

Question Type Nearest Neighbor Consensus
what brand 20.19 20.34
are 62.61 69.34
is the 60.34 65.97
is there 79.30 85.55
what sport 54.07 44.65
does 60.79 70.90
which 22.77 23.81
do 60.76 68.19
what type 24.59 25.23
what are 13.72 12.45
what color 19.11 23.10
who 11.98 12.17
what time 12.88 11.47
what kind 23.23 22.50
why 8.18 8.81
none of the above 36.66 39.78
is this 66.12 69.31
what is 11.91 11.51
how many 28.13 32.93
what does 8.46 9.57
where 6.77 5.05
what animal 17.68 14.06

Per Answer Type Accuracy

Answer Type Nearest Neighbor Consensus
other 17.82 18.35
number 23.85 27.84
yes/no 65.93 72.21


Nearest Neighbor (NN) vs. NN with Consensus



Qualitative results

Below are qualitative results from both experiments - For most cases, both perform equally except for certain cases where consensus does better in choosing the right answer.


Question: "What is the dresser made out of?"
Multiple_choice_answer: "wood"
Nearest-Neighbor: "glass"
Consensus: "wood"




Question: "Is the child male or female?"
Multiple_choice_answer: "male"
Nearest-Neighbor: "female"
Consensus: "female"




Question: "What has been upcycled to make lights?"
Multiple_choice_answer: "kettles"
Nearest-Neighbor: "luggage"
Consensus: "coffee"




Question: "What is the table made of?"
Multiple_choice_answer: "wood"
Nearest-Neighbor: "wood"
Consensus: "wood"