Exploring Nearest Neighbor Approach on VQA
Fall 2015 ECE 5554/4984 Computer Vision: Class Project
Computer Vision has made great progress through the years. Problems like Object detection, classification etc. have achieved near human accuracies. Visual Question Answering (VQA) is a new problem in the field of Computer Vision. The objective of this problem is to get computers to answer questions based on images. The goal of this project was to explore the accuracy of models on the VQA dataset using k-nearest neighbor search along with a "consensus" approach to predict an answer.
The goal is to get computers to answer questions based on images.
The main motivation behind this approach is to explore performances of naive algorithms such as nearest neighbor on the VQA dataset which will help provide insight into how further algorithms can be developed to get better performance.
The "consensus" approach used is similar to the one described in Exploring Nearest Neighbor Approaches for
Experiments and results
For every test question+image pair, 'k' nearest training question+image pairs are found based on the test question. For every question+image pair obtained, we find the cosine distance of the image from the test image. The top 'm' question+image pairs (1<=m<=k) with the least distance are chosen and a union of their 'multiple_choice_answer' field is taken to form a set of answers. The "consensus" answer is chosen as the one which occurs most frequently in the set of answers.
In this particular experiment 'k' was taken as 10 and 'word2vec' was the embedding used for the questions.
Overall accuracy of the Nearest neighbor approach is 36.58%
and overall accuracy of the consensus approach is 39.70%
. Consensus generally outperforms the simple nearest neighbor but in some specific question types such as 'what animal' and 'what sport' nearest neighbor significantly outperforms the consensus approach.
Per Question Type Accuracy
|none of the above
Per Answer Type Accuracy
Nearest Neighbor (NN) vs. NN with Consensus
Below are qualitative results from both experiments - For most cases, both perform equally except for certain cases where consensus does better in choosing the right answer.
Question: "What is the dresser made out of?"
Question: "Is the child male or female?"
Question: "What has been upcycled to make lights?"
Question: "What is the table made of?"