Abstract

To be able to enhance scenes by adding objects to them requires some common sense idea about what the scene is trying to convey, and what may be missing from the image. Utilization of common sense knowledge in Computer Vision has been a focus for a few papers recently; the idea of generating abstract scenes from textual input has been explored purely from a text to image method in a few of them. We think that generating a model to predict objects using the scene context would require learning common sense and hence is an interesting problem. Our project concentrates on the augmentation of abstract scenes by testing different methods to improve upon a scene. The primary concentration of our project was the use of Recurrent Neural Networks to do so.

Motivation

Let's look at the following image:
If you were given the above image, what would be a logical object to add to cause it to make more sense? The child is holding a bone up in the air, so a good addition would be something the child may be holding it from, such as a dog as shown below:
That's actually the scene from the dataset, which looks like it may be a child attempting to train a dog.
If we take away the bone, there may not be enough information to make the same inference. Still, giving this image, you may be able to come up with some other scene. Even without information on what to add to the scene, a human can get creative and imagine a scene that will make sense

Problem Statement

Given 'n' objects in an image, a human can easily predict ways in which the image can be enriched, or made more plausible. However, this requires commonsense knowledge about the interactions of the objects in the image, and how their relative positions and attributes play a role in it. We explored this problem by training a machine learning model to attempt to recover missing items in an abstract scene. Our inspiration is to eventually create an intelligent algorithm that not only learns to hallucinate possible completions such that it makes sense visually, but also learns to generate a plausible image as textual data is fed in word-by-word, both textually and visually.