Resolving Prepositional Phrase Attachment Ambiguity via Re-ranking with Images
Stanislaw Antol Spring 2014 ECE 6504 Probabilistic Graphical Models: Class Project Virginia Tech
Jill is running with a baseball bat towards Jack with a catcher's glove.
Purely based on text, the incorrect parsing is selected. Using image features, can we select the one in blue?
Goal
There has been a lot of work done at the intersection of computer vision and natural language processing.
Most of this work has focused on using sentences to help computer vision tasks (e.g., improve object detection).
This project aims to reverse this scenario to get computer vision to help natural language processing.
The task of interest is sentence parsing, which entails figuring out which words in a sentence go together to produce higher-level meaning.
In many languages, there are ambiguous, but grammatically correct, sentences.
For example, "I bought a red and yellow shirt." can mean buying two shirts or one shirt with two colors.
Another kind of ambiguity, and the one I focus on in this work, is prepositional phrase attachment ambiguity, such as the one shown in the figure above.
In a purely textual world, it can be difficult to resolve these ambiguities, as each interpretation could be correct.
Thus, I investigate situations with sentence-image pairs (e.g., images with captions) and I focus on ambiguities that might be resolved via visual information.
The basic idea is illustrated in the figure above and explained in more detail here.
We are given an image-sentence pair (left).
The sentence parser provides multiple parsings (right--simplified for clarity), sometimes with the incorrect parsing as most likely.
Can we use image-based features to reorder the parsings, such that we select the correct parsing (e.g., the one in the blue box)?
Approach
The approach pipeline can be seen in the figure below, with each part being described in more detail further below.
Parsing
In order to parse the sentences, I utilize the Stanford Parser.
Specifically, I use the pre-trained English Probabilistic Context-Free Grammer parser, which is able to output the top K parsings and associated probabilities.
Re-ranking
In order to re-rank the sentence parsings,
I utilize the re-ranker from the Yadollahpour et al. paper "Discriminative Re-ranking of Diverse Segmentations," courtesy of Dhruv Batra.
This re-ranker is based on a Structural SVM that uses a cutting-plane method.
Every training datapoint has multiple solutions and associated scores. It then learns a weight vector, w, such that
multiplying w with the feature vectors results in a high score for good solutions. The solutions can then be ordered via the scores to select the final answer.
Dataset
For this project, I utilize the yet-to-be-released ABSTRACT-50S dataset from the Computer Vision Lab at Virginia Tech.
This dataset contains 25,000 descriptive sentences, 50 for each one of 500 abstract illustrations depicting a scene between two children, Mike and Jenny, and some of 56 other objects.
This dataset was chosen due to the large variation in scenes and associated sentences, which make it contain more ambiguities than image classification-based datasets (e.g., PASCAL).
Another benefit of this abstract dataset is that the locations of every object in the the image is known exactly (i.e., no noisy manual annotations).
Starting with the ABSTRACT-50S dataset, I did some processing to reduce the size and make it more manageable for this class project.
First, I select 6 prepositional phrases of interest: "with," "next to," "on top of," "under," "in front of," "behind."
I then remove sentences that do not contain at least one of these prepositional phrases, which bring it to around 3,500 sentences.
Then, utilizing the parsings, I further reduce the dataset to get 399 potential sentences.
These sentences are sure to have ambiguity in prepositional phrase attachment (i.e., the top K parsings do not have the same prepositional phrases).
Furthermore, they also are between nouns that are present in the images.
Please note that the nouns in the sentences directly correspond to the names of the objects in the image dataset.
Manually annotating all of the K (= 10) parsings for the 399 images would have been tedious, so I only annotated and experimented on a random selection of 100 of these 399 sentences.
Features
I utilize a 37 dimensional feature for each parsing.
The first feature is the parsing probability from the parser.
The next 6 features are binary variables that indicate which of the 6 prepositions are in the parse.
Then, for each preposition present, I calculate 5 image-based features based on the two objects the preposition links.
These image-based features capture the absolute distance between the objects, if the first object is to the right of the second object,
if the first is to the left of the second, if the first is above the second, and if the first is below the second.
The distances are normalized by the size of the image and converted into pseudo-probabilities by substracting them from 1
(e.g., if two objects overlap, the absolute distance feature would be 1).
Some sentences (e.g., in the top-most figure) have multiple of the same preposition.
For these cases, I compute the image features for each pair of objects and then take the mean for the preposition's final image feature.
If a preposition is not present, then the image features for that preposition are zeros.
These two measures ensure that the feature vector is the same dimension, regardless of sentence parsings.
Results
Due to the small size of the dataset (i.e., 100 image-sentence pairs) and sensitivity to the train/test split, I report the results of 10-fold cross-validation.
This means that I divide the dataset into 10 random partitions, then train on 9 partitions, evaluate performance on the 10th, and then pick a different set of 9 partitions and repeat the process until each partition is tested on.
The final result reported is the C value that has the best average performance across the 10 partitions.
The performance metric is the fraction of image-sentence pairs that have the correct prepositional phrase attachment in the highest ranked parsing.
For C = 10,000, my cross-validation performance was 70%.
(All C values between 0.1 and 100,000 had performances above 60%.)
Without performing re-ranking (i.e., using the parser's best guess), the performance was 52%, an improvement of 34.5%.
These results provide preliminary evidence that this approach has a lot of potential to help this task.
Future Work
The main limitation of the current setup is that I do not extract information from the sentences as well as possible.
The first limitation is that the prepositions from the parser only contain single nouns (e.g., tree),
whereas some object are multiple words (e.g., oak tree).
Thus, I lose out on a lot of potential sentences.
The second limitation is that I do not reason about similarity between the words used to describe the objects in the sentences
and the names in the dataset object list.
Thus, if the match is not perfect, I exclude the sentence from my experimental dataset.
For a more realistic use case, I would need to compute a similarity between the sentences and the object names and select the best matching ones.
Thus, if the sentence says "bird," it would match well with the "owl" object name and I could compute and image feature (possibly scaled by the similarity).
It would also be nice to incorporate more prepositions and test in on a dataset consisting of real images.