Project Information
Your class project is an opportunity for you to explore an interesting
machine learning problem of your choice in the context of a real-world
data set.
Below, you will find some example project ideas. You do no have to
necessarily pick from the list; these are simply suggestions. If you
are in 5984, the best
idea would be to combine machine learning with problems in your own
research area. If you are in 4984, this is an opprotunity to learn in
depth about a particular sub-area of ML. Your class project must be
about new things you have
done this semester; you can't use results you have developed in
previous semesters.
Projects can be done by you as an individual, or in teams of two
students. The instructor and TA will consult with you on your ideas,
but of course the final responsibility to define and execute an
interesting piece of work is yours. Your project will be worth
25% of your final class grade, and will involve the following
deliverables:
- Project Proposal: due TBD.
- Mid-sem Presentations: In-Class.
- Final Report: due TBD.
Project Proposal
Project proposal format: Proposals should be two pages maximum.
The document must be in NIPS format.
You can find
Latex style files
here. Note that, as with any conference, the page limits are
strict! Papers over the limit will not be considered.
Please
include the following information:
- Project Title
- Project Idea
- This should be approximately two paragraphs.
- Dataset details
- We strongly urge you
to use existing dataset. Data-collection takes a lot of time and we
want you to focus on machine learning not data-collection. Talk to us
if you need to collect your own data.
- Software
- Which libraries will you use?
- What will you code up?
- Papers to read
- Include 1-3 relevant papers. You will probably want to read at least one of them before submitting your proposal.
- Teammate
- Will you have a teammate? If so, what is the division of labor? Maximum team size is 3 students.
- Mid-sem milestone
- What will you complete by the project milestone due date? Experimental results of some kind are expected here.
Dataset and Project Ideas
Below are descriptions of several data sets, and some suggested
projects. The first few are spelled out in greater detail. You are
encouraged to select and flesh out one of these projects, or make up
you own well-specified project using these datasets. If you have other
data sets you would like to work on, we would consider that as well,
provided you already have access to this data and some idea of what to
do with it.
Character Recognition (digits) data
Optical character recognition, and the simpler digit recognition task,
has been the focus of much ML research. We have three datasets on this
topic. The first tackles the more general OCR task, on a small
vocabulary of words: (Note that the first letter of each word was
removed, since these were capital letters that would make the task
harder for you.)
http://ai.stanford.edu/~btaskar/ocr/
The second dataset is the now "classic" digit recognition task for
outgoing mail zip codes:
http://yann.lecun.com/exdb/mnist/
The third (and most challenging) data set consists of scrambled text
known as Captchas that were designed by Luis Von Ahn to be difficult to
automatically recognize. For more about Captchas go to the relevant
Wikipedia article or Captcha.net where you will find several papers.
Easier Dataset
Difficult Dataset
Project Suggestions
- Learn a classifier to recognize the letter/digit
- Use an HMM to exploit correlations between neighboring letters in
the general OCR case to improve accuracy. (Since ZIP codes don't have
such constraints between neighboring digits, HMMs will probably not
help in the digit case.)
- Apply a clustering/dimensionality reduction algorithm on this
data, see if you get better classification on this lower dimensional
space.
- Learn a classifier to decipher Captchas. You may want to begin by
reading the following:
Object Detection/Segmentation Datasets
PASCAL
Visual
Object Categories Challenge was an international competition that
ran from 2006 to 2012. The main goal of this challenge is to recognize
objects from a number of visual object classes in realistic scenes
(i.e. not pre-segmented objects). It is fundamentally a supervised
learning problem in that a training set of labelled images is provided.
The twenty object classes that have been selected are:
Person: person
Animal: bird, cat, cow, dog, horse, sheep
Vehicle: aeroplane, bicycle, boat, bus, car,
motorbike, train
Indoor: bottle, chair, dining table, potted plant,
sofa, tv/monitor
There are three main object recognition competitions: classification,
detection, and segmentation, a competition on action classification,
and a competition on large scale recognition run by ImageNet.
Project Ideas
- Classify images with any of the classifiers we learnt in class.
- Run object detection with existing object detection implementations
- Use image classification as a prior for object detection.
- Use image classification as a prior for object segmentation.
- Region-Based Segmentation
- Most segmentation algorithms have focused on segmentation based
on edges or based on discontinuity of color and texture. The
ground-truth in this dataset, however, allows supervised learning
algorithms to segment the images based on statistics calculated over
regions. One way to do this is to "oversegment" the image into
superpixels (Felzenszwalb 2004, code available) and merge the
superpixels into larger segments. Come up with a set of features
to represent the superpixels (probably based on color and texture), a
classifier/regression algorithm (suggestion: boosted decision trees)
that allows you to estimate the likelihood that two superpixels are in
the same segment, and an algorithm for segmentation based on those
pairwise likelihoods. Since this project idea is fairly time-consuming
focusing on a specific part of the project may also be
acceptable.
- Supervised vs. Unsupervised Segmentation Methods
- Write two segmentation algorithms (these may be simpler than
the one above): a supervised method (such as logistic regression) and
an unsupervised method (such as K-means). Compare the results of the
two algorithms. For your write-up, describe the two classification
methods that you plan to use.
Below is a list of software you may find useful, contributed by
participants to previous challenges.
Face Recognition Data
There are two data sets for this problem. The first dataset contains
640 images of faces. The faces themselves are images of 20 former
Machine Learning students and instructors, with about 32 images of each
person. Images vary by the pose (direction the person is looking),
expression (happy/sad), face jewelry (sun glasses or not), etc. This
gives you a chance to consider a variety of classification problems
ranging from person identification to sunglass detection. The data,
documentation, and associated code are available here:
CMU Machine Learning Faces
Available Software: The same website provides an implementation of a
neural network classifier for this image data. The code is quite
robust, and pretty well documented in an associated homework
assignment.
Face Attributes / Hot-or-Not
The second data set consists of 2253 female and 1745 male rectified
frontal face images scraped from the
hotornot.com
website by
Ryan White along with user
ratings of attractiveness. The data set can be found here:
Facial
Attractiveness Images.
Project Ideas
- Try
SVM's on this data, and compare their performance to that of the
provided neural networks
- Apply a clustering algorithm to find "similar" faces
- Learn a facial attractiveness classifier. A recent NIPS paper on the topic of predicting facial attractiveness can
be found here.
Other Resources
Preciptation Data
This dataset has includes 45 years of daily precipitation data from the
Northwest of the US:
http://www.jisao.washington.edu/data_sets/widmann/
Project ideas:
- Weather prediction: Learn a probabilistic model to predict rain levels
- Sensor selection: Where should you place sensor to best predict rain
Enron Email Dataset
The Enron E-mail data set contains about 500,000 e-mails from about 150
users.
The data set is available here:
Enron Data
Project ideas
- Can you classify the text of an e-mail message to decide who sent it?
Netflix Prize Dataset
The Netflix Prize data set gives 100 million records of the form "user
X rated movie Y a 4.0 on 2/12/05". The data is available here:
Netflix Prize
Project ideas
- Can you predict the rating a user will give on a movie from the
movies that user has rated in the past, as well as the ratings similar
users have given similar movies?
- Can you discover clusters of similar movies or users?
- Can you predict which users rated which movies in 2006? In other
words, your task is to predict the probability that each pair was rated
in 2006. Note that the actual rating is irrelevant, and we just want
whether the movie was rated by that user sometime in 2006. The date in
2006 when the rating was given is also irrelevant. The test data can be
found at this website.
Other Datasets
Online Resourses
© 2013 Virginia Tech