Intro to Machine Learning | ECE, Virginia Tech

Project Information

Your class project is an opportunity for you to explore an interesting machine learning problem of your choice in the context of a real-world data set.
Below, you will find some example project ideas. You do no have to necessarily pick from the list; these are simply suggestions. If you are in 5984, the best idea would be to combine machine learning with problems in your own research area. If you are in 4984, this is an opprotunity to learn in depth about a particular sub-area of ML. Your class project must be about new things you have done this semester; you can't use results you have developed in previous semesters.

Projects can be done by you as an individual, or in teams of two students. The instructor and TA will consult with you on your ideas, but of course the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 25% of your final class grade, and will involve the following deliverables:

Project Proposal: due TBD.
Mid-sem Presentations: In-Class.
Final Report: due TBD.

Project Proposal

Project proposal format: Proposals should be two pages maximum. The document must be in NIPS format. You can find Latex style files here. Note that, as with any conference, the page limits are strict! Papers over the limit will not be considered. Please include the following information:

Project Title
Project Idea

This should be approximately two paragraphs.

Dataset details

We strongly urge you to use existing dataset. Data-collection takes a lot of time and we want you to focus on machine learning not data-collection. Talk to us if you need to collect your own data.

Software

Which libraries will you use?
What will you code up?

Papers to read

Include 1-3 relevant papers. You will probably want to read at least one of them before submitting your proposal.

Teammate

Will you have a teammate? If so, what is the division of labor? Maximum team size is 3 students.

Mid-sem milestone

What will you complete by the project milestone due date? Experimental results of some kind are expected here.

Dataset and Project Ideas

Below are descriptions of several data sets, and some suggested projects. The first few are spelled out in greater detail. You are encouraged to select and flesh out one of these projects, or make up you own well-specified project using these datasets. If you have other data sets you would like to work on, we would consider that as well, provided you already have access to this data and some idea of what to do with it.

Character Recognition (digits) data

Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research. We have three datasets on this topic. The first tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)

http://ai.stanford.edu/~btaskar/ocr/

The second dataset is the now "classic" digit recognition task for outgoing mail zip codes:

http://yann.lecun.com/exdb/mnist/

The third (and most challenging) data set consists of scrambled text known as Captchas that were designed by Luis Von Ahn to be difficult to automatically recognize. For more about Captchas go to the relevant Wikipedia article or Captcha.net where you will find several papers.

Easier Dataset

Difficult Dataset

Project Suggestions

Learn a classifier to recognize the letter/digit
Use an HMM to exploit correlations between neighboring letters in the general OCR case to improve accuracy. (Since ZIP codes don't have such constraints between neighboring digits, HMMs will probably not help in the digit case.)
Apply a clustering/dimensionality reduction algorithm on this data, see if you get better classification on this lower dimensional space.
Learn a classifier to decipher Captchas. You may want to begin by reading the following:

Object Detection/Segmentation Datasets

PASCAL Visual Object Categories Challenge was an international competition that ran from 2006 to 2012. The main goal of this challenge is to recognize objects from a number of visual object classes in realistic scenes (i.e. not pre-segmented objects). It is fundamentally a supervised learning problem in that a training set of labelled images is provided. The twenty object classes that have been selected are:

    Person: person
    Animal: bird, cat, cow, dog, horse, sheep
    Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train
    Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor

There are three main object recognition competitions: classification, detection, and segmentation, a competition on action classification, and a competition on large scale recognition run by ImageNet.

Project Ideas

Classify images with any of the classifiers we learnt in class.
Run object detection with existing object detection implementations
Use image classification as a prior for object detection.
Use image classification as a prior for object segmentation.
Region-Based Segmentation

Most segmentation algorithms have focused on segmentation based on edges or based on discontinuity of color and texture. The ground-truth in this dataset, however, allows supervised learning algorithms to segment the images based on statistics calculated over regions. One way to do this is to "oversegment" the image into superpixels (Felzenszwalb 2004, code available) and merge the superpixels into larger segments. Come up with a set of features to represent the superpixels (probably based on color and texture), a classifier/regression algorithm (suggestion: boosted decision trees) that allows you to estimate the likelihood that two superpixels are in the same segment, and an algorithm for segmentation based on those pairwise likelihoods. Since this project idea is fairly time-consuming focusing on a specific part of the project may also be acceptable.

Supervised vs. Unsupervised Segmentation Methods

Write two segmentation algorithms (these may be simpler than the one above): a supervised method (such as logistic regression) and an unsupervised method (such as K-means). Compare the results of the two algorithms. For your write-up, describe the two classification methods that you plan to use.

Below is a list of software you may find useful, contributed by participants to previous challenges.

Encoding Methods Evaluation Toolkit
Ken Chatfield, Victor Lempitsky, Andrea Vedaldi, Andrew Zisserman
CPMC: Constrained Parametric Min-Cuts for Automatic Object Segmentation
Joao Carreira and Cristian Sminchisescu.
Automatic Labelling Environment (Semantic Segmentation)
Lubor Ladicky, Philip H.S. Torr.
Discriminatively Trained Deformable Part Models
Pedro Felzenszwalb, Ross Girshick, David McAllester, Deva Ramanan.
Color Descriptors
Koen van de Sande, Theo Gevers, Cees Snoek.

Face Recognition Data

There are two data sets for this problem. The first dataset contains 640 images of faces. The faces themselves are images of 20 former Machine Learning students and instructors, with about 32 images of each person. Images vary by the pose (direction the person is looking), expression (happy/sad), face jewelry (sun glasses or not), etc. This gives you a chance to consider a variety of classification problems ranging from person identification to sunglass detection. The data, documentation, and associated code are available here:

CMU Machine Learning Faces

Available Software: The same website provides an implementation of a neural network classifier for this image data. The code is quite robust, and pretty well documented in an associated homework assignment.

Face Attributes / Hot-or-Not

The second data set consists of 2253 female and 1745 male rectified frontal face images scraped from the hotornot.com website by Ryan White along with user ratings of attractiveness. The data set can be found here:

Facial Attractiveness Images.
Project Ideas

Try SVM's on this data, and compare their performance to that of the provided neural networks
Apply a clustering algorithm to find "similar" faces
Learn a facial attractiveness classifier. A recent NIPS paper on the topic of predicting facial attractiveness can be found here.

Other Resources

Preciptation Data

This dataset has includes 45 years of daily precipitation data from the Northwest of the US: http://www.jisao.washington.edu/data_sets/widmann/

Project ideas:

Weather prediction: Learn a probabilistic model to predict rain levels
Sensor selection: Where should you place sensor to best predict rain

Enron Email Dataset

The Enron E-mail data set contains about 500,000 e-mails from about 150 users. The data set is available here: Enron Data

Project ideas

Can you classify the text of an e-mail message to decide who sent it?

Netflix Prize Dataset

The Netflix Prize data set gives 100 million records of the form "user X rated movie Y a 4.0 on 2/12/05". The data is available here: Netflix Prize
Project ideas

Can you predict the rating a user will give on a movie from the movies that user has rated in the past, as well as the ratings similar users have given similar movies?
Can you discover clusters of similar movies or users?
Can you predict which users rated which movies in 2006? In other words, your task is to predict the probability that each pair was rated in 2006. Note that the actual rating is irrelevant, and we just want whether the movie was rated by that user sometime in 2006. The date in 2006 when the rating was given is also irrelevant. The test data can be found at this website.

Introduction to Machine Learning and Perception