Project Information

Your class project is an opportunity for you to explore an interesting machine learning problem of your choice in the context of a real-world data set.
Below, you will find some example project ideas. You do no have to necessarily pick from the list; these are simply suggestions. If you are in 5984, the best idea would be to combine machine learning with problems in your own research area. If you are in 4984, this is an opprotunity to learn in depth about a particular sub-area of ML. Your class project must be about new things you have done this semester; you can't use results you have developed in previous semesters.

Projects can be done by you as an individual, or in teams of two students. The instructor and TA will consult with you on your ideas, but of course the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 25% of your final class grade, and will involve the following deliverables:

Project Proposal

Project proposal format: Proposals should be two pages maximum. The document must be in NIPS format. You can find Latex style files here. Note that, as with any conference, the page limits are strict! Papers over the limit will not be considered. Please include the following information:

Dataset and Project Ideas

Below are descriptions of several data sets, and some suggested projects. The first few are spelled out in greater detail. You are encouraged to select and flesh out one of these projects, or make up you own well-specified project using these datasets. If you have other data sets you would like to work on, we would consider that as well, provided you already have access to this data and some idea of what to do with it.

Character Recognition (digits) data

Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research. We have three datasets on this topic. The first tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)

The second dataset is the now "classic" digit recognition task for outgoing mail zip codes:

The third (and most challenging) data set consists of scrambled text known as Captchas that were designed by Luis Von Ahn to be difficult to automatically recognize. For more about Captchas go to the relevant Wikipedia article or where you will find several papers.

Easier Dataset

Difficult Dataset

Project Suggestions

Object Detection/Segmentation Datasets

PASCAL Visual Object Categories Challenge was an international competition that ran from 2006 to 2012. The main goal of this challenge is to recognize objects from a number of visual object classes in realistic scenes (i.e. not pre-segmented objects). It is fundamentally a supervised learning problem in that a training set of labelled images is provided. The twenty object classes that have been selected are:

    Person: person
    Animal: bird, cat, cow, dog, horse, sheep
    Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train
    Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor

There are three main object recognition competitions: classification, detection, and segmentation, a competition on action classification, and a competition on large scale recognition run by ImageNet.

Project Ideas

Below is a list of software you may find useful, contributed by participants to previous challenges.

Face Recognition Data

There are two data sets for this problem. The first dataset contains 640 images of faces. The faces themselves are images of 20 former Machine Learning students and instructors, with about 32 images of each person. Images vary by the pose (direction the person is looking), expression (happy/sad), face jewelry (sun glasses or not), etc. This gives you a chance to consider a variety of classification problems ranging from person identification to sunglass detection. The data, documentation, and associated code are available here:

CMU Machine Learning Faces

Available Software: The same website provides an implementation of a neural network classifier for this image data. The code is quite robust, and pretty well documented in an associated homework assignment.

Face Attributes / Hot-or-Not

The second data set consists of 2253 female and 1745 male rectified frontal face images scraped from the website by Ryan White along with user ratings of attractiveness. The data set can be found here:

Facial Attractiveness Images.
Project Ideas
Other Resources

Preciptation Data

This dataset has includes 45 years of daily precipitation data from the Northwest of the US:

Project ideas:

Enron Email Dataset

The Enron E-mail data set contains about 500,000 e-mails from about 150 users. The data set is available here: Enron Data

Project ideas

Netflix Prize Dataset

The Netflix Prize data set gives 100 million records of the form "user X rated movie Y a 4.0 on 2/12/05". The data is available here: Netflix Prize
Project ideas

Other Datasets

Online Resourses

© 2013 Virginia Tech