Machine Learning Contests

Rob Zinkov
2010-10-23

Lately, I have been noticing there are lots of contests cropping up. Some the domains look interesting, thanks to Kaggle all of them are mildly lucrative. I have always been bummed I didn’t participate in the Netflix challenge. Here are some of the ones that caught my eye.

The reddit recommendation is one of the more interesting to me. In this challenge, you are given a dump of the voting behavior of roughly 68,000 users. We are provided with the account id of the user, the id of the link they voted on, what subreddit they voted on, and how they voted on the story. The comments include some basic ideas.

Hosted elsewhere but still covered by kaggle is the Hearst challenge.

Magazine level data: (10 Publications to be included in final data set) Individual magazine titles will be deidentified and proposed segmentation as follows:

  • Magazine primary display type : “Check Out Aisle” title vs “Mainline” title
  • Magazine sale level: High, Medium, Low
  • Magazine Category: Specifics to be determined, but example categories could be - Fashion; Fitness; Home; Men’s General Interest; Food; Gossip; News; etc.
  • Magazine frequency: noted as # issues per year

Newsstand Location / Store level data: (10,000 stores to be included in final data set): * Store/ Location type * Location * Geographic related descriptive information / demos * Store Sales history by Magazine segment * Time of Sale * Time of Delivery * Price of Issue * Cost of Issue * Number of Titles for sale at Store by magazine segment

You are only allowed to use the data they provide. This one is probably the most boring, but it also potentially pays the best.

The R package recommendation is the most fun looking one of the bunch. In this one, you are given some of the packages installed by R users and must then recommend other packages for them.

We’d like you to build a library recommendation engine for R programmers, who usually refer to libraries as packages. We think that you can help neophyte R programmers by letting them know which packages are most likely to be installed by the average R user and what measurable properties of the packages themselves are able to predict this information. To train your algorithm, we’re providing a data set that contains approximately 99,640 rows of data describing installation information for 1865 packages for 52 users of R. For each package, we’ve provided a variety of predictors derived from the rich metadata that is available for every R package. Your task is to model the behavior of the sample users for this training set of 1865 packages well enough that your predictions will generalize to a test data set, containing 33,125 rows.

The cool thing about this contest is its no-holds barred. Short of bribing John or Drew for the test set, you can do anything. Got web traffic statistics to R-forge, all good. Think the date the package was last upgrade is relevant go for it. Use the number of times the journal article about the package. Not that I am suggesting all of these will lead to improved performance.

Kaggle has a few other contests on their website, and kdd-cup will be around in a few more months. These are the ones that I found interesting.