I often get people asking me which software should they use for doing arbitrary machine learning tasks. Generally there is no one library that can achieve everything I want, but I definitely have a few favorites that I like to use often.
As a quick note, most of my work is actually spent just getting the data cleaned. This isn’t really machine learning, but its important enough that I suggest checking out Google Refine and Pandas.
Google Refine is essentially a webapp to quickly point and click your way to a more usable dataset. Pandas is a python library for really drilling down into the details. It can also do some really cool stuff with time series data so check it out if you need something more involved.
As a note to machine libraries that I haven’t mentioned. Try to stick to algorithms written by practitioners and sometimes researchers. Hobbyists will often just implement the algorithm and that’s it. They will use some weird data format that is compatible with nothing else and likely have subtle bugs. Many researchers while well-intentioned often don’t write libraries that are reusable or able to actually use the algorithms on real datasets. You want practitioners and industrial researchers writing the code.
Look at the backgrounds of the people contributing, and remember machine learning code is still code. This means all the standard software engineering design principles still apply.
If you don’t have passing familiarity with R, you are setting yourself up for failure. R is partly a language but mostly a statistical development environment. It exists as an easy way to load, process, and visualize data. Many data mining packages are measured in terms of how well they can do the tasks R can do.
In the past, R was a bit cumbersome to use. Now with RStudio, there is a full IDE that is fairly intuitive so there is no excuse. You should know how to use R to be able to effectively communicate with other people in this field.
One of the main advantages of R is the massive amount of packages available for it. Since CRAN can be a bit intimidating there are Task Views for different categories and usecases.
In addition, Drew has a great list of packages to check out in R. To those listed I would add knitr as a relatively recent package that makes it easy to generate webpages that use R plots.
Passing familiarity with R and how R is used will inform your tastes and help you write shorter and cleaner data analysis code.
Vowpal Wabbit or as you will likely refer it in production use vw. Thanks to the recent reductions framework vw is a bit more featureful, but don’t be fooled. VW was designed in the unix tradition of sharp tools that do one thing very fast and very well. Vowpal was designed for binary classification of gigantic amounts of data. It is the only program on this list that can reliably handle 20 GB of data on a single machine.
It is definitely the most unfriendly from a user interface perspective, but if you follow the examples and tutorial you should be able to get something working without too much difficulty.
One of the frustrating pain points of vw is that all feature engineering and model evaluation must be done with separate tools. This frequently leads to many people needing to create shell scripts to wrap the training and evaluation steps. Still, it is one of fastest and performant machine learning libraries around.
Patsy isn’t quite a machine learning library, but it very good for designing features to then be used in conjuction with something like scikit-learn. Very often the features you want to include in your model are some transformation of the features your input provides. Patsy is nice library to manage these transformations.
If you have used R’s formula notation then you will feel right at home with this library. The output will be a matrix that can then be passed to scikit-learn. Patsy can also help with understanding the models after they have been learned. If you already have well-designed sets of features the utility of this library declines a bit. Truthfully it seems 30% of my time is spent just doing feature engineering.
Scikit-learn is as far as I am concerned the machine learning library to use in python. Built on top of numpy and scipy, this library offers a wide variety of machine learning algorithms and a very clean interface to train and validate them.
All classifiers must conform to a specific fit/predict object method pattern. This can make it frustrating for many contributors but leads to a library that is remarkably composable. Cross validation can be automatically used on all algorithms, and performance metrics are applied in the same way to all algorithms. This makes the package very clean and very easy to learn.
Nearly all the functionality includes extensive documentation and a working example to see how to use the code. The contributors are practitioners and will often be informed about how the algorithms behave under a wide variety of conditions.
This is a very active project with multiple commits at least every day. If the release doesn’t have some algorithm, check the development branch as it often has many amazing offerings not yet in a release.
Shogun is a C++ open source library with a wide variety of libraries. It originally was created as a wrapper around a collection of svm libraries so that kernels could be written in a library agnostic way. Since then it has grown into a robust framework with a wide variety of algorithms. The code is designed to be accessible not just from C++ but also python, R, Octave and MATLAB.
The early history does inform much of the functionality, so if you are doing anything with kernels you should give it a shot. The performance can be spotty for some of the algorithms but generally that is more a property of the algorithm than shogun so do your homework, you should be aware of the performance characteristics of the algorithms you wish to use. Liberally licensed, its a great place to start when looking for a machine learning algorithm in a commerical setting.
I include Weka here not because I use it, but more as an anti-recommendation. Weka while still maintained is hard to use, and the available algorithms are dated. The arff data type is idiosyncratic. The library uses massive amounts of memory for mysterious reasons. Please try to use any of the above libraries before considering Weka.
This isn’t all the software I use, but for general machine learning this is what I recommend. As the domain becomes more specialized, more fine-grained tools are available. Many of which I will talk about in future posts.