How do Data Scientists perform model selection? Is it different from Kaggle?

Although, I agree that Kaggle may be a nice playground for experiments, it typically doesn’t come even close to a “real-world” application :). (Remember the NetFlix Prize? “Netflix Never Used Its $1 Million Algorithm Due To Engineering Costs”.) So, I’d say you really want to define a metric for “success!” It’s very important to be clear about your goal before you do any kind of modeling. In practice, it often boils down to finding the sweet spot between meeting the project deadline high predictive performance high computational efficiency good interpretability These aspects differ from project to project, and it is really important to be clear about what you are trying to achieve beforehand. For example, let’s say you managed to come up with a model scoring a 0.95 on your favorite performance metric scale from 0-1. Is it worth spending a couple of more days, weeks, months, or years to squeeze out another 0.05 improvement? It depends on your date of delivery. It depends on the available computing hardware. Eventually, it may also be important to know what’s going on under the hood (would your collaborators be happy with another one of these multi-layer ensemble XGBoost frankensteins?)

When it comes to choosing between particular algorithms, I’d typically approach a new problem starting with a very simple hypothesis space – for example, simple logistic regression or softmax regression. (Of course, this comes all after exploring and getting familiar with the dataset.) I’d use my initial model as a benchmark and try another bunch of simple classifiers with piece-wise or non-linear hypothesis spaces like decision trees, random forests, and (Rbf kernel) SVMs. If these don’t cut it, I’d explore further options including MLPs, RNNs, and ConvNets if appropriate.

To provide you with a real-world example at this point: I recently ended up using a simple decision tree for a recent project for the sake of interpretability :). I was collaborating with experimental biologists who provided me with hundreds of experimental measurements for chemical molecules that they tested on a particular system. Eventually, they wanted me to tell them which particular atoms at which positions are most important to trigger that response in order to design more effective molecules. Here, I ended up using tree-based methods, since a decision tree was something that I could easily explain to a non-machine learning person. E.g., “If you have a keto-group at this position and a nitrogen group at this position, then …, etc.”

If you like this content and you are looking for similar, more polished Q & A’s, check out my new book Machine Learning Q and AI.