What factors should I consider when choosing a predictive model technique?
This is a very broad question, and the answer would basically fill an entire book. In a nutshell, I would come up with the
1. How does your target variable look like?
- continuous target variable? -> regression
- categorical (nominal) target variable? -> classification
- ordinal target variable? -> ranked classification
- no target variable and want to find structure in data? -> cluster analysis, projection
2. Is computational performance an issue?
- use “cheaper” models/algorithms
- dimensionality reduction
- feature selection
- lazy learner (e.g,. k-nearest neighbors)
3. Does my dataset fit into memory? If no:
- out of core learning
- distributed systems
4. Is my data linearly separable?
- hard to know the answer upfront
- always a good idea to compare different models
5. Finding a good bias variance threshold. Does my model overfit?
- increase regularization strength if supported by the model
- dimensionality reduction or feature selection otherwise
- collect more training data if possible (check via learning curves first)
6. Are you planning to update your model with new data on the fly?
- one option are lazy learners (e.g., K-nearest neighbors); needs to keep training data around; no learning necessary but more expensive predictions
- it’s generally relatively cheap to update generative models
- another option is stochastic gradient descent for online learning
The list goes on and on :). I think Andreas Mueller’s scikit-learn algorithm “cheat-sheet” is an excellent resource. (Click on the image to view the original, interactive version on scikit-learn)