Machine Learning FAQ
Why do you and other people sometimes implement machine learning algorithms from scratch?
There are several different reasons why implementing algorithms from scratch can be useful:
- it can help us to understand the inner works of an algorithm
- we could try to implement an algorithm more efficiently
- we can add new features to an algorithm or experiment with different variations of the core idea
- we circumvent licensing issues (e.g., Linux vs. Unix) or platform restrictions
- we want to invent new algorithms or implement algorithms no one has implemented/shared yet
- we are not satisfied with the API and/or we want to integrate it more “naturally” into an existing software library
Let us narrow down the phrase “implementing from scratch” a bit further in context of the 6 points I mentioned above. When we talk about “implementing from scratch,” we need to narrow down the scope to make this question really tangible. Let’s talk about a particular algorithm, simple logistic regression, to address the different points using concrete examples. I’d claim that logistic regression has been implemented more than thousand times.
One reason why we’d still want to implement logistic regression from scratch could be that we don’t have the impression that we fully understand how it works; we read a bunch of papers, and kind of understood the core concept though. Using a programming language for prototyping (e.g., Pyhon, MATLAB, R, and so forth), we could take the ideas from paper and try to express them in code – step by step. An established library, such as scikit-learn, can help us than double-check the results and to see if our implementation – our idea of how the algorithm is supposed to work – is correct. Here, we don’t really care about efficiency; although we spend so much time to implement the algorithm, we probably want to use an established library if we want to perform some serious analysis in our research lab and/or company. Established libraries are typically more trustworthy – they have been battle-tested by many people, people who may have already encountered certain edge cases and made sure that there are no weird surprises. Furthermore, it is also more likely that this code was highly optimized for computational efficiency over time. Here, implementing from scratch simply serves the purpose of self-assessment. Reading about a concept is one thing, but putting it to action is a whole other level of understanding – and being able to explain it to others is the icing on the cake.
Another reason why we want to re-implement logistic regression from scratch may be that we are not satisfied with the “features” of other implementations. Let’s us naively assume that other implementations don’t have regularization parameters, or it doesn’t support multi-class settings (i.e., via One-vs-All, One-vs-One, or softmax). Or if computational (or predictive) efficiency is an issue, maybe we want to implement it with another solver (e.g., Newton vs. Gradient Descent vs. Stochastic Gradient Descent, etc.). But improvements concerning computational efficiency does not necessarily need to be in terms of modifications of the algorithms, but we could use lower-level programming languages, for example, Scala instead of Python, or Fortran instead of Scala, … this can go all down to assembly or machine code, or designing a chip that is optimized for running such kind of analysis. However, if you are a machine learning (or “data science”) practitioner or researcher, this is probably something you should delegate to the software engineering team.
To come back to the main question: Different people implement algorithms from scratch for various reasons. Personally, when I implement algorithms from scratch, I do it because of the learning experience.
If you like this content and you are looking for similar, more polished Q & A’s, check out my new book Machine Learning Q and AI.