Artificial Neurons and Single-Layer Neural Networks - How Machine Learning Algorithms Work Part 1

neuron
posted on March 14, 2015
2015
Machine Learning, Python, Math and Statistics
1
This article offers a brief glimpse of the history and basic concepts of machine learning. We will take a look at the first algorithmically described neural network and the gradient descent algorithm in context of adaptive linear neurons, which will not only introduce the principles of machine learning but also serve as the basis for modern multilayer neural networks in future articles.

Principal Component Analysis in 3 Simple Steps

PCA
posted on January 17, 2015
2015
Machine Learning, Python, Math and Statistics
1
Principal Component Analysis (PCA) is a simple yet popular and useful linear transformation technique that is used in numerous applications, such as stock market predictions, the analysis of gene expression data, and many more. In this tutorial, we will see that PCA is not just a "black box," and we are going to unravel its internals in 3 basic steps.

Implementing a Weighted Majority Rule Ensemble Classifier in Scikit-learn

Ensemble Classifier
posted on January 10, 2015
2015
Machine Learning, Python, Math and Statistics
1
Here, I want to present a simple and conservative approach of implementing a weighted majority rule ensemble classifier in scikit-learn that yielded remarkably good results when I tried it in a kaggle competition. For me personally, kaggle competitions are just a nice way to try out and compare different approaches and ideas -- basically an opportunity to learn in a controlled environment with nice datasets.

MusicMood - A Machine Learning Approach to Classify Music by Mood Based on Song Lyrics

Music Mood
posted on December 8, 2014
2014
Machine Learning, Python, Math and Statistics
1
In this article, I want to share my experience with a recent data mining project which probably was one of my most favorite hobby projects so far. It's all about building a classification model that can automatically predict the mood of music based on song lyrics.

Turn Your Twitter Timeline into a Word Cloud Using Python

wordcloud
posted on November 28, 2014
2014
Python
1
Last week, I posted some visualizations in context of my current "Happy Rock Song" data mining project, and some people were curious about how I created the word clouds. I thought it might be interesting to use a different dataset for this tutorial: Your personal twitter timeline.

Naive Bayes and Text Classification I - Introduction and Theory

naive Bayes
posted on October 04, 2014
2014
Machine Learning, Math and Statistics
1
Naive Bayes classifiers, a family of classifiers that are based on the popular Bayes’ probability theorem, are known for creating simple yet well performing models, especially in the fields of document classification and disease prediction.
In this first part of a series, we will take a look at the theory of naive Bayes classifiers and introduce the basic concepts of text classification. In following articles, we will implement those concepts to train a naive Bayes spam filter and apply naive Bayes to song classification based on lyrics.

Kernel tricks and nonlinear dimensionality reduction via RBF kernel PCA

kernel PCA
posted on September 14, 2014
2014
Machine Learning, Python, Math and Statistics
1
Most machine learning algorithms have been developed and statistically validated for linearly separable data. Popular examples are linear classifiers like Support Vector Machines (SVMs) or the (standard) Principal Component Analysis (PCA) for dimensionality reduction. However, most real world data requires nonlinear methods in order to perform tasks that involve the analysis and discovery of patterns successfully.
The focus of this article is to briefly introduce the idea of kernel methods and to implement a Gaussian radius basis function (RBF) kernel that is used to perform nonlinear dimensionality reduction via KBF kernel principal component analysis (kPCA).

Predictive modeling, supervised machine learning, and pattern classification - the big picture

supervised machine learning
posted on August 24, 2014
2014
Math and Statistics, Machine Learning
1
When I was working on my next pattern classification application, I realized that it might be worthwhile to take a step back and look at the big picture of pattern classification in order to put my previous topics into context and to provide and introduction for the future topics that are going to follow.

Linear Discriminant Analysis bit by bit

Linear Discriminant Analysis
posted on August 03, 2014
2014
Math and Statistics, Machine Learning, Python
1
I received a lot of positive feedback about the step-wise Principal Component Analysis (PCA) implementation. Thus, I decided to write a little follow-up about Linear Discriminant Analysis (LDA) — another useful linear transformation technique. Both LDA and PCA are commonly used dimensionality reduction techniques in statistics, pattern classification, and machine learning applications. By implementing the LDA step-by-step in Python, we will see and understand how it works, and we will compare it to a PCA to see how it differs.

Molecular docking, estimating free energies of binding, and AutoDock's semi-empirical force field

AutoDock
posted on July 26, 2014
2014
Protein Science
1
Discussions and questions about methods, approaches, and tools for estimating (relative) binding free energies of protein-ligand complexes are quite popular, and even the simplest tools can be quite tricky to use. Here, I want to briefly summarize the idea of molecular docking and provide a short overview about how we can use AutoDock 4.2's hybrid approach for evaluating binding affinities.

A questionable practice: Dixon's Q test for outlier identification

Dixon Test
posted on July 19, 2014
2014
Python, Math and Statistics
1
I recently was faced with the impossible task to identify outliers in a dataset with very, very small sample sizes and Dixon's Q test caught my attention. Honestly, I am not a big fan of this statistical test, but since Dixon's Q-test is still quite popular in certain scientific fields (e.g., chemistry) that it is important to understand its principles in order to draw your own conclusion of the presented research data that you might stumble upon in research articles or scientific talks.

About Feature Scaling and Normalization and the effect of standardization for machine learning algorithms

Feature Scaling
posted on July 11, 2014
2014
Python, Machine Learning, Math and Statistics
1
I received a couple of questions in response to my previous article (Entry point: Data) where people asked me why I used Z-score standardization as feature scaling method prior to the PCA. I added additional information to the original article, however, I thought that it might be worthwhile to write a few more lines about this important topic in a separate article.
Data Preprocessing

Entry point: Data - Using Python's sci-packages to prepare data for Machine Learning tasks and other data analyses

posted on June 26, 2014
2014
Python, Machine Learning
1
In this short tutorial I want to provide a short overview of some of my favorite Python tools for common procedures as entry points for general pattern classification and machine learning tasks, and various other data analyses.

An introduction to parallel programming using Python's multiprocessing module

Python Multiprocessing
posted on June 20, 2014
2014
Python
1
The default Python interpreter was designed with simplicity in mind and has a thread-safe mechanism, the so-called "GIL" (Global Interpreter Lock). In order to prevent conflicts between threads, it executes only one statement at a time (so-called serial processing, or single-threading).
In this introduction to Python's multiprocessing module, we will see how we can spawn multiple subprocesses to avoid some of the GIL's disadvantages and make best use of the multiple cores in our CPU.

Kernel density estimation via the Parzen–Rosenblatt window method - explained using Python

Parzen Window
posted on June 19, 2014
2014
Python, Machine Learning, Math and Statistics
1
The Parzen-window method (also known as Parzen-Rosenblatt window method) is a widely used non-parametric approach to estimate a probability density function p(x) for a specific point p(x) from a sample p(xn) that doesn't require any knowledge or assumption about the underlying distribution.

Numeric matrix manipulation - The cheat sheet for MATLAB, Python NumPy, R, and Julia

Matrix Cheatsheet
posted on June 5, 2014
2014
1
R, Python, Machine Learning, Math and Statistics, Matlab
At its core, this article is about a simple cheat sheet for basic operations on numeric matrices, which can be very useful if you working and experimenting with some of the most popular languages that are used for scientific computing, statistics, and data analysis.

The key differences between Python 2.7.x and Python 3.x with examples

Python 2 vs. 3
posted on June 1, 2014
2014
1
Python
Many Python users are wondering which version of Python they should use. In my opinion, both Python 2.7.x and 3.x have their advantages and disadvantages, and in practice, it depends on your particular needs which version might be best suited for your project(s). However, it is worthwhile to have a look at the major differences between those two most popular versions of Python to avoid common pitfalls when writing the code for either one of them, or if you are planning to port your project.

5 simple steps for converting Markdown documents into HTML and adding Python syntax highlighting

Markdown syntax color
posted on May 28, 2014
2014
1
HTML and Markdown, Python
In this little tutorial, I want to show you in 5 simple steps how easy it is to add code syntax highlighting to your blog articles.

Creating a table of contents with internal links in IPython Notebooks and Markdown documents

IPython table of contents
posted on May 20, 2014
2014
1
Python, HTML and Markdown
Many people have asked me how I create the table of contents with internal links for my IPython Notebooks and Markdown documents on GitHub. Well, no (IPython) magic is involved, it is just a little bit of HTML, but I thought it might be worthwhile to write this little how-to tutorial.

A Beginner's Guide to Python's Namespaces, Scope Resolution, and the LEGB Rule

Python Namespaces LEGB
posted on May 12, 2014
2014
1
Python
A short tutorial about Python's namespaces and the scope resolution for variable names using the LEGB-rule with little quiz-like exercises.

Diving deep into Python - the not-so-obvious language parts

Python tricks
posted on April 26, 2014
2014
1
Python
Some while ago, I started to collect some of the not-so-obvious things I encountered when I was coding in Python. I thought that it was worthwhile sharing them and encourage you to take a brief look at the section-overview and maybe you'll find something that you do not already know - I can guarantee you that it'll likely save you some time at one or the other tricky debugging challenge.

Implementing a Principal Component Analysis (PCA) in Python step by step

PCA
posted on April 13, 2014
2014
Python, Machine Learning, Math and Statistics
2
In this article I want to explain how a Principal Component Analysis (PCA) works by implementing it in Python step by step. At the end we will compare the results to the more convenient Python PCA() classes that are available through the popular matplotlib and scipy libraries and discuss how they differ.

Implementing simple sequential feature selection algorithms in Python

Feature Selection Algorithms
posted on April 2, 2014
2014
Python, Machine Learning, Math and Statistics
1
I implemented some simple Sequential Feature Selection algorithms in Python for dimensionality reduction in pattern classification tasks. I wrote it up with some comments in hope that someone might find it useful.

Explaining the difference between a Retina vs. a non-Retina display

Retina vs. non-Retina
posted on March 24, 2014
2014
MacOS
0
Recently, someone wanted me to explain the difference between a Retina vs. a non-Retina display. I had to explain it to her via email, and the catch was that the other person was reading/seeing it on a non-Retina monitor.

smilite - a Python module for downloading and analyzing SMILE strings

Smilite SMILE strings
posted on March 23, 2014
2014
Python, Protein Science
0
smilite is a Python module I wrote in order to download and analyze SMILE strings (Simplified Molecular-Input Line-entry System) of chemical compounds from ZINC (a free database of commercially-available compounds for virtual screening, http://zinc.docking.org).

Installing Scientific Packages for Python3 on MacOS 10.9 Mavericks

Python 3 on MacOS
posted on March 13, 2014
2014
Python, MacOS
1
I just went through some pain (again) when I wanted to install some of Python's scientific libraries on my second Mac. I summarized the setup and installation process for future reference.

A thorough guide to SQLite database operations in Python

SQLite Python guide
posted on March 07, 2014
2014
Python, SQLite
1
After I wrote the initial teaser article "SQLite - Working with large data sets in Python effectively" about how awesome SQLite databases are via sqlite3 in Python, I wanted to delve a little bit more into the SQLite syntax and provide you with some more hands-on examples.

Using OpenEye software for substructure alignments and best-matching low-energy conformer overlays

OpenEye algignment
posted on February 23, 2014
2014
Protein Science
1
This is a quickguide showing how to use OpenEye software command line tools to align target molecules to a query based on substructure matches and how to retrieve the best molecule overlay from two sets of low-energy conformers.

PyPrind - A simple Python Progress Indicator

Python Progress Indicator
posted on February 2, 2014
2014
Python
0
Sometimes it can be useful to display the progress of a computation, especially for more intensive tasks. I have written a simple module that tracks the progress of iterative Python procedures via a progress bar or percentage indicator. I've been using this tool for a while now, and I thought that it might be worthwhile to share it with you in hope it can also be useful to one or the other.

Moving from MATLAB matrices to NumPy arrays - A Matrix Cheatsheet

MATLAB and NumPy arrays
posted on January 22, 2014
2014
R, Python, Machine Learning, Math and Statistics, Matlab
0
Over time Python became my favorite programming language for the quick automation of tasks, such as manipulating and analyzing data. Also, I grew fond of the great matplotlib plotting library for Python. MATLAB/Octave was usually my tool of choice when my tasks involved matrices and linear algebra. However, since I feel more comfortable with Python in general, I recently took a second look at Python's NumPy module to integrate matrix operations more easily into larger programs/scripts.

An evaluation of simple Python performance tweaks

Python performance
posted on January 18, 2014
2014
Python
0
When we are solving computational problems, we usually have almost unlimited possibilities to write and organize our code. The number of possible solutions is only limited by our own creativity. However, the goal is often not the most creative solution, but the most efficient one. Especially, when I write code to analyze massive amounts of data, I want to it to do the job as efficiently as possible.
In order to optimize some of my Python code, I analyzed the efficiency of different approaches to solve similar problems, which I want to share with you in this article.

Unit testing in Python - Why we want to make it a habit

Python unit testing
posted on December 14, 2013
2013
Python
1
Let's be honest, code testing is everything but a joyful task. However, a good unit testing framework makes this process as smooth as possible. Eventually, testing becomes a regular and continuous process, accompanied by the assurance that our code will operate just as exact and seamlessly as a Swiss clockwork. [...]
This is especially important in scientific research, where your whole project depends on the correct analysis and assessment of any data - and there is probably no more convenient way to convince both you and the rightly skeptical reviewer that you just made a(nother) groundbreaking discovery.

A short tutorial for decent heat maps in R

Heat maps in R
posted on December 8, 2013
2013
R
1
I received many questions from people who want to quickly visualize their data via heat maps - ideally as quickly as possible. This is the major issue of exploratory data analysis, since we often don't have the time to digest whole books about the particular techniques in different software packages to just get the job done.

BondPack - A collection of plugins to visualize molecular bonds in PyMOL

Bondback PyMOL
posted on November 17, 2013
2013
Python, Protein Science
0
Drawing interactions between atoms can be often quite cumbersome when done manually. For the sake of convenience, I developed three plugins for PyMOL that will make our life as protein biologists a little bit easier.

SQLite - Working with large data sets in Python effectively

SQLite in Python
posted on November 3, 2013
2013
Python, SQLite
1
My new project confronted me with the task to screen a huge set of large data files in text format with billions of entries each. I will have to retrieve data repeatedly and frequently in future, thus I was tempted to find a better solution than brute-force scanning through ~20 separate 1-column text files with ~6 billion entries every time line by line.

Getting Things Done With Simplenote

simplenote GTD
posted on September 22, 2013
2013
MacOS
0
Every now and then I try to learn from my previous experiences and try to refine my task and project management implementation. My whole goal is to have a system that allows me to have both my tasks and my references handy in one place. An important prerequisite for the tool of choice is that it must be plain and simple to use, transferable, and available on all different platforms that I am using: iPad, iPhone, Mac, and my Linux computer at work.