Kernel tricks and nonlinear dimensionality reduction via RBF kernel PCA
--posted on September 14, 2014
2014
Machine Learning, Python, Math and Statistics
1
Most machine learning algorithms have been developed and statistically validated for linearly separable data. Popular examples are linear classifiers like Support Vector Machines (SVMs) or the (multilinear) Principal Component Analysis (PCA) for dimensionality reduction. However, most real world data requires nonlinear methods in order to perform tasks that involve the analysis and discovery of patterns successfully.
The focus of this article is to briefly introduce the idea of kernel methods and to implement a Gaussian radius basis function (RBF) kernel that is used to perform nonlinear dimensionality reduction via KBF kernel principal component analysis (kPCA).
Predictive modeling, supervised machine learning, and pattern classification - the big picture
--posted on August 24, 2014
2014
Math and Statistics, Machine Learning
1
When I was working on my next pattern classification application, I realized that it might be worthwhile to take a step back and look at the big picture of pattern classification in order to put my previous topics into context and to provide and introduction for the future topics that are going to follow.
Linear Discriminant Analysis bit by bit
--posted on August 03, 2014
2014
Math and Statistics, Machine Learning, Python
1
I received a lot of positive feedback about the step-wise Principal Component Analysis (PCA) implementation. Thus, I decided to write a little follow-up about Linear Discriminant Analysis (LDA) — another useful linear transformation technique. Both LDA and PCA are commonly used dimensionality reduction techniques in statistics, pattern classification, and machine learning applications. By implementing the LDA step-by-step in Python, we will see and understand how it works, and we will compare it to a PCA to see how it differs.
Molecular docking, estimating free energies of binding, and AutoDock's semi-empirical force field
--posted on July 26, 2014
2014
Protein Science
1
Discussions and questions about methods, approaches, and tools for estimating (relative) binding free energies of protein-ligand complexes are quite popular, and even the simplest tools can be quite tricky to use. Here, I want to briefly summarize the idea of molecular docking and provide a short overview about how we can use AutoDock 4.2's hybrid approach for evaluating binding affinities.
A questionable practice: Dixon's Q test for outlier identification
--posted on July 19, 2014
2014
Python, Math and Statistics
1
I recently was faced with the impossible task to identify outliers in a dataset with very, very small sample sizes and Dixon's Q test caught my attention. Honestly, I am not a big fan of this statistical test, but since Dixon's Q-test is still quite popular in certain scientific fields (e.g., chemistry) that it is important to understand its principles in order to draw your own conclusion of the presented research data that you might stumble upon in research articles or scientific talks.
About Feature Scaling and Normalization and the effect of standardization for machine learning algorithms
--posted on July 11, 2014
2014
Python, Machine Learning, Math and Statistics
1
I received a couple of questions in response to my previous article (Entry point: Data) where people asked me why I used Z-score standardization as feature scaling method prior to the PCA. I added additional information to the original article, however, I thought that it might be worthwhile to write a few more lines about this important topic in a separate article.
Entry point: Data - Using Python's sci-packages to prepare data for Machine Learning tasks and other data analyses
--posted on June 26, 2014
2014
Python, Machine Learning
1
In this short tutorial I want to provide a short overview of some of my favorite Python tools for common procedures as entry points for general pattern classification and machine learning tasks, and various other data analyses.
An introduction to parallel programming using Python's multiprocessing module
--posted on June 20, 2014
2014
Python
1
The default Python interpreter was designed with simplicity in mind and has a thread-safe mechanism, the so-called "GIL" (Global Interpreter Lock). In order to prevent conflicts between threads, it executes only one statement at a time (so-called serial processing, or single-threading).
In this introduction to Python's multiprocessing module, we will see how we can spawn multiple subprocesses to avoid some of the GIL's disadvantages and make best use of the multiple cores in our CPU.
Kernel density estimation via the Parzen–Rosenblatt window method - explained using Python
--posted on June 19, 2014
2014
Python, Machine Learning, Math and Statistics
1
The Parzen-window method (also known as Parzen-Rosenblatt window method) is a widely used non-parametric approach to estimate a probability density function p(x) for a specific point p(x) from a sample p(xn) that doesn't require any knowledge or assumption about the underlying distribution.
Numeric matrix manipulation - The cheat sheet for MATLAB, Python NumPy, R, and Julia
--posted on June 5, 2014
2014
1
R, Python, Machine Learning, Math and Statistics, Matlab
At its core, this article is about a simple cheat sheet for basic operations on numeric matrices, which can be very useful if you working and experimenting with some of the most popular languages that are used for scientific computing, statistics, and data analysis.
The key differences between Python 2.7.x and Python 3.x with examples
--posted on June 1, 2014
2014
1
Python
Many Python users are wondering which version of Python they should use. In my opinion, both Python 2.7.x and 3.x have their advantages and disadvantages, and in practice, it depends on your particular needs which version might be best suited for your project(s). However, it is worthwhile to have a look at the major differences between those two most popular versions of Python to avoid common pitfalls when writing the code for either one of them, or if you are planning to port your project.
5 simple steps for converting Markdown documents into HTML and adding Python syntax highlighting
--posted on May 28, 2014
2014
1
HTML and Markdown, Python
In this little tutorial, I want to show you in 5 simple steps how easy it is to add code syntax highlighting to your blog articles.
Creating a table of contents with internal links in IPython Notebooks and Markdown documents
--posted on May 20, 2014
2014
1
Python, HTML and Markdown
Many people have asked me how I create the table of contents with internal links for my IPython Notebooks and Markdown documents on GitHub. Well, no (IPython) magic is involved, it is just a little bit of HTML, but I thought it might be worthwhile to write this little how-to tutorial.
A Beginner's Guide to Python's Namespaces, Scope Resolution, and the LEGB Rule
--posted on May 12, 2014
2014
1
Python
A short tutorial about Python's namespaces and the scope resolution for variable names using the LEGB-rule with little quiz-like exercises.
Diving deep into Python - the not-so-obvious language parts
--posted on April 26, 2014
2014
1
Python
Some while ago, I started to collect some of the not-so-obvious things I encountered when I was coding in Python. I thought that it was worthwhile sharing them and encourage you to take a brief look at the section-overview and maybe you'll find something that you do not already know - I can guarantee you that it'll likely save you some time at one or the other tricky debugging challenge.
Implementing a Principal Component Analysis (PCA) in Python step by step
--posted on April 13, 2014
2014
Python, Machine Learning, Math and Statistics
1
In this article I want to explain how a Principal Component Analysis (PCA) works by implementing it in Python step by step. At the end we will compare the results to the more convenient Python PCA() classes that are available through the popular matplotlib and scipy libraries and discuss how they differ.
Implementing simple sequential feature selection algorithms in Python
--posted on April 2, 2014
2014
Python, Machine Learning, Math and Statistics
1
I implemented some simple Sequential Feature Selection algorithms in Python for dimensionality reduction in pattern classification tasks. I wrote it up with some comments in hope that someone might find it useful.
Explaining the difference between a Retina vs. a non-Retina display
--posted on March 24, 2014
2014
MacOS
0
Recently, someone wanted me to explain the difference between a Retina vs. a non-Retina display. I had to explain it to her via email, and the catch was that the other person was reading/seeing it on a non-Retina monitor.
smilite - a Python module for downloading and analyzing SMILE strings
--posted on March 23, 2014
2014
Python, Protein Science
0
smilite is a Python module I wrote in order to download and analyze SMILE strings (Simplified Molecular-Input Line-entry System) of chemical compounds from ZINC (a free database of commercially-available compounds for virtual screening, http://zinc.docking.org).
Installing Scientific Packages for Python3 on MacOS 10.9 Mavericks
--posted on March 13, 2014
2014
Python, MacOS
1
I just went through some pain (again) when I wanted to install some of Python's scientific libraries on my second Mac. I summarized the setup and installation process for future reference.
A thorough guide to SQLite database operations in Python
--posted on March 07, 2014
2014
Python, SQLite
1
After I wrote the initial teaser article "SQLite - Working with large data sets in Python effectively" about how awesome SQLite databases are via sqlite3 in Python, I wanted to delve a little bit more into the SQLite syntax and provide you with some more hands-on examples.
Using OpenEye software for substructure alignments and best-matching low-energy conformer overlays
--posted on February 23, 2014
2014
Protein Science
1
This is a quickguide showing how to use OpenEye software command line tools to align target molecules to a query based on substructure matches and how to retrieve the best molecule overlay from two sets of low-energy conformers.
PyPrind - A simple Python Progress Indicator
--posted on February 2, 2014
2014
Python
0
Sometimes it can be useful to display the progress of a computation, especially for more intensive tasks. I have written a simple module that tracks the progress of iterative Python procedures via a progress bar or percentage indicator. I've been using this tool for a while now, and I thought that it might be worthwhile to share it with you in hope it can also be useful to one or the other.
Moving from MATLAB matrices to NumPy arrays - A Matrix Cheatsheet
--posted on January 22, 2014
2014
R, Python, Machine Learning, Math and Statistics, Matlab
0
Over time Python became my favorite programming language for the quick automation of tasks, such as manipulating and analyzing data. Also, I grew fond of the great matplotlib plotting library for Python. MATLAB/Octave was usually my tool of choice when my tasks involved matrices and linear algebra. However, since I feel more comfortable with Python in general, I recently took a second look at Python's NumPy module to integrate matrix operations more easily into larger programs/scripts.
An evaluation of simple Python performance tweaks
--posted on January 18, 2014
2014
Python
0
When we are solving computational problems, we usually have almost unlimited possibilities to write and organize our code. The number of possible solutions is only limited by our own creativity. However, the goal is often not the most creative solution, but the most efficient one. Especially, when I write code to analyze massive amounts of data, I want to it to do the job as efficiently as possible.
In order to optimize some of my Python code, I analyzed the efficiency of different approaches to solve similar problems, which I want to share with you in this article.
Unit testing in Python - Why we want to make it a habit
--posted on December 14, 2013
2013
Python
1
Let's be honest, code testing is everything but a joyful task. However, a good unit testing framework makes this process as smooth as possible. Eventually, testing becomes a regular and continuous process, accompanied by the assurance that our code will operate just as exact and seamlessly as a Swiss clockwork. [...]
This is especially important in scientific research, where your whole project depends on the correct analysis and assessment of any data - and there is probably no more convenient way to convince both you and the rightly skeptical reviewer that you just made a(nother) groundbreaking discovery.
A short tutorial for decent heat maps in R
--posted on December 8, 2013
2013
R
1
I received many questions from people who want to quickly visualize their data via heat maps - ideally as quickly as possible. This is the major issue of exploratory data analysis, since we often don't have the time to digest whole books about the particular techniques in different software packages to just get the job done.
BondPack - A collection of plugins to visualize molecular bonds in PyMOL
--posted on November 17, 2013
2013
Python, Protein Science
0
Drawing interactions between atoms can be often quite cumbersome when done manually. For the sake of convenience, I developed three plugins for PyMOL that will make our life as protein biologists a little bit easier.
SQLite - Working with large data sets in Python effectively
--posted on November 3, 2013
2013
Python, SQLite
1
My new project confronted me with the task to screen a huge set of large data files in text format with billions of entries each. I will have to retrieve data repeatedly and frequently in future, thus I was tempted to find a better solution than brute-force scanning through ~20 separate 1-column text files with ~6 billion entries every time line by line.
Getting Things Done With Simplenote
--posted on September 22, 2013
2013
MacOS
0
Every now and then I try to learn from my previous experiences and try to refine my task and project management implementation. My whole goal is to have a system that allows me to have both my tasks and my references handy in one place. An important prerequisite for the tool of choice is that it must be plain and simple to use, transferable, and available on all different platforms that I am using: iPad, iPhone, Mac, and my Linux computer at work.
Structural Classification of Food Allergen Epitopes
- The PDB To FASTA Converter in Action!
--posted on August 25, 2013
2013
Python, Protein Science
0
Naveen Chakicherla discovered the shared tertiary structure consensus motif of an important group of allergen epitopes in his research project at Lawrence Berkeley National Laboratory. His results were reported in a research article that was published in the July 2013 issue of the Computational Crystallography Newsletter.
Be aware of the streamlined exception hierarchy in Python 3.3.0
--posted on March 3, 2013
2013
Python
0
It is really nice to see the active development of Python. Thanks to the great community, Python has evolved into the favorite and most popular interpreted programming language. An important date in the history of Python certainly was December 3rd, 2008 - the release of Python 3.0. However, Python 3 did not please everyone, the community was literally divided; as until today, many people are still using Python 2.7x.
Misleading FASTA sequences in the Protein Data Bank
--posted on February 23, 2013
2013
Python, Protein Science
0
The Protein Data Bank (rcsb.org) deposited amino acid sequences in FASTA format for each PDB structure file. However, those FASTA sequences are not necessarily identical to the amino acid sequences in the corresponding PDB files.