What is the difference between a Perceptron, Adaline, and neural network model?

Both Adaline and the Perceptron are (single-layer) neural network models. The Perceptron is one of the oldest and simplest learning algorithms out there, and I would consider Adaline as an improvement over the Perceptron.

What Adaline and the Perceptron have in common

  • they are classifiers for binary classification
  • both have a linear decision boundary
  • both can learn iteratively, sample by sample (the Perceptron naturally, and Adaline via stochastic gradient descent)
  • both use a threshold function

Before we talk about the differences, let’s talk about the inputs first. The first step in the two algorithms is to compute the so-called net input z as the linear combination of our feature variables x and the model weights w.

Then, in the Perceptron and Adaline, we define a threshold function to make a prediction. I.e., if z is greater than a threshold theta, we predict class 1, and 0 otherwise:

The differences between the Perceptron and Adaline

  • the Perceptron uses the class labels to learn model coefficients
  • Adaline uses continuous predicted values (from the net input) to learn the model coefficients, which is more “powerful” since it tells us by “how much” we were right or wrong

So, in the perceptron, as illustrated below, we simply use the predicted class labels to update the weights, and in Adaline, we use a continuous response:

(Note that I inserted the “activation function” in Adaline just for illustrative purposes; here, this activation function is simply the identity function) Both learning algorithms can actually be summarized by 4 simple steps – given that we use stochastic gradient descent for Adaline:

  1. Initialize the weights to 0 or small random numbers.
  2. For each training sample:
    1. Calculate the output value.
    2. Update the weights.

We write the weight update in each iteration as:

where

Again, the “output” is the continuous net input value in Adaline and the predicted class label in case of the perceptron; eta is the learning rate. (In case you are interested: This weight update in Adaline is basically just taking the “opposite step” in direction of the sum-of-squared error cost gradient. I’ve a more detailed walkthrough here on deriving the cost gradient.

Multi-layer neural networks

Although you haven’t asked about multi-layer neural networks specifically, let me add a few sentences about one of the oldest and most popular multi-layer neural network architectures: the Multi-Layer Perceptron (MLP). The term “Perceptron” is a little bit unfortunate in this context, since it really doesn’t have much to do with Rosenblatt’s Perceptron algorithm.

MLPs can basically be understood as a network of multiple artificial neurons over multiple layers. Here, the activation function is not linear (like in Adaline), but we use a non-linear activation function like the logistic sigmoid (the one that we use in logistic regression) or the hyperbolic tangent, or a piecewise-linear activation function such as the rectifier linear unit (ReLU). In addition, we often use a softmax function (a generalization of the logistic sigmoid for multi-class problems) in the output layer, and a threshold function to turn the predicted probabilities (by the softmax) into class labels.

So, what the advantage of the MLP over the classic Perceptron and Adaline? By connecting the artificial neurons in this network through non-linear activation functions, we can create complex, non-linear decision boundaries that allow us to tackle problems where the different classes are not linearly separable.

Let me show you an example :)

Here’s the Python code if you want to reproduce these plots:

from mlxtend.evaluate import plot_decision_regions
from mlxtend.classifier import Perceptron
from mlxtend.classifier import Adaline
from mlxtend.classifier import MultiLayerPerceptron
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
import matplotlib.gridspec as gridspec
import itertools

gs = gridspec.GridSpec(2, 2)xw
X, y = make_moons(n_samples=100, random_state=123)
fig = plt.figure(figsize=(10,8))

ppn = Perceptron(epochs=50, eta=0.05, random_seed=0)
ppn.fit(X, y)
ada = Adaline(epochs=50, eta=0.05, random_seed=0)
ada.fit(X, y)

mlp = MultiLayerPerceptron(n_output=len(np.unique(y)),
                           n_features=X.shape[1],
                           n_hidden=150,
                           l2=0.0,
                           l1=0.0,
                           epochs=500,
                           eta=0.01,
                           alpha=0.0,
                           decrease_const=0.0,
                           minibatches=1,
                           shuffle_init=False,
                           shuffle_epoch=False,
                           random_seed=0)

mlp = mlp.fit(X, y)


for clf, lab, grd in zip([ppn, ppn, mlp],
                         ['Perceptron', 'Adaline', 'MLP (logistic sigmoid)'],
                         itertools.product([0, 1], repeat=2)):

    clf.fit(X, y)
    ax = plt.subplot(gs[grd[0], grd[1]])
    fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2)
    plt.title(lab)

plt.show()