The covariance is a measure for how two variables are related to each other, i.e., how two variables vary with each other.

Let \(n\) be the population size, \(x\) and \(y\) two different features (variables), and \(\mu\) the population mean; the covariance can then be formally defined as:

\[\sigma_{x y}=\frac{1}{n} \sum_{i}^{n}\left(x^{(i)}-\mu_{x}\right)\left(y^{(i)}-\mu_{y}\right).\]

A covariance of 0 indicates that two variables are totally unrelated. If the covariance is positive, the variables increase in the same direction, and if the covariance is negative, the variables change in opposite directions. As it can be seen in the equation above, the magnitude of the covariance depends on the scale of each variable (the size of the population or sample mean).

Pearson’s \(\rho\) or “r” (or typically just called “correlation coefficient”) is measures the linear correlation between two features and is closely related to the covariance. In fact, it’s a normalized version of the covariance as shown below:

\[\rho=\frac{\sum_{i=1}^{n}\left[\left(x^{(i)}-\mu_{x}\right)\left(y^{(i)}-\mu_{y}\right)\right]}{\sqrt{\sum_{i=1}^{n}\left(x^{(i)}-\mu_{x}\right)^{2}} \sqrt{\sum_{i=1}^{n}\left(y^{(i)}-\mu_{y}\right)^{2}}}=\frac{\sigma_{x y}}{\sigma_{x} \sigma_{y}}\]

(Note that we dropped the \(1/n\) term as it cancels.)

By dividing the covariance by the features’ standard deviations, we ensure that the correlation between two features is in the range [-1, 1], which makes it more interpretable than the unbounded covariance. However, note that the covariance and correlation are exactly the same if the features are normalized to unit variance (e.g., via standardization or z-score normalization). Two features are perfectly positively correlated if \(\rho=1\) and pefectly negatively correlated if \(\rho=-1\). No correlation is observed if \(\rho=0\).

Covariance and correlation for standardized features

We can show that the correlation between two features is in fact equal to the covariance of two standardized features. To show this, let us first standardize the two features, \(x\) and \(y\), to obtain their z-scores, which we will denote as \(x'\) and \(y'\) , respectively:

\[x^{\prime}=\frac{x-\mu_{x}}{\sigma_{x}}, \quad y^{\prime}=\frac{y-\mu_{y}}{\sigma_{y}}.\]

As you recall, the (population) covariance between two features is computed as follows:

\[\sigma_{x y}=\frac{1}{n} \sum_{i}^{n}\left(x^{(i)}-\mu_{x}\right)\left(y^{(i)}-\mu_{y}\right).\]

Since standardization performs mean-centering, we can rewrite the previous equation as

\[\sigma_{x y}^{\prime}=\frac{1}{n} \sum_{i}^{n}\left(x^{\prime (i)}-0\right)\left(y^{\prime (i)}-0\right).\]

Now, if we resubstitute those terms using the defnitions of the standardized features, we get:

\[\begin{equation} \begin{aligned} & \frac{1}{n} \sum_{i}^{n}\left(\frac{x-\mu_{x}}{\sigma_{x}}\right)\left(\frac{y-\mu_{y}}{\sigma_{y}}\right) \\ &= \frac{1}{n \cdot \sigma_{x} \sigma_{y}} \sum_{i}^{n}\left(x^{(i)}-\mu_{x}\right)\left(y^{(i)}-\mu_{y}\right), \end{aligned} \end{equation}\]

which simplifies to

\[\sigma_{x y}^{\prime}=\frac{\sigma_{x y}}{\sigma_{x} \sigma_{v}}\]

and concludes the proof that covariance and correlation are the same if the features are standardized.




If you like this content and you are looking for similar, more polished Q & A’s, check out my new book Machine Learning Q and AI.