The partial derivative of the mean squared error with respect to a weight parameter \(w_j\) is very simple to compute, as I outlined verbosely below:

\[\begin{align} \frac{\partial E}{\partial w_j} &= \frac{\partial}{\partial w_j} \frac{1}{2n} \sum_{i=1}^{n} (t_i - o_i)^2 \\ &= \frac{1}{2n} \sum_{i=1}^{n} \frac{\partial}{\partial w_j} (t_i - o_i)^2 \quad [\text{chain rule}] \\ &= \frac{1}{2n} \sum_{i=1}^{n} 2 (t_i - o_i) \frac{\partial}{\partial w_j} (t_i - o_i) \quad [\text{sum rule}] \\ &= \frac{1}{n} \sum_{i=1}^{n} (t_i - o_i) \left( \frac{\partial}{\partial w_j} t_i - \frac{\partial}{\partial w_j} o_i \right)\\ &= - \frac{1}{n} \sum_{i=1}^{n} (t_i - o_i) \frac{\partial}{\partial w_j} o_i. \end{align}\]

Supposing that the “output” is probably computed by some activation function that takes the weighted inputs “net,” we end up with something like this, if we were to expand \(\frac{\partial o_i}{\partial w_j}\):

\[\begin{align} \frac{\partial E}{\partial w_j} &= - \frac{1}{n} \sum_{i=1}^{n} (t_i - o_i) \frac{\partial}{\partial w_j} o_i\\ & = - \frac{1}{n} \sum_{i=1}^{n} (t_i - o_i) \frac{\partial o_i}{\partial \text{net}_i}\frac{\partial\text{net}_i}{\partial {w_j}}. \end{align}\]




If you like this content and you are looking for similar, more polished Q & A’s, check out my new book Machine Learning Q and AI.