Softmax Classifier in Neural Network
Concept
Softmax classifier, together with Support Vector Machine (SVM) are the two most common classifiers used in neural networks. They are usually the last later to convert input data or data from hidden layer to output score for potential classes.
For both Softmax and SVM, the raw scores are calculated exactly the same way via linear transformation (a.k.a. matrix multiplication). Suppose we have input data X with 'T' test cases and each test case with feature dimension 'F', and the classifier is trying to map output to C classes. Then X.shape is (T, F). The classifiers uses a weight matrix W of (F, C) to convert X to raw scores Z, a vector with C elements. The raw score calculation function can be expressed as:
$Z(X,W) = X * W $
or $Z = X.dot(W)$
Softmax function is used to convert raw scores (Z) to a normalized probabilities (q):
$q(z_i) = \frac{e^{z_i}}{\sum\limits_{k=1}^{C} e^{{z_k}}}$
and the subtotal of all q is 1.
The Softmax classifier then uses a cross-entropy loss, between real distribution (p) and the calculated distribution (q), as an objective function for optimization. For a given test case $X_i$, the loss:
$L(p, q)_i = - \sum\limits_{k=1}^{C} p(z_k) * log(q(z_k)) \tag{1}$
Softmax classifier will change the weight matrix W in order to reduce cross-entropy loss. Note that for supervised learning, the correct class distribution p(x) contains only one '1' (i.e. Y = [0, ..., 1, ..., 0]). Hence for each test case, only the one $q(x_k), k$ that corresponding to correct class ($k=y$) contributes to the final loss.
For the rest of this blog, we will use $L$ instead of $L_i$ to stand for loss for a given test $X_i$ to simplify the problem. For all test cases, it is simply a sum of all the $L_i$ calculated.
The cross-entropy loss can be expressed as a probability distribution. For input data $x_i$ with corresponding class $y_i$:
or $Z = X.dot(W)$
Softmax function is used to convert raw scores (Z) to a normalized probabilities (q):
$q(z_i) = \frac{e^{z_i}}{\sum\limits_{k=1}^{C} e^{{z_k}}}$
and the subtotal of all q is 1.
The Softmax classifier then uses a cross-entropy loss, between real distribution (p) and the calculated distribution (q), as an objective function for optimization. For a given test case $X_i$, the loss:
$L(p, q)_i = - \sum\limits_{k=1}^{C} p(z_k) * log(q(z_k)) \tag{1}$
Softmax classifier will change the weight matrix W in order to reduce cross-entropy loss. Note that for supervised learning, the correct class distribution p(x) contains only one '1' (i.e. Y = [0, ..., 1, ..., 0]). Hence for each test case, only the one $q(x_k), k$ that corresponding to correct class ($k=y$) contributes to the final loss.
For the rest of this blog, we will use $L$ instead of $L_i$ to stand for loss for a given test $X_i$ to simplify the problem. For all test cases, it is simply a sum of all the $L_i$ calculated.
Engineering A Numerically Stable Softmax Function
The cross-entropy loss can be expressed as a probability distribution. For input data $x_i$ with corresponding class $y_i$:
$q(y_i | x_i) = \frac{e^{z_{y_i}}}{\sum\limits_{j=1}^{C} e^{z_j}}$
It looks simple enough to implement strictly. However, the exponential operator can produce a large number. Divide such a large number can be numerically unstable. We need a way to avoid numerical issues as much as possible.
A nice property of exponential function is that for negative variable $ r < 0$, the output $e^r$ is squashed between (0, 1]. The idea here is to transfer the numerator and denominator into negative domain without changing the value of the faction:
$\frac{e^{z_{y_i}}}{\sum\limits_{j=1}^{C} e^{z_j}} = \frac{e^{z_{y_i} + M}}{ \sum\limits_{j=1}^{C} e^{{z_j} + M}} $
Constant M is just a number that we want to transfer $z$ into the non-positive domain $z-M$. We can use the largest score of $Z$ vector: $ M = -max(Z)$
To guide optimization process, we need to calculate gradient of Loss function with regarding to $W_{i,j}$. Direct calculation is a daunting task because mapping from W to Loss goes through linear transformation, then Softmax function and then loss function. As usual, we use chain-rule to peel the onion.
First, Softmax function gradient with regarding to raw score vector Z can be calculated per element wise:
$\frac{\partial q(z_i)}{\partial z_{j}} = q(z_i)(1-q(z_i))$ when $i = j$ and
$\frac{\partial q(z_i)}{\partial z_{j}} = -q(z_i)q(z_j)$ when $i \neq j$
2nd, consider cross-entropy loss function (1). The grand truth distribution P has only one '1'. Suppose $Z_y$ is the one corresponding to correct answer. Then the loss function becomes:
$\begin{align} \frac{\partial L(p, q)}{\partial z_j} = \frac{\partial L(p, q(z_y))}{\partial z_j} = q(z_j) - 1(y,j) \tag{2} \label{eq:2} \end{align}$
in which $z_y$ is the one corresponding to the correct class for a given test case $X_i$.
And 1(y,j) is the kronecker delta function:
$1(y,j) = \begin{cases} 1 & y = j \\
0 & y\neq j \end{cases} $
This is how we calculate elements for the partial differential vector of (1, C):
$\frac{\partial L}{\partial Z} = \left[ \frac{\partial L}{\partial z_1}, \frac{\partial L}{\partial z_2},
..., \frac{\partial L}{\partial z_C} \right] \tag{3}$
As the 3rd step, we will handle the matrix multiplication: $W*X = Z$ of the fully connected layer. Let's check the matrix dimensions first. W.shape is (num_classes, num_features) or (C, F), X.shape is (num_features, 1) and raw score Z is (num_classes, 1).
The goal of this optimization is to find better W to produce accurate Z outputs in each iteration. Hence W is the parameter matrix to be fine tuned. We need partial derivatives of Z with regarding to W. W itself is a (C, F) matrix, to make things easier to process, we flat W matrix into a row vector of length C*F, Wflat = W.reshape(1, C*F). Then the partial derivatives is a (C, CF) matrix, which matches with C outputs (Z) and C*F parameters of W (flat version):
$ \frac{\partial Z}{\partial w_{i,j}} = \left[ \begin{array}{ccccc} \frac{\partial z_1}{\partial w_{1,1}} & \frac{\partial z_1}{\partial w_{1,2}} & ... & \frac{\partial z_1}{\partial w_{C,F-1}}& \frac{\partial z_1}{\partial w_{C,F}} \\
\frac{\partial z_2}{\partial w_{1,1}} & \frac{\partial z_2}{\partial w_{1,2}} & ... & \frac{\partial z_2}{\partial w_{C,F-1}}& \frac{\partial z_2}{\partial w_{C,F}} \\
... & ... & ... & ... & ... \\
\frac{\partial z_C}{\partial w_{1,1}} & \frac{\partial z_C}{\partial w_{1,2}} & ... & \frac{\partial z_C}{\partial w_{C,F-1}}& \frac{\partial z_C}{\partial w_{C,F}} \end{array} \right] \tag{4}$
From matrix multiplication (dot product of two vectors) we know that
$z_t = w_{t,1} x_1 + w_{t,2} x_2 + w_{t,3} x_3 + ... + w_{t, F} x_F$
Therefore, we have $\frac{\partial z_t}{\partial w_{i,j}} = 0, ..., x1, x2, x3, ..., x_F, 0, ..., 0$
The partial derivatives are '0' when $i \neq t$
Hence equation (4) is much simpler than its original form:
$ \frac{\partial Z}{\partial w_{i,j}} = \left[ \begin{array}{cccccccc} x_1 & x_2 & ... & x_F & 0 & ... & 0 & 0 \\ 0 & ... & x_1 & ... & x_F & 0 & ... & 0 \\ ... & ... & ... & ... & ... & ... &... &...\\ 0 & 0 & 0 & 0 & 0 & x_1 & ... & x_F \end{array} \right] \tag{5}$
It is a sparse matrix, entries $\frac{\partial z_t}{\partial w_{i,j}}$ are mostly '0' except when row number equals row subscribe of W: $t == i$ and value at {t, (t-1)*F + j} is $x_j$.
Put all 3 steps together, we will calculate the partial derivatives with regarding to W. Using chain rule, we need to calculate $\frac{\partial L}{\partial w_{i,j}} = \sum\limits_{k=1}^{c} \frac{\partial L}{\partial z_{k}} \frac{\partial z_{k}}{\partial w_{i,j}}$, which is just the dot product of vector (3) and column $(i-1)*C+j$ of matrix (4). Hence the partial derivatives can be calculated as vector (3) multiply matrix (4):
$ \frac{\partial L}{\partial W} = \left[ \begin{array}{ccccc} \frac{\partial L}{\partial w_{1,1}}, & \frac{\partial L}{\partial w_{1,2}}, & ... ,& \frac{\partial L}{\partial w_{C,F-1}},& \frac{\partial L}{\partial w_{C,F}} \end{array} \right] \tag{6}$
Remember this row vector is just for convenience purpose. We can always rewrite the partial derivatives in matrix form:
$ \frac{\partial L}{\partial w_{i,j}} = \left[ \begin{array}{ccccc} \frac{\partial L}{\partial w_{1,1}} & \frac{\partial L}{\partial w_{1,2}} & ... & \frac{\partial L}{\partial w_{1,F-1}} & \frac{\partial L}{\partial w_{1,F}}\\
\frac{\partial L}{\partial w_{2,1}} & \frac{\partial L}{\partial w_{2,2}} & ... & \frac{\partial L}{\partial w_{2,F-1}}& \frac{\partial L}{\partial w_{2,F}} \\
... & ... & ... & ... \\
\frac{\partial L}{\partial w_{C,1}} & \frac{\partial L}{\partial w_{C,2}} & ... & \frac{\partial L}{\partial w_{C,F-1}}& \frac{\partial L}{\partial w_{C,F}} \end{array} \right] \tag{7}$
Now let's check each element, like the one at {i,j}:
$\frac{\partial L}{\partial w_{i,j}} = \sum\limits_{k=1}^{C}\frac{\partial L}{\partial z_k} \frac{\partial z_k}{\partial w_{i,j}} \tag{8}$
From equation (5) we know that only when $k==i$, $\frac{\partial z_k}{\partial w_{i,j}}$ is not zero and the value is $x_j$. Using this property, together with equation (2), equation (8) can be simplified as:
$\frac{\partial L}{\partial w_{i,j}} = (q(z_i) - 1(y,i)) * x_j \tag{9}$
Surprisingly, the final result is so simple and elegant. On the other hand, exponential function is the usual choice of eigenfunction for linear, time-invariant (LTI) systems. For more complicated systems, the result may not be as simple as Softmax, the onion peeling method should equally apply!
$\frac{e^{z_{y_i}}}{\sum\limits_{j=1}^{C} e^{z_j}} = \frac{e^{z_{y_i} + M}}{ \sum\limits_{j=1}^{C} e^{{z_j} + M}} $
Constant M is just a number that we want to transfer $z$ into the non-positive domain $z-M$. We can use the largest score of $Z$ vector: $ M = -max(Z)$
Gradient
To guide optimization process, we need to calculate gradient of Loss function with regarding to $W_{i,j}$. Direct calculation is a daunting task because mapping from W to Loss goes through linear transformation, then Softmax function and then loss function. As usual, we use chain-rule to peel the onion.
First, Softmax function gradient with regarding to raw score vector Z can be calculated per element wise:
$\frac{\partial q(z_i)}{\partial z_{j}} = q(z_i)(1-q(z_i))$ when $i = j$ and
$\frac{\partial q(z_i)}{\partial z_{j}} = -q(z_i)q(z_j)$ when $i \neq j$
2nd, consider cross-entropy loss function (1). The grand truth distribution P has only one '1'. Suppose $Z_y$ is the one corresponding to correct answer. Then the loss function becomes:
$\begin{align} \frac{\partial L(p, q)}{\partial z_j} = \frac{\partial L(p, q(z_y))}{\partial z_j} = q(z_j) - 1(y,j) \tag{2} \label{eq:2} \end{align}$
in which $z_y$ is the one corresponding to the correct class for a given test case $X_i$.
And 1(y,j) is the kronecker delta function:
$1(y,j) = \begin{cases} 1 & y = j \\
0 & y\neq j \end{cases} $
This is how we calculate elements for the partial differential vector of (1, C):
$\frac{\partial L}{\partial Z} = \left[ \frac{\partial L}{\partial z_1}, \frac{\partial L}{\partial z_2},
..., \frac{\partial L}{\partial z_C} \right] \tag{3}$
As the 3rd step, we will handle the matrix multiplication: $W*X = Z$ of the fully connected layer. Let's check the matrix dimensions first. W.shape is (num_classes, num_features) or (C, F), X.shape is (num_features, 1) and raw score Z is (num_classes, 1).
The goal of this optimization is to find better W to produce accurate Z outputs in each iteration. Hence W is the parameter matrix to be fine tuned. We need partial derivatives of Z with regarding to W. W itself is a (C, F) matrix, to make things easier to process, we flat W matrix into a row vector of length C*F, Wflat = W.reshape(1, C*F). Then the partial derivatives is a (C, CF) matrix, which matches with C outputs (Z) and C*F parameters of W (flat version):
$ \frac{\partial Z}{\partial w_{i,j}} = \left[ \begin{array}{ccccc} \frac{\partial z_1}{\partial w_{1,1}} & \frac{\partial z_1}{\partial w_{1,2}} & ... & \frac{\partial z_1}{\partial w_{C,F-1}}& \frac{\partial z_1}{\partial w_{C,F}} \\
\frac{\partial z_2}{\partial w_{1,1}} & \frac{\partial z_2}{\partial w_{1,2}} & ... & \frac{\partial z_2}{\partial w_{C,F-1}}& \frac{\partial z_2}{\partial w_{C,F}} \\
... & ... & ... & ... & ... \\
\frac{\partial z_C}{\partial w_{1,1}} & \frac{\partial z_C}{\partial w_{1,2}} & ... & \frac{\partial z_C}{\partial w_{C,F-1}}& \frac{\partial z_C}{\partial w_{C,F}} \end{array} \right] \tag{4}$
From matrix multiplication (dot product of two vectors) we know that
$z_t = w_{t,1} x_1 + w_{t,2} x_2 + w_{t,3} x_3 + ... + w_{t, F} x_F$
Therefore, we have $\frac{\partial z_t}{\partial w_{i,j}} = 0, ..., x1, x2, x3, ..., x_F, 0, ..., 0$
The partial derivatives are '0' when $i \neq t$
Hence equation (4) is much simpler than its original form:
$ \frac{\partial Z}{\partial w_{i,j}} = \left[ \begin{array}{cccccccc} x_1 & x_2 & ... & x_F & 0 & ... & 0 & 0 \\ 0 & ... & x_1 & ... & x_F & 0 & ... & 0 \\ ... & ... & ... & ... & ... & ... &... &...\\ 0 & 0 & 0 & 0 & 0 & x_1 & ... & x_F \end{array} \right] \tag{5}$
It is a sparse matrix, entries $\frac{\partial z_t}{\partial w_{i,j}}$ are mostly '0' except when row number equals row subscribe of W: $t == i$ and value at {t, (t-1)*F + j} is $x_j$.
Put all 3 steps together, we will calculate the partial derivatives with regarding to W. Using chain rule, we need to calculate $\frac{\partial L}{\partial w_{i,j}} = \sum\limits_{k=1}^{c} \frac{\partial L}{\partial z_{k}} \frac{\partial z_{k}}{\partial w_{i,j}}$, which is just the dot product of vector (3) and column $(i-1)*C+j$ of matrix (4). Hence the partial derivatives can be calculated as vector (3) multiply matrix (4):
$ \frac{\partial L}{\partial W} = \left[ \begin{array}{ccccc} \frac{\partial L}{\partial w_{1,1}}, & \frac{\partial L}{\partial w_{1,2}}, & ... ,& \frac{\partial L}{\partial w_{C,F-1}},& \frac{\partial L}{\partial w_{C,F}} \end{array} \right] \tag{6}$
Remember this row vector is just for convenience purpose. We can always rewrite the partial derivatives in matrix form:
$ \frac{\partial L}{\partial w_{i,j}} = \left[ \begin{array}{ccccc} \frac{\partial L}{\partial w_{1,1}} & \frac{\partial L}{\partial w_{1,2}} & ... & \frac{\partial L}{\partial w_{1,F-1}} & \frac{\partial L}{\partial w_{1,F}}\\
\frac{\partial L}{\partial w_{2,1}} & \frac{\partial L}{\partial w_{2,2}} & ... & \frac{\partial L}{\partial w_{2,F-1}}& \frac{\partial L}{\partial w_{2,F}} \\
... & ... & ... & ... \\
\frac{\partial L}{\partial w_{C,1}} & \frac{\partial L}{\partial w_{C,2}} & ... & \frac{\partial L}{\partial w_{C,F-1}}& \frac{\partial L}{\partial w_{C,F}} \end{array} \right] \tag{7}$
Now let's check each element, like the one at {i,j}:
$\frac{\partial L}{\partial w_{i,j}} = \sum\limits_{k=1}^{C}\frac{\partial L}{\partial z_k} \frac{\partial z_k}{\partial w_{i,j}} \tag{8}$
From equation (5) we know that only when $k==i$, $\frac{\partial z_k}{\partial w_{i,j}}$ is not zero and the value is $x_j$. Using this property, together with equation (2), equation (8) can be simplified as:
$\frac{\partial L}{\partial w_{i,j}} = (q(z_i) - 1(y,i)) * x_j \tag{9}$
Surprisingly, the final result is so simple and elegant. On the other hand, exponential function is the usual choice of eigenfunction for linear, time-invariant (LTI) systems. For more complicated systems, the result may not be as simple as Softmax, the onion peeling method should equally apply!