Softmax Classifier in Neural Network
Concept
Softmax classifier, together with Support Vector Machine (SVM) are the two most common classifiers used in neural networks. They are usually the last later to convert input data or data from hidden layer to output score for potential classes.
For both Softmax and SVM, the raw scores are calculated exactly the same way via linear transformation (a.k.a. matrix multiplication). Suppose we have input data X with 'T' test cases and each test case with feature dimension 'F', and the classifier is trying to map output to C classes. Then X.shape is (T, F). The classifiers uses a weight matrix W of (F, C) to convert X to raw scores Z, a vector with C elements. The raw score calculation function can be expressed as:
Z(X,W)=X∗W
or Z=X.dot(W)
Softmax function is used to convert raw scores (Z) to a normalized probabilities (q):
q(zi)=eziC∑k=1ezk
and the subtotal of all q is 1.
The Softmax classifier then uses a cross-entropy loss, between real distribution (p) and the calculated distribution (q), as an objective function for optimization. For a given test case Xi, the loss:
L(p,q)i=−C∑k=1p(zk)∗log(q(zk))
Softmax classifier will change the weight matrix W in order to reduce cross-entropy loss. Note that for supervised learning, the correct class distribution p(x) contains only one '1' (i.e. Y = [0, ..., 1, ..., 0]). Hence for each test case, only the one q(xk),k that corresponding to correct class (k=y) contributes to the final loss.
For the rest of this blog, we will use L instead of Li to stand for loss for a given test Xi to simplify the problem. For all test cases, it is simply a sum of all the Li calculated.
The cross-entropy loss can be expressed as a probability distribution. For input data xi with corresponding class yi:
or Z=X.dot(W)
Softmax function is used to convert raw scores (Z) to a normalized probabilities (q):
q(zi)=eziC∑k=1ezk
and the subtotal of all q is 1.
The Softmax classifier then uses a cross-entropy loss, between real distribution (p) and the calculated distribution (q), as an objective function for optimization. For a given test case Xi, the loss:
L(p,q)i=−C∑k=1p(zk)∗log(q(zk))
Softmax classifier will change the weight matrix W in order to reduce cross-entropy loss. Note that for supervised learning, the correct class distribution p(x) contains only one '1' (i.e. Y = [0, ..., 1, ..., 0]). Hence for each test case, only the one q(xk),k that corresponding to correct class (k=y) contributes to the final loss.
For the rest of this blog, we will use L instead of Li to stand for loss for a given test Xi to simplify the problem. For all test cases, it is simply a sum of all the Li calculated.
Engineering A Numerically Stable Softmax Function
The cross-entropy loss can be expressed as a probability distribution. For input data xi with corresponding class yi:
q(yi|xi)=ezyiC∑j=1ezj
It looks simple enough to implement strictly. However, the exponential operator can produce a large number. Divide such a large number can be numerically unstable. We need a way to avoid numerical issues as much as possible.
A nice property of exponential function is that for negative variable r<0, the output er is squashed between (0, 1]. The idea here is to transfer the numerator and denominator into negative domain without changing the value of the faction:
ezyiC∑j=1ezj=ezyi+MC∑j=1ezj+M
Constant M is just a number that we want to transfer z into the non-positive domain z−M. We can use the largest score of Z vector: M=−max(Z)
To guide optimization process, we need to calculate gradient of Loss function with regarding to Wi,j. Direct calculation is a daunting task because mapping from W to Loss goes through linear transformation, then Softmax function and then loss function. As usual, we use chain-rule to peel the onion.
First, Softmax function gradient with regarding to raw score vector Z can be calculated per element wise:
∂q(zi)∂zj=q(zi)(1−q(zi)) when i=j and
∂q(zi)∂zj=−q(zi)q(zj) when i≠j
2nd, consider cross-entropy loss function (1). The grand truth distribution P has only one '1'. Suppose Zy is the one corresponding to correct answer. Then the loss function becomes:
∂L(p,q)∂zj=∂L(p,q(zy))∂zj=q(zj)−1(y,j)
in which zy is the one corresponding to the correct class for a given test case Xi.
And 1(y,j) is the kronecker delta function:
1(y,j)={1y=j0y≠j
This is how we calculate elements for the partial differential vector of (1, C):
∂L∂Z=[∂L∂z1,∂L∂z2,...,∂L∂zC]
As the 3rd step, we will handle the matrix multiplication: W∗X=Z of the fully connected layer. Let's check the matrix dimensions first. W.shape is (num_classes, num_features) or (C, F), X.shape is (num_features, 1) and raw score Z is (num_classes, 1).
The goal of this optimization is to find better W to produce accurate Z outputs in each iteration. Hence W is the parameter matrix to be fine tuned. We need partial derivatives of Z with regarding to W. W itself is a (C, F) matrix, to make things easier to process, we flat W matrix into a row vector of length C*F, Wflat = W.reshape(1, C*F). Then the partial derivatives is a (C, CF) matrix, which matches with C outputs (Z) and C*F parameters of W (flat version):
∂Z∂wi,j=[∂z1∂w1,1∂z1∂w1,2...∂z1∂wC,F−1∂z1∂wC,F∂z2∂w1,1∂z2∂w1,2...∂z2∂wC,F−1∂z2∂wC,F...............∂zC∂w1,1∂zC∂w1,2...∂zC∂wC,F−1∂zC∂wC,F]
From matrix multiplication (dot product of two vectors) we know that
zt=wt,1x1+wt,2x2+wt,3x3+...+wt,FxF
Therefore, we have ∂zt∂wi,j=0,...,x1,x2,x3,...,xF,0,...,0
The partial derivatives are '0' when i≠t
Hence equation (4) is much simpler than its original form:
∂Z∂wi,j=[x1x2...xF0...000...x1...xF0...0........................00000x1...xF]
It is a sparse matrix, entries ∂zt∂wi,j are mostly '0' except when row number equals row subscribe of W: t==i and value at {t, (t-1)*F + j} is xj.
Put all 3 steps together, we will calculate the partial derivatives with regarding to W. Using chain rule, we need to calculate ∂L∂wi,j=c∑k=1∂L∂zk∂zk∂wi,j, which is just the dot product of vector (3) and column (i−1)∗C+j of matrix (4). Hence the partial derivatives can be calculated as vector (3) multiply matrix (4):
∂L∂W=[∂L∂w1,1,∂L∂w1,2,...,∂L∂wC,F−1,∂L∂wC,F]
Remember this row vector is just for convenience purpose. We can always rewrite the partial derivatives in matrix form:
∂L∂wi,j=[∂L∂w1,1∂L∂w1,2...∂L∂w1,F−1∂L∂w1,F∂L∂w2,1∂L∂w2,2...∂L∂w2,F−1∂L∂w2,F............∂L∂wC,1∂L∂wC,2...∂L∂wC,F−1∂L∂wC,F]
Now let's check each element, like the one at {i,j}:
∂L∂wi,j=C∑k=1∂L∂zk∂zk∂wi,j
From equation (5) we know that only when k==i, ∂zk∂wi,j is not zero and the value is xj. Using this property, together with equation (2), equation (8) can be simplified as:
∂L∂wi,j=(q(zi)−1(y,i))∗xj
Surprisingly, the final result is so simple and elegant. On the other hand, exponential function is the usual choice of eigenfunction for linear, time-invariant (LTI) systems. For more complicated systems, the result may not be as simple as Softmax, the onion peeling method should equally apply!
ezyiC∑j=1ezj=ezyi+MC∑j=1ezj+M
Constant M is just a number that we want to transfer z into the non-positive domain z−M. We can use the largest score of Z vector: M=−max(Z)
Gradient
To guide optimization process, we need to calculate gradient of Loss function with regarding to Wi,j. Direct calculation is a daunting task because mapping from W to Loss goes through linear transformation, then Softmax function and then loss function. As usual, we use chain-rule to peel the onion.
First, Softmax function gradient with regarding to raw score vector Z can be calculated per element wise:
∂q(zi)∂zj=q(zi)(1−q(zi)) when i=j and
∂q(zi)∂zj=−q(zi)q(zj) when i≠j
2nd, consider cross-entropy loss function (1). The grand truth distribution P has only one '1'. Suppose Zy is the one corresponding to correct answer. Then the loss function becomes:
∂L(p,q)∂zj=∂L(p,q(zy))∂zj=q(zj)−1(y,j)
in which zy is the one corresponding to the correct class for a given test case Xi.
And 1(y,j) is the kronecker delta function:
1(y,j)={1y=j0y≠j
This is how we calculate elements for the partial differential vector of (1, C):
∂L∂Z=[∂L∂z1,∂L∂z2,...,∂L∂zC]
As the 3rd step, we will handle the matrix multiplication: W∗X=Z of the fully connected layer. Let's check the matrix dimensions first. W.shape is (num_classes, num_features) or (C, F), X.shape is (num_features, 1) and raw score Z is (num_classes, 1).
The goal of this optimization is to find better W to produce accurate Z outputs in each iteration. Hence W is the parameter matrix to be fine tuned. We need partial derivatives of Z with regarding to W. W itself is a (C, F) matrix, to make things easier to process, we flat W matrix into a row vector of length C*F, Wflat = W.reshape(1, C*F). Then the partial derivatives is a (C, CF) matrix, which matches with C outputs (Z) and C*F parameters of W (flat version):
∂Z∂wi,j=[∂z1∂w1,1∂z1∂w1,2...∂z1∂wC,F−1∂z1∂wC,F∂z2∂w1,1∂z2∂w1,2...∂z2∂wC,F−1∂z2∂wC,F...............∂zC∂w1,1∂zC∂w1,2...∂zC∂wC,F−1∂zC∂wC,F]
From matrix multiplication (dot product of two vectors) we know that
zt=wt,1x1+wt,2x2+wt,3x3+...+wt,FxF
Therefore, we have ∂zt∂wi,j=0,...,x1,x2,x3,...,xF,0,...,0
The partial derivatives are '0' when i≠t
Hence equation (4) is much simpler than its original form:
∂Z∂wi,j=[x1x2...xF0...000...x1...xF0...0........................00000x1...xF]
It is a sparse matrix, entries ∂zt∂wi,j are mostly '0' except when row number equals row subscribe of W: t==i and value at {t, (t-1)*F + j} is xj.
Put all 3 steps together, we will calculate the partial derivatives with regarding to W. Using chain rule, we need to calculate ∂L∂wi,j=c∑k=1∂L∂zk∂zk∂wi,j, which is just the dot product of vector (3) and column (i−1)∗C+j of matrix (4). Hence the partial derivatives can be calculated as vector (3) multiply matrix (4):
∂L∂W=[∂L∂w1,1,∂L∂w1,2,...,∂L∂wC,F−1,∂L∂wC,F]
Remember this row vector is just for convenience purpose. We can always rewrite the partial derivatives in matrix form:
∂L∂wi,j=[∂L∂w1,1∂L∂w1,2...∂L∂w1,F−1∂L∂w1,F∂L∂w2,1∂L∂w2,2...∂L∂w2,F−1∂L∂w2,F............∂L∂wC,1∂L∂wC,2...∂L∂wC,F−1∂L∂wC,F]
Now let's check each element, like the one at {i,j}:
∂L∂wi,j=C∑k=1∂L∂zk∂zk∂wi,j
From equation (5) we know that only when k==i, ∂zk∂wi,j is not zero and the value is xj. Using this property, together with equation (2), equation (8) can be simplified as:
∂L∂wi,j=(q(zi)−1(y,i))∗xj
Surprisingly, the final result is so simple and elegant. On the other hand, exponential function is the usual choice of eigenfunction for linear, time-invariant (LTI) systems. For more complicated systems, the result may not be as simple as Softmax, the onion peeling method should equally apply!