Loading [MathJax]/jax/output/HTML-CSS/jax.js

Sunday, November 27, 2016

Softmax Classifier in Neural Network

Concept


  Softmax classifier, together with Support Vector Machine (SVM) are the two most common classifiers used in neural networks. They are usually the last later to convert input data or data from hidden layer to output score for potential classes.

  For both Softmax and SVM, the raw scores are calculated exactly the same way via linear transformation (a.k.a. matrix multiplication). Suppose we have input data X with 'T' test cases and each test case with feature dimension 'F', and the classifier is trying to map output to C classes. Then X.shape is (T, F). The classifiers uses a weight matrix W of (F, C) to convert X to raw scores Z, a vector with C elements. The raw score calculation function can be expressed as:

Z(X,W)=XW

or     Z=X.dot(W)

Softmax function is used to convert raw scores (Z) to a normalized probabilities (q):

q(zi)=eziCk=1ezk

and the subtotal of all q is 1.

The Softmax classifier then uses a cross-entropy loss, between real distribution (p) and the calculated distribution (q), as an objective function for optimization. For a given test case Xi, the loss:

L(p,q)i=Ck=1p(zk)log(q(zk))

  Softmax classifier will change the weight matrix W in order to reduce cross-entropy loss. Note that for supervised learning, the correct class distribution p(x) contains only one '1' (i.e. Y = [0, ..., 1, ..., 0]). Hence for each test case, only the one q(xk),k that corresponding to correct class (k=y) contributes to the final loss.

  For the rest of this blog, we will use L instead of Li to stand for loss for a given test Xi to simplify the problem. For all test cases, it is simply a sum of all the Li calculated.

Engineering A Numerically Stable Softmax Function 


  The cross-entropy loss can be expressed as a probability distribution. For input data xi with corresponding class yi:

 q(yi|xi)=ezyiCj=1ezj

  It looks simple enough to implement strictly. However, the exponential operator can produce a large number. Divide such a large number can be numerically unstable. We need a way to avoid numerical issues as much as possible.

  A nice property of exponential function is that for negative variable r<0, the output er is squashed between (0, 1]. The idea here is to transfer the numerator and denominator into negative domain without changing the value of the faction:

   ezyiCj=1ezj=ezyi+MCj=1ezj+M

  Constant M is just a number that we want to transfer z into the non-positive domain zM. We can use the largest score of Z vector: M=max(Z)

Gradient


  To guide optimization process, we need to calculate gradient of Loss function with regarding to Wi,j. Direct calculation is a daunting task because mapping from W to Loss goes through linear transformation, then Softmax function and then loss function. As usual, we use chain-rule to peel the onion.

  First, Softmax function gradient with regarding to raw score vector Z can be calculated per element wise:

q(zi)zj=q(zi)(1q(zi)) when i=j and
 q(zi)zj=q(zi)q(zj) when ij

  2nd, consider cross-entropy loss function (1). The grand truth distribution P has only one '1'. Suppose Zy is the one corresponding to correct answer. Then the loss function becomes:

L(p,q)zj=L(p,q(zy))zj=q(zj)1(y,j)
    in which zy is the one corresponding to the correct class for a given test case Xi.
And 1(y,j) is the kronecker delta function:

1(y,j)={1y=j0yj

This is how we calculate elements for the partial differential vector of (1, C):

LZ=[Lz1,Lz2,...,LzC]

  As the 3rd step, we will handle the matrix multiplication: WX=Z of the fully connected layer. Let's check the matrix dimensions first. W.shape is (num_classes, num_features) or (C, F), X.shape is (num_features, 1) and raw score Z is (num_classes, 1).

  The goal of this optimization is to find better W to produce accurate Z outputs in each iteration. Hence W is the parameter matrix to be fine tuned. We need partial derivatives of Z with regarding to W. W itself is a (C, F) matrix, to make things easier to process, we flat W matrix into a row vector of length C*F, Wflat = W.reshape(1, C*F). Then the partial derivatives is a (C, CF) matrix, which matches with C outputs (Z) and C*F parameters of W (flat version):

Zwi,j=[z1w1,1z1w1,2...z1wC,F1z1wC,Fz2w1,1z2w1,2...z2wC,F1z2wC,F...............zCw1,1zCw1,2...zCwC,F1zCwC,F]

From matrix multiplication (dot product of two vectors) we know that
zt=wt,1x1+wt,2x2+wt,3x3+...+wt,FxF
Therefore, we have ztwi,j=0,...,x1,x2,x3,...,xF,0,...,0
The partial derivatives are '0' when it

Hence equation (4) is much simpler than its original form:
Zwi,j=[x1x2...xF0...000...x1...xF0...0........................00000x1...xF]
It is a sparse matrix, entries ztwi,j are mostly '0' except when row number equals row subscribe of W: t==i and value at {t, (t-1)*F + j} is xj.

Put all 3 steps together, we will calculate the partial derivatives with regarding to W. Using chain rule, we need to calculate Lwi,j=ck=1Lzkzkwi,j, which is just the dot product of vector (3) and column (i1)C+j of matrix (4). Hence the partial derivatives can be calculated as vector (3) multiply matrix (4):

LW=[Lw1,1,Lw1,2,...,LwC,F1,LwC,F]

Remember this row vector is just for convenience purpose. We can always rewrite the partial derivatives in matrix form:
Lwi,j=[Lw1,1Lw1,2...Lw1,F1Lw1,FLw2,1Lw2,2...Lw2,F1Lw2,F............LwC,1LwC,2...LwC,F1LwC,F]

Now let's check each element, like the one at {i,j}:

Lwi,j=Ck=1Lzkzkwi,j

From equation (5) we know that only when k==i, zkwi,j is not zero and the value is xj. Using this property, together with equation (2), equation (8) can be simplified as:
Lwi,j=(q(zi)1(y,i))xj

Surprisingly, the final result is so simple and elegant. On the other hand, exponential function is the usual choice of eigenfunction for linear, time-invariant (LTI) systems. For more complicated systems, the result may not be as simple as Softmax, the onion peeling method should equally apply!

Thursday, October 20, 2016

Hough Transform Lines (Octave)

Hough transform was developed to detect lines in 2D pictures.
This post documents simple examples and results I got while working on UD810.

Functions implemented:
  • hough_lines_acc(img_edgs, rho_resolution, theta_resolution)
  • hough_peaks(Hough_acc, NHoodSize)
  • hough_lines_draw(img, output, peaks, rho, theta)

 % create an example array
img = zeros(200, 200);

% Add lines
img(50*200+50 : 201 : 150*200 +150) = 255;
img(50*200+150 : 199 : 150*200 + 50) = 255;

% converting to gray image
I = mat2gray(img);










% extract edges using Canny operator
edges = edge(I, 'canny');










% Hough Transform
[H, T, R] = hough_lines_acc(edges);
% The tricky part of implementing this function is
% to handle negative values:
% rhos and thetas can be positive and negative
% but array index must be positive

% plot voting lines in Hough domain
fig1 = figure 1;
imagesc(H);
saveas(fig1,'lines_in_hough_domain.png');















% Plot hough lines 
% Don't forget use cosd and sind because Theta is in degree
hough_lines_draw(I, 'output.png', P, R, T);

% calculate peaks
% This function not only sort and pick the top peaks
% but also needs to eliminate neighbours (nHoodSize, an odd parameter)
% for each peaks found
% The tricky part is to handle boundary condition, rest is straightforward

P = hough_peaks(H,2) % since we know there are two lines in this example


% Plot outputs from each step
















% Try the flow on football field photo
img = imread('football-field.jpeg');






% Converting to gray image, then extract edges
I2 = rgb2gray(img);

edges = edge(I2, 'canny')
[H, R, T] = hough_lines_acc(edges);















%Plot Hough Domain
imagesc(H);











P = hough_peaks(H,20);
% draw original image, lines and Hough lines
hough_lines_draw(I2, 'football_output.png', P, R, T);

















Monday, September 12, 2016

Canny Edge Detector

Canny Edge Detector

Canny Edge Detector

Canny edge dector uses several steps to extract edges in image:

  • Apply Gaussian filter to smooth image (reduce noise)
  • Calculate gradients of the image (1 and 2 can be combined into one step)
  • Use high threshold to extract "significant" gradients (strong edge pixels)
  • "Thin" to reduce edge pixels to the local peak of gradient (along gradient direction)
  • Linking stage: using low threshold to extend edges extracted the strong edges (clever!)

It is very easy to apply Canny Edge Detector in Matlab:

    edge_img = edge(org_img, 'canny');

In this blog, we will use the Canny edge dector from skimage package

In [18]:
# First part borrow from scikit webpage
# Simple demo using square
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
from scipy import ndimage as ndi
from skimage import feature

# Ensure plots embeded in notebook
%matplotlib inline
plt.rcParams['figure.figsize']= (8.0, 6.0)
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
In [11]:
# Generate a square with noise
im = np.zeros((256, 256))
im[64:-64, 64:-64] = 1
im = ndi.rotate(im, 30, mode='constant')
im = ndi.gaussian_filter(im, 5)
im += 0.2 * np.random.randn(*im.shape)
plt.imshow(im)
plt.show()
In [14]:
# Extract edges using two different sigma values
edges_default = feature.canny(im)
edges_major = feature.canny(im, sigma=3)
edges_overMajor =  feature.canny(im, sigma=5)
plt.subplot(131)
plt.title('Default Canny Filter')
plt.imshow(edges_default, cmap='gray')
plt.axis('off')
plt.subplot(132)
plt.title('Canny Filter, Sigma = 3')
plt.imshow(edges_major, cmap='gray')
plt.axis('off')
plt.subplot(133)
plt.title('Canny Filter, Sigma = 5')
plt.imshow(edges_overMajor, cmap='gray')
plt.axis('off')

plt.show()
In [24]:
# Second Part, Hummingbird example
bird_im = np.array(Image.open('hummingbird.jpg').convert('L'))
bird_edge_default = feature.canny(bird_im)
bird_edge_sigma2 = feature.canny(bird_im, sigma=2)
bird_edge_sigma3 = feature.canny(bird_im, sigma=3)
plt.subplot(2,2,1)
plt.title('Original Image')
plt.axis('off')
plt.imshow(bird_im)
plt.subplot(2,2,2)
plt.title('Default Canny')
plt.axis('off')
plt.imshow(bird_edge_default)
plt.subplot(2,2,3)
plt.title('Canny, Sigma = 2')
plt.axis('off')
plt.imshow(bird_edge_sigma2)
plt.subplot(2,2,4)
plt.title('Canny, Sigma = 3')
plt.axis('off')
plt.imshow(bird_edge_sigma3)

plt.show()