Stanford Cs231n- Assignment 1- Linear Classifier
Linear classifier is a simple but not strong method that can be used for image classification. It is not strong because as can be inferred from its name it can do well only if the fed dataset be linearly separable. What linear classifier does is getting input datapoints (X) and multiplying it with a randomly generated weight matrix (W), the result is a matrix that contains scores of different classes for all datapoints. Highest class score for a datapoint means that the datapoint is belonged to that particular class. But because in first iteration we fill the weight matrix with randomly generated numbers ,results wont be promising. So, we need to somehow change the weight matrix to get better results.
Generally, we have two type of loss that are very common, SVM loss and Softmax loss. At this point, linear classifier tries to minimize the loss with finding best wights. This can be done using back propagation which means finding gradient of weights. Strictly speaking, these gradients show the effect of each element of W matrix on loss function. So, we use the gradient to update the wights. Iteratively doing these steps will provide us final weight matrix which
can be used to classify new input data. The equation1 calculate the loss for in training set. Total loss is calculated by taking the average of the loss of all images over the training set.
In Softmax classifier scores are the unnormalized log probabilities of classes which are normalized using equation 2. Similar to SVM, Total loss is calculated by averaging the loss of all images in train dataset. After training, classifier predicts the label of test image as the label for class with the highest probability.
SVM loss is zero if the score of the correct class is greater than other classes by a given margin. conversely, in Softmax, we do not have the zero loss.
Two loss functions, SVM and Softmax have presented in previous paragraphs. What SVM loss function dose simply is ensuring that the correct class score is higher than the scores of incorrect classes by a constant delta value.
In the naive Method of SVM, we have traverse over all training datasets one by one and calculate their score for that datapoint by multiplying pixel values of the datapoint with the weight of given matrix. For calculating the margins of data point we have to employ another for this task. IN another word, In terms of naive approach, what it does is iterating over all training datapoints one by one and calculate the score for that datapoint by multiplying pixel values of the datapoint with weight matrix. Then, code has another for loop which iterates over scores of the datapoint and calculates the margin. If margin be greater than zero for aclass, margin will be added to the loss. Also for margins grater than zero, according to formula of calculating analytical gradient, pixel values of current datapoint will be added to corresponding column of the class which second for loop is iterating over it- and will be subtracted from column of current datapoints class in gradient matrix. After iterating over
all datapoints well have a loss which is a sum over all training examples and also dw which is sum over all training examples. Thus, by dividing the loss and dw by number of trains, we get average loss and gradient. Finally code adds the regularization term to loss where regularization parameter is 0.5.
In the vectorized version of SVM loss: is so faster than naive approach due to lack of using for loops.
In this part we can multiply the multiplication of WX in once instead of multiplying them index by index. About the margin, due to vectorized implementation of the code, we do not need to have allocate new value for margin so we save all of the steps in the Scores. we have add margin to over loss when our margin is positive(otherwise, we do not do anything). At the end, we divide the loss and dW by number of training and add the regularization term derivative to get the total loss.
We calculate the scores of classes for all the training samples using a dot product of X and W, extract the correct class scores. for preventing the adding the delta=1 to the loss, we allocate zero for correct classes of margin.
for selecting the correct class score oft the whole examples, we use the package the np.arrange that is the part of the Numpy package. At the next step we Sum up positive margins,Then adding the multiplication of transpose of X and sum m ask gives us the gradient matrix which is sum over all training examples. Final gradient matrix is calculated by dividing all of elements of dw by number of training examples and adding the regularization term. That is
used to implement following formula:
.