11. Softmax
Here is a question that I’d like us to keep in mind as we go through the contents of this video. Let’s say that we have a set of softmax classification neurons. The output of these is going to be a label. For instance, in digit classification, the output of this set of neurons is going to be the digit. What format, what representation will this output be in? This is a question that I’d like you to keep in mind in mind as you go through the video. Let’s now turn our attention to actually implementing logistic regression in TensorFlow. Let’s say that our use case is going to consist of predicting the probability of Google stock going up in a given month, driven by changes in the SNP 500 equity index.
So our Y variable is related to the returns on Google stock and the X variable is related to returns on the SP 500 index. Now we are interested in the probability that Google stock goes up in a particular month depending on whether the S and P 500 went up or not in that same month. This means that we are going to convert our Google stock returns into a categorical variable. This categorical variable will take the value of one if Google stock goes up in that month. Otherwise it will take the value zero. Our Y variable now is binary because it can only take the values one or zero. This means that we can use logistic regression and binary classification.
Here, the ultimate objective is to test our prediction accuracy because we already know how Google did in a particular month. We will then come up with a prediction based on a logistic regression model, and we will compare the actual and the predicted values of that categorical variable. Now, you should be aware that this is not a very difficult problem to solve, because we are going to be using the values of the SNP 500 in the same month in order to predict whether Google went up or down. If you are familiar with the idea of market beta, there is a very high probability that Google and the S and P 500 will move in sync. But in any case, this is still a very good application for logistic regression.
We are going to fit an S curve. On the left hand side of that curve will be the probability that Google goes up in a given month, and on the right hand side will be the usual expression for an S curve. Here, x is the SNP’s return in that month. We are seeking to find the best values of A and B, the constants such that over our entire data set we obtain the best fitting logistic regression S curve. So that really is what we are looking to do in TensorFlow. Let’s contrast our approach to logistic regression with the one we adopted with linear regression. There we started with a baseline non TensorFlow implementation. We then set up a computation graph with just one neuron.
With only an Affine transformation and no activation function, we made use of the mean square error as our cost function. This was our way of quantifying goodness of fit. To find the best regression line, we needed an optimizer object, specifically a gradient descent optimizer, which would minimize the objective function I e. It would minimize the cost function. We had to specify various training parameters such as the batch size, the number of epochs, and the corpus of training data. And then TensorFlow took over and trained our model.
The gradient descent optimizer went ahead and did its magic and we were left with a converged model with values of W and B which we could compare to the baseline non TensorFlow implementation. Our binary logistic regression implementation is going to differ from the linear regression version in two of these respects the cost function and the computation graph. When we switch from linear to logistic regression, our cost function will change. It will now be something known as cross entropy. This is a way of measuring the distance or the similarity or dissimilarity between two sets of values that have been drawn from different probability distributions. The other difference is going to be in the shape of the computation graph.
We now require an activation function. That activation function is called soft max. In addition, the number of neurons that we require is one less than the number of output labels. Here, because we are carrying out binary classification, one neuron is still enough. But if we were carrying out a more general form of classification, say with ten labels like digit classification, we would require N minus one that is, nine neurons. These are the two important differences. Let’s understand these really quickly. First, let’s understand the soft max activation. In linear regression, we go from a set of points to a regression line making use of a neural network with just one neuron. That one neuron in turn has within it an activation function.
This activation function takes the weights W and the bias variable B. Now, because the function that we are seeking to learn in linear regression is a linear one, there is no need for an activation function at all. The Affine transformation is all that we need. We simply need to find the correct values of W and B. And so for the activation function, we make use of the identity function. An identity activation function will simply pass through whatever is passed into it. But that definitively will not work when we are trying to learn a logistic regression function, because that the S curve is clearly nonlinear and that’s why we make use of an alternative activation function called softmax. Softmax will output the probability that Y is equal to true.
And from this we can easily calculate the complementary probability that Y is equal to false. I should add that the number of neurons which we require in our logistic regression neural network depends on the number of possible output labels. If there are two possible labels, like true and false one neuron suffices. If there are N, then we’ll require nine output neurons, each with a soft max activation and so on. In any case, let’s now understand this soft max activation function. The equation of a soft max function has a specific form which really mirrors the S curve of a logistic regression. We’ll get to that form in a moment.
But the important implication is that there are now a pair of variables, w two and B two, which are related to our softmax activation function, which now need to be calculated during the training process. So the number of variables in our computation graph, which needs to be found out, has now increased. The input into the soft max activation function will be the output of the afine transformation. This is equal to W one x plus B one. For simplicity, let’s just call this x prime. Now, the output of the soft max activation function is going to be a function of x prime. This function will take the form that we see on screen now. Yep, you guessed it. This is the equation for the probability of Y equal to one from a logistic regression S curve.
This is the probability that Y is equal to true in exactly the same manner. The complementary probability that Y is equal to false will be one minus this quantity. And that works out to a very similar quantity, except that the sign of the exponent is now flipped. So to express this precisely when there are just two output labels possible I. E. When we are carrying out binary classification, soft max activation is equivalent to logistic regression. We will get to the relationship between multilabel classification and logistic regression and the softmax in a moment. But first, let’s just make sure that we understand the simple case. If we only have two output categories, we can just carry this out using one neuron with a soft max activation unit.
So for binary classifiers, the input into the softmax will be x or x prime. The output will represent the probability that Y is equal to true. This probability agrees with what we would get from an equivalent logistic regression S curve. Let’s now relate this to the shapes of our variables in our TensorFlow computation graph. In linear regression, where we had a one dimensional feature vector, ie. Where we had just one x variable, the shape of our W variable was a two dimensional tensor with one value in each dimension. That meant that the shape of W was equal to one comma one. Our b tensor. That’s the other variable in the linear regression was a one dimensional tensor with just one value in there, and so its shape was just one.
The first element of the shape vector W must match the number of dimensions in our feature vector. That’s just for simple matrix multiplication, the second element in the shape vector of W, as well as the single element in the shape vector of B must match the number of parameters required to be tuned per neuron. So when we go from linear regression to logistic regression, even if we still only have a one dimensional feature vector, the shapes of our W and B tensors will change. The second dimension will now have two elements. That’s, because we now have two constants that need to be calculated for each neuron. Let’s generalize this to digit classification. There we have N possible output labels rather than two.
Let’s say that we continue with one dimensional feature data, but we would now need N values of W and B to be found. Why n you ask? Well, nine of these values correspond to the soft max activation units. That is, for nine out of ten of the input labels. We do not require a 10th because we can infer the probability of the last label as one minus the sum of all the others. So that yields nine values of W and B corresponding to the softmax activation units and one value of WNB corresponding to the afign transformation. Generalizing from N to N categories. If we would like to use soft max for n category classification and one dimensional feature data, the shape of our weights tensor will be one comma n and the shape of our biased tensor will be just n as.
We just mentioned a moment ago the first element in the shape of the w vector must agree with the number of dimensions in our feature vector. So if we were to generalize to M dimensions, we now would need to change the first element in the shape of the W tensor to be M rather than N. So as we can see from this conversation changing. Our computation graph so that it uses a soft max activation rather than the identity activation which we use for linear regression, is a fairly significant change. Let’s come back to the question we posed at the start of the video. The output of a set of soft max classification neurons is going to be a vector of probabilities.
Let’s say that we are seeking to perform digit classification. So the possible output labels are the digits zero through nine. The output of a soft max classification unit will be a set of ten probabilities corresponding to each of the ten possible labels zero through nine. Notice that this is not exactly the same as a one hot representation. The actual labels are in one hot notation because they’re exactly one of those labels is equal to one. All of the others are equal to zero. In contrast, the outputs of a soft of max classification unit are N probabilities, which all sum to one. There is no need for any one of them to be actually equal to one. So the output is going to be a vector of probabilities. This is not exactly the same as one hot representation.
12. Argmax
Here is a question that I’d like you to think about as we go through the contents of this video. While performing linear regression, we use mean square error as our cost function. This is what we tell our optimizer to minimize. Can we do the same while working with logistic regression? Can we continue with the choice of mean square error as the cost function? If yes, why? If not, why not? Another significant difference that’s introduced by our switching to logistic regression is that our cost function changes. We now make use of something known as cross entropy. We will not dwell on the mathematics of cross entropy. We will just talk very quickly about the intuition behind it. Recall that the cost function is something that we’ve got to specify during the training process.
This is required for TensorFlow to figure out how the optimizer should find the best values of A, W, and B, the variables in our computation graph. In the case of linear regression, we used the mean square error. This was the sum of the squares of the length of all of those residuals and the best fitting regression line was the one which minimized the value of that MSE. Now, in logistic regression, unlike in linear regression, we are dealing with categorical variables and our measure of performance has to do with prediction accuracy. From the output of the logistic regression, we have predicted labels. We then need to compare these to the actual labels which tell us whether Google went up or down in a given month.
And then we calculate the prediction accuracy by tabulating all of the actual and the predicted values and seeing how often the logistic regression got it right. Cross entropy is a mathematical formula which is used to quantify how similar or dissimilar these sets of variable values are. Let’s say, for instance, that the actual and the predicted labels look like the ones on screen. Now, in our example, actual and predicted values will always be either zero or one. But in general, the values of any variables drawn from any probability distributions can be compelled. If the labels of the two series are in sync, then intuitively, those two sets of numbers are drawn from relatively similar probability distributions. This intuition is captured by a specific formula that’s highlighted now on screen.
This formula refers to the cross entropy and we would like that cross entropy to be as small as possible. Now, you need not understand that formula. You need not even be familiar with the terms in that formula. All of this is abstracted away for you by TensorFlow. But that’s the intuition. And likewise the intuition between high cross entropy is if we have two sets of numbers where the labels are clearly out of sync. In such a case that formula, the quantity which we call cross entropy will be large. And that in turn will tell us that these numbers were not drawn from similar probability distributions. Once again, just a high level understanding of this is fine because unlike in the case of linear regression, we will not actually code up a function to calculate the mean square error.
Cross entropy is something which is abstracted away from us by TensorFlow. Let’s now turn our attention to a few last little implementation details which will help while writing up the code. The first of these has to do with the one hot representation. When we made use of soft max activation, it meant that the output is going to come in the form of a vector of probabilities. Those probabilities will sum to one and there will be two elements, each corresponding to the probabilities of y being zero and y being one, respectively. Now, this is true in general. Anytime one is carrying out classification using a soft max activation function in the output layer.
For instance, let’s say we were performing digit classification where there are ten possible output labels corresponding to the digits zero through nine. There the output that we would obtain would be in the form of an array. That array or one dimensional tensor, would contain ten probabilities. The highest of those probabilities would be used for the actual predicted label. All of those probabilities would sum to one. So this is a general statement about any soft max activation. The outputs are going to be obtained in the form of label vectors with probabilities. Now, because the outputs of the neural network are obtained in this form, we also usually encode the actual values of the labels in something similar, and that form is known as one hot representation.
Consider an example. Let’s say that we have a label vector of true and false elements. The one hot label representation corresponding to this would have two dimensions. Effectively, we would now have columns which correspond to each possible value of the label. Here there are two possible values, true and false. And so we have two columns. Every label in our label vector would now have one in exactly one of those columns. That column is known as the one hot value, and it would have zeros in all of the other columns. Again, this is a very standard form of encoding the actual labels when we are making use of soft max activation functions in the output layer of our neural network.
And that’s because the output of the softmax will emerge in a form which is similar to a onehot representation. Going from the label vector to the one hot representation requires explicit coding up. This is a simple bit of code which you shall see when we get to the drills going in the other direction, ie. From the onehot label vector to the corresponding index of the onehot element is accomplished using a function called ARGMAX. We will end up using Arg max a lot in the implementation and TensorFlow. So let’s understand this function in some detail. Here’s what arc max does. It takes in a vector or a tensor and it takes in a dimension, then what it will return is the index of the largest element of that tensor in that dimension.
Let’s run through an example to make sure we understand what’s really going on. Let’s say that we use EF dot ARGMAX on a two dimensional tensor. This is a Tensor y. It has elements along dimension zero and dimension one. The indices, of course, of those elements range from zero through five. Invoking TF r max on this tensor along the first dimension will return the value three, and that is because three is the index of the largest element along dimension one. We will also make use of TF ARGMAX in this rather complex bit of code which you see on screen. Now, this is comparing the predicted and the actual labels. Let’s break down what’s going on. On the one hand, we have an invocation to TF arg max. This invocation is on the actual labels.
These are in the one hot notation. The second invocation is to TF arg max on the predicted labels. TF equal will compare the indices of the largest elements in the actual and the predicted label tensors and return one if they are true and false. Otherwise, remember that the actual labels are in one hot notation, because each month Google either went up or it did not. If it did go up, then the value of the true variable will be one and that of the false variable will be zero. And if it did not go up, then the reverse will be true. So the net effect of invoking arg max on the one hot label will be to return the indices of the one hot elements which are either zero or one. That is as far as the first invocation of TF r max goes as far as the predicted labels go.
These are the outputs from the soft max activation. These are going to be predicted probabilities. So there will be two probabilities which will sum to one. By invoking TF arg max on these predicted probabilities, we are going to get the predicted final output. This will be true if the probability is greater than 50% and false otherwise. In other words, this is based on that rule of 50% which we previously spoke about. Let’s work through a similar example, this time for digit classification. Here we will start with actual digits, which are numbers between zero and nine. The one hot label vectors will now have ten columns corresponding to the ten digit zero through nine. Of those ten columns, exactly one will always contain the value one.
The other nine will contain zeros. Again, this is the definition of the one hot representation. Because exactly one element is hot, it has the value one. All of the other elements are zero. And in this way we are guaranteed to have each row sum to one. The one hot representation works for actual digits to go in the other direction. To get the corresponding digit back from that one hot representation, make use of the TF arg max method. The actual labels are in one hot form, but the predicted outputs of the soft max are not one hot because softmax is going to return a vector of numbers. Those are probabilities which will sum to one. So for instance, maybe the probability of this digit being a zero is 70%, that of its being a one is 30% and the other probability is a zero.
In order to go from these predicted probabilities to our final predicted label, we are going to again invoke the arc max function. This will give us the digit or the index corresponding to the highest probability. This in a nutshell is why calculating the predicted accuracy requires two invocations to the TF arc max function. One of these is on the actual labels which are in one hot representation. The other is on the predicted labels which are not in one hot form but are probabilities that have been output by softmax. The net effect of invoking KF equal on these two sets of return values from KF arg max will be to give us a list of true and false values.
Every true in this will correspond to a correct prediction and every false will correspond to a false prediction. And it will now be a relatively simple matter to calculate the percentage accuracy. So in summary, when we implement logistic regression in TensorFlow, we make use of soft max as the activation function in the output layer of our neural networks. The consequence of this is that we also need to encode the actual labels in one hot notation and this in turn requires us to make use of the TF ARGMAX function in order to calculate accuracy. Don’t be confused by the one hot notation or TF ARGMAX.
It just comes with the territory while working with classification networks in TensorFlow. Let’s come back now to the question which we posed at the start of this video. While working with linear regression, our cost function was mean square error. This was the mean or the average of the square of the residuals in the regression. This is not something that we can use while carrying out logistic regression. And this has to do with the differing probability distributions of the residuals in a linear regression and the residuals in a logistic regression. Remember that in a linear regression, the residuals are normally distributed.
If you were to take all of them and plot them, we would get a bell curve. In logistic regression, on the other hand, there cannot be normally distributed residuals because the residual values will all be categorical. In the case of binary logistic regression, for instance, the residuals will always be either zero plus one or minus one. There is no other option. The math behind the mean square error cost function will just not work with categorical variables in the residual series. And that’s why we’ve got to make use of the concept of cross entropy instead. crossentropy is a way of comparing whether two sets of numbers have been drawn from very different probability distributions or not.