4. Gradient Descent
Here is a question that I’d like you to try and attempt and keep in mind as we go through the contents of this video. What’s the difference between stochastic gradient descent and minibatch gradient descent? As the names would suggest, these are very closely related types of optimization. What is the precise difference between these two? We are now done with the first step in our process. We’ve established a baseline by implementing regression using regular Python code. Let’s now plunge into the TensorFlow implementation. In TensorFlow, all operations are represented using computation graphs. Computation graphs are directed asyclic graphs of operations in which the nodes are neurons or operations and the edges are tensors or data.
Here our computation graph is a neural network of just one neuron with only an affine transformation. Recall our discussion of how regression is an example of the simplest operation that a neural network can learn. Here we just have one neuron which takes in a set of points, applies a transformation of the form Y is equal to W x plus B, and in this way it learns the linear relationship and finds the regression line. The question then is given all of these input values, these are the x variables. What are the values of W and B? What are the values of the weights and the biases that pass through our fi transformation that define that regression line? They call also that a neuron has something called an activation function.
In this instance, the activation function merely takes the output of the AFind transformation. So the activation function here is nothing but the identity function. This is a very basic, very simple computation graph. It just has the one node in it. But even so, we will have to convert this into TensorFlow constructs. X and Y will be placeholders, w and B will be variables. More on that in a bit. The next step after this is to define the cost function. Recall that while finding a linear regression line, we seek to minimize the mean square error. This mean square error, which is abbreviated to MSE, is a measure of the cost. It is a measure of how bad a given fit is.
And so what we’d like to do is to find that regression line which, given a set of points, minimizes the squares of the residuals in all of those points. How do we calculate those residuals? Well, it’s simple. It’s merely the difference between the predicted values as given by our regression line and the actual values of Y, which we already have. So this difference between the actual and the predicted values is going to feed into the mean square error. And that mean square error is the cost function. That’s what we want to minimize. This now is where it gets really interesting, because this is where we need to use an optimizer, specifically a gradient descent optimizer, which will help us to find that best fitting regression line. Let’s understand exactly how this optimization problem is framed.
Like all optimization problems, it has an objective function. That’s what we really wish to achieve. It has some constraints which are conditions that we must satisfy and it also has a bunch of decision variables. These are the variables whose values we can control. Collectively, these answers constitute the optimization problem. Linear regression is a pretty classic example of an optimization problem. The objective function here is to minimize the square of the variances of our residuals, that is, to minimize the MSE. Our constraint here lies in the form of the relationship which we’ve expressed as a linear one. Y is equal to WX plus B. And the decision variables here are those values of W and B respectively.
Now, there are standard cookie cutter techniques for solving this regression optimization problem. These are the method of moments, the method of least square and maximum likelihood estimation. But we are intentionally not going to use these methods while solving this problem in TensorFlow. Instead, we are going to make use of gradient descent optimization. This is a more general optimization technique which can be applied to a class of machine learning problems. This does mean that as far as the linear regression problem goes, it is quite possible that the solutions that we get using TensorFlow will be less efficient or even less correct than the ones that we would get using these cookie cutter techniques which are optimized for linear regression.
Let’s understand exactly how a gradient descent optimizer would work in this example. Here our decision variables are W and B, which we represent on the two horizontal axis. And then we have the error term that’s the MSE which we represent along the vertical axis. If we were to calculate the value of the MSE for every value of W and D, given some set of training data, we would get a complex surface, maybe something like that you see on screen now. Now of course it’s impossible to actually calculate the value of MSC for every combination of W and B. But what we do really want is to find only the smallest value of the MSE. This represents the minimum in that surface. This smallest value of the cost function represents a global minimum.
Because if we were to drop a vertical line from this point to the horizontal plane, that vertical line would be as small as is possible. And if we were able to find the smallest value of the MSC, we could find the corresponding values of WNP. And those corresponding values would in turn give us the ideal or the best parameters for our neuron. And that process of finding the best values of WNB is called the training process. Now of course, the question that arises is how do we get to that best possible value? We’ve got to start somewhere and that is how gradient descent optimizers work. They start with some initial estimate or guesstimate of the MSC and then use an optimization algorithm to converge to descend literally towards that smallest value of the cost function.
And this process of finding the global minimum in this way is called gradient descent. TensorFlow offers a rich variety of gradient descent optimizers. We will explore these in just a bit. Now, the question is how do we get to that best value of W and B? We’ve got to start somewhere. We’ve got to start with some initial guess. The responsibility of providing those initial values lies on us, as does the responsibility of specifying exactly the kind of optimizer that we would like to use in order to carry out the gradient descent. As we can tell, this gradient descent process is iterative. Each arrow represents one step towards that best possible value of the parameters. Each of those steps is called an epoch, and the rate at which we move towards that global minimum in each of those epochs is called the learning rate.
Some gradient descent optimizers will use a constant learning rate. This means that they will advance towards that minimum in steps of equal size. The class TF dot train dot gradient descent optimizer is one such. There are a host of other optimizers which change their learning rates. As they advance towards the solution, they initially take large steps and then as they converge towards the correct answer, they take smaller and smaller steps in order to avoid overshooting it. In our Python implementation, we’ll play around with a few different types of descent optimizers and learning rates and see how the results vary. Another decision that we’ve got to make has to do with how much of the training data we are going to make use of in each step of the optimizer, that is, in each epoch.
Remember that this gradient descent is an iterative process and each step that we take towards that optimal is called an epoch. We also need to decide the total number of epochs and at each iteration we can choose how much of the training data is going to be taken into account by the optimizer. And this is called the batch size. You may have heard the term stochastic gradient descent. This refers to a form of gradient descent in which each epoch only takes into account one point from the training data. The most commonly adopted solution is minibatch gradient descent. This is a middle path in which some subset of the training data is passed into the optimizer at each point at each epoch.
At the other extreme is batch gradient descent, in which all of the training data is fed into the optimizer at each step. Now, you should know that it’s not a problem to feed the same training data again and again. All data passed into the training algorithm helps. And once we’ve made these decisions about the bat size and the number of epochs, really we are done with the training process. As we shall see, the actual training process is carried out for us. But we still need to make decisions. Decisions about the initial values, about the type of optimizer, about the number of epochs, and about the batch size in each epoch. And the output of this training process is a converged model.
This converged model will have the values of WNP that we need. We can go ahead and compare this to our baseline implementation, and we can also use this model just as we would any other regression model. So now let’s go ahead and implement all of this in TensorFlow. Let’s now turn to the question we posed at the start of this video. As we’ve discussed, the difference between stochastic and minibash gradient descent lies in the quantity of training data that we take into account at each step of the gradient descent optimization. If we only take into account one point from our entire data set, then that is stochastic gradient descent. If we take into account a subset not one point, more than one point, but less than all of the training data, then that is known as minibatch gradient descent.
5. Lab: Linear Regression
When building our regression model and training it, we’ll be making use of an optimizer. This can be the gradient descent optimizer. We’ll also see an Ftrl optimizer, and so on. What is the exact function of this optimizer? In this lecture, we’ll see an EndToEnd implementation of Linear regression in TensorFlow. Starting with setting up a computation graph and ending with converged model. We’ve implemented linear regression using Python libraries. We used ScikitLearn to set up a baseline model. Now we’ll work with TensorFlow to implement this linear regression model as a computation graph.
Let’s switch over to Python code and pick up where we left off. In our baseline implementation, we had calculated the coefficient and the intercept, and you can see the result right there on screen. We start off our TensorFlow implementation by importing the TensorFlow libraries alias test TF. The first step is to set up estimates for values of W and B. We have to start somewhere before we perform the gradient descent optimization. To find the values for our best fit regression line. W and D are both variables. Their values will be updated as the model is trained. We use TF dot zeros to initialize the W tensor and the B tensors to all zeros. The value specified in parentheses specify the shape of the tensor. W is a tensor with one row and one column.
The next step is to specify a placeholder to feed in the X variables of our regression. This is the returns data for the SNP 500. Each data point is of type float 32. The shape of the data that we feed into this tensor is none comma one. The first dimension is none because we don’t know how many data points there will be. The second dimension is one. Each x value is exactly one point. The next step is to set up our linear regression model, which is Y is equal to WX plus B. We first do an intermediate calculation of WX. We use the TF matmull function and multiply x by W. This matrix multiplication only works when the number of Columns of x is equal to the number of rows in W.
We then calculate Y is equal to WX plus B, which is the output of a single neuron. The next step after this is to set up a placeholder for Y. Underscore this is a placeholder that holds the labels of our training data. These form the actual Y values corresponding to every x that we’ve fed in at the input. Once again, this data is in the form of data points, which are of type float 32, and the shape of the data is none comma one. It’s two dimensional data. The first dimension is none because we don’t know how many data points there will be. The second dimension is one corresponding to one point. The Y label one point for every corresponding x. The variable Y is the predicted value from our model, and Y underscore is the actual value of y with training labels for our x data.
The next step in our linear regression model is to calculate the cost function. This cost function is a measure of how good a regression line is, how well it fits the input data. It can also be called the loss function in machine learning. And the objective of our machine learning algorithm, which in this case is linear regression, is to minimize this loss. In linear regression, the cost function is represented by mean square error. This can actually be done very simply in TensorFlow, thanks to the standard math libraries that TensorFlow provides. We simply call the reduce mean function on the squares of the differences between the actual and predicted y values. This is the cost function for our linear regression model.
The mean square error determines how well a particular line fits our data. This is done by dropping vertical lines from each point to our regression line. You can compare these two lines and say which of these is a better fit for our data. This can be formally computed by using a method which minimizes the least square error. This is done by calculating the difference between the predicted values, which is the y values on the line, and the actual yvalues of the points themselves. Our objective of finding the best fit regression line involves minimizing this cost function, minimizing the loss. This is done using an optimizer. At this point, you’ll have to make a decision as to what optimizer you want to use. The plain vanilla gradient descent optimizer also exists in the TensorFlow library.
We saw that when we saw the linear regression. The Baby program that we set up earlier. In this example, we’ve done something a little bit different and used a different optimizer, the Ftrl optimizer. The exact differences between the Ftrl optimizer and the plain vanilla gradient descent optimizer is beyond the scope of this course at a meta level. What you need to know is that the steps that these optimizers take to optimize whatever function you pass in will be different. This optimizer has a learning rate of one, and the objective function of this optimizer is to minimize the cost function. The learning rate for any optimizer determines the step sizes that the optimizer takes as it minimizes the value of the cost function. Now that we’ve set up our optimizer, we can move on to the training step.
We’ll set up a function which iteratively calls a regression model, with a little bit of data in each epoch. At the very first step, we’ll set up a Python variable with the total length of our x data, that is, data set size. Let’s write a function which does the actual work of training our model. This function is called Train, with multiple points per epoch, and it takes in three arguments. The first argument, named steps, is the number of steps or the number of epochs. In our training if you remember our gradient descent optimizer, it takes steps to get towards the optimal solution. Each step towards the optimal is called an epoch. By passing this in as an argument to our function, we can vary the number of epochs or steps in our training algorithm.
Second argument to this function is the train step, that is, the optimizer that we’ll be using. We’ve instantiated the Ftrl. Optimizer. You can use gradient descent or any of the other optimizers in the TensorFlow library. The third argument that we pass into this function, which allows us to tweak our training, is the batch size. If you remember, batch sizes can range from one point per batch, which is the stochastic gradient descent, to the other extreme, where the entire batch is passed in for every iteration, that is, the batch gradient descent. In between is the mini batch gradient descent, where we pass in some subset of data in every epoch within this function, trained with multiple points per epoch, first, initialize all the variables that you’ve set up in this program.
This we do by calling PF global variables initializer in order to execute this computation graph. Instantiate a Session Object It’s always the best practice to instantiate a session object using the with statement. The session will be closed as soon as we exit the with body. Before you run any of the other commands in your TensorFlow program, make sure you call session run on in it, and initialize all your variables. The next step is to set up a for loop that will actually train our TensorFlow model. The for loop runs as many times as the steps that we’ve specified. Every iteration of the for loop is one epoch. The next set of Ifels statements is used to specify the subset of data that is to be passed in for every epoch.
f the batch size that we’ve specified is equal to the entire data set, that is, data set size is equal to batch size, then the batch starts at index zero. It’s an error. If the batch size specified is greater than the size of the data set, we raise a value error in that case, if the batch size is smaller than the size of the data set. This is the mini batch gradient descent, if you remember. Then we have to choose some subset of the data from the entire data set. For every iteration, we choose different subsets of data. The start index of the batch or batch start IDX is I multiplied by batch size, where I tells us what iteration of the for loop we are on. We perform a modular operation with the data set size. The start index of the batch cannot go beyond the size of the data set.
Once we have the start index of where the batch starts, the end index of the batch is very straightforward. We simply add the batch size to the start index. Once we have the start and N indices, which represent one batch of data, access the X and Y values in batches. So we have the batch underscore XS and batch underscore YS, the x values and their corresponding y labels. Once we have all the x and Y data that correspond to one batch, we set up a feed dictionary to feed these X and Y values to our training model. If you remember our training model, there were two placeholders whose values we had to feed in x and y underscore batch underscore excess feeds in the x data and batch underscore YS feeds in the data for Y underscore.
In order to convert these arrays into feature vectors, we need to reshape the data, which is why we call reshape minus one comma one on both x and Y. The reshape function as applied here. We convert both x and Y into two dimensional array, or an array of arrays. The nested array has just a single data point per array. This reshape is important because the input that we specify for these placeholders should match the shape that we specified for these placeholders when we had instantiated them. None comma one. The first dimension is none because we had no idea how many data points we’re going to feed in. The second dimension was one, because we had one element for every data point. Once we have this feed dictionary set up, we call session run on our training step.
The train step that was passed in here is the optimizer that we instantiated early on the FDRL optimizer. If you play around with this program, you can change the optimizer, use the simple gradient descent or any of the other optimizers to see how your result changes because of that. And this step where you call session run on your optimizer is the heart of this entire program. All the remaining code in the body of this width statement are simple print statements to see how your model converges. Every 500 iterations, we print out the values of w, B, and the cost. If the computation of any of these nodes requires placeholder values, make sure you pass in the feed dictionary. At this point, we are now ready to train our model.
This should give us the converged model, giving us the values of w and B that minimizes our mean square error. Call train with multiple points per epoch. Pass in 5000 as the number of steps or iterations or epochs in our model train step. Ftrl is the optimizer that we plan to use. This can be gradient descent or add a grad or any of the other optimizers you can choose. I’m going to choose a batch size that is less than the size of my data set. I’ll arbitrarily choose 100. When you run this, after about 5000 iterations, you’ll see that we get a final value of W to be zero point 98, seven, and a value of b of 0. 9. Our minimum cost is about 0. 2. Let’s visually track the values of W after every 500 iterations.
Notice that W is slowly inching towards the 1. 67 value that our baseline implementation got. The value of B, on the other hand, jumps around a lot from positive to negative values and doesn’t really seem to have converged after about 5000 iterations. This means this current model is not good enough. You need to tweak some of the parameters in order to get a better result. Our baseline value for W was 1. 67 and for B was 0. 8. Don’t be disappointed. This is what machine learning is all about. Tweaking the various parameters to see if you can get a better result. What are the things you can tweak? You can increase the number of steps, the number of epochs that you run. You can change the optimizer functions, or you can change your batch size.
I’m going to show you just one here. We’ll change the batch size to be the size of the entire data set. I’ll leave it to you for practice. Change the number of training steps and the optimizer and see how it affects the output of this regression. And when you run this, you will find that the value of W is 1. 67 and B is 0. 8. It exactly matches our baseline. In our linear regression model, the optimizer was used to minimize the cost function so that we find the best fit regression line. An optimizer generally starts off at any value for our cost function and then takes steps to find the lowest possible value. And optimizers, such as the gradient descent, plain vanilla optimizer, or the Ftrl optimizer that we use, follow different paths to minimize the cost function.