1. Lab: Access Data from Yahoo Finance
The remaining demos that we’ll see, which involve practical examples of using regression models in the real world, will all use data from Yahoo. Finance. So you can go ahead and download the latest data from Finance@yahoo. com. This lecture, I’ll quickly walk you through the files that are needed. However, if you want to work on the exact same data that I use for these demos demos, then you might want to download these files directly from a Google Drive folder that I’ve set up here. Just go to this tiny URL, go to the Learning TensorFlow folder within it, and go to Data inside Learning TensorFlow. Download all the CSV files, put it in the data folder under your Learning TensorFlow in your data lab, and you’re good to go. I’ll walk you through how we got this data in the first place. In the search box, look for the SNP 500 index.
The symbol for that is GSPC. Click on the historical data link that you see here on screen. Now. It’s true. That yahoo. Finance often updates this page, and you might find that things don’t look exactly the same when you are actually watching this course. You have two choices in that case. The first is you simply download the CSV files from the Google Drive link that I showed you earlier. In the second case, you find the equivalent in the new site. It shouldn’t be very difficult, should be fairly straightforward, in fact. Yahoo allows you to specify a bunch of options that you can choose to get the data. You can specify the time period over which you want this historical data and the frequency of the data daily, monthly, and so on. I’ve downloaded data for a little over ten years, going back all the way to 2007, and so we have a manageable set of data points.
I’ve chosen the frequency to be monthly. Click on apply. This will give you a preview of the data, and once you have the preview, go ahead and click on the download data link that appears right there. Save the CSV file somewhere on your local machine where your program can access it. Give the CSV file a meaningful name, such as SP 500 CSV. Follow the exact same series of steps for the stock data for Google. Do this for ExxonMobil as well. We also use the Nasdaq Composite index in our demos. That is IXIC. And finally, the US. Oil Fund will serve as a representative of oil prices. All of these are inputs into our regression models. Make sure that you’ve chosen the same time period and the same frequency monthly for all the data that you download.
2. Non TensorFlow Regression
Here is a question that I’d like you to keep in mind as we go through the contents of this video. Let’s say that we wish to run a multiple regression to explain academic performance as measured by GPA or the grade point average, using three variables IQ, gender and income. We would like to do so for a thousand students. How many elements will each feature vector have? Let’s now turn our attention from theory to practice. Let’s start building linear regression models using TensorFlow. Here is an approach, an outline of our line of attack. What we are going to do is start by getting a baseline implementation which does not rely on TensorFlow. This will just be regular Python code. This is so that we know what we are shooting for. We then move to TensorFlow, where we start by defining the computation graph.
In the case of linear regression, all that we need is a really simple neural network of just one neuron with an Affine transformation. We don’t even need an activation function in here. The next step in TensorFlow is to specify the cost function. Here, the cost function will consist of the mean square error. This is our way of quantifying the goodness of fit. Given a set of points, we would like to minimize the cost, and that cost is the mean square error. Any minimization requires an optimizer, and in particular, TensorFlow relies on something known as gradient descent optimization, which we shall discuss in some detail. This optimizer object needs to be instantiated and given the cost function as well as the objective to minimize it.
Once that’s done, it will know that it ought to minimize the cost to improve the goodness of it. Once we’ve defined the cost function and the optimizer, we actually carry out the training process. This is done by invoking the optimizer, but that optimizer goes off and does its thing under TensorFlow’s framework. We don’t actually need to carry out the optimization ourselves. Remember that the optimizer is trying to get the best values of the variables in our TensorFlow computation graph. That process of updating those values, those variable values will be carried out for us. We will need to tell the optimizer what subset of the training data it ought to use and how many steps it ought to take. These refer to the batch size and the number of epochs.
We will return to these later. Once the training process is done, the values of the variables W and B will be frozen. This will give us a converged and fully trained model that we can now use for new test data. Let’s start with the baseline implementation. This does not require TensorFlow. The idea here is for us to know what we are shooting for. In this example, we will start by implementing simple regression. This is just one cause or independent variable and one effect or dependent variable. So we have one x variable and one y variable let’s go ahead and very quickly see how this could be implemented without using TensorFlow I e. Just making use of regular regression facilities available in Python.
It turns out that there are a whole bunch of libraries we can rely on. Pandas is one important one. This provides an rlike abstraction for data called data frames. Another useful library is NumPy, which has a whole bunch of linear algebra functionality, stats Models, which has specific statistical techniques and a toolkit for logistic and linear regression. And lastly, Matplotlib, which helps with plotting under the hood. This makes use of some powerful MATLAB functionality. Let’s very quickly discuss some implementation notes. These are little details which are likely to come in handy when we are actually implementing regression in Python. Remember how negative indices in Python are interpreted? This is different than, say, in R.
Basically, negative indices are backward indices. These count from the end of a container. So, just as there are forward indices which start from zero and go up to n minus one, there are corresponding backward indices which can be used to access the same data using the indices minus one through minus n. Linear regression should never be carried out on trending data. This is because of the underlying statistics and as a result, we often need to convert prices into returns before we can perform linear regression. And thus data frames allow the conversion of prices to returns using the percent changes method in just one step.
But if for some reason you are not able to make use of this method, you can achieve the same result using negative indices and a little bit of arithmetic juggery as you see on screen. Now let’s also understand the data representation that’s typically used for machine learning based approaches to regression. This is in contrast to the coordinate geometry approach to regression. In the coordinate geometry based approach, the corpus consists of points represented as X and y coordinates. If we have just two variables, one explanatory variable and one dependent variable, we have just two dimensions in our points. We would pass all of these points into our regression algorithm, which is a geometric algorithm which would find the best regression line.
This is not the typical machine learning based approach. In the machine learning based approach, we treat all of the X variables as features. And that means that all of those X variables together need to be represented using a set of feature vectors. These are then associated with a set of labels which are the Y values. This distinction between the feature vectors and the labels is a characteristic of machine learning based approaches to regression. This is how we would carry out a regression using NumPy, for instance. And in doing so, it’s very likely that you will find yourself using a function called reshape given a list with.
3. Lab: Linear Regression – Setting Up a Baseline
In NumPy. What does the array dot reshape method do? And why would you want to use this? Let’s find out in this lecture. In this lecture, we’ll see how we can implement linear regression in TensorFlow. Working on some real world data. We’ll work on the stock picker data that we downloaded in the earlier lecture. We’ll use this regression to model how changes in the SNP 500 index affects Google’s stock price. So the cause is the value of the S and P 500 index and the effect is Google’s stock price. As a very first step, we’ll set up a baseline in Python for our TensorFlow regression model. Linear regression is a standard numerical problem. And there are a ton of Python libraries which implement this.
We’ll implement linear regression in Python, get a set of results, and then implement the same thing in TensorFlow. The baseline will serve as a comparison for how well our TensorFlow regression model does. If you’re using Cloud Shell to connect to datalab, it’s possible that your Cloud Shell session has expired. In that case, you need to reconnect. And you can do that using datalab connect and the name of your data lab instance, which in our case is TensorFlow. Open up port 80. 81. And there you have it. Your Data Lab instance is still alive and well. We are currently working in the datalab learning TensorFlow folder. Create a new folder in here called Data. And this is where we’ll store the CSV files that we downloaded from Yahoo.
Finance. This is the data that we’ll use in our regression models. Go ahead and upload the CSV files that you have on your machine onto this data folder within your Data Lab VM instance. As you can see, I have very indicative names for each of my CSV files. I know exactly what data each of them represents. All the sample code that you’re going to see in the next few demos is present right here in this learning TensorFlow folder in this lecture. Open up the file for linear regression with stock market data. That’s what we’ll be using. Now, all our code will extensively use these two libraries. In addition to TensorFlow Pandas and NumPy, pandas is a great tool for manipulating data in a tabular format, and NumPy is the standard Python tool for numerical computations.
There are a bunch of helper functions that we’ll use to read in data from the CSV files and set up and format the data the way we want it to be. You’ll find these helper functions in the first few code cells of every IPython notebook. We use these helper functions. I’ll explain what exactly they do and how we format the data. Let’s look at the very first one. Read Google SP 500 data frame. This reads in data for Google as well as the SP 500 from our CSV files. The return value for this function is a single data frame which contains the returns for both Google and the SNP 500 for the same range of dates. The columns in the data frame are date, Google’s Returns and the SNP 500.
Returns store the path to where your CSV files live in the Google file and SP file variables. In our case, they are both in the data directory under the current working directory. We use Pandas to read in both of these CSV files. We specify the paths to the file. Fields within individual lines are separated by a comma. The only columns that we are interested in for both these files are columns zero and columns five. Column zero holds the date and column five holds the adjusted close for Google as well as the SP 500. We give these columns simple names such as date and the Google file and Date and SP 500 for the S and P 500 file. We also want to include the headers that are present for these columns, which is why header is equal to zero in our parameters.
To read CSV, we want to merge the data from these two data frames into one single data frame, that is the Google data frame. We set up a new column called SP 500 in the Google data frame and assigned the values from the SP 500 column in the SP data frame. This operation is effectively a join on the Date column. All the adjusted close values for both Google and the SNP 500 will line up on the same date. In all our regression models, we’ll work with the returns data. What is the percentage return that we get month over month? For Google as well as the SNP 500? We need to convert these adjusted close values to returns, and to do that we first need to sort by date.
And in order to sort by date, date should not be expressed as a string field. It should be of type date. Use the two date time method in Pandas to convert our string date to the format that you see on screen that we assign to the Date column in our single data frame. The next step is to sort the data frame in the ascending order of dates. This you can do by calling the Sort values function on the Google data frame. Once the data frame is sorted in the ascending order of dates, use a very handy function that data frames offer in order to calculate returns. This function is Percentage change pct underscore change. This method automatically calculates the percentage changes on adjacent cells in a data frame. We only want to apply this percentage change method to those columns that are of numeric types.
We don’t want to apply to the date column that’s present in our data frame, which is why we iterate over the column data types for each column in this data frame and apply the percentage change method to those columns which are either float 64 or int 64 types. That means Returns will be calculated for our adjusted close columns. We store this result in a data frame named Returns and return this data frame from this method. The next helper method that we will see on screen is not something that we’ll be using right now, so let’s skip over the explanation for Read Google SP 500 logistic data. We’ll come back to this when we perform logistic regression. The method after that the helper method that we will use in this demo is the Readgoog SP 500 data.
This is the method that sets up the X and Y values that are required to train our linear regression module. The return value from this method is A coupled with two fields. The first field has the returns for Google, and the second field has the returns for the SNP 500. Each of these fields are represented as one dimensional arrays. In the first step, we call the helper method that we just saw read Google SP 500 data frame and get a data frame called Returns. This contains the date, column and returns for Google as well as the SP 500. From this data frame, extract two one dimensional arrays. One array, which is X data, which contains all the SNP 500 returns that forms the x axis data for our linear regression. And the second array is Y data, which contains the returns for Google.
Stock both of these returns data sets, we filter out the very first row. That’s because when we go from prices to returns, we always have one less data point than we started off with. The very first day for which we have the adjusted close will have no corresponding value. For Returns, we filter out the very first row, as shown by the example on the very right of the screen, return a tuple from this function, which contains both of these arrays. There is another helper function here read XOB oil nasdaq data. We look at this later when we actually use it while performing multiple regression. Once we’ve understood the helper functions that we are going to use, we are now ready to set up the baseline implementation of linear regression. We’ll perform this baseline computation using a library called ScikitLearn.
So from Sklearn we import data sets and linear model. ScikitLearn is a very powerful Python library for numerical computations. It’s very popular and widely used. That’s what we’re going to use here to set up our baseline. The first step is to get the X values and y values that will be part of our linear regression. This we do by using the helper method that we set up earlier, read Google SP 500 data. Performing linear regression in ScikitLearn is very easy. We simply use the linear model library and instantiate a linear regression model and assign it to our Google model variable. The next step is to fit this model on the data that we have our x data and y data. All of this is very straightforward. The only thing that might be slightly strange here is the Reshape method that we call on X data as well as Y data.
Let’s understand why this is. If you think about linear regression, conceptually, it involves passing in x and y variables to our regression algorithm or our regression model. And the output of this model, given these x and y values, is the equation of a line. This is the conceptual setup. When we want to implement this regression in a practical way, using existing libraries, we have to set up the data in the way that the library expects the data to be. Python libraries expect that the input data will be in the form of arrays. n particular, the X data for our linear regression should be in the form of an array of arrays. We can achieve this very easily in NumPy by using the Reshape functionality. Reshaping a single dimensional array in NumPy with minus one comma one will give us an array of arrays where every element in the original array is enclosed in a single element array and the single element arrays live within a larger array.