1. Lab: Taxicab Prediction – Setting up the dataset
Let’s say that you’re working on a data set that you’ve seen for the very first time. You have a bunch of variables, and you want to see whether the cause effect relationship exists between any of those variables. What are some of the things that you can do to explore this data set and see what relationships exist? In this lecture, using a data set that’s freely available in Victory Query, we’ll see if we can predict the demand for taxi cabs in New York City. In this first lecture, we’ll set up the data set so that we can perform machine learning on it. This lecture will contain a lot of cleaning up and setting up the data the way we want it to be.
This example is part of Google’s Code Lab, and the code for this can be downloaded from the Training Data Analyst repository under CPB 100, Lab four A, and the file is a Python notebook named Demand Forecast. This file should be available to you in the source code that you uploaded into Data Lab, which has all the TensorFlow examples. This code under the Learning TensorFlow folder has only the code snippets. It doesn’t have the corresponding explanations because the explanations is what I am going to be giving you. The original file has a ton of very well written explanations using Markdown. If you want to see those, I would suggest look on GitHub or download that file.
Open up the file and you’re ready to start the first step clear all cells so that previous execution results are not visible. They tend to be distracting when you’re actually executing it yourself. This is a long and pretty involved example, and it uses a whole range of libraries. We use the BigQuery library to read input data from BigQuery Pandas for setting up the data set in tabular form, Nump PY for numerical computations, and shuttle to perform some files and directory manipulation. Before we start reading data in, it’s always helpful to check out the schema of the table that we are going to query. This is a table in a public data set that’s freely available to us, and it has a whole lot of interesting data on trips passengers have taken in cabs.
For every trip, you have information available, such as the pickup time, the drop off time, where exactly they were picked up, the lat long coordinates, the trip distance, the fare that the passenger finally paid, and also how long cabs were idling just before this trip. One of the columns that we’ll query is the day of year. This assigns an integer identifier to each day of the year 1 January is one, 2 January is two, and so on. You can examine this result. The extract function seems to work just fine. Let’s move on to the actual query that we want to set up. If you observe the table name for this taxicab data, you’ll notice that the suffix of that table is a year. 2015 means the data is for the year 2015. 2014 means the data is for the year 2014, and so on.
We set this query up so that it’s parameterized on the year, which means you can pass in the year for which you want the data as an input variable. We want to predict the demand for taxi cabs in New York on any day. Now let’s set up the number of trips as being a representative of the demand for taxicabs. There are of course other values that can serve equally well, such as the total amount spent on taxis. But this example uses number of trips for simplicity. For every day of the year in any year, find the total number of trips made by taxi cabs. In order to execute this query, you need to specify the date input variable, the date for which you want this data.
In this example, we first get the data for the year 2015. This is passed in as a string parameter and appended to the end of the name of the table, execute the query and pass the JSON that you just set up as a part of the query parameters. BigQuery provides very convenient Pandas integration. You can convert results from a BigQuery table to a Pandas data frame by simply calling the two data frame method. We then just sample the results in this data frame to get an idea of what it looks like. And there you see it a data frame with the day number and the corresponding number of trips. This is for the entire year 2015. There are many ways to get a baseline.
One way is to simply calculate the average number of trips for all days in a year and find the root mean square error of this. Average root mean square error is the same as the MSC that we’ve seen before, except that we find the square root of the MSC. This baseline gives us an RMSE of 10,000. Remember, lower values of RMSE are better. Let’s see if our TensorFlow models beat this when we actually implement them. This benchmark or our baseline basically means that if you were to predict that there will be 55,000 trips every day, we’ll be off by 10,000 trips. This average prediction of 55,000 can be an overestimate of 10,000 trips or an underestimate of 10,000 trips.
Let’s say that you have a theory that you want to confirm or disprove using this data. You believe that the demand patterns for taxi cabs depend on the weather in some way. Let’s test if your hypothesis is correct. This we can do by using the table which has our weather information. We’ve used this table before in earlier examples. This is the Public Geological Survey data set and we want to query the stations table to know which is the weather station where we can get information on New York weather. This is the Where clause that will help us find the station state is equal to New York name like LaGuardia and so on. You can see that the station with ID 725030 will give us New York weather.
We’ll query the table which has the weather information to get various details of the weather for each day of the year in the year 2015. The column that’s common between this weather data and the number of trips data that we got earlier is going to be the day of the year. So extract the day of the year information from this table as well. In addition to the weather. Say you also believe that taxicab demand differs by day of the week. Say you believe demand will be high on Saturday when people go out to party. Day of week will give you information as to what day of the week it is. Starts off with Sunday being one and Saturday being seven. And here is the weather data.
We get the minimum temperature on that day, maximum temperature and a number representing how much it rained that day. The precipitation. The temperature data is in Fahrenheit and the precipitation data is in inches. Notice that we get this information for the weather in New York. We use the Identifier for the station, which gives us New York City’s weather. We once again parameterize this table as well by year, so we can pass in the year as an input variable. Execute this query by passing in the year as 2015. That’s the year for which we have taxicab data. Now we have weather data and day of week information for this year as well. Execute this query and sample five records in the result to see what the data looks like.
The next step makes one data frame of all the information that we have. We have the weather data and the number of trips for every day of the year 2015. We can perform a join operation where the join column in traditional SQL terms is the day of the year in Pandas. This is super straightforward. Simply use the PD merge method. If you examine the result, you’ll see the day number column that is common to both data frames. The next four columns form the data from the weather table and the column after that is the number of trips from the taxi table. Before you go ahead and start setting up a machine learning model, you might want to quickly explore the data to see if any relationships exist. There are a bunch of tools that you can use to do that.
Since we are already using Pandas, we can use the plot function in Pandas to help us see visual representations of this data. When we examine data visually, you might find some relationships exist. Something might jump out at you. In this step, we’ll see whether a relationship exists between the number of trips on a particular day and the maximum temperature on that day. We’ll plot a scatter plot to visualize this. This cater plot doesn’t seem to hold much information for us. There is no obvious clustering around certain temperatures, or no obvious linear or some other kind of relationship between these two variables. Let’s explore another variable that might be more promising the day of week and the number of trips.
What you’ve seen, anecdotally is that you can never find a cab on Saturday. Let’s see if that translates to more trips for the weekends. And you’ll immediately notice that this visualization has much more information and is much more interesting. Day of the week is, of course, categorical data. It has discrete values. Days go from one through seven. Days six and seven. That is, Friday and Saturday have a significantly larger number of trips as compared with other days. You can postulate a number of theories for why this might be the case. For example, if New York, which is a tourist destination, has a number of tourists pouring in on the weekends and looking for cabs, the demand would be higher.
If you look at the days in the middle of the week, such as Tuesday and Wednesday, the demand is significantly lower. If you were to use this data to recommend to taxi drivers what days they should be taking off, you would recommend Tuesday and Wednesday. Now, it might be that you still believe that the maximum temperature has some effect on the demand for taxi caps. But this day of week effect is so strong that it might be obscuring the relationship with the maximum temperature. A way to test your hypothesis would be to examine the relationship between the number of taxi trips and the maximum temperature for the same day of the week across the entire year. That’s what we’re trying to do here.
We are trying to eliminate the effect of the day of the week to see where the maximum temperature now has any effect on the number of taxi trips. Let’s plot the number of trips for every Saturday against the maximum temperature on that day. Again, if you take a look at the plot, there’s not much information there. One thing should strike you, though, that the number of data points are far fewer than we saw in the other graphs. There’s just not enough information there. Let’s say, instead of just working on the year 2015, we were to include a few more years of data. We might find that a pattern exists. On that note, this example adds one more year of data to our data set. Set up the query parameters as we did before.
This time specify the year as 2014. And here is why having the right parameterized query is so important. By simply setting up the same query with different year values, you are able to quickly get data from different years. Run the query for getting taxicab trip data as well as the weather data for the year 2014. Merge the data together, performing a join on the day number into a single data frame. As before, this is Data 2014. Execute the query and sample the data. It looks fine. It’s kind of annoying working with a data frame which has data from 2015 and another which has data from 2014. This is all the same data. We can simply bring these two data sets together using PD concat. The combined result is stored in the data frame.
Data Two. The Data Two describe command will give you statistics on all the columns of data that’s stored within this data frame. These are the standard statistics that you might need to know in order to understand and get a feel for this data. For every column value, there is the count, mean, standard deviation, min max, and all the quartiles. Across the two years 2014 and 2015, we have a total of 546 records for these 546 records for every column. Here are other statistics. The day numbers range from a minimum of one to a maximum of 365. That makes sense. The number of trips per day also vary pretty significantly. It goes from a low of around 13,000 trips to a high of about 81,000 trips on a single day.
Let’s go back and see if adding this additional data set for the number of trips in 2014 allows us to view some pattern in the effect that maximum temperature has on the number of trips per day. After eliminating the effect of the day of week, this is for Saturday. The resultant plot is a lot more interesting. There is more clustering around the higher temperatures, but again, there is still insufficient data. You might have to add a couple of more years to see whether a real pattern emerges. We believe that some relationship exists between our predictor variables that is, the day of week and the weather on a particular day. The number of trips that passengers take on New York taxi cabs.
Let’s run a TensorFlow regression model in order to see what this relationship is. So what would you do if you’re working on an unfamiliar data set? You need to understand the data, and you want to explore what relationships exist. You will typically set up the data set in a way that makes sense to you, and you’ll perform a few exploratory visualizations in order to see what kind of pattern emerges. If visualizations give you an idea of the patterns that exist, that’s great. You have some kind of objective in mind when you set out with your machine learning models. Just a caveat, though. Patterns may not always jump out at you when you’re visualizing the data, sometimes you believe patterns exist. They might only be extracted with machine learning algorithms. editions.
2. Lab: Taxicab Prediction – Training and Running the model
Training the machine learning model is just the first step. The entire use of machine learning algorithm comes in when you use this model for predictive analysis. Before you apply the model, you want to save this trade model somewhere. How would you do that in TensorFlow? As a continuation from the last lecture, we’ll predict taxicab demand in New York City. We’ve set up the data set the way we want it to be. We’ll now run the ML model on it. We’ll train our model and then use it for prediction. You have some data set on which you want to train your machine learning model. Typically, you would use a fraction of the data set for training and the rest of the data set to test or validate whether your model works.
In this example, we use 80% of the data for training and 20% of our data as the test data which we’ll use to validate our model. We shuffle up our data set so it’s not in any predictable order when we get started. This we do by calling the sample method or our data frame. We sample the entire data set. It will now be in any random order. In our machine learning model. The variables that we use to predict the output, to predict the variable or the feature vectors will comprise of the data, which is the day of the week, maximum temperature, minimum temperature and the precipitation on any day. The target variable, or what we want to predict is the number of trips. If you think about it, one of these features is a categorical variable and that is the day of the week.
This should typically be represented in one hot notation where Sunday will be represented by a vector of one followed by zeros, monday is zero, one followed by remaining zeros, and so on. Day of week is not really a continuous variable, but for simplicity’s sake, in this model we treat it as such. There is some code here in Google’s Code Lab which allows you to treat the day of week as a categorically variable. It’s not a great idea with a small data set because it leads to something called overfitting. When the day of week is expanded to the one hot notation, we increase the number of feature vectors that we pass in for prediction. When we increase the number of columns in the predictor variables. Increase the number of features.
If we have a huge number of features with a small amount of data, we can end up overfitting the model. Overfitting is when a model is very dependent or very sensitive to statistical error or noise and doesn’t just define the underlying relationship. In this implementation, we’ll treat the day of week as is as a continuous variable which we know has just discrete values. The predictors for this regression are columns one through five. These are the columns which are for the day of week, mint temperature, max temperature and rain. Let’s quickly sample the shuffled data set. It’s the same data we set up except in any random order. The target or what we want to predict in our machine learning model is the number of trips taxicabs take in our training data set.
This will serve as the training labels for our regression the number of trips at index five. We mentioned earlier that 80% of this data set is going to be used for training. The number of records that we have for training is stored in the train size variable. We’ll compute a new baseline or a benchmark on our training data, access the first 80% of rows from the shuffled data frame and find the average. The average calculation of the number of trips formed our baseline before. We are going to use the same thing for a baseline here. Once we have the average number of trips, find the root mean square error so we know how good this baseline is. Our training data set has an average of around 46,900 and a root mean square error of around 12,000.
So a prediction of around 46,900 cabs for a particular day can be about 12,000 trips away in either direction. If you’ve observed that data set, you’ll see that the predicted value for taxicab demand is generally in the 10,000 to 80,000 range. These are very, very large numbers. We’ll use all these number of trips values by scaling them down by a factor of 100,000, so that the resultant predicted values will be in the range zero to one. Machine learning models generally run faster when the weights that they use or the predicted values are smaller. We don’t want to use these huge numbers. In order to speed up our model, we’ll slag them down by a factor of 100,000. The next two lines of code set up 80% of our data for training and 20% for test.
The number of predictors or regression variables in this model are the number of columns in our predictors data frame, which is four. The number of outputs is just the number of trips for the taxi cabs. That is, number of outputs is equal to one. Set the log level to one so we get fewer output messages. In TensorFlow, you can save your trained model to a directory. Now the directory that we are going to use to save our trained model is this trained model linear remove the directory if it already exists. We don’t want to load previous weights and biases from this directory. If this directory exists, then TensorFlow will try to initialize the model with data from this directory.We use a linear regressor, an estimator, to perform this regression.
As we spoken about earlier, estimators are high level APIs, which take care of a lot of little details of regression for you in the real world. The really common cookie cutter model, such as logistic or linear regression, you’ll perform using estimators. There are some new parameters here that we haven’t seen before. There is a model underscore Dur which is where we store our trained model. This is in the trained underscore model underscore linear directory, under the current working directory. And here are some other inputs that we can specify to configure this estimator. The first of these is the optimizer that your estimator should use. If you don’t want to use the plain well in a gradient descent optimizer, you can specify another one.
Here we’ve specified the atom optimizer. Ignore the enable centered bias argument for now. It’s a way to specify whether you want the weights in your model to have a centered bias or not. And the discussion of this is beyond the scope of this example. And the last argument is the feature columns. What features you want this model to look at while training. Here is the input function which gives the estimator information of the data set that it has to train on. A lot of the code that you see here is basically shaping the data to be in the right format and scaling the number of trips. Remember that we scale the number of trips down by 100,000 so that all our predicted values come in the range zero to one. This keeps the weights of the model low and helps it run faster.
Pass in the input training data set and the corresponding training labels. Call estimator, fit with the input function and specify the number of epochs for your training. When this line of code is executed, you have a fully trained machine learning model and you can use that model to predict taxicab demand. You can use this trained model to predict taxicab demand by calling estimator predict. Here we do it for the test data set. The test data set is typically used to validate our model and all the remaining data after the training data is the test data set. In your final prediction though, you need to scale up the output values by multiplying by scale NUM trips. You want the final result to be in the range 10,000 to 10,0000 and not in the zero to one range.
Once you have the predicted values, you can calculate its root mean square error and see how well it performs. Our linear regression model using an estimator gives us a root mean square error of around 10,600. So a little bit of an improvement over the baseline, the one we did calculating averages. Let’s run a deep neural network regression model on the same data. And really the only change you need to make for this is use TF contrived learndnregressor that’s the change in the estimator that you use, you’ve set up your data set correctly. You can simply pick and choose your estimator. We don’t need to go into the details of deep neural networks here, but you need to specify how many layers your neural network will have and how many neurons or nodes within each layer.
That’s where the hidden new units comes in. Our neural network has two layers. Five nodes in the first and two nodes in the second. Run the training for this regression model and you will find that the root mean square error seems a little higher than what we got earlier. This is around twelve zero in my run of this code. Given the size of this data set, and there could be many other factors which make neural network solutions harder to converge. We’ve stored our train model in a directory. We can now use it to make predictions. Here is the input feature vector that we pass in in order to determine what the demand for taxi cabs will be.
If the days of week are four, five, six, that is Wednesday, Thursday and Friday, the minimum temperatures are 30, 60 and 50 Fahrenheit. On those days, maximum temperatures are 40, 70 and 60. And on Thursday there’s a little bit of rain. It says 0. 8 in our input. Instantiate a trained linear regressor from our model directory. Our model directory will have the weights and biases for the neuron that goes into this linear regressor we have the estimator instantiated we can call estimator predict. In order to get our predictions, the predictions need to be multiplied by scale NUM trips so they are in the right range. Run this bit of code and there you’ll see the result.
If you remember our visualization earlier, Thursdays tend to be slow days. The demand on Thursdays tend to be much lower than on Fridays, but here they are almost the same 47,000 as opposed to 49,000. This may be because we’ve set it up so that it has rained a little bit on Thursday. The demand for taxicabs is higher. If you’ve trained a machine learning model, how would you save it? To use it for future predictions, you would specify a model directory when you set up an estimator, and the estimator will simply write out the weight and bias values in that directory. Using a trained model for predictions simply involves instantiating your estimate estimator. In our case it was a linear regressor by pointing to the model directory.