8. Linear Classification
Here is a question to keep in mind as we go through the contents of this video. In logistic regression, when applied to classification, on what basis is the predicted output label selected? Again, what is the basis used by our logistic regressor in order to decide which output label to assign or to predict? Logistic regression is often known as linear classification. This is because it is used to implement classification algorithms which are based on a machine learning approach. Let’s see how this would play out using our favorite whales analogy. Whales are the fish or mammals? This is the question that we wish to answer. We know that technically whales are mammals because they are members of a certain infra order set Asia.
But a plausible case could be made that they are fish because they look like fish, swim like fish and move with fish. Remember now that the rule based approach would be to pass specimens of a whale to a set of rules, and those rules would have been drawn up by a set of human experts. Those experts would know because of their knowledge that whales do belong to the infra order SATA, and so they would classify whales as mammals. In an mLBased approach, we would start with a large dataset called a corpus. We would pass this into a classification algorithm and the output of this would be a trained mLBased classifier. Notice how human experts did not play much of a role in this process.
Because this is a purely mechanized process, we have to be careful about directing the attention of our classifier to the right attributes. If we told our ML based predictor that this is an animal which moves like a fish and looks like a fish, and that those attributes are what matter, most likely it would make the prediction that whales are fish. To prevent it falls like these, it’s important to have a corpus of correctly labeled instances. Not only does that corpus have to be large, representative and well labeled, it also should focus on the right attributes. So we are better off telling our classifier that this is an animal which breeds like a mammal and gives birth like a mammal.
If we do all of this, we will hopefully get the correct answer that whales are indeed mammals. In any case, this was just a quick refresher of how machine learning and rule based classifiers differ. Let’s now talk about how logistic regression can be used in classification. The basic idea should be quite familiar. From linear regression, we start with a purpose. We pass it into the logistic regression algorithm, and the output from this algorithm is a nice logistic regression curve. The process of finding that curve involves finding the best fitting S curve, that is, finding the constants A and B that best fit the data.This is very similar to linear regression.
As in the case of linear regression, the shape of the curve is already fixed all that is carried out during this process is determining the constants which define that curve. Once we do have a curve like this, it’s a relatively simple matter to give it feature vectors. For instance, maybe an animal lives in water, breeds with lungs, and does not lay eggs. Then our curve, which has already been fitted to predict probabilities, spits out a probability of 55%. That probability has been calculated by a classifier on the basis of that S curve, because during the training process it was exposed to a wide range of animals. This S curve tells us that if an animal lives on land, it breeds with lungs and does not lay eggs, then the probability of it being a fish is only 5%.
Let’s say it’s at the other extreme. It lives in water, breeds with gills and lays eggs, then it’s almost certainly a fish. In the intermediate portion of the curve, the training data might have included animals like whales or duck build plateauses, special cases, as it were, which share some but not all of the common characteristics of mammals. By plotting a line and S curve, to be precise, through all of these points we can get a probability. If that probability works out to be less than 50%, then we will say that the animal is not a fish I. E, it’s a mammal. Here, because we are carrying out a binary classification I. E, there are only two possible labels.
If the probability is greater than 50%, then we can quickly assume that this is indeed a mammal. Things get a little more complicated if there are more than two labels, but that’s easily taken care of in TensorFlow and using the softmax activation functions in neural networks. So that’s it. That really is the basic idea of how logistic regression is used in linear classification. Find the probabilities of the different labels. The one which has the highest probability becomes your output or predicted label. And in case you’re wondering why logistic regression is also called linear classification, that’s because there is a special relationship between logistic regression and linear regression.
That’s what we are going to examine next in a bunch of detail. Let’s now come back to the question we posed at the start of this video. When a logistic regression is used in classification, we get as the output of the logistic regression a set of probabilities. We will get one probability for each possible output label. We will then go ahead and pick that output label which has the highest probability. In the case of binary logistic regression or binary classification, there are just two possible output labels. And so the probability which is greater than 50% is used to assign the output label.
9. Lab: Logistic Regression – Setting Up a Baseline
Let’s say that we wanted to predict whether Google’s stock will be up or down at the end of a particular month. Returns data for both Google and the S and P 500 are continuous. How can we convert this continuous returns data to up down categorically predictions? Over the next few lectures, we’ll see how to implement logistic regression in in TensorFlow. In this lecture, we’ll see how we can set up the baseline using basic Python libraries. This baseline will help us calibrate how well our TensorFlow model performs in this logistic regression implementation. We’ll set up a baseline using regular Python code. Our logistic regression code requires a bunch of helper functions.
We skimmed over these helper functions in earlier lectures. We take a look at them now. We first use the same helper function which reads data for Google and the S and P 500 into a Pandas data frame. We’ve used this helper function earlier in the demos for linear regression, so I won’t go into the details of this. The return value is a Pandas data frame, which has three columns the date returns for Google and returns for the SNP 500. It turns out that there are a few quirks in setting up data to be used in logistic regression, which is why we use another method to set up the data for logistic regression. This method is called read.
Google SP 500 logistic data code in here takes the same data from the Pandas data frame. We start off in the same place for both linear and logistic regression. But in the case of logistic regression, we set up the data in a slightly different way. The first step in this function is to read in the returns data frame from the previous helper method that we just saw, and the second step is where the quirk comes in for logistic regression. By default, logistic regression assumes that the intercept is not included in its equation. In order to have the intercept included, we have to set up a new column in the returns data where every value in the column is set to one.
That is, the intercept column assigned to a scalar value of one within this data frame. If this intercept value were not present in the input to our logistic regression equation, the value of A in the logistic regression S curve equation will be left out. The addition of a column for the intercept implies that our X data now has two dimensions the returns of the SNP 500 and our intercept column. We set up an array using NP array containing these two dimensions of data. In this X data, we leave out the first and the last row. We leave out the first row because it won’t have a value for returns. This was true in linear regression as well.
We leave out the last row of data corresponding to the last month as well, because each of our predictions of whether Google is up and down is for the next month. We do not have any months data available after the last month, so we can leave out the last row. The y values or the y data in logistic regression cannot be continuous variables. They have to be categorical variables. Because the objective of logistic regression is to find the probability of each of the y values. Y data is the returns for the Google stock, which is a continuous variable. We can convert this to a categorical variable by checking whether the returns for Google is greater than zero or less than or equal to zero.
If it’s greater than zero, we assign a label of true, and if it’s less than zero, we assign a label of false. Google’s returns were greater than zero in a particular month. That was an up month for Google. Otherwise it was a down month. So we assign to y data returns of Google greater than zero, true or false values based on whether the returns were up or down. Return our two dimensional x data and the one dimensional categorical y data as a tuple. Now that we have our helper methods set up, let’s go ahead and set up our baseline implementation import Pandas, NumPy and statsmodel API. You’re familiar with Pandas and NumPy? We’ve used it extensively in this course.
statsmodel API is a Python library of statistical models. We’ll be using the logistic regression estimator from this library. Read in our x data and y data values using the helper function that we just set up. Read Google SP 500 logistic data. This x data and the corresponding y data then serve as the input to the logic class from our Stats model library. This logic class provides a standard cookie cutter implementation for logistic regression. You can fit the logistic regression model to our data using logic fit. This would fit the s curve on our data. The details of this S curve are not relevant at this point in time.
What we are really interested in is whether this logistic regression model can predict up days and down days for Google stock. We use our logistic model result and call the predict function on it to predict up and down days for Google. X data is what our logistic model will base its predictions on. Remember that the output of logistic regression is a probability of a particular y value. Let’s say if the probability is greater than 0. 5, that predicts an update for Google. And if it’s below 0. 5 or equal to 0. 5, that’s a down day. So our prediction probabilities are converted to true and false values. The predictions made using our logistic model are stored in the predictions array.
Remember that our training data has actual y values in the form of labels. We know whether Google’s top was actually up or down in a particular month using these labels. We simply compare our predictions with the actual value of these labels. To get an idea of how accurate our prediction is, we get the result true if a predicted value matched the actual value and false if a predicted value did not match the actual value. And then we can count the number of trues to find how many of our predictions were accurate and store that in numb accurate predictions.
The percentage accuracy of our model is now a very simple calculation. We divide the number of accurate predictions by the total number of predictions and get a floating point value between zero and one. Print this out to screen and you’ll find that our logistic regression model has an accuracy of 72. 8. So how would we convert returns data that is continuous to categorical values? We can simply say that if the returns in a particular month is greater than zero, that is an up month, and if it’s less than zero, that is a down month. So anything above zero will be up and anything below zero will be down. Returns data will now have two categorical values up or done.
10. Logit
Here is a question that I’d like you to keep in mind as you go through the video. The residuals in a logistic regression are normally distributed just like those in a linear regression. Is the statement true or false? The relationship between logistic and linear regression is a subtle and also an important one because the way in which we will solve logistic regression in TensorFlow will be quite subtly similar alert to that in which we solve linear regression. So let’s make sure that we understand it. Recall that linear regression is all about attributing effects to causes. So x causes y. This is the kind of relationship that we want to quantify with linear regression.
We have causatory or independent variables and we have effect or dependent variables. And the relationship between x and y is defined in the form of a straight line given a set of points, linear regression then involves finding the best fitting line which passes through those points. And the word best here has to do with a certain error metric, that is the mean square error or the square of the length of all of these residual dashed lines. This is how linear regression works. We need to find the best fitting line. Y is equal to a plus BX. Logistic regression is actually very, very similar. Logistic regression is all about predicting probability. So on the y axis we have the probability of y.
Also we have the actual observed values of y. These can only be zero or one for a binary variable. This means that we now have a set of points which will lie along the line. Y is equal to one and another set of points where y is equal to zero. The logistic regression process is very similar to the linear regression process. It involves finding the best fitting S curve rather than the best fitting straight line. This is the S curve which will pass through all of these points and which will best model the probabilities of y being equal to one using the s curve equation that we see on screen now. And that process, that algorithm of finding that best fitting s curve involves finding the best values of A and B.
Clearly, there are a bunch of similarities and differences between logistic and linear regression, so let’s quantify them by placing them side by side. Linear regression focuses on predicting effects given causes. That means that given the value of the x variable, what we are looking to do is to predict the value of the y variable and we do so by using a straight line. Logistic regression is looking to predict the probabilities of effects given the causes. So this time on the y axis we are going to represent not y itself, but rather the probability that y is equal to one. By the way, for this two dimensional example, we are just assuming that y is a binary variable in the logistic regression case from the examples we’ve already seen, it’s pretty clear that linear regression needs the effect variable y to be continuous.
In logistic regression, on the other hand, the effect variable must be categorical. It must be drawn from a finite set of outcomes. The causatory variables can be either continuous or categorical. And this is true with both linear and logistic regression. So in neither case is there any restriction on the type of variable used as the explanatory variable. Given the differing natures of the y variables. In linear regression, we seek to connect the dots with a straight line. In logistic regression, we seek to connect the dots with an S curve. And we’ve seen why S curves are a good fit. They model reality pretty well in a lot of their use cases. In both cases, the functional forms of the curves that we are trying to fit have been decided upfront.
In linear regression, it’s y is equal to a plus BX. In logistic regression, it has the probability of y on the left hand side and that S curve the exponent. On the right, the functional forms are different, but the objectives of the regression algorithms are the same. In both cases, the objective of the regression is to find the values of A and B that best fit the data. And now there is also a special relationship between linear and logistic regression. The linear regression equation is linear by assumption, by definition. In fact, the logistic regression equation can be made linear by taking something known as the log of the odds.
This is a log transformation which allows us to change logistic regression into a linear regression problem. Please focus on this equation. Now, on the right, the point and the importance of this equation is that we can now solve logistic regression much as we would solve a linear regression problem. There are a whole bunch of cookie cutter techniques which exist to do this. We don’t really need to prove this mathematically, but I feel that’s important enough for us to understand exactly how we get here. So let’s start with the logistic regression equation. Here we have found the values of A and B that best fit our data. And we have used these to define an S curve where on the left hand side of the equal sign we have the probability that y is equal to one.
And on the right we have the S curve that’s the exponent. Let’s just abbreviate that probability. Let’s just call it p for short. That is, the probability that y is equal to one. Now, if you’re familiar with the definition of the odds of an event that is on screen now, the odds of any event with probability p are simply defined as the ratio of p to one minus p. The odds of an event are a really popular way of expressing its probabilities. There are lots of folks, particularly in sports betting and gambling, who think intuitively in terms of odds. In any case, let’s now keep moving on. Let’s calculate the odds given our formula for the probability which came from the S curve. Here we started with the probability which was defined by our S curve.
And it turns out that this is a formula which lends itself really well to calculating the odds. We get to the denominator of the odds formula that’s the probability of the complementary event. I’d highly recommend that you just jot down these formula with pencil and paper and work through them. It’s much easier than viewing them on screen line by line. But the bottom line is that you will be left with values for p and one minus p, which when divided by each other, give a really nice formula. The odds of p simply work out to be e raised to the power a plus BX. Now it’s simple enough for us to transform this equation by taking logarithms to the base e on both sides of this equation, and we end up with a straight line equation. On the left of this equation is the logic.
The logic is used to refer to the log of the odds of any event. And on the right we are left with a straight line a plus BX. And this finally is the bottom line. We have been able to reduce our logistic regression S curve into a linear equation relating the logic function on the left to a straight line A plus BX on the right. And this means that logistic regression can be solved via linear regression on the logic function, the log of the odds function. We will actually not use this method when we are implementing logistic regression in TensorFlow, but this is important background so that you can understand how linear and logistic regression are conceptually linked and why logistic regression is called linear classification. Let’s come back now to the question we posed at the start of this video.
The residuals in a logistic regression cannot be normally distributed because after all, they represent values from a categorical set. Consider binary classification. For instance, the actual labels are going to be either zero or one. The predicted labels can also only be zero or one. So the difference between the actual and the predicted labels will be zero plus one or minus one. This represents a discrete variable, a categorical variable, and this definitely cannot be used to represent a normally distributed variable. A normal distribution looks like a bell curve and it is inherently continuous. So the regression residuals and logistic regression are not normally distributed. This is a difference between linear and logistic regression residuals.