What is Logistic Regression?

Dec. 23rd, 2020ยท17 min read

This post will introduce you to the principles behind logistic regression as a binary classification method. Using Python and NumPy we will implement the core principles from scratch on a test dataset. There are different ways to implement this particular algorithm but I will focus on an implementation with a neural network mindset as many of these ideas extend well into neural nets in later posts.

Logistic regression is a predictor borrowed from statistics used for binary classification. All this means is that we can use the algorithm to predict whether a given example belongs to a class or not. As an example, if we knew certain features about the weather (temperature, humidity, wind, etc.) we could try to predict if it's going to rain or not. To do so, we need many *labeled* examples of data from other days. We'll get into this more moving forward.

Before we can dive into the implementation, we first need to fully understand the required data and math for this problem.

We represent a particular example as a vector of features and we store all these examples as one large matrix where is a particular example (a single day if we follow our prior rain metaphor). The labeled aspect means that we know whether or not that day had rain, we'll call this the ground-truth and save all labels for our examples in a vector where we store a 0 for the days it didn't rain and a 1 when it does rain.

As an example, let's imagine that we track 2 different aspects to describe each day such as average temperature and humidity. For a year, we would have 365 examples of temperature and humidity stored in our vector and the whether it rained or not in our vector .

The forward propagation step is where we take an example and pass it through our model to yield a prediction . To do so, we have a vector that stores what we call the weights. We also have a bias term that we denote as . Together with the sigmoid function () we get the following:

Where is the probability that the example is in the class we are trying to predict. For the rain example, this is the logistic regressions estimated probability that it will rain that day. The sigmoid function is well catered to this as it introduces a non-linearity that bounds the output between 0 and 1, perfect for a probability!

Let's consider our example in the prior data organization section. I understand that at this point I have not discussed how to determine and but for now follow that we know it, we will discuss how to find the correct values in the next section. Let's use the following:

We can actually go ahead and bulk propagate all examples using matrix multiplication! Let's use and as the define days from before which is:

Following the equation for we get the following:

When we compare those probabilities to the known results we can see they match! Days that had rain (days 2 and 3) show high probabilities that it rained and the non rain days respectively show low probabilities. In practice you can get a prediction as a 1 or 0 by rounding the probability, the idea is you round towards the prediction that is more likely in the probability. In the case of logistic regression where there are only two possibilities (rain or didn't) you can estimate one from the other very easily.

Essentially, if the probability is greater or less than we round as the alternate probability becomes lower.

Personally, I prefer to move the bias term () into the matrix multiplication (). This is done by adding a column of 1's to the end of , adding a feature that is consistent across all examples. Then we increase the size of the weights by 1 (now etc.) such that the last weight acts as the bias. Check the math below if you are still curious. This helps to simplify the training procedure as you only need to train for ! Moving forward, I will refer to as just .

As mentioned prior I just gave you suitable values for and to go through the example by tweaking with the results in a python session. In practice, it can be difficult/impossible to self determine the correct values. To most correctly determine the values we use a process called training!

Values of the weights are determined through training!

Training is the process where we take many labeled examples and use them to determine values for that will yield us the best *overall* performance. To be precise, we will describe some formulations for what are called the cost and loss of the model.

Let's start with the loss function. The loss is a measure of the error for a particular example's prediction. Following our prior notation the loss for one example is:

Let's play this out and see why it makes sense. For a given example that we know had no rain (), we can consider the error as the log of the difference between the prediction and 0 as the left side term goes to 0 and we're just left with . As a reminder, so any deviation from 1 would create an error. Going back to if , a perfect prediction would be if and we'd be left with . Therefore, the cost function accommodates for the two different possible values of and takes the log of the difference for the respective cases.

The cost function is easy, it's just the average of the loss functions so:

Now that we have a formulation for a function to decrease let's just take the derivative and set it equal to 0 to see what the values of should be. In practice, this is not possible so we need to approach this as a numerical optimization problem.

Gradient descent is fairly simple, given the gradient of a function with respect to the variable to be optimized take a step in the negative direction of the gradient (positive for gradient ascent) and update your inputs with the step. Keep repeating this until you are happy with the convergence. In general, it looks something like this:

Where is a tuneable parameter called the learning rate, it dictates what fraction of the gradient we should step by. I will not go through the derivation (great resource), but for our problem we would like to use gradient descent where is the cost function and we take the gradient in terms of . For our use case this is:

Now we have everything to implement!

Let's go ahead and import our python modules first as follow:

I will be using Python and NumPy for the implementation. The dataset used is imported in our prior code block with the rest of our libraries. The dataset takes a few human health metrics as features and tries to predict if the patient has breast cancer, you can read more here.

As we have many parameters, helper functions, and repeated operations I will be creating the model as a Python class. Let's go ahead and initialize the class with all the methods we will be using.

`sigmoid`

will be where we write the sigmoid function`prep_data`

is where we will normalize our data to between 0 and 1`add_dimension`

is a helper function for adding the column of 1's to`train`

is where we will run the gradient descent iterations to determine our weights`predict`

is where we will forward propagate on unseen (test) data to estimate the classes

Let's start with the `sigmoid`

function, as mentioned before the function is , using NumPy this is:

Let's continue with `prep_data`

and `add_dimension`

. `prep_data`

takes the range of each column and brings them down to , that way no particular feature dominates the result of . `add_dimension`

creates a vector of ones the size of the number of samples and tacks it as the last column of .

The training process is going to be more involved, the comments explain what is going on but feel free to reach out if you would like more explanation.

To close out the model class, let's implement `predict`

that computes the forward propagation of our test data.

Assuming you have downloaded the aforementioned dataset, you should make sure you have it in the same folder as your Python file! The following code will import it and split the data into a test and train set. I have not mentioned this yet but this is a good place to talk about it.

Quick digression, a test and train split is a process of splitting your data into two groups based on a fraction (typical is 70/30 for train/test). This is very important to do as testing a model on the data it was trained on is not a good metric of its performance. To put it briefly, a model can learn to fit a training set very well but won't generalize well to new data. This is very bad as all we really care about is how it generalizes. This is as much as you need to know for now, but I would highly recommend you read more here.

We've made it! It's time to train our model and see how it does on the testing set! The following code will help you out:

Now all we have to is compute the accuracy and see how we did.

This results in an accuracy of around 91%! Considering the low number of features in the dataset this is a great result.

As the intro suggests, this is just an introduction and there is much more that we can do with logistic regression. Just to list a few, here are the topics I am considering in extending the basics of logistic regression.

- Underfitting and overfitting in the model, how do we know when this is happening and how can we mitigate it.
- Multinomial logistic regression, let's extend the prediction space from in or not in a class to multiple classes.
- Tuning of parameters, what other metrics can we look at to better understand the best values for our hyperparameters.