Neural Nets: Gradient Descent

Mar. 25th, 2023ยท27 min read

In this series, I hope to build intuition and understanding of machine learning topics leading up to modern neural networks, below is the plan for the series and links to completed posts.

- Gradient Descent
- Logistic Regression for Classification
- Two Layer Neural Network
- Gradients Over Many Layers
- Modern Deep Learning Frameworks

Gradient descent is a simple and intuitive optimization method that is common across a large space of problems ranging from simple linear regression to neural networks. With this, we can optimize against a given function and determine parameters that give us the best performance. By the end of this post, you should have a good idea on how this optimizer works and how to apply it to new problems.

There are a few concepts that you should know before coming into this post, in general I try to explain everything clearly but it helps if you have prior exposure to these concepts.

**NumPy:**We'll be using NumPy as the data storage and computation backbone. If you're not familiar with NumPy matrices and how to manipulate them take a look at this resource.**Calculus:**We rely heavily on derivatives, in particular, partial derivatives are important!

Before we step into how we use the gradient to help in our optimization problem, let's first build some understanding around the gradient. For those who are unfamiliar with partial derivatives or want a refresher you can take a look at the aside below.

The gradient of a function is actually quite a simple concept but can be so powerful when used correctly. The way to think about the gradient intuitively is that it's a vector that points in the direction of greatest increase of a certain function.

It's a vector in the sense that it has the direction where max increase happens as well as the magnitude of the increase. The components of the vector lie in the space of the function so if the function takes one input the gradient is only one value and direction.

To be more formal, we can define the gradient in terms of partial derivatives such that we could actually compute a gradient. Consider a function that depends on multiple input variables. The gradient is a vector of partial derivatives.

So if you're standing at a point and compute the associated gradient at that point, you'd be getting the vector pointing in the direction from where you are such that you would climb the fastest. A clearer way to show this is that we evaluate each partial derivative at our point.

Don't be worried about the math for now, I just want you to take the intuition and roll with it for now. If you plan to implement this from scratch you'll need to actually evaluate this but for now just know it points to where we climb fastest!

Now that we have an understanding of how to compute gradients and what information they give us, let's use it to optimize our problems through gradient descent.

Let imagine that you are stranded on a large hill and really dense fog surrounds you such that you can only see a few feet around. You want to *descend* to the bottom of the hill back to safety but you can't see far enough to path out how to get to the bottom.

You decide that the best thing you can do is to follow the direction of greatest descent local to what you can see. As you continue to descend you get new information about your surroundings so you change your direction every so often to the new greatest descent. Eventually you'll either reach a valley in the hill landscape or the true bottom where you're trying to be.

What you've just done is gradient descent down the hill! In the next section we'll associate these bits of intuition to the math and see how it works.

Let's formally define gradient descent and associate our example back to the components of it. In the simplest form, gradient descent is repeatedly computing the following equation.

- is the function we're minimizing with gradient descent, it's the surface height of the hill
- is the set of variables that needs, it's the and position on the hill looking top down
- is the gradient of with respect to , it's the direction we move in and on the hill
- is a scalar for the gradient to decrease how big of a step we take, often called learning rate

Another aspect of the problem we should try to connect is the idea of an intermediate valley in the hill landscape versus the true bottom. Formally this difference is called local and global minimas.

With gradient descent there is a problem in which depending on the loss function we are not guaranteed that we'll land at the global minima which is the lowest point over the entire landscape. Let's consider some simple cases to see why this is the case.

In a convex function, there is only one minima and that implies it's the global minima. A simple convex function is a parabola so when we execute gradient descent over this function we always land at the global minimum no matter where we start.

In the video, the blue represents the partial derivative so with descent we expect to go opposite this direction as we change the value of x with the negative of the gradient.

Let's consider a function where there are multiple minimas, in particular we have two equivalent global minimas and one local minima in the center. When we start gradient descent at various points we see that we don't always get to the global minima.

In practice, we don't even know what the loss landscape looks like for large models, but this problem exists no matter what. There is research done on variations of gradient descent to help mitigate these problems as well as improve training speed and overall performance. The go-to right now is the Adam optimizer which varies automatically based on patterns observed in prior iterations.

Just to give you an idea what these more complicated landscapes look like, consider this work done to try to visualize them! Mind you, in most applications the number of inputs variables is larger than we can visualize so this is an attempt at visualizing.

Before we step into linear regression, let's try to emulate the hill descent.

For our primary function , we need to model the height of where we would be on the hill at a position which simply means we need something of the form . For this example I opted to use simple sin waves over the multiple variables.

Why is height the right equation to minimize? Well consider that whatever function we model we will end up minimizing its value so we want that behavior to match our goals. In this case, we are trying to get to the base of the hill which means we want to minimize our height. Let's go ahead and compute the corresponding partial derivatives.

What we have now are the equations to know how high we are based on our position and the associated gradient for those positions to know which direction we should follow!

Let's try to visualize what is going on and see if we can draw some conclusions to help our understanding. First, let's initialize some points randomly on these hills and execute gradient descent on each of them to see where they end up.

Let's go ahead and take a look at the hills example realized! What you can see is a 3D view and a top down view of the paths taken by each of the points.

As we can see they all reach minimums which is what we expect! Something to note is that you can see that depending on where the points start they have a destined minimum they'll end up in, in other landscapes this may mean that a point will terminate at a local not a global minima!

We can evaluate the gradient at a grid of points across the space we're optimizing in. We know in this case that the gradient produces a vector over and so we can plot the gradient as an arrow pointing in the gradient direction. This grid of arrows is called a "gradient vector field".

Plotting the paths from before over the vector field we can see that the path the points take almost exactly follows the negative of the gradient which makes sense! The gradient points in the direction of greatest *increase* so when we descend we follow the negative of the gradient to follow the path of greatest *decrease*. We could just as easily follow the gradient if we wanted to maximize a function called gradient ascent but this isn't nearly as common.

Now that we have the foundation to apply this method, let's implement gradient descent for linear regression and analyze some of the behavior. In this section I focus more on the conceptual side of this problem and not the code but you can see how I implemented it in the next section.

The linear regression problem is to determine the line that best fits a set of data. To be precise, let's define some of the components of this.

We need the inputs and the corresponding results . contains all of the samples in our dataset. An entry as a particular sample out of our whole dataset with it's corresponding . The number of samples in our dataset is denoted as .

The general model, as expected, is a simple linear model that takes in the input variables and computes a prediction . This can be thought of as the following equation.

Where the weight is the slope of the line and the bias is the line vertical offset. We are considering the scalar case here where is a single number for each example in our dataset but we could also have the case where is a vector of multiple inputs for each sample and as such would also be a vector. In the next post, we'll look more into what that looks like but I want start simple.

Often it will help to think about this model working on a single sample at a time which merely implies that we index a sample out of and that only results in the prediction for that sample.

In this problem, we want to formulate a function that by minimizing we'll achieve the expected behavior. One way to think about this is we want a function that indicates how bad we are doing and we want to minimize this "badness". This is commonly referred to as a loss function.

In the case of linear regression, we're looking to produce a real-valued number and compare it to the dataset samples. The loss function used for this is called the mean-squared error loss (MSE).

To be more inline with our prior labeling we'll call MSE our loss . We can sub in our prior model for prediction for to get the final loss description in terms of our model parameters and . This is because we can think of and as our knobs that we can tweak to change where the line is!

As we are trying to minimize the loss using gradient descent, we need to compute the needed gradients for our model parameters and which are the following. If you want to see step by step how I computed these partials look at the aside below.

With these gradients, we can execute gradient descent by updating our model parameters by their associated gradients evaluated at each iteration.

Let's actually step into an example where we can see how the model trains and eventually performs. We'll start with modeling a dataset where the underlying trend of the data is linear so we would expect this model to perform well.

The construction of out linear dataset is very simple! We start with a line with a known and , in this case and , and we apply noise to the points just as we would expect real data to have. The noise is merely sampling from a normal distribution . In a real example, we would not know what the true line is and we use linear regression to estimate and of that line.

We can now execute a linear regression to solve for and and see how well we match our known true line based on fitting to the samples alone.

What you can see in the video below is the loss landscape on the left and the history of our values as the point traveling through it and on the right you can see the predicted line in red plotting the corresponding line for that based on the position in the descent. We can see the loss value is quite low and the fit seems good!

One way to think about this is that the position of the point defines a line that will have a corresponding height that is the current loss. By traversing down the loss landscape we find the such that the corresponding loss in minimal.

Let's consider a different example where the underlying trend of the data isn't a line anymore but instead a parabola.

In the same way that we created the linear data, we will sample points along the function and apply noise to each of the points from a normal distribution.

The visualization here is similar to the prior but just with the parabolic data. As we can see the method is working quite well as we've reached the minimum loss again.

You may have noticed that both examples reach the minimum which would indicate that we should have reached a point where the model performs well, right?

Not quite, in fact there is a whole concept in machine learning called the bias/variance tradeoff. The general idea of this is that there are two extremes that result in poor performance.

**High Bias**: The model is too*inflexible*to model the data correctly- Happens when models are too simple for patterns in the data
- This is where we are, there is no world where a line will correctly fit a parabola

**High Variance**: The model is too*flexible*to generalize the data correctly- Happens when models are too complex are starts fitting to patterns that aren't "real"
- Large models can have an issue where they fit to noise and don't generalize the data well

Somewhere in the middle there is a goldilocks region where the model is flexible enough to encompass the complexity of the data but not so flexible that it starts overfitting.

So when we compare slices along the minimums of the loss landscapes we see that sure we may have reached the minimum but that minimum is very high for the parabolic case!

The learning rate is generally regarded as how big of a step to take in the direction of the gradient during descent. Changing the value of the learning rate has massive implications on the descent time and results. Let's consider some example and try to understand how to pick an adequate value for the learning rate.

Let's consider the convex function below where we know the optimal value lies at . With different learning rates we will see how the behavior and results change!

Across the following examples, the number of iterations is kept the same as to easily compare their behaviors but you could always adjust this when solving a problem.

When the learning rate is too small there isn't a concern of instability but getting to a minima takes forever! This is not a big deal for small problems but on large problems where the computation is not trivial wasting time taking micro steps is not recommended.

When the learning rate is well chosen we swiftly converge to the minima without any oscillations or sporadic behavior. We reach the optimal point on our loss surface within ~25 iterations and hold there.

When the learning rate is a bit large we have some oscillations as the descent settles into the final path. Looking just at the loss curve everything seems perfectly fine and in practice this is fine! One issue is sometimes you'll see a rapid decay of loss at first and then unstable behavior as you settle in closer.

When the learning rate is way too large we have unmanageable oscillations and in fact instead of converging on the optimal it diverges out to a massive loss (loss scale is 10^18)!

So how do you pick the right learning rate? In practice this is harder said then done and is a bit of an art. I would advise looking at similar models trained and what people used as a baseline but have fun experimenting with it too. Keep the behaviors in mind and change as you see necessary.

With better optimizers like aforementioned Adam optimizer much of these issues are handled automatically but it still helps to know general behaviors in plain gradient descent.

As I alluded to before we can scale this model to work over many input variables where the input does not just have a scalar for each entry but instead of vector of multiple features that describe that sample. In the next post, we build off this model and extend this into many variables and show how to execute this efficiently.

If you want to replicate this, I break down some of the code I used for linear regression here. I really wanted to focus on the conceptual understanding in this post so I pushed this to the end to focus on the process more than the code.

Just importing the libraries that we intend to use for this example.

Here we create the data based on a line and add noise, if you wanted to try a different function you could add another function to represent your model but make sure to keep `line_func`

for linear regression predictions!

Here we define the function to compute our mean-squared error loss as well as getting the gradients from the input data , the true results and the predictions .

Here we just implement the iterative gradient descent that we've been talking about and print out the loss so we can see if it all works. At this point if you run the code you would expect something like the output on the code block.

As a check we can plot the results and see what we get!

As with anything, I learned this method through applying it but also external resources. I wanted to highlight a few that have helped me and might help you!

- 3Blue1Brown
- Math educator who essentially started a new genre of how math is taught, amazing!

- Cyril Stachniss
- Generally a great producer of robotics lectures.

- Math for Machine Learning
- Freely available textbooks that is a great reference for purposeful mathematics!

I hope you enjoyed this primer to the series, much of the material for the remaining posts is still up in the air so if there are aspects you liked or maybe didn't like let me know!