This is an archived version of the course. Please find the latest version of the course on the main webpage.

Chapter 4: PyTorch for Automatic Gradient Descent

Automatic Gradient Descent with PyTorch

face Luca Grillotti

Recap:

So far we saw how to use the backward() method to compute the gradient automatically. For instance, the code below evaluates the gradient of the loss function L(\theta) = \theta^2 at the value \theta_0 = 1.

import torch

tensor_0 = torch.Tensor([1])
theta = torch.nn.Parameter(tensor_0)

loss = theta * theta
print(theta.grad)
loss.backward()
print(theta.grad)
None
tensor([2.])

What we would like to have: automatic gradient descent!

In the end, if we try to print \theta, we can see that its value has not changed since its initialisation:

print(theta)
Parameter containing:
tensor([1.], requires_grad=True)

now that we have \dfrac{\partial L}{\partial \theta}(\theta), we can update \theta to minimise the loss, using gradient descent:

\theta \leftarrow \theta - \lambda \dfrac{\partial L}{\partial \theta}

torch optim module

PyTorch provides plenty of optimisers that automatically perform parameter updates based on the computed gradients. All those classes are present in the optim module.

Here we would like to apply a basic gradient descent. The corresponding class in module torch.optim is SGD, where SGD stands for “Stochastic Gradient Descent” (if you don’t know where the “Stochastic” comes from, absolutely no worries ^^)

Each optimiser each initialised with a list of the parameters we would like to optimise. In our case, we would like to optimise the value of \theta, in order to minise the loss function L:

list_parameters = [theta]
learning_rate = 0.2

optimiser = torch.optim.SGD(params=list_parameters, lr=learning_rate)

Useful functions of optimisers

Each optimiser presents 2 methods that are used almost everytime we optimise parameters:

  • zero_grad()
  • step()

zero_grad()

Try to guess the behaviour of the following code:

import torch

tensor_0 = torch.Tensor([1])
theta = torch.nn.Parameter(tensor_0)

# ---------------------------
loss = theta * theta
loss.backward()
print(f"After 1st loss.backward(): theta.grad={theta.grad}")
# ---------------------------
loss = theta * theta  # necessary to recompute the loss before using .backward() method
loss.backward()
print(f"After 2nd loss.backward(): theta.grad={theta.grad}")
# ---------------------------
loss = theta * theta  # necessary to recompute the loss before using .backward() method
loss.backward()
print(f"After 3rd loss.backward(): theta.grad={theta.grad}")
After 1st loss.backward(): theta.grad=tensor([2.])
After 2nd loss.backward(): theta.grad=tensor([4.])
After 3rd loss.backward(): theta.grad=tensor([6.])

What happened?! loss.backward() computes the gradients and ADD those computed gradients to the attributes .grad of the parameters.

How to solve that issue? We simply need to define a torch optimiser on those parameters, and use the method .zero_grad() to reset the tensor gradient to 0s.

import torch

tensor_0 = torch.Tensor([1])
theta = torch.nn.Parameter(tensor_0)

# Defining optimiser on theta:
list_parameters = [theta]
learning_rate = 0.2

optimiser = torch.optim.SGD(params=list_parameters, lr=learning_rate)

# ---------------------------

optimiser.zero_grad()
loss = theta * theta
loss.backward()
print(f"After 1st loss.backward(): theta.grad={theta.grad}")

# ---------------------------
optimiser.zero_grad()
print(f"After optimiser.zero_grad(): theta.grad={theta.grad}")
loss = theta * theta  # necessary to recompute the loss before using .backward() method
loss.backward()
print(f"After 2nd loss.backward(): theta.grad={theta.grad}")

And now we get the expected results \o/

After 1st loss.backward(): theta.grad=tensor([2.])
After optimiser.zero_grad(): theta.grad=tensor([0.])
After 2nd loss.backward(): theta.grad=tensor([2.])

step()

The step() method simply performs one optimisation step. In the case of SGD, it performs the update on all parameters: \theta \leftarrow \theta - \lambda \dfrac{\partial L}{\partial \theta}

If you want to verify this by yourself, you can simply add an optimiser.step() after the loss.backward()

import torch

tensor_0 = torch.Tensor([1])
theta = torch.nn.Parameter(tensor_0)

# Defining optimiser on theta:
list_parameters = [theta]
learning_rate = 0.2

optimiser = torch.optim.SGD(params=list_parameters, lr=learning_rate)

# ---------------------------

optimiser.zero_grad()
loss = theta * theta
loss.backward()
print(f"Parameters BEFORE optimisation step: {theta}")
optimiser.step()
print(f"Parameters AFTER optimisation step: {theta}")
Parameters BEFORE optimisation step: Parameter containing:
tensor([1.], requires_grad=True)
Parameters AFTER optimisation step: Parameter containing:
tensor([0.6000], requires_grad=True)

We obtain the same value when calculating \theta - \lambda \dfrac{\partial L}{\partial \theta} with:

  • \theta=1
  • \lambda=0.2
  • \dfrac{\partial L}{\partial \theta} = 2 (calculated, knowing that L(\theta) = \theta^2).