Chapter 4: PyTorch for Automatic Gradient Descent

Automatic Gradient Descent with PyTorch Optimisers

face Luca Grillotti

torch optim module

PyTorch provides plenty of optimisers that automatically perform parameter updates based on the computed gradients. All these classes are present in the optim module.

Here we would like to apply a basic gradient descent. The corresponding class in module torch.optim is SGD, where SGD stands for “Stochastic Gradient Descent” (if you don’t know where the “Stochastic” comes from, absolutely no worries ^^)

Each optimiser is initialised with a list of the parameters we would like to optimise. In our case, we would like to optimise the value of \theta, in order to minimise the loss function L:

list_parameters = [theta]
learning_rate = 0.2

optimiser = torch.optim.SGD(params=list_parameters, lr=learning_rate)

Useful optimiser methods

Each optimiser presents 2 methods that are used almost everytime we optimise parameters:

  • zero_grad()
  • step()

zero_grad()

Try to guess the behaviour of the following code:

import torch

tensor_0 = torch.Tensor([1])
theta = torch.nn.Parameter(tensor_0)

# ---------------------------
loss = theta * theta
loss.backward()
print(f"After 1st loss.backward(): theta.grad={theta.grad}")
# ---------------------------
loss = theta * theta  # necessary to recompute the loss before using .backward() method
loss.backward()
print(f"After 2nd loss.backward(): theta.grad={theta.grad}")
# ---------------------------
loss = theta * theta  # necessary to recompute the loss before using .backward() method
loss.backward()
print(f"After 3rd loss.backward(): theta.grad={theta.grad}")

You’ll find the output as follows:

After 1st loss.backward(): theta.grad=tensor([2.])
After 2nd loss.backward(): theta.grad=tensor([4.])
After 3rd loss.backward(): theta.grad=tensor([6.])

What happened?! loss.backward() computes the gradients and ADDS those computed gradients to the attributes .grad of the parameters.

How do we solve this issue? We simply need to define a torch optimiser on those parameters, and use the method .zero_grad() to reset the tensor gradient to 0s.

import torch

tensor_0 = torch.Tensor([1])
theta = torch.nn.Parameter(tensor_0)

# Defining optimiser on theta:
list_parameters = [theta]
learning_rate = 0.2

optimiser = torch.optim.SGD(params=list_parameters, lr=learning_rate)

# ---------------------------

optimiser.zero_grad()
loss = theta * theta
loss.backward()
print(f"After 1st loss.backward(): theta.grad={theta.grad}")

# ---------------------------
optimiser.zero_grad()
print(f"After optimiser.zero_grad(): theta.grad={theta.grad}")
loss = theta * theta  # necessary to recompute the loss before using .backward() method
loss.backward()
print(f"After 2nd loss.backward(): theta.grad={theta.grad}")

And now we get the expected results \o/

After 1st loss.backward(): theta.grad=tensor([2.])
After optimiser.zero_grad(): theta.grad=tensor([0.])
After 2nd loss.backward(): theta.grad=tensor([2.])

step()

The step() method simply performs one optimisation step. In the case of SGD, it performs the update on all parameters: \theta \leftarrow \theta - \lambda \dfrac{\partial L}{\partial \theta}

If you want to verify this yourself, you can simply add an optimiser.step() after the loss.backward()

import torch

tensor_0 = torch.Tensor([1])
theta = torch.nn.Parameter(tensor_0)

# Defining optimiser on theta:
list_parameters = [theta]
learning_rate = 0.2

optimiser = torch.optim.SGD(params=list_parameters, lr=learning_rate)

# ---------------------------

optimiser.zero_grad()
loss = theta * theta
loss.backward()
print(f"Parameters BEFORE optimisation step: {theta}")
optimiser.step()
print(f"Parameters AFTER optimisation step: {theta}")
Parameters BEFORE optimisation step: Parameter containing:
tensor([1.], requires_grad=True)
Parameters AFTER optimisation step: Parameter containing:
tensor([0.6000], requires_grad=True)

We obtained the same value when manually calculating \theta - \lambda \dfrac{\partial L}{\partial \theta} with:

  • \theta=1
  • \lambda=0.2
  • \dfrac{\partial L}{\partial \theta} = 2 (calculated, knowing that L(\theta) = \theta^2).