Introduction to Deep Learning with PyTorch > Exercise: Let's Implement Gradient Descent! | Python Programming | Department of Computing

Introduction to Deep Learning with PyTorch

Chapter 2: Gradient Descent

Exercise: Let's Implement Gradient Descent!

face Luca Grillotti

Gradient Descent

Remember the single-step Gradient Descent formula you just saw ?

$\theta \leftarrow \theta - \lambda \dfrac{\partial L}{\partial \theta}$

where:

$\theta$ are the inputs of the loss function $L$ .
$\lambda$ represents the learning rate of the gradient descent.
$\dfrac{\partial L}{\partial \theta}$ represents the gradient of $L$ with respect to its parameters $\theta$ .

Exercise: Implement gradient descent!

Single-step gradient descent

Let’s first try to implement this formula!

As a loss, we consider the square function $L(\theta) = \theta^2$ where $\theta$ is a 1-dimensional variable.

Try to implement a function that implements the update rule given above:

def single_step_gradient_descent(theta, learning_rate):
    """
    Args:
        theta (float)
        learning_rate (float)

    Returns:
        new_theta (float): updated value of theta after 1 step of gradient descent. 
    """

def get_gradient_square_function(theta):
    return 2 * theta


def single_step_gradient_descent(theta, learning_rate):
    """
    Args:
        theta (float)
        learning_rate (float)

    Returns:
        new_theta (float): updated value of theta after 1 step of gradient descent.
    """
    return theta - learning_rate * get_gradient_square_function(theta)

Gradient descent

Once you’ve finished implementing single_step_gradient_descent, you can implement a full Gradient Descent:

def gradient_descent(initial_theta, learning_rate, number_steps):
    """
    Args:
        initial_theta (float): Initial value of theta
        learning_rate (float)
        number_steps (int): number of 1-step gradient descent to perform.

    Returns:
        final_theta (float): Final value of theta after multiple 1-step gradient descents
    """

gradient_descent simply iteratively performs several 1-step gradient descent.

You can test your function gradient_descent with:

a learning rate of 0.2
20 steps
an initial theta of 1.

Do not hesitate to print the value of the updated $\theta$ after each iteration ;)

Once you’ve finished, try running your gradient_descent function with very low (~0.001) and very high (~10) learning rates. What do you notice?

def gradient_descent(initial_theta, learning_rate, number_steps):
    """
    Args:
        initial_theta (float): Initial value of theta
        learning_rate (float)
        number_steps (int): number of 1-step gradient descent to perform.

    Returns:
        final_theta (float): Final value of theta after multiple 1-step gradient descents
    """
    theta = initial_theta
    for _ in range(number_steps):
        theta = single_step_gradient_descent(theta, learning_rate)
        print(theta)
    return theta

theta_star = gradient_descent(
    initial_theta=1.,
    learning_rate=0.2,
    number_steps=20
)
print(f"theta after optimisation: {theta_star}")

A low learning rate reduces the speed of convergence; whereas with a high learning rate, there are greater risks of divergence.