Chapter 2: Gradient Descent

Exercise: Let's Implement Gradient Descent!

face Luca Grillotti

Gradient Descent

Remember the single-step Gradient Descent formula you just saw ?

\theta \leftarrow \theta - \lambda \dfrac{\partial L}{\partial \theta}

where:

  • \theta are the inputs of the loss function L.
  • \lambda represents the learning rate of the gradient descent.
  • \dfrac{\partial L}{\partial \theta} represents the gradient of L with respect to its parameters \theta.

Exercise: Implement gradient descent!

Single-step gradient descent

Let’s first try to implement this formula!

As a loss, we consider the square function L(\theta) = \theta^2 where \theta is a 1-dimensional variable.

Try to implement a function that implements the update rule given above:

def single_step_gradient_descent(theta, learning_rate):
    """
    Args:
        theta (float)
        learning_rate (float)

    Returns:
        new_theta (float): updated value of theta after 1 step of gradient descent. 
    """

def get_gradient_square_function(theta):
    return 2 * theta


def single_step_gradient_descent(theta, learning_rate):
    """
    Args:
        theta (float)
        learning_rate (float)

    Returns:
        new_theta (float): updated value of theta after 1 step of gradient descent.
    """
    return theta - learning_rate * get_gradient_square_function(theta)

Gradient descent

Once you’ve finished implementing single_step_gradient_descent, you can implement a full Gradient Descent:

def gradient_descent(initial_theta, learning_rate, number_steps):
    """
    Args:
        initial_theta (float): Initial value of theta
        learning_rate (float)
        number_steps (int): number of 1-step gradient descent to perform.

    Returns:
        final_theta (float): Final value of theta after multiple 1-step gradient descents
    """

gradient_descent simply iteratively performs several 1-step gradient descent.

You can test your function gradient_descent with:

  • a learning rate of 0.2
  • 20 steps
  • an initial theta of 1.

Do not hesitate to print the value of the updated \theta after each iteration ;)

Once you’ve finished, try running your gradient_descent function with very low (~0.001) and very high (~10) learning rates. What do you notice?

def gradient_descent(initial_theta, learning_rate, number_steps):
    """
    Args:
        initial_theta (float): Initial value of theta
        learning_rate (float)
        number_steps (int): number of 1-step gradient descent to perform.

    Returns:
        final_theta (float): Final value of theta after multiple 1-step gradient descents
    """
    theta = initial_theta
    for _ in range(number_steps):
        theta = single_step_gradient_descent(theta, learning_rate)
        print(theta)
    return theta

theta_star = gradient_descent(
    initial_theta=1.,
    learning_rate=0.2,
    number_steps=20
)
print(f"theta after optimisation: {theta_star}")

A low learning rate reduces the speed of convergence; whereas with a high learning rate, there are greater risks of divergence.