Introduction to Deep Learning with PyTorch
Chapter 2: Gradient Descent
Exercise: Let's Implement Gradient Descent!
Gradient Descent
Remember the single-step Gradient Descent formula you just saw ?
\theta \leftarrow \theta - \lambda \dfrac{\partial L}{\partial \theta}
where:
- \theta are the inputs of the loss function L.
- \lambda represents the learning rate of the gradient descent.
- \dfrac{\partial L}{\partial \theta} represents the gradient of L with respect to its parameters \theta.
Exercise: Implement gradient descent!
Single-step gradient descent
Let’s first try to implement this formula!
As a loss, we consider the square function L(\theta) = \theta^2 where \theta is a 1-dimensional variable.
Try to implement a function that implements the update rule given above:
def single_step_gradient_descent(theta, learning_rate):
"""
Args:
theta (float)
learning_rate (float)
Returns:
new_theta (float): updated value of theta after 1 step of gradient descent.
"""
def get_gradient_square_function(theta):
return 2 * theta
def single_step_gradient_descent(theta, learning_rate):
"""
Args:
theta (float)
learning_rate (float)
Returns:
new_theta (float): updated value of theta after 1 step of gradient descent.
"""
return theta - learning_rate * get_gradient_square_function(theta)
Gradient descent
Once you’ve finished implementing single_step_gradient_descent
, you can implement a full Gradient Descent:
def gradient_descent(initial_theta, learning_rate, number_steps):
"""
Args:
initial_theta (float): Initial value of theta
learning_rate (float)
number_steps (int): number of 1-step gradient descent to perform.
Returns:
final_theta (float): Final value of theta after multiple 1-step gradient descents
"""
gradient_descent
simply iteratively performs several 1-step gradient descent.
You can test your function gradient_descent
with:
- a learning rate of 0.2
- 20 steps
- an initial theta of 1.
Do not hesitate to print the value of the updated \theta after each iteration ;)
Once you’ve finished, try running your gradient_descent
function with very low (~0.001) and very high (~10) learning rates.
What do you notice?
def gradient_descent(initial_theta, learning_rate, number_steps):
"""
Args:
initial_theta (float): Initial value of theta
learning_rate (float)
number_steps (int): number of 1-step gradient descent to perform.
Returns:
final_theta (float): Final value of theta after multiple 1-step gradient descents
"""
theta = initial_theta
for _ in range(number_steps):
theta = single_step_gradient_descent(theta, learning_rate)
print(theta)
return theta
theta_star = gradient_descent(
initial_theta=1.,
learning_rate=0.2,
number_steps=20
)
print(f"theta after optimisation: {theta_star}")
A low learning rate reduces the speed of convergence; whereas with a high learning rate, there are greater risks of divergence.