When it comes to neural networks and learning rates, two approaches were always dominating (second one more than the other). Either select a certain learning rate and stick with it, or select a certain learning rate and monotonically decrease it during the training process.

Even though the second approach proved beneficial, it can be difficult to set up optimal starting $l_r$ and schedule for its annealing.

Cyclical learning rates (CLR) provide a solution as they fluctuate $l_r$ between two values at essentially no additional costs and provide a way to select those values optimally, instead of guessing.

I Understanding the Idea

Author of the paper, Leslie Smith, introduces the idea that fluctuating learning rates during training could have, at worst, near-optimal results compared to previously mentioned approaches.

Cyclical Learning Rates could be best understood with the following picture, which is taken directly from the paper:

Maximum and base learning rates are self explanatory. Stepsize (or half-cycle) is the number of iterations needed for the learning rate (blue line) to reach maximum value. Thus, one full cycle is when our learning rate goes from base to maximum value, and then back again to the base value.

Formally, cyclical learning rates are nothing more than following three equations: $$ cycle = \lfloor{(1 + \frac{iter}{2 \cdot stepsize})} \rfloor \\ x = \ \mid \frac{iter}{stepsize} - 2\cdot cycle + 1 \ \mid \\ l_r = l_r^{base} + (l_r^{max} - l_r^{base}) \cdot max(0, 1-x) \\ $$

cycle variable gives us the number of the cycle (starting from 1)
x variable fluctuates between 1 and 0 and is the variable responsible for the cyclical part of the name Cyclical Learning Rates
Finally, $l_r$ is the learning rate that we provide to the model in the end

Let's see the code example:

def clr(base_lr, max_lr, n_iter, stepsize, history):
  
  cycle = np.floor(1 + n_iter / (2*stepsize))
  x = np.abs(n_iter/ stepsize - 2*cycle + 1)
  lr = base_lr + (max_lr - base_lr) * np.maximum(0, 1-x)

  if n_iter % 100 == 0:
    history['iter'].append(n_iter)
    history['lr'].append(lr)
    history['x'].append(x)

With the following piece of code, we can perform one full cycle and then plot our variables' change over time:

history = {'iter' : [], 'lr' : [], 'x' : []}
total_iters = 2001

for i in range(0, total_iters):
  clr(base_lr=0.01, max_lr=0.05, n_iter=i, stepsize=1000, history=history)

II CLR as a Keras Callback

Since we'll compare actual neural networks' performance with and without CLR, and Keras library doesn't possess CLR, we'll need to create a new Keras Callback. At the end of the post, I linked an official tutorial on custom Keras Callbacks, but the main idea is that we can create different methods which will be invoked when:

a new epoch or batch starts or ends, or
training, evaluation or prediction starts or ends

When the method will be invoked, solely depends on the method's name.

In our case, our CyclicalLR Callback class will have:

clr which is the main part of the callback and used to set a new $l_r$ value
on_train_begin which will set the learning rate to the base value
on_batch_end which will increase the number of iterations and invoke clr method
on_epoch_end which will add new info to the history dictionary

class CyclicalLR(keras.callbacks.Callback):
  
  def __init__(self, base_lr, max_lr, stepsize, mode='triangular', gamma=1):
    super(CyclicalLR, self).__init__()
    self.base_lr = base_lr
    self.max_lr = max_lr
    self.stepsize = stepsize
    self.mode = mode
    self.gamma = gamma
    self.iterations = 0
    self.history = {'lr' : [], 'iter' : []}

  def clr(self):
    cycle = np.floor(1 + self.iterations / (2*self.stepsize))
    x = np.abs(self.iterations / self.stepsize - 2*cycle + 1)
    lr = self.base_lr + (self.max_lr - self.base_lr) * np.maximum(0, 1-x)

    if self.mode == 'triangular':
      return lr
    elif self.model == 'triangular2':
      return base_lr + (max_lr - base_lr) * np.maximum(0, 1-x) / (2 ** (cycle-1))
    elif self.model == 'exp_range':
      return base_lr + (max_lr - base_lr) * np.maximum(0, 1-x) * self.gamma ** self.iterations

  def on_train_begin(self, logs=None):
    logs = logs or {}

    if self.iterations == 0:
      K.set_value(self.model.optimizer.lr, self.base_lr)
    else:
      K.set_value(self.model.optimizer.lr, self.clr())

  def on_batch_end(self, batch, logs=None):
    logs = logs or {}

    self.iterations += 1
    K.set_value(self.model.optimizer.lr, self.clr())

  def on_epoch_end(self, epoch, logs=None):
    logs = logs or {}

    self.history['lr'].append(float(self.model.optimizer.lr))
    self.history['iter'].append(self.iterations)

You probably noticed there are several different modes of CLRs: triangular, triangular2 and exp_range. We'll go into further details about these in the VI part.

III Recreating Results on CIFAR-10

CIFAR-10 is one of the standard datasets for testing new approaches and setting state-of-the-art results. It contains 60,000 low-resolution pictures of 10 classes (airplane, automobile, bird, cat, dog, deer, frog, horse, ship, and truck), of which 10,000 is left out as a test set.

We'll use this dataset to compare models, but firstly let's see a few example pictures:

And here's our simple Keras model:

def build_model():
  model = tf.keras.models.Sequential()
  model.add(tf.keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
  model.add(tf.keras.layers.MaxPooling2D((2, 2)))
  model.add(tf.keras.layers.Conv2D(64, (3, 3), activation='relu'))
  model.add(tf.keras.layers.MaxPooling2D((2, 2)))
  model.add(tf.keras.layers.Conv2D(64, (3, 3), activation='relu'))
  model.add(tf.keras.layers.Flatten())
  model.add(tf.keras.layers.Dense(64, activation='relu'))
  model.add(tf.keras.layers.Dense(10))
  return model

IV LR Range Test

IV.I How do we set up base and max learning rate?

Before we begin training our model, we need to answer this question.

LR range test is an approach proposed in the paper for setting up these values. Luckily, the LR Range test can be performed with our CyclicalLR class by setting stepsize the same as the total number of iterations. This way, we'll have $l_r$ go only once from base to the maximum value (we'll perform one half-cycle). Afterward, we plot accuracies versus learning rates and select base and maximum learning rates from that plot.

First step would be to calculate total number of iterations. The following equation is really helpful in doing this: $$ I_t = \frac {|X|}{n_{bs}} * n_e $$

$I_t$ is total number of iterations
$|X|$ is number of images in the train set (50,000 in our case)
$n_{bs}$ is batch size (in our case it'll be 100)
$n_e$ is number of epochs (in our case it'll be 150)

This means we'll now easily get: $$ I_t = \frac{50.000}{100} * 150 = 75.000 $$

Now we initialize our callback as follows and then plot accuracies versus learning rates:

clr_triangular = CyclicalLR(base_lr=0.001, max_lr=0.02, stepsize=75000)

We select the base as the point where the accuracy starts to sharply increase, and the maximum up to the point where it slows or starts to fall. From our example above, we'll select 0.001 as the base learning rate, and 0.005 as the maximum one.

IV.II How do we set up `stepsize` parameter?

Paper proposes we set it as 2-8 times the number of iterations in one epoch. That's why we first need to calculate the number of iterations per epoch with a similar formula to the one we saw before: $$ I = \frac {|X|}{n_{bs}} \rightarrow \frac{50.000}{100}=500$$

In our case, the model will have 500 iterations per epoch.

Now, stepsize should be in 1000-4000 range. The author says that when comparing 2*stepsize and 8*stepsize, the latter is "only slightly better".

V Comparison

We'll first run the model without CLR for the total of 75,000 iterations:

model = build_model()
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=150, batch_size=100, 
                    validation_data=(test_images, test_labels))

Without CLR
	Loss: 4.367868900299072
	Accuracy: 0.6783000230789185

But, let's see the same model trained for the third of the iterations, and with CLR:

Note: We now train the model for the third of iterations compared to before, since we want to prove the claim that CLR gives better results faster.

clr_triangular = CyclicalLR(base_lr=0.001, max_lr=0.005, stepsize=2500)

model = build_model()
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=50, batch_size=100, 
                    validation_data=(test_images, test_labels), callbacks=[clr_triangular])

With CLR
	Loss: 2.375159740447998
	Accuracy: 0.6941999793052673

The difference in accuracy may be only ~1.6%, but we did prove the main point - better results faster!

VI Different Types of CLR

If you remember, our CyclicalLR callback had different modes, but we didn't cover them. Now's the time!

First comes the default mode we've already seen - triangular mode. The learning rate just fluctuates between the base and maximum learning rate during the whole process of training:

Triangular2 mode picks up on the idea that learning rate should decrease over time - learning rate difference is halved after every cycle:

And finally, exp range mode picks up on the same idea as previous mode, but decreases learning rate difference exponentially - by an exponential factor of gamma$^{iterations}$:

VII Conclusion

Even though a simple idea, Cyclical Learning Rates introduced both a way to find an optimal range of learning rates and an approach to train neural networks quicker to reach the near-optimal performance.

This can be used to quickly prototype baseline models on new data and get the best possible results at that point.

Who knew fast and better was possible!

Thank you for reading up to here!

This was mainly made as a reminder and a practice for me, but if this helped you, feel free to share or comment to let me know your thoughts!

If you find any mistakes I made, notify me and I'll make the necessary changes and mention you and your help. Any and all suggestions and constructive criticism are always welcome. We're all here to learn!

VIII References and Further Literature

[1] Leslie N. Smith. Cyclical Learning Rates for Training Neural Networks. 2015. U.S. Naval Research Laboratory.
[2] Tensorflow's Custom Keras Callbacks Tutorial
[3] Tensorflow's Tutorial on CNNs (showcases usage of built-in CIFAR-10 dataset)