Recreating Cyclical Learning Rates
In this short post, we'll cover Cyclical Learning Rates, how to create them as a custom Keras Callback and compare results of models with and without Cyclical Learning Rates. Is faster AND better even possible?
- I Understanding the Idea
- II CLR as a Keras Callback
- III Recreating Results on CIFAR-10
- IV LR Range Test
- V Comparison
- VI Different Types of CLR
- VII Conclusion
- VIII References and Further Literature
When it comes to neural networks and learning rates, two approaches were always dominating (second one more than the other). Either select a certain learning rate and stick with it, or select a certain learning rate and monotonically decrease it during the training process.
Even though the second approach proved beneficial, it can be difficult to set up optimal starting $l_r$ and schedule for its annealing.
Cyclical learning rates (CLR) provide a solution as they fluctuate $l_r$ between two values at essentially no additional costs and provide a way to select those values optimally, instead of guessing.
I Understanding the Idea
Author of the paper, Leslie Smith, introduces the idea that fluctuating learning rates during training could have, at worst, near-optimal results compared to previously mentioned approaches.
Cyclical Learning Rates could be best understood with the following picture, which is taken directly from the paper:
Maximum and base learning rates are self explanatory. Stepsize (or half-cycle) is the number of iterations needed for the learning rate (blue line) to reach maximum value. Thus, one full cycle is when our learning rate goes from base to maximum value, and then back again to the base value.
Formally, cyclical learning rates are nothing more than following three equations: $$ cycle = \lfloor{(1 + \frac{iter}{2 \cdot stepsize})} \rfloor \\ x = \ \mid \frac{iter}{stepsize} - 2\cdot cycle + 1 \ \mid \\ l_r = l_r^{base} + (l_r^{max} - l_r^{base}) \cdot max(0, 1-x) \\ $$
-
cycle
variable gives us the number of the cycle (starting from 1)
-
x
variable fluctuates between 1 and 0 and is the variable responsible for the cyclical part of the name Cyclical Learning Rates - Finally, $l_r$ is the learning rate that we provide to the model in the end
Let's see the code example:
def clr(base_lr, max_lr, n_iter, stepsize, history):
cycle = np.floor(1 + n_iter / (2*stepsize))
x = np.abs(n_iter/ stepsize - 2*cycle + 1)
lr = base_lr + (max_lr - base_lr) * np.maximum(0, 1-x)
if n_iter % 100 == 0:
history['iter'].append(n_iter)
history['lr'].append(lr)
history['x'].append(x)
With the following piece of code, we can perform one full cycle and then plot our variables' change over time:
history = {'iter' : [], 'lr' : [], 'x' : []}
total_iters = 2001
for i in range(0, total_iters):
clr(base_lr=0.01, max_lr=0.05, n_iter=i, stepsize=1000, history=history)
Since we'll compare actual neural networks' performance with and without CLR, and Keras library doesn't possess CLR, we'll need to create a new Keras Callback. At the end of the post, I linked an official tutorial on custom Keras Callbacks, but the main idea is that we can create different methods which will be invoked when:
- a new epoch or batch starts or ends, or
- training, evaluation or prediction starts or ends
When the method will be invoked, solely depends on the method's name.
In our case, our CyclicalLR
Callback class will have:
-
clr
which is the main part of the callback and used to set a new $l_r$ value -
on_train_begin
which will set the learning rate to the base value -
on_batch_end
which will increase the number of iterations and invokeclr
method -
on_epoch_end
which will add new info to the history dictionary
class CyclicalLR(keras.callbacks.Callback):
def __init__(self, base_lr, max_lr, stepsize, mode='triangular', gamma=1):
super(CyclicalLR, self).__init__()
self.base_lr = base_lr
self.max_lr = max_lr
self.stepsize = stepsize
self.mode = mode
self.gamma = gamma
self.iterations = 0
self.history = {'lr' : [], 'iter' : []}
def clr(self):
cycle = np.floor(1 + self.iterations / (2*self.stepsize))
x = np.abs(self.iterations / self.stepsize - 2*cycle + 1)
lr = self.base_lr + (self.max_lr - self.base_lr) * np.maximum(0, 1-x)
if self.mode == 'triangular':
return lr
elif self.model == 'triangular2':
return base_lr + (max_lr - base_lr) * np.maximum(0, 1-x) / (2 ** (cycle-1))
elif self.model == 'exp_range':
return base_lr + (max_lr - base_lr) * np.maximum(0, 1-x) * self.gamma ** self.iterations
def on_train_begin(self, logs=None):
logs = logs or {}
if self.iterations == 0:
K.set_value(self.model.optimizer.lr, self.base_lr)
else:
K.set_value(self.model.optimizer.lr, self.clr())
def on_batch_end(self, batch, logs=None):
logs = logs or {}
self.iterations += 1
K.set_value(self.model.optimizer.lr, self.clr())
def on_epoch_end(self, epoch, logs=None):
logs = logs or {}
self.history['lr'].append(float(self.model.optimizer.lr))
self.history['iter'].append(self.iterations)
You probably noticed there are several different modes of CLRs: triangular, triangular2 and exp_range. We'll go into further details about these in the VI part.
III Recreating Results on CIFAR-10
CIFAR-10 is one of the standard datasets for testing new approaches and setting state-of-the-art results. It contains 60,000 low-resolution pictures of 10 classes (airplane, automobile, bird, cat, dog, deer, frog, horse, ship, and truck), of which 10,000 is left out as a test set.
We'll use this dataset to compare models, but firstly let's see a few example pictures:
And here's our simple Keras model:
def build_model():
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(tf.keras.layers.MaxPooling2D((2, 2)))
model.add(tf.keras.layers.Conv2D(64, (3, 3), activation='relu'))
model.add(tf.keras.layers.MaxPooling2D((2, 2)))
model.add(tf.keras.layers.Conv2D(64, (3, 3), activation='relu'))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(10))
return model
IV.I How do we set up base and max learning rate?
Before we begin training our model, we need to answer this question.
LR range test is an approach proposed in the paper for setting up these values. Luckily, the LR Range test can be performed with our CyclicalLR
class by setting stepsize
the same as the total number of iterations. This way, we'll have $l_r$ go only once from base to the maximum value (we'll perform one half-cycle). Afterward, we plot accuracies versus learning rates and select base and maximum learning rates from that plot.
First step would be to calculate total number of iterations. The following equation is really helpful in doing this: $$ I_t = \frac {|X|}{n_{bs}} * n_e $$
- $I_t$ is total number of iterations
- $|X|$ is number of images in the train set (50,000 in our case)
- $n_{bs}$ is batch size (in our case it'll be 100)
- $n_e$ is number of epochs (in our case it'll be 150)
This means we'll now easily get: $$ I_t = \frac{50.000}{100} * 150 = 75.000 $$
Now we initialize our callback as follows and then plot accuracies versus learning rates:
clr_triangular = CyclicalLR(base_lr=0.001, max_lr=0.02, stepsize=75000)
We select the base as the point where the accuracy starts to sharply increase, and the maximum up to the point where it slows or starts to fall. From our example above, we'll select 0.001 as the base learning rate, and 0.005 as the maximum one.
stepsize
parameter?
IV.II How do we set up Paper proposes we set it as 2-8 times the number of iterations in one epoch. That's why we first need to calculate the number of iterations per epoch with a similar formula to the one we saw before: $$ I = \frac {|X|}{n_{bs}} \rightarrow \frac{50.000}{100}=500$$
In our case, the model will have 500 iterations per epoch.
Now, stepsize should be in 1000-4000 range. The author says that when comparing 2*stepsize and 8*stepsize, the latter is "only slightly better".
model = build_model()
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
history = model.fit(train_images, train_labels, epochs=150, batch_size=100,
validation_data=(test_images, test_labels))
But, let's see the same model trained for the third of the iterations, and with CLR:
clr_triangular = CyclicalLR(base_lr=0.001, max_lr=0.005, stepsize=2500)
model = build_model()
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
history = model.fit(train_images, train_labels, epochs=50, batch_size=100,
validation_data=(test_images, test_labels), callbacks=[clr_triangular])
The difference in accuracy may be only ~1.6%, but we did prove the main point - better results faster!
If you remember, our CyclicalLR
callback had different modes, but we didn't cover them. Now's the time!
First comes the default mode we've already seen - triangular mode. The learning rate just fluctuates between the base and maximum learning rate during the whole process of training:
Triangular2 mode picks up on the idea that learning rate should decrease over time - learning rate difference is halved after every cycle:
And finally, exp range mode picks up on the same idea as previous mode, but decreases learning rate difference exponentially - by an exponential factor of gamma$^{iterations}$:
VII Conclusion
Even though a simple idea, Cyclical Learning Rates introduced both a way to find an optimal range of learning rates and an approach to train neural networks quicker to reach the near-optimal performance.
This can be used to quickly prototype baseline models on new data and get the best possible results at that point.
Who knew fast and better was possible!
Thank you for reading up to here!
This was mainly made as a reminder and a practice for me, but if this helped you, feel free to share or comment to let me know your thoughts!
If you find any mistakes I made, notify me and I'll make the necessary changes and mention you and your help. Any and all suggestions and constructive criticism are always welcome. We're all here to learn!
VIII References and Further Literature
[1] Leslie N. Smith. Cyclical Learning Rates for Training Neural Networks. 2015. U.S. Naval Research Laboratory.
[2] Tensorflow's Custom Keras Callbacks Tutorial
[3] Tensorflow's Tutorial on CNNs (showcases usage of built-in CIFAR-10 dataset)