Why Kaggle?

I have started using Kaggle seriously a couple of months ago when I joined the SIIM-ISIC Melanoma Classification Competition.

The initial reason, I think, was that I wanted a serious way to test my Machine Learning (ML) and Deep Learning (DL) skills. At the time, I was studying for the Coursera AI4Medicine Specialization and I was intrigued (I'm still) by what can be realized applying DL to Medicine. I was also reading the beautiful book by Eric Topol: Deep Medicine, which is full of interesting ideas on what could be done.

I had opened my Kaggle account several years ago, but haven't done yet anything serious. Then, I discovered the Melanoma Challenge and it seemed to be a really good way to start working on a difficult task, with real data.

Therefore, I started working on the competition and I was caught in the game. I thought it was easier.

A summary of what I have learned.

The first thing that I learned to master has been how to efficiently read many images, without having the GPU (or TPU) to wait.

In fact, in the beginning, I was trying to train one of my first models on a 2-GPU machine, and the training seemed too slow. The GPU utilization was really low, about 20%. Why?

Because I was using Keras ImageDataGenerator, reading images from directories. Reading several discussions on Kaggle (yes, this is an important suggestion: read the discussions) I discovered that a very efficient way is to pack images (eventually pre-processed, resized) in files in TFRecord format. This way I have been able to bring GPU utilization in higher 90s. 

Yes, I know that things are going to improve with the preprocessing and data loading capabilities coming with TF 2.3 (see: ImageDatasetfromDirectory) but if you need to do massive image (or data) pre-processing then you should consider packaging results in TFRecord format.

The second important thing is to use a modern pre-trained ConvNet.

Again, reading from Kaggle's discussions I discovered the family of EfficientNets. These are convolutional networks, pre-trained on Imagenet, proposed by Google researchers in 2019. These CNN are very efficient and you can achieve high accuracy with less computational power if compared to old CNN. It is surprising the increase in accuracy that you can achieve simply using an EfficientNet as a convolutional layer (feature extractor).

The third thing is to develop a robust cross-validation approach. As always you want to have a training set as large as possible. But, at the same time, you need an enough large validation set in order to have a fair idea of what could be the performance (accuracy, AUC, ...) of the trained model on unseen data. If the validation set is too small your estimates for the score depend heavily on the way you split between train and validation set. The only viable way is to adopt a robust cross-validation (CV) schema. For example, I often use CV with 5 folds: the train set is divided into 5 folds and I repeat the train (for the entire number of epochs) five times, taking each time one-fifth of the train set for validation. For each fold, I estimate a final accuracy (or whatever metric you choose) and the best estimate (in the validation phase) is the average. If the distribution in train and test set is the same you should achieve a CV score that is a good estimate for the public score LB (and you hope for the final private LB score too). 

Fourth: learn to use TPU. TPUs are specialized (ASIC) processors designed and developed by Google explicitly for working with Neural Networks. If you know how to do, you can use Tensorflow on TPU and train your model with a speed that is ten times faster than the speed that you can achieve with a 2-GPU machine. And it is by far better to be able to test changes in the model, and their results in terms of accuracy, in 1/10 of the time. This way you won't be bored for waiting too long and you can do more tests. On Kaggle you have 30 hours of TPU time for free every week (the only drawback so far is that they don't support, yet, TF 2.3, but I'm sure it won't take too long to support it).

In general, you should know how to train on a multi-GPU and a TPU machine. It is not that difficult, even if at first sight the configuration code looks a little obscure.

Fifth: be gentle with your learning rate.

The learning rate (LR) is probably the most important hyper-parameter. But, if you read an introductory book on Deep Learning you won't find any detailed description of the strategy that you can adopt to use the LR to get the best out of your data.

Some quick considerations: if you're doing transfer learning (see EfficientNet) you can start training without freezing, in the beginning, the convolutional layer, but then you should be very careful to adopt a very small learning rate, to avoid that the gradients coming from the "not trained" classification tail destroy the weights of the pre-trained convolutional head. Then you should gradually increase the learning rate with the epochs. But, when the improvements in loss start decreasing you should start decreasing the learning rate. It is difficult to get it right. Normally, in Kaggle competitions, I have seen that is adopted a time-varying learning rate, using Keras LearningRateScheduler.

For some more details, see https://www.jeremyjordan.me/nn-learning-rate/

An example of the code I have been using can be found here:

https://github.com/luigisaetta/diabetic-retinopathy/blob/master/Diabetic-Retinopathy-512-Classification-Copy6.ipynb 

A complete example.

I have prepared a rather complete example using data from an old competition: Diabetic Retinopathy Detection. You can find the code in one of my Github repositories: https://github.com/luigisaetta/diabetic-retinopathy.

I have:

  • preprocessed all the images, reducing the size to 512x512
  • applied a filter, gaussian blur, that enhance the details showing signs of DR
  • packed the processed images in TFRecord files
  • developed a CNN model for the classification in five classes (as requested by the competition) using an Efficient Net B4
  • applied all the things detailed previously

You can find all the details to reproduce the results in the Github repository and the dataset (with TFRecord files) is available on Kaggle: https://www.kaggle.com/luigisaetta/tfrecord512

Some conclusion.

With about 24 hours of training, my best model has shown a CV accuracy of 0.856. 

The following image shows the increase of accuracy with the epochs, for one of the CV fold (for the others it is similar).

But, what is more interesting is that the resulting submission (you can use late submission on Kaggle's closed competition) has achieved a private LB score of 0.79369. This score would have positioned me in the 14th place.

Well, it doesn't mean that I'm already a master, but for sure it is a demonstration that with today's technologies and techniques it is by far easier to get results that required some months of hard work five years ago. Just to give one detail regarding this claim: five years ago EfficientNets were not available (they have been published in 2019).

This is what we call: progress.

AI4Medicine.

I really think that one area where DL and AI could give a great contribution to the development of a better world is Medicine. And it is not only applying the results from the Computer Vision field (diagnoses supported by AI). Especially in the least developed countries, where we still have too many people suffering and dying from diseases that can be cured if correctly diagnosticated in time.

One last thing: what about my SIIM-ISIC Melanoma Classification Competition?

Well, as I said it hasn't been easy. More than three thousand participants and there was a final surprise: the private LB is very different from the public one.

I finished in the top 20%, which is not bad for a Kaggle beginner.