Introduction.

First of all, I would like to clarify that many of these notes have been developed following the Deep Learning Specialization from Coursera.

Therefore, if you have given a look at the courses in that specialization you won't find many original ideas. These are more notes from the lectures.

But, I hope so, maybe I have putten the concepts in a clear and concise way, and therefore it is worth to read.

The need for a strategy.

A Deep Learning (DL) project could be a complex project if you want to develop something that really gives tools to your company. 

If you want to reach an accuracy fitted to your needs you have to deal with a lot of data. And, in today BigData world often is not too difficult to collect enough data. And, if you deploy a first prototype of your model you can collect more data, from early users, and improve the model.

But analyzing these data often requires a lot of computational capacity, and this means costs and time. 

To avoid to spend a lot of time (and money) with little results you should have a clear idea of what it is better to do. What decisions and steps to make and prioritize.

First of all, DL works better with an iterative approach: set your target, identify the metrics (and it is better to have a single optimization metric), collect the data, develop rapidly a first model and the analyze your performances. Then, identify what are the best actions to improve performances, code the new model and iterate.

Analyzing your performances against the optimizing metric will give you a clear idea of what you should do to improve your model and then iterate, iterate, iterate until you don't reach your goal.

Training and Dev set.

It is simple, but for those starting, this is an area where they normally do mistakes.

It doesn't make sense to train your model on all the available data. You seriously risk to overfit the model on your training data and have no way, no tools, to predict what the performances will be on new, unseen data. And you developed your model to predict something on new data, not to predict what you already know.

Yes, I know that sometimes, in the beginning, it seems that you have small data. But there is no way. DL works better than "classic models" only if you have a lot of data.

Therefore collect enough data for starting and split the data as described below.

Having said that, you should first split your data into 3 sets:

  • training set
  • dev set (holdout cross-validation set)
  • test set

and you must put aside the test set, as the last tool to produce the last word regarding true world performances. Another big mistake is to use many times the test set during the development of the model. This way you will overfit to the test set. Don't do this.

In many introductory books on "classic Machine Learning" you will find suggestions like:

  • 80/10/10 (train/dev/test)
  • 70-15-15
  • and so on.

These are numbers that are OK only if you have 1000 or 10000 samples in your data. But in many DL projects you have 1000000 or more samples. In this BigData world, you will see used proportion like

98%-1%-1%.

It is ok if you have, for example, 10000 samples in your dev set. It is more important to train on as much data as possible and then have only enough data for testing.

Another important thing to keep in mind is that dev and test set must come from the same distribution and as much as possible they should contain the kind of data on which the model must do predictions.

This is another area where it is possible to make mistakes. If you don't have enough samples from the area you're working on, normally you resort to some form of "data augmentation". For example, if you are working to some sort of Image Recognition task, you gather images from the Web. but you must be careful, images from the web can be higher resolution, more precise, not blurred. Instead, the image you need to work on can be low-res, blurred and so on. It can be OK to add images from the web to the train set, but you should insert in the dev and train set only the kind on images your model will be working on. This way you're correctly setting the target. And, if you're adding images from the Web, your train set will come from a different distribution than dev and test set. In this case, you can have what is called "Data mismatch problem". There are techniques to detect and address the data mismatch problem, you should be aware of.

The train set basically will be used to train your model, the dev set to decide between different models. For example during the hyper-parameters optimization phase.

To quickly identify what it is best to do, what to prioritize, you must analyze the performances on the train and dev set, using the (single) optimizing metric you have chosen.

Bias and Variance.

On traditional ML books, you will often find discussions about the so-called "bias and variance tradeoff". Well, in DL you care only on how to reduce bias and variance. If you have enough data, there is no more a tradeoff.

Bias basically means that your model is not enough powerful to "understand the complexity of reality". Variance means that the model works well on the data it has been trained on but doesn't work well (generalize) on unseen data.

The approach to follow to quantify bias and variance is the following.

  1. First, establish an estimate of what could be the best accuracy you can get (an estimate of the so-called Bayes Error
  2. Evaluate the error on the train set
  3. Evaluate the error on the dev set.

The first step can be difficult. If I'm starting on a new field, how can I estimate "the best performances I can get"?. Sure.

Here only the experience can help you. Or searching what other people have done in similar fields.

If you're working on a task that normally humans are very good at (for example, street signals recognition, image recognition, speech recognition, language translation, so on) human level performances are a good estimator for an upper bound on Bayes error.

So, for example, you can say that for image recognition human level performance is (as recognition error) is 0.5% (well, it depends on what kind of images you're working on, if you're working on image from radiology and diagnosis it can different, but for the purpose of the exposition the number given can be a good starting point).

Then, you must compare train error (2) and dev error 83) to error defined for point 1 (what I'll call in the following as human-level performances).

This is the easy recipe:

  • if train error is significatively bigger than error (1), you have high bias
  • if dev error is much bigger than train error you have high variance

 you can have both.

A first interpretation.

If you have a large bias, your model is not enough powerful to capture all that you need from the available. If you have a large variance your model is overfitting to your train set.

In the first case, you can choose to:

  • increase the capacity of your network (more layers, more hidden units)
  • train for a longer number of iterations (increase the number of epochs)
  • maybe, change the architecture of the network

In the second case:

  • train on more data
  • apply regularization techniques
  • apply dropout
  • maybe, change the architecture of the network

Orthogonalization.

You can think of having two orthogonal dimensions along which you can move in order to reduce bias and variance. Well, you can also make actions influencing together bias and variance but, in order to proceed at maximum speed, you shouldn't do that.

Data mismatch problems.

If your train and dev set come from a different distribution, then the situation can be more complicated.

Imagine that you have done the analysis above described and you get those numbers:

  • human level error: 0.5%
  • train error: 1%
  • dev error: 10%

At this point, before deciding that you have a "variance problem", let's consider a question: the difference between train and dev error is due to the fact that the model has been trained on the train set (and therefore it is only a variance issue) or from the fact that the two sets come from different distributions and, for example, it is more difficult to make predictions on the second type of data?

Well, with this setup there is no way to answer. In this case (this is one technique) you should introduce a "train-dev set".

You should, before starting the training, take from the train set a small number of samples (of the order of the dev set) and put them aside. This samples will make the "train -dev set". Th remaining will be called the "true" train set.

This way you end-up with four set

  • train set
  • train-dev set
  • dev set
  • test set

You will train only on train set. And then evaluate the followings:

  • human-level error
  • train error
  • train-dev error
  • dev error

Imagine that you get:

  • human level error: 0.5%
  • train error: 1%
  • train-dev error: 1.5%
  • dev error: 10%

from this data, it is clear that you don't have a variance problem, but a "data mismatch problem", due to the fact that train and dev set come from a different distribution.

The model has been trained only on the train set. Therefore train-dev are unseen data. Since you have similar value for train and train-dev error the difference between train and dev error is not due to the fact that the model doesn't work well on unseen data (doesn't generalize well). Therefore it is not a variance issue. 

At this point, you should have a close look at the samples from the dev set where the model makes wrong predictions, to try to get clues on why this happens. You should manually analyze errors.

If you get:

  • human level error: 0.5%
  • train error: 1%
  • train-dev error: 9.5%
  • dev error: 10%

Then you have a variance problem. 

What to do if there is a "data mismatch problem"?

Well, the general criteria are that you should train, as much as possible, on the data your model should be good at making predictions on. Then, you should add to the train set more data from the second distribution (the one where dev and test come from).

If you have not enough data, you could try to resort to "data augmentation" and "data synthesis"  technique.  Or reduce (maybe remove) the differences from the two distributions.