官术网_书友最值得收藏!

Understanding hyperparameters

Hyperparameters serve a similar purpose to the various tone knobs on a guitar that are used to get the best sound. They are settings that you can tune to control the behavior of an ML algorithm.

A vital aspect of any deep learning solution is the selection of hyperparameters. Most deep learning models have specific hyperparameters that control various aspects of the model, including memory or the execution cost. However, it is possible to define additional hyperparameters to help an algorithm adapt to a scenario or problem statement. To get the maximum performance of a particular model, data science practitioners typically spend lots of time tuning hyperparameters as they play such an important role in deep learning model development.

Hyperparameters can be broadly classified into two categories:

Model training-specific hyperparameters

Network architecture-specific hyperparameters

In the following sections, we will cover model training-specific hyperparameters and network architecture-specific hyperparameters in detail

Model training-specific hyperparameters

Model training-specific hyperparameters play an important role in model training. These are hyperparameters that live outside the model but have a direct influence on it. We will discuss the following hyperparameters:

Learning rate

Batch size

Number of epochs

Let's start with the learning rate.

Learning rate

The learning rate is the mother of all hyperparameters and quantifies the model's learning progress in a way that can be used to optimize its capacity.

A too-low learning rate would increase the training time of the model as it would take longer to incrementally change the weights of the network to reach an optimal state. On the other hand, although a large learning rate helps the model adjust to the data quickly, it causes the model to overshoot the minima. A good starting value for the learning rate for most models would be 0.001; in the following diagram, you can see that a low learning rate requires many updates before reaching the minimum point:

Fig 2.18: A low learning rate

However, an optimal learning rate swiftly reaches the minimum point. It requires less of an update before reaching near minima. Here, we can see a diagram with a decent learning rate:

Fig 2.19: Decent learning rate

A high learning rate causes drastic updates that lead to divergent behaviors, as shown in the following diagram:

Fig 2.20: A high learning rate
Here is a paper on choosing the learning rate ( Cyclical Learning Rates for Training Neural Networks) by Leslie Smith:  https://arxiv.org/abs/1506.01186.

In the next section, we will learn about an important model training-specific parameter called batch size.

Batch size

Another non-trivial hyperparameter that has a huge influence on the training accuracy, time, and resource requirements is batch size. Basically, batch size determines the number of data points that are sent to the ML algorithm in a single iteration during training.

Although having a very large batch size is beneficial for huge computational boosts, in practice, it has been observed that there is a significant degradation in the quality of the model, as measured by its ability to generalize. Batch size also comes at the expense of needing more memory for the training process.

Although a smaller batch size increases the training time, it almost always yields a better model than when using a larger batch size. This can be attributed to the fact that smaller batch sizes introduce more noise in gradient estimations, which helps them converge to flat minimizers. However, the downside of using a small batch size is that training times are increased.

In general, if the training sample sizes are larger, it  is usually  recommended to choose a larger batch size.  A good batch size  value would be around 2 to 32. You can refer to a paper written by Dominic Masters ( https://arxiv.org/search/cs?searchtype=author&query=Masters%2C+D ) and Carlo Luschi ( https://arxiv.org/search/cs?searchtype=author&query=Luschi%2C+C),  Revisiting Small Batch Training for Deep Neural Networks ( https://arxiv.org/abs/1804.07612 ), for more information. As this paper says, mini-batch sizes between m = 2 and m = 32 perform well consistently.

Number of epochs

An epoch is the number of cycles for which a model is trained. One epoch is when a whole dataset is passed forward and backward only once through the neural network. We can also say that an epoch is an easy way to track the number of cycles, while the training or validation error continues to go on. Since one epoch is too large to feed at once to the machine, we divide it into many smaller batches. 

One of the techniques to do this is to use the early stopping Keras callback, which stops the training process if the training/validation error has not improved in the past 10 to 20 epochs.

Network architecture-specific hyperparameters 

The hyperparameters that directly deal with the architecture of the deep learning model are called network architecture-specific hyperparameters. The different types of network-specific hyperparameters are as follows:

Number of hidden layers

Regularization

Activation function as hyperparameters

In the following section, we will see how network architecture-specific hyperparameters work.

Number of hidden layers 

It is easy for a model to learn simple features with a smaller number of hidden layers. However, as the features get complex or non-linearity increases, it requires more and more layers and units.

Having a small network for a complex task would result in a model that performs poorly as it wouldn't have the required learning capacity. Having a slightly larger number of units than the optimal number is not a problem; however, a much larger number will lead to the model overfitting. This means that the model will try to memorize the dataset and perform well on the training dataset, but will fail to perform well on the test data. So, we can play with the number of hidden layers and validate the accuracy of the network.

Regularization

Regularization is a hyperparameter that allows slight changes to the learning algorithm so that the model becomes more generalized. This also improves the performance of the model on the unseen data.

In ML, regularization penalizes the coefficients. In deep learning, regularization penalizes the weight matrices of the nodes.

We are going to discuss two types of regularization, as follows:

L1 and L2 regularization

Dropout

We will start with L1 and L2 regularization.

L1 and L2 regularization

The most common types of regularization are L1 and L2. We change the overall cost function by adding another term called regularization. The values of weight matrices decrease due to the addition of this regularization because it assumes that a neural network with smaller weight matrices leads to simpler models. 

Regularization is different in L1 and L2. The formula for L1 regularization is as follows:

In the preceding formula, regularization is represented by lambda (λ). Here, we penalize the absolute weight.

The formula for L2 regularization is as follows:

In the preceding formula, L2 regularization is represented by lambda (λ). It is also called weight decay as it forces the weights to decay close to 0.

Dropout

Dropout is a regularization technique that is used to improve the generalizing power of a network and prevent it from overfitting. Generally, a dropout value of 0.2 to 0.5 is used, with 0.2 being a good starting point. In general, we have to select multiple values and check the performance of the model.

The likelihood of a dropout that has a value that is too low has a negligible impact. However, if the value is too high for the network, then the network under-learns the features during model training. If dropout is used on a larger and wider network, then you are likely to get better performance, giving the model a greater opportunity to learn independent representations.

An example of dropout can be seen as follows, showing how we are going to drop a few of the neurons from the network:

Fig 2.21: Dropout  

In the next section, we will learn about activation functions as hyperparameters. 

Activation functions as hyperparameters

Activation functions, which are less commonly known as transfer functions, are used to enable the model to learn nonlinear prediction boundaries. Different activation functions behave differently and are carefully chosen based on the deep learning task at hand. We have already discussed different types of activation in an earlier section of this chapter, Understanding activation functions.

In the next section, we will learn about the popular deep learning APIs—TensorFlow and Keras. 

主站蜘蛛池模板: 隆化县| 湟中县| 陈巴尔虎旗| 嫩江县| 长春市| 蒲城县| 石家庄市| 贡觉县| 莫力| 灵丘县| 连城县| 长泰县| 雷山县| 根河市| 漯河市| 奎屯市| 徐汇区| 长乐市| 凤冈县| 比如县| 温宿县| 赣榆县| 分宜县| 漾濞| 饶阳县| 阳城县| 营口市| 娄烦县| 格尔木市| 乌苏市| 玉溪市| 龙海市| 锡林浩特市| 西盟| 铅山县| 长春市| 永宁县| 兴安县| 宁强县| 南康市| 银川市|