- Applied Deep Learning and Computer Vision for Self/Driving Cars
- Sumit Ranjan;Dr. S. Senthamilarasu
- 1428字
- 2021-04-09 23:13:04
Understanding hyperparameters
Hyperparameters serve a similar purpose to the various tone knobs on a guitar that are used to get the best sound. They are settings that you can tune to control the behavior of an ML algorithm.
A vital aspect of any deep learning solution is the selection of hyperparameters. Most deep learning models have specific hyperparameters that control various aspects of the model, including memory or the execution cost. However, it is possible to define additional hyperparameters to help an algorithm adapt to a scenario or problem statement. To get the maximum performance of a particular model, data science practitioners typically spend lots of time tuning hyperparameters as they play such an important role in deep learning model development.
Hyperparameters can be broadly classified into two categories:
Model training-specific hyperparameters
Network architecture-specific hyperparameters
In the following sections, we will cover model training-specific hyperparameters and network architecture-specific hyperparameters in detail.
Model training-specific hyperparameters
Model training-specific hyperparameters play an important role in model training. These are hyperparameters that live outside the model but have a direct influence on it. We will discuss the following hyperparameters:
Learning rate
Batch size
Number of epochs
Let's start with the learning rate.
Learning rate
The learning rate is the mother of all hyperparameters and quantifies the model's learning progress in a way that can be used to optimize its capacity.
A too-low learning rate would increase the training time of the model as it would take longer to incrementally change the weights of the network to reach an optimal state. On the other hand, although a large learning rate helps the model adjust to the data quickly, it causes the model to overshoot the minima. A good starting value for the learning rate for most models would be 0.001; in the following diagram, you can see that a low learning rate requires many updates before reaching the minimum point:
However, an optimal learning rate swiftly reaches the minimum point. It requires less of an update before reaching near minima. Here, we can see a diagram with a decent learning rate:
A high learning rate causes drastic updates that lead to divergent behaviors, as shown in the following diagram:
In the next section, we will learn about an important model training-specific parameter called batch size.
Batch size
Another non-trivial hyperparameter that has a huge influence on the training accuracy, time, and resource requirements is batch size. Basically, batch size determines the number of data points that are sent to the ML algorithm in a single iteration during training.
Although having a very large batch size is beneficial for huge computational boosts, in practice, it has been observed that there is a significant degradation in the quality of the model, as measured by its ability to generalize. Batch size also comes at the expense of needing more memory for the training process.
Although a smaller batch size increases the training time, it almost always yields a better model than when using a larger batch size. This can be attributed to the fact that smaller batch sizes introduce more noise in gradient estimations, which helps them converge to flat minimizers. However, the downside of using a small batch size is that training times are increased.
Number of epochs
An epoch is the number of cycles for which a model is trained. One epoch is when a whole dataset is passed forward and backward only once through the neural network. We can also say that an epoch is an easy way to track the number of cycles, while the training or validation error continues to go on. Since one epoch is too large to feed at once to the machine, we divide it into many smaller batches.
One of the techniques to do this is to use the early stopping Keras callback, which stops the training process if the training/validation error has not improved in the past 10 to 20 epochs.
Network architecture-specific hyperparameters
The hyperparameters that directly deal with the architecture of the deep learning model are called network architecture-specific hyperparameters. The different types of network-specific hyperparameters are as follows:
Number of hidden layers
Regularization
Activation function as hyperparameters
In the following section, we will see how network architecture-specific hyperparameters work.
Number of hidden layers
It is easy for a model to learn simple features with a smaller number of hidden layers. However, as the features get complex or non-linearity increases, it requires more and more layers and units.
Having a small network for a complex task would result in a model that performs poorly as it wouldn't have the required learning capacity. Having a slightly larger number of units than the optimal number is not a problem; however, a much larger number will lead to the model overfitting. This means that the model will try to memorize the dataset and perform well on the training dataset, but will fail to perform well on the test data. So, we can play with the number of hidden layers and validate the accuracy of the network.
Regularization
Regularization is a hyperparameter that allows slight changes to the learning algorithm so that the model becomes more generalized. This also improves the performance of the model on the unseen data.
In ML, regularization penalizes the coefficients. In deep learning, regularization penalizes the weight matrices of the nodes.
We are going to discuss two types of regularization, as follows:
L1 and L2 regularization
Dropout
We will start with L1 and L2 regularization.
L1 and L2 regularization
The most common types of regularization are L1 and L2. We change the overall cost function by adding another term called regularization. The values of weight matrices decrease due to the addition of this regularization because it assumes that a neural network with smaller weight matrices leads to simpler models.
Regularization is different in L1 and L2. The formula for L1 regularization is as follows:
In the preceding formula, regularization is represented by lambda (λ). Here, we penalize the absolute weight.
The formula for L2 regularization is as follows:
In the preceding formula, L2 regularization is represented by lambda (λ). It is also called weight decay as it forces the weights to decay close to 0.
Dropout
Dropout is a regularization technique that is used to improve the generalizing power of a network and prevent it from overfitting. Generally, a dropout value of 0.2 to 0.5 is used, with 0.2 being a good starting point. In general, we have to select multiple values and check the performance of the model.
The likelihood of a dropout that has a value that is too low has a negligible impact. However, if the value is too high for the network, then the network under-learns the features during model training. If dropout is used on a larger and wider network, then you are likely to get better performance, giving the model a greater opportunity to learn independent representations.
An example of dropout can be seen as follows, showing how we are going to drop a few of the neurons from the network:
In the next section, we will learn about activation functions as hyperparameters.
Activation functions as hyperparameters
Activation functions, which are less commonly known as transfer functions, are used to enable the model to learn nonlinear prediction boundaries. Different activation functions behave differently and are carefully chosen based on the deep learning task at hand. We have already discussed different types of activation in an earlier section of this chapter, Understanding activation functions.
In the next section, we will learn about the popular deep learning APIs—TensorFlow and Keras.
- 中文版AutoCAD 2015實(shí)用教程
- 數(shù)碼攝影后期零基礎(chǔ)入門教程
- Photoshop CS6從入門到精通
- InDesign平面設(shè)計(jì)案例教程:從設(shè)計(jì)到印刷
- 老郵差數(shù)碼照片處理技法 圖層篇
- 工業(yè)軟件研發(fā)、測(cè)試與質(zhì)量管理論叢
- Excel數(shù)據(jù)管理:不加班的秘密
- Adobe創(chuàng)意大學(xué)Photoshop CS5 產(chǎn)品專家認(rèn)證標(biāo)準(zhǔn)教材
- Premiere pro CC中文版自學(xué)視頻教程
- 平面設(shè)計(jì)制作標(biāo)準(zhǔn)教程(微課版 第2版)
- Adobe創(chuàng)意大學(xué)InDesign CS5 產(chǎn)品專家認(rèn)證標(biāo)準(zhǔn)教材
- 24小時(shí)學(xué)會(huì)Word-Excel-PowerPoint 2010三合一
- 中文版SolidWorks 2018完全實(shí)戰(zhàn)技術(shù)手冊(cè)
- UG NX 12.0中文版實(shí)戰(zhàn)從入門到精通
- OpenAM