Deep neural networks are extremely challenging to train. A deep residual learning framework is used to ease the training of neural networks which are significantly deeper. The crystal-clear reformulated layers act as learning residual functions regarding layer inputs. Let's have a look at how deep residual learning benefits image recognition.
The extensive pragmatic proof in companies like Tooliqa and Netguru exhibit that these residual networks are easier to optimize which gains reliability from the substantially increased depths.
On the Image net dataset, residual nets having a depth of up to 152 layers can be accessed easily which is 8 times larger than the VGGnets but still pose less complexity. This combo of residual nets achieves a 3.75% error on the Image net test set.
In addition, the depth of representations plays a vital role in many visual recognition tasks. The deep representation result shows a 28% relative refinement over the CoCo object detection dataset.
Introduction to deep residual learning
Deep recognition is an engineering application of machine learning. The Deep convolutional neural networks have a series of breakthroughs for image classifications. Deep networks incorporate low, high, and even level features and the classifiers follow up an end-to-end multi-layer fashion, and the level of feature can be further enhanced by the number of layers.
The network depth plays a crucial role and the leading results on the challenging Image net dataset, utilize very deep models with a depth ranging from sixteen to thirty.
Non-trivial recognition tasks have also enjoyed the advantageous position due to the very deep models. The significance of the depth leads to a question: Is learning better networks is as easy as stacking layers? The barrier in achieving this is the problem of vanishing/exploding gradients which in turn hinders convergence from the beginning.
Further, this problem is rectified by using the normalized initialization and intermediate normalization layers, enabling networks with tens of layers to start converging for stochastic gradient descent (SGD) with the use of backpropagation.
- Residual network
A residual network solves the degradation problem by shortcut or skips connections. They enable very deep networks to be built.
When deeper networks start converging, there comes a degradation problem; with the increase in network depth which leads accuracy to saturation, and then it starts to degrade rapidly. The reason behind this degradation is not overfitting, which in turn adding more layers to a suitable deep model leads to a high training error.
This degradation specifies that all the systems are not easy to optimize. The deep residual nets are easy to optimize when compared to the plain nets which exhibit higher training error when there is an increase in depth.
Also, deep residual nets enjoy the accuracy gains when there is an increase in depth when compared with the plain nets. The image net classification dataset gives far accurate results by using the extremely deep residual nets.
Further, the 152 layers residual net is the deepest network in the Image net, but still poses less complexity than VGG nets (40). The residual learning principle is common in nature and hence can be applied in vision as well as non-vision problems.
Related Work in deep residual learning
1. Residual representations
In image recognition, the representation that is commonly used in the VLAD which encodes by the residual vectors based on the dictionary and the fisher vector can be designed as a probabilistic version of the CLAD. Both have a powerful representation for image retrieval as well as classification.
For vector quantization, encoding the residual vector is more productive than the original vectors. In low-level vision and computer graphics, the Multigrid method is used to solve the partial differential equations redraft the system as subproblems at multiple scales, and these subproblems provide the coarser to a finer scale.
Furthermore, the hierarchical basis preconditioning acts as a backup to a multigrid which depends on variables for representing the residual vectors between two scales. These solvers converge much faster than the standard solvers, unaware of the residual solutions. This method puts forward a good reformulation or preconditioning which in turn simplifies the optimization.
2. Shortcut connections
The prior practice of training multi-layer perceptrons (MLPs) is by adding a linear layer that is connected from the network input to output, some intermediate layers which are connected to auxiliary classifiers for addressing vanishing/exploding gradients.
Moreover, an inception layer is made of a shortcut branch and some deeper branches. The highway networks have shortcut connections with gate functions where these gates are data-dependent and hold parameters.
Also, the identity shortcuts are parameter-free. When a gated shortcut is closed, the highway network layers represent the non-residual function. The identity shortcuts are never closed and always learn from the residual functions where all the pieces of information are passed through the residual function which to be learned. The highway networks have showcased the accuracy rate even with increased depths.
Deep residual learning
1. Residual learning
Consider H(x) as an underlying mapping to be fit by a few stacked layers, where x denotes the initial input for these layers. If multiple nonlinear layers asymptotically approximate complicated functions which is equivalent to hypothesize asymptotically approximate residual functions.
Furthermore, making stacked layers to approximate H(x) instead, make these layers approximate a residual function F(x) = H(x) − x. Then the original function will be F(x)+x. Both forms asymptotically approximate the desired functions where ease of learning differs.
2. Identify mapping by shortcuts
A building block is defined as y = F(x, {Wi}) + x, where x and y are input and output vectors of layers. The function F(x, {Wi}) is the residual mapping which too is learned. The function F + x is carried out by a shortcut connection and element-wise addition. The dimensions of x and F should be equal; this makes a linear projection Ws by the shortcut connections to match the dimensions.
y = F(x, {Wi}) + Wsx
3. Network architectures
Plain network: The convolutional layer in plain baselines uses 3×3 filters which have two design rules:
(i) The number of filters in the layers will be the same as the output feature map size.
(ii) If feature map size is halved, which makes the filter number double.
Residual network
The insertion of shortcut connections in the plain network turns into a residual network (deep residual learning). Moreover, the identity shortcut is used when the dimension is the same for input and output. If the dimension increases, it falls into two categories:
(i) The shortcut performs identity mappings with additional zero entries padded for the dimensional increase.
(ii) The projection shortcut is used to match the increased dimensions.
Implementation
The Image net implementation takes [21, 40] in practice. The image is resized randomly to its shortest sample size [256, 480] for the scale enhancement [40]. The standard color enhancement in [21] is most commonly used. We initialize the weights in [12] to train all plain as well as residual nets. For comparison studies standard 10-crop testing [2] is used whereas for best results fully convolutional form in [40,12] is used.
Also read: Image Classification: An Artistic Science | insights - Tooliqa
Tooliqa specializes in AI, Computer Vision and Deep Technology to help businesses simplify and automate their processes with our strong team of experts across various domains.
Want to know more on how AI can result in business process improvement? Let our experts guide you.
Reach out to us at business@tooli.qa.