渣渣英语,只能在看论文的时候先翻译后深入研读了。
声明:论文翻译只是为了自身学习,如有侵权必删。
Deep Residual Learning for Image Recognition
Abstract
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers——8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
摘要
由于网络层数越深,训练起来就越困难,因此我们在这篇论文中提出“残差学习网络框架”来优化网络训练,这些网络要比以前使用的网络具有更深的层数。我们参照每一层的输入将层
(layers)显式的表示为“学习残差函数”,而不是“学习无参考函数”。我们提供了全面的实验数据,结果表明这些残差网络更容易优化,并且可以通过显著增加的深度获得精度提升。我们在 ImageNet 数据集上评估了 152 层的残差网络,虽然比 VGG 深了 8 倍 [40],但是仍然具有较低的复杂度。对这些残差网络做了 ensemble 后在 ImageNet 数据集的测试集上错误率只有 3.57%
,因此也获得了 ILSVRC 2015 竞赛的第一名。我们也给出了 100 层和 1000 层的残差网络结构在 CIFAR-10 数据集上的分析结果。
The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
对于大部分的视觉识别任务而言,图像表示的层次深度往往是至关重要的。仅仅因为我们较深的表示层次,我们在 COCO 目标检测数据集上获得了 28% 的相对提高。深度残差网络
作为我们在 ILSVRC 和 COCO 2015 数据竞赛上提交的网络结构的基础单元,也让我们在 ImageNet 的检测和定位任务以及 COCO 的检测和分割任务中夺得第一。
1. Introduction
Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21, 49, 39]. Deep networks naturally integrate low/mid/high-level features [49] and classifiers in an end-to-end multi-layer fashion, and the “levels” of features can be enriched by the number of stacked layers (depth). Recent evidence [40, 43] reveals that network depth is of crucial importance, and the leading results [40, 43, 12, 16] on the challenging ImageNet dataset [35] all exploit “very deep” [40] models, with a depth of sixteen [40] to thirty [16]. Many other non-trivial visual recognition tasks [7, 11, 6, 32, 27] have also greatly benefited from very deep models.
1. 引言
深度卷积神经网络 [22,21] 带来了图像分类上的一系列突破 [21,49,39]。深度网络能够以端到端的多层方式合理地整合 低、中、高三个层次的图像特征 [49] 以及分类器,并且这三个层次的特征可以被在深度方向堆叠的模型增强。最近得到的证据显示:网络的深度非常重要,并且在 ImageNet 数据集上的前排模型都是非常深的模型,深度在16层到30层之间。很多其他的视觉检测任务也得益于非常深的模型。
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [14, 1, 8], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization [23, 8, 36, 12] and intermediate normalization layers [16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22].
在深度重要性的推动下,一个问题也随之而来:学习到表现好的模型跟堆叠层数一样简单吗?回答这个众所周知的问题有一个障碍:梯度消失和梯度爆炸,这个障碍从一开始就阻碍了模型收敛。但是,这个问题已经通过标准初始化和中间标准化层得到了很大的改善,这也使得十几层的网络结构开始通过具有反向传播的随机梯度下降进行收敛。
When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in [10, 41] and thoroughly verified by our experiments. Fig. 1 shows a typical example.
Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.
当越深的网络能够逐渐收敛时,一个严峻的问题也随之暴露出来:伴随网络深度的增加,精度开始趋于饱和(看起来并不奇怪),并且接着就开始急剧下降。令人遗憾的是,这种下降趋势并不是由于过拟合而造成的,而且添加越多的网络层到一个已经相对比较合适的深度模型中往往造成更高的训练误差,正如在[10,41]提出以及我们通过实验验证的结果一样。图一是一个典型的例子。
图 1. 20 层和 56 层的原模型在 CIFAR-10 上的训练误差(左)和测试误差(右)。越深的网络训练误差越大,测试误差也是如此。在 ImageNet 上的相同将在图 4 中展示。
The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time).
训练精度的降低表明并不是所有的模型都一样容易优化。让我们考虑一个浅层的网络结构以及它对应通过添加更多层而得到的深层次网络结构。对于深层次的模型而言存在一个从“构建”(construction)角度出发的解决方案:添加得到的层只是一种恒等映射,同时其他层是从学习到的浅层模型复制过来的。这种现存的“构建”解决方案表明更深的模型不应该产生比其浅层模型更高的训练误差。但是实验却显示我们目前的解决方法无法找到比“构建”解决方法相当好或者更好的方式(或者说在合理的时间内无法完成)。
In this paper, we address the degradation problem by introducing a deep residual learning framework. Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as $H(x)$, we let the stacked nonlinear layers fit another mapping of $F(x) := H(x) - x$. The original mapping is recast into $F(x)+x$. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.
在这篇论文中,我们通过引进一种 深度残差学习网络
解决了准确度降低的问题。我们明确地让这些层进行残差映射,而不希望每个堆叠得到的层直接进行底层映射。形式上,表示期望的底层映射函数为:$H(x)$,我们让堆叠得到的非线性层进行另一种映射:$F(x) := H(x) - x$。这样以来原始映射关系被重塑成:$F(x)+x$。我们假设优化这些残差映射比优化那些原始的、未引用的映射(unreferenced mapping)更为容易。极端情况,如果一个恒等映射以及最优,那么将残差推到零比拟合一个由非线性堆层组成的恒等映射更容易。
The formulation of $F(x)+x$ can be realized by feedforward neural networks with “shortcut connections” (Fig. 2). Shortcut connections [2, 33, 48] are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2). Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., Caffe [19]) without modifying the solvers.
Figure 2. Residual learning: a building block.
公式 $F(x)+x$ 可以通过带有“快速连接”的前馈神经网络实现(图2),“快速连接”是指跳过一个或多个层进行连接。在我们的例子中,“快速连接”只是简单的恒等映射,并且它们的输出也被添加到了“堆叠层”(stacked layers) 的输出中去(图2)。恒等映射的“快速连接”既没有额外增加参数也没有提高复杂度,整个网络依然能够通过带有反向传播 的SGD实现端到端的训练,并且也能够利用常见的深度学习框架(例如 Caffe )轻易实现。
图 2. 残差学习: 一个残差模块.
We present comprehensive experiments on ImageNet [35] to show the degradation problem and evaluate our method. We show that: 1) Our extremely deep residual nets are easy to optimize, but the counterpart “plain” nets (that simply stack layers) exhibit higher training error when the depth increases; 2) Our deep residual nets can easily enjoy accuracy gains from greatly increased depth, producing results substantially better than previous networks.
我们在 ImageNet 数据集上进行了综合实验来展示退化问题,并且评估我们提出的方法。结果如下:1)我们提出的极深的残差网络比较容易进行优化,但是相对而言原网络(简单堆叠层)则随着深度的增加表现出更高的训练误差。2)我们的残差网络能够轻松地通过加深网络层数来获得准确度的提升,表现结果要比先前的网络好很多。
Similar phenomena are also shown on the CIFAR-10 set [20], suggesting that the optimization difficulties and the effects of our method are not just akin to a particular dataset. We present successfully trained models on this dataset with over 100 layers, and explore models with over 1000 layers.
在 CIFAR-10 数据集上得到的效果也是如此,结果证明优化困难以及我们所提出的的方法的实用性不仅仅针对特定的数据集。目前在这个数据集上我们已经成功实现了拥有 100 的网络结构,并且又拓展到了 1000 层的模型。
On the ImageNet classification dataset [35], we obtain excellent results by extremely deep residual nets. Our 152-layer residual net is the deepest network ever presented on ImageNet, while still having lower complexity than VGG nets [40]. Our ensemble has 3.57% top-5 error on the ImageNet test set, and won the 1st place in the ILSVRC 2015 classification competition. The extremely deep representations also have excellent generalization performance on other recognition tasks, and lead us to further win the 1st places on: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation in ILSVRC & COCO 2015 competitions. This strong evidence shows that the residual learning principle is generic, and we expect that it is applicable in other vision and non-vision problems.
在 ImageNet 分类数据集上,通过极深的残差网络我们获得了相当不错的结果。虽然我们提出的 152 层的网络模型是目前 ImageNeT 最深的,但是依旧比 VGG16 的复杂度还要低。并且我们融合的模型在 ImageNet 的测试集上获得了 3.57% 的 Top-5 错误率,同时也赢得了 ILSVRC 2015 的 第一名。极深表示在其他识别任务上也具有很好的泛化性能,这使得我们在ILSVRC 2015 的ImageNet 检测、定位任务和 COCO 2015竞赛的 检测、分割任务都获得了第一名的成绩。这是一个强有力的证据来证明我们所提出的深度残差网络的原理是通用的,同时希望可以应用到其他的视觉和非视觉任务中。
2. Related Work
Residual Representations. In image recognition, VLAD [18] is a representation that encodes by the residual vectors with respect to a dictionary, and Fisher Vector [30] can be formulated as a probabilistic version [18] of VLAD. Both of them are powerful shallow representations for image retrieval and classification [4, 47]. For vector quantization, encoding residual vectors [17] is shown to be more effective than encoding original vectors.
2. 相关工作
残差表示。在图像识别领域,VLAD
待更新。。。。。