改进神经⽹络的学习⽅法(二)

过拟合 和 正则化

Posted by 逸杯久 on February 8, 2019

“Action speak louder than words. ”

[TOC]

1 过拟合(Overfitting)

1.1 什么是过拟合

​ Google机器学习给出的解释是:创建的模型与训练数据过于匹配,以致于模型无法根据新数据做出正确的预测。

下面我们看一个线性模型的例子:

如下一个简单的训练数据集,我们为其建立模型:

overfit_01

我们分别建立一个线性模型(下图蓝色实线所示)y = kx 和一个 6阶多项式模型(下图橙色虚线所示),得到如下图所示的结果:

overfit_02

多项式模型比线性模型在训练数据上表现要更好,但是我们现在把这两个模型用在测试数据(下图“十字”为测试样本)上,做最后的验收评估,得到如下图所示的结果:

overfit_03

综上:这个例子中虽然 6阶多项式模型在训练数据测试中表现比线性模型好,但是应用到测试数据(实际预测)中变现并不好——这是由于模型与训练数据过于匹配,反而局限了模型对新数据的处理。

1.2 如何判断过拟合

​ 发生过拟合时,训练集和验证集相对于训练迭代次数的损失通常如下泛化曲线所示:

è®­ç"ƒé›†çš„æŸå¤±å‡½æ•°é€æ¸ä¸‹é™ã€‚相比之下,验证集的损失函数先下降,然后开始上升。

​ 上图显示的是某个模型的训练损失逐渐减少,但验证损失最终增加。

防止过拟合的方法有:

  • 正则化
  • 扩大训练集
  • 早停法
  • 弃权

2 正则化(regularization)

​ 奥卡姆的威廉是 14 世纪一位崇尚简单的修士和哲学家。他认为科学家应该优先采用更简单(而非更复杂)的公式或理论。奥卡姆剃刀定律在机器学习方面的运用如下: ​ 机器学习模型越简单,良好的实证结果就越有可能不仅仅基于样本的特性。

​ 正则化表达式通常如下所示:

*注:是原始的代价函数 ,后面部分是**规范化项 **,衡量模型的复杂度(常用的有L1,L2正则化)。其中 λ > 0 可以称为规范化参数,⽽ n 就是训练集合的⼤⼩; sgn(w) 就是 w 的正负号,即 w 是正数时为 +1,⽽ w 为负数时为 -1。 *

2.1 L1正则化

L1正则化公式如下所示:

这个⽅法是在未规范化的代价函数上加上⼀个权重绝对值的和。

对公式(2)进行求导,可得:

根据公式(4)可知,参数b的更新不变

根据用公式(3)L1正则化的参数w更新公式如下:

2.2 L2正则化

L2正则化公式如下所示:

这个⽅法是在未规范化的代价函数上加上⼀个权重平方和。

对公式(8)进行求导,可得:

根据公式(10)可知,参数b的更新不变

根据用公式(9)L2正则化的参数w更新公式如下:

2.3 L1和L2的区别

​ 在 L1 规范化中,权重通过⼀个常量向 0 进⾏缩⼩;在 L2 规范化中,权重通过⼀个和 w 成⽐例的量进⾏缩⼩的。

​ 所以L1规范化会使部分w为0,达到对模型进行“降维”的效果;而L2只能部分w接近0,并不能达到“降维”的效果。

2.4 L2正则化代码示例

​ 代码下载参考《基于感知机的手写数字识别神经网络》中的1.4部分。

​ 运行下面代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
"""
overfitting
~~~~~~~~~~~

Plot graphs to illustrate the problem of overfitting.  
"""

# Standard library
import json
import random
import sys

# My library
sys.path.append('../src/')
import src.mnist_loader as mnist_loader
import src.network2 as network2

# Third-party libraries
import matplotlib.pyplot as plt
import numpy as np


def main(filename, num_epochs,
         training_cost_xmin=200, 
         test_accuracy_xmin=200, 
         test_cost_xmin=0, 
         training_accuracy_xmin=0,
         training_set_size=1000, 
         lmbda=0.0):
    """``filename`` is the name of the file where the results will be
    stored.  ``num_epochs`` is the number of epochs to train for.
    ``training_set_size`` is the number of images to train on.
    ``lmbda`` is the regularization parameter.  The other parameters
    set the epochs at which to start plotting on the x axis.
    """
    run_network(filename, num_epochs, training_set_size, lmbda)
    make_plots(filename, num_epochs, 
               training_cost_xmin,
               test_accuracy_xmin,
               test_cost_xmin, 
               training_accuracy_xmin,
               training_set_size)
                       
def run_network(filename, num_epochs, training_set_size=1000, lmbda=0.0):
    """Train the network for ``num_epochs`` on ``training_set_size``
    images, and store the results in ``filename``.  Those results can
    later be used by ``make_plots``.  Note that the results are stored
    to disk in large part because it's convenient not to have to
    ``run_network`` each time we want to make a plot (it's slow).

    """
    # Make results more easily reproducible
    random.seed(12345678)
    np.random.seed(12345678)
    training_data, validation_data, test_data = mnist_loader.load_data_wrapper()
    net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost())
    net.large_weight_initializer()
    test_cost, test_accuracy, training_cost, training_accuracy \
        = net.SGD(training_data[:training_set_size], num_epochs, 10, 0.5,
                  evaluation_data=test_data, lmbda = lmbda,
                  monitor_evaluation_cost=True, 
                  monitor_evaluation_accuracy=True, 
                  monitor_training_cost=True, 
                  monitor_training_accuracy=True)
    f = open(filename, "w")
    json.dump([test_cost, test_accuracy, training_cost, training_accuracy], f)
    f.close()

def make_plots(filename, num_epochs, 
               training_cost_xmin=200, 
               test_accuracy_xmin=200, 
               test_cost_xmin=0, 
               training_accuracy_xmin=0,
               training_set_size=1000):
    """Load the results from ``filename``, and generate the corresponding
    plots. """
    f = open(filename, "r")
    test_cost, test_accuracy, training_cost, training_accuracy \
        = json.load(f)
    f.close()
    plot_training_cost(training_cost, num_epochs, training_cost_xmin)
    plot_test_accuracy(test_accuracy, num_epochs, test_accuracy_xmin)
    plot_test_cost(test_cost, num_epochs, test_cost_xmin)
    plot_training_accuracy(training_accuracy, num_epochs, 
                           training_accuracy_xmin, training_set_size)
    plot_overlay(test_accuracy, training_accuracy, num_epochs,
                 min(test_accuracy_xmin, training_accuracy_xmin),
                 training_set_size)

def plot_training_cost(training_cost, num_epochs, training_cost_xmin):
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.plot(np.arange(training_cost_xmin, num_epochs), 
            training_cost[training_cost_xmin:num_epochs],
            color='#2A6EA6')
    ax.set_xlim([training_cost_xmin, num_epochs])
    ax.grid(True)
    ax.set_xlabel('Epoch')
    ax.set_title('Cost on the training data')
    plt.show()

def plot_test_accuracy(test_accuracy, num_epochs, test_accuracy_xmin):
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.plot(np.arange(test_accuracy_xmin, num_epochs), 
            [accuracy/100.0 
             for accuracy in test_accuracy[test_accuracy_xmin:num_epochs]],
            color='#2A6EA6')
    ax.set_xlim([test_accuracy_xmin, num_epochs])
    ax.grid(True)
    ax.set_xlabel('Epoch')
    ax.set_title('Accuracy (%) on the test data')
    plt.show()

def plot_test_cost(test_cost, num_epochs, test_cost_xmin):
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.plot(np.arange(test_cost_xmin, num_epochs), 
            test_cost[test_cost_xmin:num_epochs],
            color='#2A6EA6')
    ax.set_xlim([test_cost_xmin, num_epochs])
    ax.grid(True)
    ax.set_xlabel('Epoch')
    ax.set_title('Cost on the test data')
    plt.show()

def plot_training_accuracy(training_accuracy, num_epochs, 
                           training_accuracy_xmin, training_set_size):
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.plot(np.arange(training_accuracy_xmin, num_epochs), 
            [accuracy*100.0/training_set_size 
             for accuracy in training_accuracy[training_accuracy_xmin:num_epochs]],
            color='#2A6EA6')
    ax.set_xlim([training_accuracy_xmin, num_epochs])
    ax.grid(True)
    ax.set_xlabel('Epoch')
    ax.set_title('Accuracy (%) on the training data')
    plt.show()

def plot_overlay(test_accuracy, training_accuracy, num_epochs, xmin,
                 training_set_size):
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.plot(np.arange(xmin, num_epochs), 
            [accuracy/100.0 for accuracy in test_accuracy], 
            color='#2A6EA6',
            label="Accuracy on the test data")
    ax.plot(np.arange(xmin, num_epochs), 
            [accuracy*100.0/training_set_size 
             for accuracy in training_accuracy], 
            color='#FFA933',
            label="Accuracy on the training data")
    ax.grid(True)
    ax.set_xlim([xmin, num_epochs])
    ax.set_xlabel('Epoch')
    ax.set_ylim([90, 100])
    plt.legend(loc="lower right")
    plt.show()

if __name__ == "__main__":
    filename = input("Enter a file name: ")
    num_epochs = int(input(
        "Enter the number of epochs to run for: "))
    training_cost_xmin = int(input(
        "training_cost_xmin (suggest 200): "))
    test_accuracy_xmin = int(input(
        "test_accuracy_xmin (suggest 200): "))
    test_cost_xmin = int(input(
        "test_cost_xmin (suggest 0): "))
    training_accuracy_xmin = int(input(
        "training_accuracy_xmin (suggest 0): "))
    training_set_size = int(input(
        "Training set size (suggest 1000): "))
    lmbda = float(input(
        "Enter the regularization parameter, lambda (suggest: 5.0): "))
    main(filename, num_epochs, training_cost_xmin, 
         test_accuracy_xmin, test_cost_xmin, training_accuracy_xmin,
         training_set_size, lmbda)

输出如下:

1
2
3
4
5
6
7
8
Enter a file name: overfitting
Enter the number of epochs to run for: 400
training_cost_xmin (suggest 200): 200
test_accuracy_xmin (suggest 200): 200
test_cost_xmin (suggest 0): 0
training_accuracy_xmin (suggest 0): 0
Training set size (suggest 1000): 1000
Enter the regularization parameter, lambda (suggest: 5.0): 0.1

可以的得到以下运行结果:

myplot_2

myplot_1

显然,规范化的使⽤能够解决过度拟合的问题。

注意事项

权重衰减因⼦ :L2中,在引入更大是训练数据集时,需要跟随n变大而变大,才能保证有参考意义。

其他防止过拟合的技术——弃权

弃权(Dropout)是⼀种相当激进的技术。和正则化不同,弃权技术并不依赖对代价函数的修改。⽽是,在弃权中,我们改变了⽹络本⾝。

​ 弃权的想法是:三个臭皮匠胜过一个诸葛亮——一个神经网络容易产生过拟合,那么多个神经网络参与其中,进行“投票”选出结果,不就能避免过拟合了么?!然而其并没有建立多个神经网络,而是在同一个神经网络中隐藏层每一层随机丢弃一些权重,相当于这些节点丢失,从而构建不同结构的神经网络进行训练,这样最后我们将获得一个由多个“片面”的神经网络组合在一起,变得“全面”的神经网络——弃权充满争议,在此不做讨论。

参考资料: