拟合 Logistic 模型 · UCB DS100 数据科学的原理与技巧

# 拟合 Logistic 模型 > 原文：[https://www.textbook.ds100.org/ch/17/classification_sgd.html](https://www.textbook.ds100.org/ch/17/classification_sgd.html) ``` # HIDDEN # Clear previously defined variables %reset -f # Set directory for data loading to work properly import os os.chdir(os.path.expanduser('~/notebooks/17')) ``` ``` # HIDDEN import warnings # Ignore numpy dtype warnings. These warnings are caused by an interaction # between numpy and Cython and can be safely ignored. # Reference: https://stackoverflow.com/a/40846742 warnings.filterwarnings("ignore", message="numpy.dtype size changed") warnings.filterwarnings("ignore", message="numpy.ufunc size changed") import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns %matplotlib inline import ipywidgets as widgets from ipywidgets import interact, interactive, fixed, interact_manual import nbinteract as nbi sns.set() sns.set_context('talk') np.set_printoptions(threshold=20, precision=2, suppress=True) pd.options.display.max_rows = 7 pd.options.display.max_columns = 8 pd.set_option('precision', 2) # This option stops scientific notation for pandas # pd.set_option('display.float_format', '{:.2f}'.format) ``` 之前，我们讨论了批量梯度下降，这是一种迭代更新$\BoldSymbol \Theta 的算法，以查找损失最小化参数$\BoldSymbol \Theta$。讨论了随机梯度下降和小批量梯度下降，利用统计理论和并行硬件的方法来减少训练梯度下降算法的时间。在本节中，我们将把这些概念应用到逻辑回归中，并使用 SciKit 学习函数遍历示例。 ### 批梯度下降批梯度下降的一般更新公式如下： $$ \boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \alpha \cdot \nabla_\boldsymbol{\theta} L(\boldsymbol{\theta}^{(t)}, \textbf{X}, \textbf{y}) $$ 在逻辑回归中，我们使用交叉熵损失作为我们的损失函数： $$ L(\boldsymbol{\theta}, \textbf{X}, \textbf{y}) = \frac{1}{n} \sum_{i=1}^{n} \left(-y_i \ln \left(f_{\boldsymbol{\theta}} \left(\textbf{X}_i \right) \right) - \left(1 - y_i \right) \ln \left(1 - f_{\boldsymbol{\theta}} \left(\textbf{X}_i \right) \right) \right) $$ 交叉熵损失的梯度为$\nabla_ \boldsymbol \theta l（\boldsymbol \theta、\textbf x、\textbf y）=-\frac 1 n \sum i=1 ^n（y \sigma x u i$。把它插入到更新公式中，我们就可以找到特定于逻辑回归的梯度下降算法。让$\sigma_i=f_ \boldSymbol \theta（\textbf x u i）=\sigma（\textbf x u i \cdot \boldSymbol \theta）$： $$ \begin{align} \boldsymbol{\theta}^{(t+1)} &= \boldsymbol{\theta}^{(t)} - \alpha \cdot \left(- \frac{1}{n} \sum_{i=1}^{n} \left(y_i - \sigma_i\right) \textbf{X}_i \right) \\ &= \boldsymbol{\theta}^{(t)} + \alpha \cdot \left(\frac{1}{n} \sum_{i=1}^{n} \left(y_i - \sigma_i\right) \textbf{X}_i \right) \end{align} $$ * $\BoldSymbol \Theta ^（t）$是迭代$t 时当前对$\BoldSymbol \Theta$的估计$ * $\alpha$是学习率 * $-\frac 1 n \ sum i=1 n \ left（y i-\sigma i \ right）textbf x u i$是交叉熵损失的梯度 * $\BoldSymbol \Theta（t+1）$是通过减去\Alpha$的乘积和以\BoldSymbol \Theta（t）计算的交叉熵损失的下一个估计数。$ ### 随机梯度下降随机梯度下降使用单个数据点的损失梯度近似所有观测的损失函数的梯度。一般更新公式如下，其中$\ell（\boldsymbol \theta，\textbf x u i，y_i）$是单个数据点的损失函数： $$ \boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \alpha \nabla_\boldsymbol{\theta} \ell(\boldsymbol{\theta}, \textbf{X}_i, y_i) $$ 回到我们在逻辑回归中的例子，我们使用一个数据点的交叉熵损失梯度来近似所有数据点的交叉熵损失梯度。如下所示，其中$\sigma_i=f \boldsymbol \theta（\textbf x u i）=\sigma（\textbf x u i\cdot\boldsymbol \theta）$。 $$ \begin{align} \nabla_\boldsymbol{\theta} L(\boldsymbol{\theta}, \textbf{X}, \textbf{y}) &\approx \nabla_\boldsymbol{\theta} \ell(\boldsymbol{\theta}, \textbf{X}_i, y_i)\\ &= -(y_i - \sigma_i)\textbf{X}_i \end{align} $$ 当我们把这个近似值代入随机梯度下降的一般公式时，我们找到了逻辑回归的随机梯度下降更新公式。 $$ \begin{align} \boldsymbol{\theta}^{(t+1)} &= \boldsymbol{\theta}^{(t)} - \alpha \nabla_\boldsymbol{\theta} \ell(\boldsymbol{\theta}, \textbf{X}_i, y_i) \\ &= \boldsymbol{\theta}^{(t)} + \alpha \cdot (y_i - \sigma_i)\textbf{X}_i \end{align} $$ ### 小批量梯度下降同样，我们可以使用一个随机的数据点样本（称为小批量）来近似所有观测的交叉熵损失梯度。 $$ \nabla_\boldsymbol{\theta} L(\boldsymbol{\theta}, \textbf{X}, \textbf{y}) \approx \frac{1}{|\mathcal{B}|} \sum_{i\in\mathcal{B}}\nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}, \textbf{X}_i, y_i) $$ 我们将此近似值替换为交叉熵损失的梯度，得出一个特定于逻辑回归的小批量梯度下降更新公式： $$ \begin{align} \boldsymbol{\theta}^{(t+1)} &= \boldsymbol{\theta}^{(t)} - \alpha \cdot -\frac{1}{|\mathcal{B}|} \sum_{i\in\mathcal{B}}(y_i - \sigma_i)\textbf{X}_i \\ &= \boldsymbol{\theta}^{(t)} + \alpha \cdot \frac{1}{|\mathcal{B}|} \sum_{i\in\mathcal{B}}(y_i - \sigma_i)\textbf{X}_i \end{align} $$ ## SciKit Learn[¶](#Implementation-in-Scikit-learn)中的实现 Scikit Learn 的[`SGDClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)类提供了随机梯度下降的实现，我们可以通过指定`loss=log`来使用它。由于 SciKit Learn 没有实现批梯度下降的模型，因此我们将比较`SGDClassifier`与[`LogisticRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)在`emails`数据集上的性能。为了简洁起见，我们省略了特征提取： ``` # HIDDEN emails = pd.read_csv('emails_sgd.csv').sample(frac=0.5) X, y = emails['email'], emails['spam'] X_tr = CountVectorizer().fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X_tr, y, random_state=42) y_train = y_train.reset_index(drop=True) y_test = y_test.reset_index(drop=True) ``` ``` log_reg = LogisticRegression(tol=0.0001, random_state=42) stochastic_gd = SGDClassifier(tol=0.0001, loss='log', random_state=42) ``` ``` %%time log_reg.fit(X_train, y_train) log_reg_pred = log_reg.predict(X_test) print('Logistic Regression') print(' Accuracy: ', accuracy_score(y_test, log_reg_pred)) print(' Precision: ', precision_score(y_test, log_reg_pred)) print(' Recall: ', recall_score(y_test, log_reg_pred)) print() ``` ``` Logistic Regression Accuracy: 0.9913793103448276 Precision: 0.974169741697417 Recall: 0.9924812030075187 CPU times: user 3.2 s, sys: 0 ns, total: 3.2 s Wall time: 3.26 s ``` ``` %%time stochastic_gd.fit(X_train, y_train) stochastic_gd_pred = stochastic_gd.predict(X_test) print('Stochastic GD') print(' Accuracy: ', accuracy_score(y_test, stochastic_gd_pred)) print(' Precision: ', precision_score(y_test, stochastic_gd_pred)) print(' Recall: ', recall_score(y_test, stochastic_gd_pred)) print() ``` ``` Stochastic GD Accuracy: 0.9808429118773946 Precision: 0.9392857142857143 Recall: 0.9887218045112782 CPU times: user 93.8 ms, sys: 31.2 ms, total: 125 ms Wall time: 119 ms ``` 以上结果表明，与`LogisticRegression`相比，`SGDClassifier`能够在更短的时间内找到解决方案。虽然`SGDClassifier`的评估指标稍差，但我们可以通过调整超参数来提高`SGDClassifier`的性能。此外，这种差异也是数据科学家在现实世界中经常遇到的一种权衡。根据具体情况，数据科学家可能会在较低的运行时或较高的度量标准上赋予更大的价值。 ## 摘要[¶](#Summary) 随机梯度下降是数据科学家用来降低计算成本和运行时间的一种方法。我们可以在 logistic 回归中看到随机梯度下降的值，因为我们只需要计算每次迭代一次观测的交叉熵损失梯度，而不需要计算每次批量梯度下降观测的交叉熵损失梯度。从使用 Scikit Learn 的`SGDClassifier`的示例中，我们观察到随机梯度下降可能会实现稍微差一点的评估指标，但会显著提高运行时。在更大的数据集或更复杂的模型上，运行时的差异可能更大，因此更有价值。