6. 并行 · ApacheCN 深度学习译文集

## 6. 并行 2015 年 11 月发布的第一个 TensorFlow 软件包，已准备好在具有可用 GPU 的服务器上运行，并同时在其中执行训练操作。 2016 年 2 月，更新添加了分布式和并行化处理的功能。在这个简短的章节中，我将介绍如何使用 GPU。对于那些想要了解这些设备如何工作的读者，有些参考文献将在上一节中给出。但是，鉴于本书的介绍性，我不会详细介绍分布式版本，但对于那些感兴趣的读者，一些参考将在上一节中给出。 ### 带有 GPU 的执行环境支持 GPU 的 TensorFlow 软件包需要 CudaToolkit 7.0 和 CUDNN 6.5 V2。对于安装环境，我们建议访问 cuda 安装 [44] 网站，为了不会深入细节，同时信息也是最新的。在 TensorFlow 中引用这些设备的方法如下： + `/cpu:0`：引用服务器的 CPU。 + `/gpu:0`：服务器的 GPU（如果只有一个可用）。 + `/gpu:1`：服务器的第二个 GPU，依此类推。要知道我们的操作和张量分配在哪些设备中，我们需要创建一个`sesion`，选项`log_device_placement`为`True`。我们在下面的例子中看到它： ```py import tensorflow as tf a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') c = tf.matmul(a, b) sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) printsess.run(c) ``` 当读者在计算机中测试此代码时，应出现类似的输出： ``` . . . Device mapping: /job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K40c, pci bus id: 0000:08:00.0 . . . b: /job:localhost/replica:0/task:0/gpu:0 a: /job:localhost/replica:0/task:0/gpu:0 MatMul: /job:localhost/replica:0/task:0/gpu:0 … [[ 22.28.] [ 49.64.]] … ``` 此外，使用操作的结果，它通知我们每个部分的执行位置。如果我们想要在特定设备中执行特定操作，而不是让系统自动选择设备，我们可以使用变量`tf.device`来创建设备上下文，因此所有操作都在上下文将分配相同的设备。如果我们在系统中拥有更多 GPU，则默认情况下将选择具有较低标识符的 GPU。如果我们想要在不同的 GPU 中执行操作，我们必须明确指定它。例如，如果我们希望先前的代码在 GPU#2 中执行，我们可以使用`tf.device('/gpu:2')`，如下所示： ```py import tensorflow as tf with tf.device('/gpu:2'): a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') c = tf.matmul(a, b) sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) printsess.run(c) ``` ### 多个 GPU 的并行如果我们有更多的 GPU，通常我们希望一起使用它们来并行地解决同样的问题。为此，我们可以构建我们的模型，来在多个 GPU 之间分配工作。我们在下一个例子中看到它： ```py import tensorflow as tf c = [] for d in ['/gpu:2', '/gpu:3']: with tf.device(d): a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3]) b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2]) c.append(tf.matmul(a, b)) with tf.device('/cpu:0'): sum = tf.add_n(c) # Creates a session with log_device_placement set to True. sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) print sess.run(sum) ``` 正如我们所看到的，代码与前一代码相同，但现在我们有 2 个 GPU，由`tf.device`指定，它们执行乘法（两个 GPU 在这里都做同样的操作，以便简化示例代码），稍后 CPU 执行加法。假设我们将`log_device_placement`设置为`true`，我们可以在输出中看到，操作如何分配给我们的设备 [45]。 ```py . . . Device mapping: /job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K40c /job:localhost/replica:0/task:0/gpu:1 -> device: 1, name: Tesla K40c /job:localhost/replica:0/task:0/gpu:2 -> device: 2, name: Tesla K40c /job:localhost/replica:0/task:0/gpu:3 -> device: 3, name: Tesla K40c . . . . . . Const_3: /job:localhost/replica:0/task:0/gpu:3 I tensorflow/core/common_runtime/simple_placer.cc:289] Const_3: /job:localhost/replica:0/task:0/gpu:3 Const_2: /job:localhost/replica:0/task:0/gpu:3 I tensorflow/core/common_runtime/simple_placer.cc:289] Const_2: /job:localhost/replica:0/task:0/gpu:3 MatMul_1: /job:localhost/replica:0/task:0/gpu:3 I tensorflow/core/common_runtime/simple_placer.cc:289] MatMul_1: /job:localhost/replica:0/task:0/gpu:3 Const_1: /job:localhost/replica:0/task:0/gpu:2 I tensorflow/core/common_runtime/simple_placer.cc:289] Const_1: /job:localhost/replica:0/task:0/gpu:2 Const: /job:localhost/replica:0/task:0/gpu:2 I tensorflow/core/common_runtime/simple_placer.cc:289] Const: /job:localhost/replica:0/task:0/gpu:2 MatMul: /job:localhost/replica:0/task:0/gpu:2 I tensorflow/core/common_runtime/simple_placer.cc:289] MatMul: /job:localhost/replica:0/task:0/gpu:2 AddN: /job:localhost/replica:0/task:0/cpu:0 I tensorflow/core/common_runtime/simple_placer.cc:289] AddN: /job:localhost/replica:0/task:0/cpu:0 [[44.56.] [98.128.]] . . . ``` ### GPU 的代码示例为了总结这一简短的章节，我们提供了一段代码，其灵感来自 DamienAymeric 在 Github [46] 中共享的代码，计算`An + Bn`，`n=10`，使用 Python `datetime`包，将 1 GPU 的执行时间与 2 个 GPU 进行比较。首先，我们导入所需的库： ```py import numpy as np import tensorflow as tf import datetime ``` 我们使用`numpy`包创建两个带随机值的矩阵： ```py A = np.random.rand(1e4, 1e4).astype('float32') B = np.random.rand(1e4, 1e4).astype('float32') n = 10 ``` 然后，我们创建两个结构来存储结果： ```py c1 = [] c2 = [] ``` 接下来，我们定义`matpow()`函数，如下所示： ```py defmatpow(M, n): if n < 1: #Abstract cases where n < 1 return M else: return tf.matmul(M, matpow(M, n-1)) ``` 正如我们所见，要在单个 GPU 中执行代码，我们必须按如下方式指定： ```py with tf.device('/gpu:0'): a = tf.constant(A) b = tf.constant(B) c1.append(matpow(a, n)) c1.append(matpow(b, n)) with tf.device('/cpu:0'): sum = tf.add_n(c1) t1_1 = datetime.datetime.now() with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess: sess.run(sum) t2_1 = datetime.datetime.now() ``` 对于 2 个 GPU 的情况，代码如下： ```py with tf.device('/gpu:0'): #compute A^n and store result in c2 a = tf.constant(A) c2.append(matpow(a, n)) with tf.device('/gpu:1'): #compute B^n and store result in c2 b = tf.constant(B) c2.append(matpow(b, n)) with tf.device('/cpu:0'): sum = tf.add_n(c2) #Addition of all elements in c2, i.e. A^n + B^n t1_2 = datetime.datetime.now() with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess: # Runs the op. sess.run(sum) t2_2 = datetime.datetime.now() ``` 最后，我们打印计算时间的结果： ```py print "Single GPU computation time: " + str(t2_1-t1_1) print "Multi GPU computation time: " + str(t2_2-t1_2) ``` ### TensorFlow 的分布式版本正如我之前在本章开头所说，2016 年 2 月，Google 发布了 TensorFlow 的分布式版本，该版本由 gRPC 支持，这是一个用于进程间通信的高性能开源 RPC 框架（TensorFlow 服务使用的相同协议）。对于它的用法，必须构建二进制文件，因为此时包只提供源代码。鉴于本书的介绍范围，我不会在分布式版本中解释它，但如果读者想要了解它，我建议从 TensorFlow 的分布式版本的官网开始 [47]。与前面的章节一样，本书中使用的代码可以在本书的 Github [48] 中找到。我希望本章足以说明如何使用 GPU 加速代码。