多应用+插件架构,代码干净,二开方便,首家独创一键云编译技术,文档视频完善,免费商用码云13.8K 广告
# 3.10。示例 > 原文: [http://numba.pydata.org/numba-doc/latest/cuda/examples.html](http://numba.pydata.org/numba-doc/latest/cuda/examples.html) ## 3.10.1。矩阵乘法 这是使用 CUDA 内核的矩阵乘法的简单实现: ```py @cuda.jit def matmul(A, B, C): """Perform square matrix multiplication of C = A * B """ i, j = cuda.grid(2) if i < C.shape[0] and j < C.shape[1]: tmp = 0. for k in range(A.shape[1]): tmp += A[i, k] * B[k, j] C[i, j] = tmp ``` 这种实现很简单直观但性能很差,因为相同的矩阵元素将从设备内存中多次加载,这很慢(某些设备可能有透明的数据缓存,但它们可能不够大,不能一次保存整个输入)。 如果我们使用阻塞算法来减少对设备内存的访问,则会更快。 CUDA 为块中的线程提供快速[共享内存](memory.html#cuda-shared-memory),以便在任务上协同计算。以下实现了使用共享内存的方形矩阵乘法的更快版本: ```py from numba import cuda, float32 # Controls threads per block and shared memory usage. # The computation will be done on blocks of TPBxTPB elements. TPB = 16 @cuda.jit def fast_matmul(A, B, C): # Define an array in the shared memory # The size and type of the arrays must be known at compile time sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32) sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32) x, y = cuda.grid(2) tx = cuda.threadIdx.x ty = cuda.threadIdx.y bpg = cuda.gridDim.x # blocks per grid if x >= C.shape[0] and y >= C.shape[1]: # Quit if (x, y) is outside of valid C boundary return # Each thread computes one element in the result matrix. # The dot product is chunked into dot products of TPB-long vectors. tmp = 0. for i in range(bpg): # Preload data into shared memory sA[tx, ty] = A[x, ty + i * TPB] sB[tx, ty] = B[tx + i * TPB, y] # Wait until all threads finish preloading cuda.syncthreads() # Computes partial product on the shared memory for j in range(TPB): tmp += sA[tx, j] * sB[j, ty] # Wait until all threads finish computing cuda.syncthreads() C[x, y] = tmp ``` 由于共享内存是有限的资源,因此代码一次从输入数组预加载小块。然后,它调用 [`syncthreads()`](../cuda-reference/kernel.html#numba.cuda.syncthreads "numba.cuda.syncthreads") 等待所有线程完成预加载并在共享内存上进行计算之前。它在计算后再次同步,以确保所有线程在共享内存中完成数据,然后在下一次循环迭代中覆盖它。