TensorFlow和pytorch中的pin_memory和non_blocking设置是做什么的，又是否有用？？？-CFANZ编程社区

因为最早接触CUDA是大学时代，至今已经十多年了，有些东西用习惯了、时间久了就不太care了，最近由于工作原因又搞了会CUDA和深度学习的框架，看到pin_memory和non_blocking这两个参数设置，每次看到都想写些分析的technical report，最近由于疫情窝在家也正好是旧事重提，便有了本post。

====================================================

pin_memory的设置是几乎所有深度学习框架dataloader中的参数，而non_blocking参数主要在pytorch中有发现使用。

其实对cuda编程有些了解的人对这两个参数从字面上就可以理解个大概。

首先说下pin_memory ：

相关介绍可以看：https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/

上面链接地址中有一个比较形象的图：

TensorFlow和pytorch中的pin_memory和non_blocking设置是做什么的，又是否有用？？？_杂谈

解释一下：

在主机内存中可以把available内存分配为pageable memory和pinned memory两种，pageable memory就是可以页置换的内存。如果了解虚拟内存的人就会知道当计算机available memory空间不够的情况下就会将used memory中的部分内存空间置换到硬盘上，这部分可以被置换的（被保存到硬盘空间以腾出一定的可用物理内存空间）就是pageable memory，与此相对的就是pinned memory内存空间。作为已分配使用的内存，pinned memory是不会被置换到硬盘空间上的，也就是说一旦一段内存空间被分配为pinned memory，那么这段物理内存就会被所申请的应用所独占，而不会被操作系统通过页置换而重新分配给其他应用（进程）。在主机host和GPU（device）之间进行数据传输为块传输方式，也就是说host端的CPU发出传输指令和需要传输的物理内存地址给device，然后device根据收到的指令将host端所指定的物理内存中的地址copy到device中的显存中，而该种传输方式不可以避免的就是host端待传输数据的那段物理内存是需要被本进程所独占的，因此这一段内存空间必须是pinned memory。

在CUDA编程中可以手动的把CPU端准备传输到GPU的那部分host内存空间指定为pinned memory，这样GPU端执行copy时就可以直接从这段host的内存中copy；但是如果没有手动指定待拷贝的host上的物理内存为pinned memory那么这段内存便是pageable memory，那么在这种情况下执行copy就需要CPU上操作向操作系统隐式的申请一段临时的pinned memory，然后CPU将待拷贝的pageable memory中的数据copy到临时申请的pinned memory中然后发送命令给GPU来从这段临时的pinned memory中copy数据。

从上面的copy过程中可以看到，如果不手动在host端指定pinned memory的话，host和device之间的数据拷贝每次都需要创建一个新的临时pinned memory，然后再把pageable memory中的数据拷贝到pinned memory，而在这个过程中pinned memory的申请和pageable memory与pinned memory之间的数据互copy都是较大的时间花费的，因此为了提高CUDA程序的运行效率可以手动将待传输的数据指定为pinned memory。

pinned memory的缺点：

在不考虑编写代码时单独指定的操作花费，那么pinned memory的唯一缺点就是浪费内存，因为一段被指定为pinned memory的物理内存空间是不允许其他应用复用的，只能该申请创建的进程所独占。

pinned memory的时代局限性：

上面分析了，pinned memory的优点是提高异构设备间数据拷贝的效率，缺点是导致host端内存的利用效率降低。但是这些优点和缺点的分析都是根据几十年前的资料所得出的，根据How to Optimize Data Transfers in CUDA C/C++中的数据显示，pinned memory可以极大提高host与device之间数据的拷贝速度，在NVIDIA 4200M型号的GPU上效率表现如下：

代码：

#include <stdio.h>
#include <assert.h>

// Convenience function for checking CUDA runtime API results
// can be wrapped around any runtime API call. No-op in release builds.
inline
cudaError_t checkCuda(cudaError_t result)
{
#if defined(DEBUG) || defined(_DEBUG)
  if (result != cudaSuccess) {
    fprintf(stderr, "CUDA Runtime Error: %s\n", 
            cudaGetErrorString(result));
    assert(result == cudaSuccess);
  }
#endif
  return result;
}

void profileCopies(float        *h_a, 
                   float        *h_b, 
                   float        *d, 
                   unsigned int  n,
                   char         *desc)
{
  printf("\n%s transfers\n", desc);

  unsigned int bytes = n * sizeof(float);

  // events for timing
  cudaEvent_t startEvent, stopEvent; 

  checkCuda( cudaEventCreate(&startEvent) );
  checkCuda( cudaEventCreate(&stopEvent) );

  checkCuda( cudaEventRecord(startEvent, 0) );
  checkCuda( cudaMemcpy(d, h_a, bytes, cudaMemcpyHostToDevice) );
  checkCuda( cudaEventRecord(stopEvent, 0) );
  checkCuda( cudaEventSynchronize(stopEvent) );

  float time;
  checkCuda( cudaEventElapsedTime(&time, startEvent, stopEvent) );
  printf("  Host to Device bandwidth (GB/s): %f\n", bytes * 1e-6 / time);

  checkCuda( cudaEventRecord(startEvent, 0) );
  checkCuda( cudaMemcpy(h_b, d, bytes, cudaMemcpyDeviceToHost) );
  checkCuda( cudaEventRecord(stopEvent, 0) );
  checkCuda( cudaEventSynchronize(stopEvent) );

  checkCuda( cudaEventElapsedTime(&time, startEvent, stopEvent) );
  printf("  Device to Host bandwidth (GB/s): %f\n", bytes * 1e-6 / time);

  for (int i = 0; i < n; ++i) {
    if (h_a[i] != h_b[i]) {
      printf("*** %s transfers failed ***\n", desc);
      break;
    }
  }

  // clean up events
  checkCuda( cudaEventDestroy(startEvent) );
  checkCuda( cudaEventDestroy(stopEvent) );
}

int main()
{
  unsigned int nElements = 4*1024*1024;
  const unsigned int bytes = nElements * sizeof(float);

  // host arrays
  float *h_aPageable, *h_bPageable;   
  float *h_aPinned, *h_bPinned;

  // device array
  float *d_a;

  // allocate and initialize
  h_aPageable = (float*)malloc(bytes);                    // host pageable
  h_bPageable = (float*)malloc(bytes);                    // host pageable
  checkCuda( cudaMallocHost((void**)&h_aPinned, bytes) ); // host pinned
  checkCuda( cudaMallocHost((void**)&h_bPinned, bytes) ); // host pinned
  checkCuda( cudaMalloc((void**)&d_a, bytes) );           // device

  for (int i = 0; i < nElements; ++i) h_aPageable[i] = i;      
  memcpy(h_aPinned, h_aPageable, bytes);
  memset(h_bPageable, 0, bytes);
  memset(h_bPinned, 0, bytes);

  // output device info and transfer size
  cudaDeviceProp prop;
  checkCuda( cudaGetDeviceProperties(&prop, 0) );

  printf("\nDevice: %s\n", prop.name);
  printf("Transfer size (MB): %d\n", bytes / (1024 * 1024));

  // perform copies and report bandwidth
  // profileCopies(h_aPageable, h_bPageable, d_a, nElements, "Pageable");
  // profileCopies(h_aPinned, h_bPinned, d_a, nElements, "Pinned");
  //
  char a[] = "Pageable";
  char b[] = "Pinned";
  // char *a = "Pageable";
  // char *b = "Pinned";
  profileCopies(h_aPageable, h_bPageable, d_a, nElements, a);
  profileCopies(h_aPinned, h_bPinned, d_a, nElements, b);

  printf("n");

  // cleanup
  cudaFree(d_a);
  cudaFreeHost(h_aPinned);
  cudaFreeHost(h_bPinned);
  free(h_aPageable);
  free(h_bPageable);

  return 0;
}