Utilizing The NVIDIA CUDA Stream-Ordered Memory Allocator Half 1
Most CUDA developers are accustomed to the cudaMalloc and cudaFree API capabilities to allocate GPU accessible memory. Nevertheless, Memory Wave Method there has lengthy been an impediment with these API features: they aren’t stream ordered. In this publish, we introduce new API capabilities, Memory Wave cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations. In part 2 of this series, we highlight the benefits of this new functionality by sharing some large knowledge benchmark results and provide a code migration information for modifying your current functions. We additionally cover superior subjects to make the most of stream-ordered memory allocation in the context of multi-GPU entry and the usage of IPC. This all helps you improve efficiency inside your present applications. The following code example on the left is inefficient as a result of the primary cudaFree name has to look forward to kernelA to complete, so it synchronizes the device before freeing the memory. To make this run more effectively, the memory might be allotted upfront and sized to the bigger of the two sizes, as proven on the best.
This will increase code complexity in the application because the memory administration code is separated out from the enterprise logic. The issue is exacerbated when different libraries are involved. This is way harder for the appliance to make environment friendly because it might not have full visibility or control over what the library is doing. To bypass this problem, the library would have to allocate memory when that operate is invoked for the primary time and by no means free it till the library is deinitialized. This not solely will increase code complexity, however it additionally causes the library to carry on to the memory longer than it must, doubtlessly denying one other portion of the applying from utilizing that memory. Some purposes take the thought of allocating memory upfront even further by implementing their very own custom allocator. This adds a significant quantity of complexity to utility improvement. CUDA goals to supply a low-effort, excessive-performance various.
CUDA 11.2 launched a stream-ordered memory allocator to resolve these kinds of problems, with the addition of cudaMallocAsync and cudaFreeAsync. These new API capabilities shift memory allocation from global-scope operations that synchronize all the gadget to stream-ordered operations that enable you to compose memory management with GPU work submission. This eliminates the necessity for Memory Wave synchronizing excellent GPU work and helps limit the lifetime of the allocation to the GPU work that accesses it. It is now potential to manage memory at operate scope, as in the following instance of a library perform launching kernelA. All the standard stream-ordering rules apply to cudaMallocAsync and cudaFreeAsync. The memory returned from cudaMallocAsync may be accessed by any kernel or memcpy operation as long as the kernel or memcpy is ordered to execute after the allocation operation and earlier than the deallocation operation, in stream order. Deallocation will be carried out in any stream, as long as it is ordered to execute after the allocation operation and after all accesses on all streams of that Memory Wave Method on the GPU.
In effect, stream-ordered allocation behaves as if allocation and free have been kernels. If kernelA produces a valid buffer on a stream and kernelB invalidates it on the same stream, then an application is free to access the buffer after kernelA and earlier than kernelB in the appropriate stream order. The following example reveals various legitimate usages. Figure 1 reveals the varied dependencies specified in the sooner code instance. As you'll be able to see, all kernels are ordered to execute after the allocation operation and full earlier than the deallocation operation. Memory allocation and deallocation cannot fail asynchronously. Memory errors that occur due to a name to cudaMallocAsync or cudaFreeAsync (for example, out of memory) are reported immediately by an error code returned from the decision. If cudaMallocAsync completes successfully, the returned pointer is assured to be a sound pointer to memory that is safe to entry in the appropriate stream order. The CUDA driver uses memory swimming pools to achieve the conduct of returning a pointer immediately.