Beyond GPU Memory Limits With Unified Memory On Pascal
Modern computer architectures have a hierarchy of recollections of varying size and efficiency. GPU architectures are approaching a terabyte per second memory bandwidth that, coupled with excessive-throughput computational cores, creates a great device for data-intensive tasks. However, all people knows that fast memory is expensive. Trendy purposes striving to unravel bigger and bigger issues can be restricted by GPU memory capability. Since the capacity of GPU memory is considerably lower than system memory, it creates a barrier for builders accustomed to programming only one memory area. With the legacy GPU programming mannequin there is no easy solution to "just run" your utility when you’re oversubscribing GPU memory. Even in case your dataset is barely barely larger than the out there capability, you'd nonetheless need to handle the lively working set in GPU memory. Unified Memory is a much more intelligent memory administration system that simplifies GPU improvement by providing a single memory area directly accessible by all GPUs and CPUs in the system, with automatic web page migration for information locality.
Migration of pages allows the accessing processor to learn from L2 caching and the decrease latency of local memory. Moreover, migrating pages to GPU memory ensures GPU kernels take advantage of the very excessive bandwidth of GPU memory (e.g. 720 GB/s on a Tesla P100). And page migration is all fully invisible to the developer: the system mechanically manages all knowledge motion for you. Sounds nice, proper? With the Pascal GPU structure Unified Memory is even more highly effective, thanks to Pascal’s larger virtual memory deal with space and Web page Migration Engine, enabling true digital memory demand paging. It’s additionally worth noting that manually managing memory movement is error-prone, which impacts productiveness and delays the day when you'll be able to finally run your whole code on the GPU to see these great speedups that others are bragging about. Developers can spend hours debugging their codes due to memory coherency points. Unified memory brings big advantages for developer productivity. In this publish I'll show you ways Pascal can enable applications to run out-of-the-field with larger memory footprints and obtain great baseline efficiency.
For a second you possibly can completely neglect about GPU memory limitations whereas creating your code. Unified Memory Wave Method was introduced in 2014 with CUDA 6 and the Kepler structure. This comparatively new programming model allowed GPU functions to use a single pointer in both CPU capabilities and Memory Wave GPU kernels, which greatly simplified memory administration. CUDA eight and the Pascal architecture significantly improves Unified Memory performance by including 49-bit digital addressing and on-demand web page migration. The large 49-bit virtual addresses are adequate to allow GPUs to entry the entire system memory plus the memory of all GPUs in the system. The Web page Migration engine allows GPU threads to fault on non-resident memory accesses so the system can migrate pages from anywhere in the system to the GPUs memory on-demand for environment friendly processing. In other words, Unified Memory transparently permits out-of-core computations for any code that's using Unified Memory for allocations (e.g. `cudaMallocManaged()`). It "just works" without any modifications to the applying.
CUDA eight additionally adds new methods to optimize data locality by offering hints to the runtime so it is still potential to take full management over information migrations. These days it’s laborious to discover a high-performance workstation with just one GPU. Two-, 4- and eight-GPU techniques are becoming common in workstations in addition to large supercomputers. The NVIDIA DGX-1 is one example of a excessive-performance integrated system for deep learning with 8 Tesla P100 GPUs. Should you thought it was troublesome to manually manage data between one CPU and one GPU, now you will have 8 GPU memory spaces to juggle between. Unified Memory is crucial for such techniques and it allows extra seamless code development on multi-GPU nodes. Whenever a particular GPU touches knowledge managed by Unified Memory, this knowledge could migrate to native memory of the processor or the driver can set up a direct access over the out there interconnect (PCIe or NVLINK).