July 2nd, 2024

GPUs can now use PCIe-attached memory or SSDs to boost VRAM capacity

Major companies like AMD, Intel, and Nvidia are considering supporting Panmnesia's CXL IP for GPU memory expansion using PCIe-attached memory or SSDs. Panmnesia's low-latency solution outperforms traditional methods, showing promise for AI/HPC applications. Adoption by key players remains uncertain.

Read original articleLink Icon
GPUs can now use PCIe-attached memory or SSDs to boost VRAM capacity

Companies like AMD, Intel, and Nvidia may soon support Panmnesia's CXL IP, enabling GPUs to expand memory capacity using PCIe-attached memory or SSDs. Panmnesia's CXL IP offers low-latency memory expansion solutions for GPUs, addressing the increasing memory requirements for AI training datasets. By developing a CXL 3.1-compliant root complex and host bridge, Panmnesia allows GPUs to access external memory over PCIe, improving performance compared to traditional methods like UVM. Testing shows that Panmnesia's solution achieves significantly lower latency and faster execution times, outperforming UVM and CXL-Proto. While CXL support holds promise for AI/HPC GPUs, the adoption by major players like AMD and Nvidia remains uncertain. Whether these companies will integrate CXL support or develop their own technology in response to the trend of using PCIe-attached memory for GPUs is yet to be seen. Panmnesia's innovative approach demonstrates potential benefits for GPU memory expansion and performance optimization in the evolving landscape of AI and HPC applications.

Related

First 128TB SSDs will launch in the coming months

First 128TB SSDs will launch in the coming months

Phison's Pascari brand plans to release 128TB SSDs, competing with Samsung, Solidigm, and Kioxia. These SSDs target high-performance computing, AI, and data centers, with larger models expected soon. The X200 PCIe Gen5 Enterprise SSDs with CoXProcessor CPU architecture aim to meet the rising demand for storage solutions amidst increasing data volumes and generative AI integration, addressing businesses' data management challenges effectively.

Testing AMD's Giant MI300X

Testing AMD's Giant MI300X

AMD introduces Radeon Instinct MI300X to challenge NVIDIA in GPU compute market. MI300X features chiplet setup, Infinity Cache, CDNA 3 architecture, competitive performance against NVIDIA's H100, and excels in local memory bandwidth tests.

AMD MI300X performance compared with Nvidia H100

AMD MI300X performance compared with Nvidia H100

The AMD MI300X AI GPU outperforms Nvidia's H100 in cache, latency, and inference benchmarks. It excels in caching performance, compute throughput, but AI inference performance varies. Real-world performance and ecosystem support are essential.

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Nscale explores AI model optimization through GEMM tuning, leveraging rocBLAS and hipBLASlt for AMD MI300x GPUs. Results show up to 7.2x throughput increase and reduced latency, benefiting large models and enhancing processing efficiency.

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Nscale explores GEMM tuning impact on AI model optimization, emphasizing throughput and latency benefits. Fine-tuning parameters and algorithms significantly boost speed and efficiency, especially on AMD GPUs, showcasing up to 7.2x throughput improvement.

Link Icon 6 comments
By @jauntywundrkind - 4 months
It's CXL not PCIe. The latency is much more like NUMA hop or so with CXL, which makes this much more likely to be useful than trying to use host memory over PCIe.

CXL 3.1 was the first spec where they added any way to have a host CPU also be able to share memory (host to host), itself be part of RDMA. It seems like it's not exactly going to look like any other CXL memory device, so it'll take some effort to make other hosts or even the local host be able to take advantage of CXL. https://www.servethehome.com/cxl-3-1-specification-aims-for-...

By @RecycledEle - 4 months
Good job decreasing latency.

Now work on the bandwidth.

A single HBM3 module has the bandwidth of half-a-dozen data center grade PCIe 5.0 x16 NVME drives.

A single DDR5 DIMM has the bandwidth of a pair of PCIe 5.0 x4 NVME drives.

By @karmakaze - 4 months
Perhaps this would be a good application for 3D XPoint memory that was seemingly discontinued due to lack of a compelling use case.
By @p1esk - 4 months
Using CPU memory to extend GPU memory seems like a more straightforward approach. Does this method provide any benefits over it?