June 22nd, 2024

Testing AMD's Bergamo: Zen 4c

AMD's Bergamo server CPU, based on Zen 4c cores, prioritizes core count over clock speed for power efficiency and density. It targets cloud providers and parallel applications, emphasizing memory performance trade-offs.

Read original articleLink Icon
Testing AMD's Bergamo: Zen 4c

AMD's Bergamo is a server CPU designed to increase core counts using density-focused Zen 4c cores, similar to strategies by Intel and Ampere. The Zen 4c variant sacrifices high clock speeds for better power efficiency and area usage, allowing more cores in the same silicon area. Bergamo reuses AMD's Zen 4 server platform but with Zen 4c CCDs, each containing two CCX-es with 16 MB shared L3 cache. The system aims to cater to cloud providers and parallel applications by offering higher core counts. Bergamo's memory bandwidth and latency performance are crucial, with implications for application performance. In a dual socket configuration, Bergamo can scale up to 256 cores, but accessing remote memory incurs a latency penalty. Comparisons with older server CPUs like Broadwell and Westmere highlight trade-offs in latency and bandwidth. Core-to-core latency tests reveal Bergamo's challenges with high core counts, while benchmarking against other Zen 4 CCD variants shows the impact of core count and cache capacity on memory bandwidth. Overall, Bergamo's design focuses on maximizing core counts within a given area while balancing power efficiency and performance considerations.

Related

Unisoc and Xiaomi's 4nm Chips Said to Challenge Qualcomm and MediaTek

Unisoc and Xiaomi's 4nm Chips Said to Challenge Qualcomm and MediaTek

UNISOC and Xiaomi collaborate on 4nm chips challenging Qualcomm and MediaTek. UNISOC's chip features X1 big core + A78 middle core + A55 small core with Mali G715 MC7 GPU, offering competitive performance and lower power consumption. Xiaomi's Xuanjie chip includes X3 big core + A715 middle core + A510 small core with IMG CXT 48-1536 GPU, potentially integrating a MediaTek baseband. Xiaomi plans a separate mid-range phone line with Xuanjie chips, aiming to strengthen its market presence. The successful development of these 4nm chips by UNISOC and Xiaomi marks progress in domestically produced mobile chips, enhancing competitiveness.

Arm64EC – Build and port apps for native performance on Arm

Arm64EC – Build and port apps for native performance on Arm

Arm64EC is a new ABI for Windows 11 on Arm devices, offering native performance benefits and compatibility with x64 code. Developers can enhance app performance by transitioning incrementally and rebuilding dependencies. Specific tools help identify Arm64EC binaries and guide the transition process for Win32 apps.

Vulnerability in Popular PC and Server Firmware

Vulnerability in Popular PC and Server Firmware

Eclypsium found a critical vulnerability (CVE-2024-0762) in Intel Core processors' Phoenix SecureCore UEFI firmware, potentially enabling privilege escalation and persistent attacks. Lenovo issued BIOS updates, emphasizing the significance of supply chain security.

Finnish startup says it can speed up any CPU by 100x

Finnish startup says it can speed up any CPU by 100x

A Finnish startup, Flow Computing, introduces the Parallel Processing Unit (PPU) chip promising 100x CPU performance boost for AI and autonomous vehicles. Despite skepticism, CEO Timo Valtonen is optimistic about partnerships and industry adoption.

Gren 0.4: New Foundations

Gren 0.4: New Foundations

Gren 0.4 updates its functional language with enhanced core packages, a new compiler, revamped FileSystem API, improved functions, and a community shift to Discord. These updates aim to boost usability and community engagement.

Link Icon 9 comments
By @Pet_Ant - 5 months
The actual title seems to be "Testing AMD’s Bergamo: Zen 4c Spam" which I really like because for the perspectives of 20 years ago this would feel a bit like "spam" or a CPU-core "Zergling rush".

As I said before, I do believe that this is the future of CPUs core. [1] With RAM latency not really having kept pace with CPUs have more performant cores really seems like a waste. In a Cloud setting where you always have some work to do it seems like simpler cores but more of them is really the answer. It's in the environment that the weight of x86's legacy will catch up with us and we'll need to get rid of all the waste transistors decoding cruft.

https://news.ycombinator.com/item?id=40535915

By @adrian_b - 5 months
The article says "AMD’s server platform also leaves potential for expansion. Top end Genoa SKUs use 12 compute chiplets while Bergamo is limited to just eight. 12 Zen 4c compute dies would let AMD fit 192 cores in a single socket".

It should be noted that the successor of Bergamo, Turin dense, which is expected by the end of the year, will have 12 compute chiplets, for a total of 192 Zen 5c cores, bringing thus both more cores and faster cores.

By @crote - 5 months
I wonder if these large cache sizes make it possible to operate functionally RAM-less servers? When optimized for size, it'd probably be possible to fit some microservices in an MB of 8 of RAM. You can fit a decent bit of per-connection client state in a few KB, which can take up the rest of the cache. Throw in the X3D cache for 96MB per CCD and you can actually do some pretty serious stuff.

With the right preprocessing, it should be possible to essentially stream from NIC straight to CPU, process some stuff, and output it straight to the NIC again without ever touching RAM - although I doubt current DMA hardware allows this. It'd require quite a bit of re-engineering of the on-wire protocol to remove the need for any nontrivially-sized buffers, but I reckon it isn't impossible.

By @tedunangst - 5 months
I remember as part of the TBB library release, Intel remarked that 100 pentium cores had the same transistor count of a core2. Took a while, but starting to turn the corner on slower and wider becoming more common.
By @Havoc - 5 months
Every time I hear about a 100+ core my mind jumps to python GIL
By @saxonww - 5 months
My immediate thought is that Sun beat both of these guys to market with the T1 20 years ago. Not exactly the same thing, but I wonder how close to the same idea they are; the line seems to have stopped with the M8, which is a 32c/256t part from 2017.
By @nullc - 5 months
Is the 4c core slower in any other way than L3 cache reductions?

Would be interesting to see a compute bound perfectly scaling workload and compare it in terms of absolute performance and performance per watt between Bergamo and Genoa.