July 18th, 2024

What Would You Do with a 16.8M Core Graph Processing Beast?

TSMC collaborates with Intel, MIT, and AWS to develop HIVE, a graph processing unit named PIUMA with 16.8 million cores. The advanced chip aims to efficiently process neural networks, featuring a custom RISC-based architecture and photonics interconnect for scalability.

Read original article

What Would You Do with a 16.8M Core Graph Processing Beast?

In a recent development, TSMC is aiming to enhance its position in the industry by creating a highly advanced graph processing unit with 16.8 million cores. This initiative is inspired by the need for efficient processing of neural networks, which often resemble graphs in structure. The project, known as HIVE, is a collaboration between Intel, MIT's Lincoln Laboratory, and Amazon Web Services. The processor, named PIUMA, features a custom RISC-based architecture with multiple pipelines optimized for graph analytics. It incorporates a photonics interconnect developed in partnership with Ayar Labs to connect a large number of processors. The system's design allows for scalability, with the potential to create a supercomputer with millions of cores and petabytes of shared memory. The PIUMA chip, fabricated using 7nm technology from TSMC, boasts impressive specifications such as 27.6 billion transistors and 1 TB/sec of optical interconnect bandwidth. Despite the substantial costs involved in scaling up the system, the demand for such high-performance computing solutions is expected to attract interest from entities like the US National Security Agency and the Department of Defense.

TSMC experimenting with rectangular wafers vs. round for more chips per wafer

TSMC is developing an advanced chip packaging method to address AI-driven demand for computing power. Intel and Samsung are also exploring similar approaches to boost semiconductor capabilities amid the AI boom.

Finnish startup says it can speed up any CPU by 100x

A Finnish startup, Flow Computing, introduces the Parallel Processing Unit (PPU) chip promising 100x CPU performance boost for AI and autonomous vehicles. Despite skepticism, CEO Timo Valtonen is optimistic about partnerships and industry adoption.

Intel CPU with Optical OCI Chiplet Demoed with 4Tbps of Bandwidth and 100M Reach

Intel unveiled a new CPU with an Optical Compute Interconnect chiplet, offering 4Tbps bandwidth and 100m reach. The chiplet merges Photonic and Electronic Integrated Circuits, featuring built-in lasers for improved server communication, especially in AI servers. Intel plans to scale the technology up to 32Tbps, focusing on power savings and enhanced bandwidth. This advancement aligns with the industry's move towards co-packaged optics for better efficiency and performance, marking a significant shift in server architecture towards power efficiency and high-speed data transfer.

Supercomputer-on-a-chip goes live: single PCIe card packs more than 6k cores

InspireSemi introduces Thunderbird I Accelerated Computing chip with 1,536 custom RISC-V CPU cores for scientific and data processing. Energy-efficient, scalable to 360,000 cores, suitable for high-performance computing tasks. CEO praises team's work.

Extreme Measures Needed to Scale Chips

The July 2024 IEEE Spectrum issue discusses scaling compute power for AI, exploring solutions like EUV lithography, linear accelerators, and chip stacking. Industry innovates to overcome challenges and inspire talent.

13 comments

By @jandrewrogers - 9 months

The question that isn’t be asked, but should be, is if this idea is so great then why were none of the several prior generations of similar computers successful? To be clear about where my perspective is coming from, I was one of the people doing graph algorithm research on exotic graph processors many years ago and the old Tera machines mentioned in the article are one of my favorite computing architectures of all time. I am predisposed to liking the tech but I also have a realistic view of the tradeoffs.

These architectures were killed by two separate issues. The first is that very few programmers seem to develop an intuitive sense of how to design algorithms and data structures that are efficient on these architectures. Real-world performance is poor because the software is poor. The second, and more permanent state of affairs, is that along the way researchers developed new algorithms and software architectures that run on ordinary CPUs only a little less efficiently, while taking advantage of the faster and much cheaper commodity silicon. Why would you buy exotic silicon when you can call Dell?

There is a community of diehard graph silicon enthusiasts inside the US DoD who continue to fund the development of processors like the one in the article, so we get a new design every few years. And it is cool tech! But if I were building a business around graph processing, I’d use commodity silicon without a second thought. Not that I wouldn’t love to play with one of these new graph computers just to see what they can do.

By @twoodfin - 9 months

The framing is weird: The TLAs that gave Tera enough money to buy the Cray badge and that are funding these spiritual descendants don’t care about AI. They’ve got some core, very secret “needle in a haystack” algorithms that really like chasing computed pointers (and popcount… never forget popcount!)

By @gerdesj - 9 months

Zooming in on a boundary section of the Mandelbrot set will fuck it up depending on how fast you want it to. Displaying associated Julia sets from the focus in a fly out will simply enhance the fun!

"That’s just the time we live in now." - no it isn't. That beastie might be handy for something in the future but it isn't "I". It will deliver really fast weather forecasts and that's useful.

By @michelpp - 9 months

Implement the GraphBLAS API of course!

https://graphblas.org/

By @magicalhippo - 9 months

A relaxing game of Solitaire.

Looking at the block diagrams, it looks more like it's a giant mesh router with a few processing cores sprinkled on top, Salt Bae style. After all, the cores take up just over 4% of the transistors. Not sure if that includes the 4MB per-core scratchpad though.

By @jerven - 9 months

Run the UniProt sparql service on it ;) I was lucky to test the yarcdata implementation of sparql on an xmt machine around 2012, super UI but single user.

By @01HNNWZ0MV43FF - 9 months

Rasterize Quake

What Would You Do with a 16.8M Core Graph Processing Beast?

Related

TSMC experimenting with rectangular wafers vs. round for more chips per wafer

Finnish startup says it can speed up any CPU by 100x

Intel CPU with Optical OCI Chiplet Demoed with 4Tbps of Bandwidth and 100M Reach

Supercomputer-on-a-chip goes live: single PCIe card packs more than 6k cores

Extreme Measures Needed to Scale Chips

Related

TSMC experimenting with rectangular wafers vs. round for more chips per wafer

Finnish startup says it can speed up any CPU by 100x

Intel CPU with Optical OCI Chiplet Demoed with 4Tbps of Bandwidth and 100M Reach

Supercomputer-on-a-chip goes live: single PCIe card packs more than 6k cores

Extreme Measures Needed to Scale Chips