An Actual AI Hardware Company Teaches NVIDIA How Its Done

A cutting-edge AI chip company has developed a novel chip aimed at providing exceptional AI inference performance specifically tailored for large language models (LLMs).

“Groq”, not to be confused with Elon Musk’s AI Chatbot “Grok”, accomplishes this feat by introducing a specialized processing unit called the Tensor Streaming Processor (TSP), engineered to deliver consistent and predictable performance for AI calculations without relying on traditional GPUs.

In contrast, GPUs are primarily optimized for parallel graphics processing, featuring a significant number of cores. For example, an Nvidia H100 GPU unit boasts 14,592 CUDA cores whereas AMD’s MI300X boasts 19,456 Stream Processors.

Groq dubs the chips that result from this endeavor as “Language Processor Units,” or LPUs.

According to Jay Scambler, managing director of the AI firm Kapstone Bridge, in a Twitter post

Groq is serving the fastest responses I've ever seen. We're talking almost 500 T/s!

I did some research on how they're able to do it. Turns out they developed their own hardware that utilize LPUs instead of GPUs. Here's the skinny:

Groq created a novel processing unit known as… pic.twitter.com/mgGK2YGeFp
— Jay Scambler (@JayScambler) February 19, 2024

“The LPU’s architecture marks a departure from the SIMD (Single Instruction, Multiple Data) model employed by GPUs, favoring a more streamlined approach that removes the necessity for intricate scheduling hardware. This design maximizes the utilization of every clock cycle, ensuring consistent latency and throughput.”

The increased efficiency of the LPU stems in part from its ability to eliminate the overhead associated with managing multiple threads and the underutilization of cores. Consequently, an LPU boasts superior computing capacity, enabling the generation of text sequences at a significantly accelerated pace.

Groq’s groundbreaking chip design facilitates seamless interconnection of multiple TSPs, circumventing the typical bottlenecks encountered in GPU clusters and rendering them highly scalable. This capability enables a linear enhancement of performance with the addition of more LPUs, streamlining the hardware prerequisites for large-scale AI models and simplifying the scaling process for developers without necessitating a complete overhaul of their systems.

According to a report on Tom’s Hardware, Groq asserts that its users are already leveraging its engine and API to execute LLMs at speeds up to 10 times faster than GPU-based alternatives.

LPUs have the potential to offer significant advancements over GPUs in serving AI applications in the future! Having alternative high-performance hardware would be beneficial, especially considering the high demand for A100s and H100s as NVIDIA’s dominant AI monopoly continues to strengthen with each passing day.

About The Author

techopse

Leave a reply Cancel reply