Wednesday, February 21, 2024
Recently the good folks at Groq released a formidable demo showing a 70 billion parameter language model inferencing at 300 tokens per second at batch size 1. This immediately elicited two responses from the community:
Configuration. Groq's marketing is somewhat facetious here. The 'LPU' is just their old 14 nm SRAM-based systolic array processor, which is in turn somewhere in between the Graphcore SRAM-based processor array and the big systolic arrays found in Gaudi and TPU. The LPU has 230 Mbytes of SRAM per chip, and some software tricks are used to shard the model across many, many LPUs for inference. If we assume 10 Mbytes activation memory per token, 4K context on a 70B parameter model in int8 comes out to about 110 Gbytes of memory, which requires eight racks (512 devices) to hold. Needless to say, that is a rather voluminous configuration.
Power. At first glance Groq is absolutely boned here, requiring 512 300W devices (150KW) to do their inference. Fortunately, batch size 1 inference doesn't really stress the ALUs even with all of Groq's bandwidth (a forward pass is 140 Gflops so 300 tokens/sec is a tiny fraction of the total throughput of the cluster) so the actual power per chip will be quite low.
Economics. This is where it gets really spicy. Deriders will remark that the Groq device is $20K, but that's in quantity 1 from Mouser for a card built by Bittware, a company notorious for high markups. Groq's chip is fabbed on a '14 nm process' - give or take, the die probably costs $60, 2x that for packaging and testing. Now, here's the magic - because it is SRAM based, the boards are very simple; all in, I would estimate that a Groq board costs under $500 to build.
Suddenly, we're looking cost competitive. The boards for 512 devices come out to $250,000, figure double that once we account for the host servers. Half a million for 10x the performance of an octal A100 ($180,000) is suddenly pretty good. Of course, we're being facetious here, because an octal A100 costs about $50,000 to build, but Groq gets to pay wholesale prices and you don't.
Conclusions. What are our real conclusions here? I'd say there are two, maybe three: