Forgive me starting with a cliché, financial jargon that has recently crept into the tech lexicon, but I’m afraid I have to talk about “moats”. Popularized decades ago by Warren Buffett to refer to a company’s competitive advantage, the word found its way into Silicon Valley presentations when it supposedly leaked from Googletitled “We don’t have a moat and neither does OpenAI,” he feared that open-source AI would plunder Substantial Tech’s castle.
After several years, the castle walls remain protected. Apart from a brief panic attack when DeepSeek first appeared, open-source AI models did not perform significantly better than proprietary models. Still, none of the pioneering labs – OpenAI, Anthropic, Google – have anything to talk about.
The company that has a moat is Nvidia. CEO Jensen Huang called him his most valuable “treasure.” This is not, as you might assume for a chip company, a hardware company. It’s something called CUDA. What sounds like a chemical banned by the FDA may be the only real moat in artificial intelligence.
CUDA technically stands for Compute Unified Device Architecture, but similar laser Or divingno one bothers to develop an acronym; we just say “KOO-duh”. So what is this extremely crucial treasure used for? If forced to answer in one word: parallelism.
Here’s a basic example. Let’s assume that we ask the machine to fill in the 9×9 multiplication table. Using a single-core computer, all 81 operations are performed diligently, one after the other. However, a GPU with nine cores can allocate tasks so that each core occupies a different column – one from 1×1 to 1×9, another from 2×1 to 2×9, etc. – for a ninefold enhance in speed. Contemporary GPUs can be even smarter. For example, if they are programmed to recognize commutativity – 7×9 = 9×7 – they can avoid duplicating work, reducing 81 operations to 45, which almost reduces the workload. When a single course of training costs a hundred million dollars, every optimization counts.
Nvidia GPUs were originally built for rendering graphics in video games. In the early 2000s, Stanford graduate student Ian Buck, who first encountered GPUs as a gamer, realized that their architecture could be reused for general high-performance computing applications. He created a programming language called Brook, was hired by Nvidia, and led the development of CUDA with John Nickolls. If AI ushers in an era of constant underclass white-collar workers and autonomous weapons, just know that it will be because someone, somewhere, is playing Moose I thought a demon’s scrotum was supposed to vibrate at 60 frames per second.
CUDA is not a programming language itself, but a “platform”. I apply this weasel word because, just as The Novel York Times is a newspaper that is also a gaming company, CUDA has, over the years, become a nested suite of artificial intelligence software libraries. Each feature reduces the time of individual math operations by nanoseconds – collectively, they make GPUs, in industry jargon, work brrr.
Contemporary graphics A card is not just a circuit board filled with chips, memory and fans. It is a convoluted combination of cache hierarchies and specialized units called “tensor cores” and “streaming multiprocessors”. In this sense, what chip companies sell is like a professional kitchen, and more cores are like more grilling stations. But even a kitchen with 30 grilling stations won’t run faster unless a skilled chef deftly assigns tasks – just like CUDA does with GPU cores.
To extend this metaphor, hand-tuned CUDA libraries optimized for single-die operations are the equivalent of kitchen tools designed for one task and nothing more – cherry pitters, shrimp peelers – that are a delight for home cooks, but not when you have 10,000 shrimp entrails to pull out. Which brings us back to DeepSeek. Its engineers went below this already deep layer of abstraction to work directly in PTX, a sort of assembly language for Nvidia GPUs. Suppose the task is to peel garlic. An unoptimized GPU would say, “Peel your skin with your nails.” CUDA may instruct: “Break the clove with the flat part of the knife.” PTX allows you to dictate each subinstruction: “Raise the blade 2.35 inches above the cutting board, position it parallel to the equator of the clove, and strike down with the palm of your hand with a force of 36.2 newtons.”
