No, Classical Computation is not Dead, yet—Meet, Cerebras.

1 trillion parameter AI model on a single CS-3 system sets the bar for Amazon, Microsoft, Google, Grok, and OpenAI at NeurIPS 2024.

Dec 28, 2024

Digital brain with a superposition arrow | Xybecraft | Judah Estrada | Substack

What the heck is a CS-3 System?

The CS-3 system is the third-generation AI accelerator developed by Cerebras Systems, a company founded in 2015… 2015, less than a decade ago, keep that in mind if you actually continue to read all of this publication, also keep in mind that this isn’t even a quantum computer, this is binary computation.

Subscribe | Xybercraft ™

Cerebras (CS-3) (classical) clustered @ 256 exaFLOPS = 256×10^18 floating-point operations per second.

Who: Cerebras Systems, a leading AI hardware company demonstrated unprecedented computational efficiency in large-scale machine learning model training.
When: During a major international AI conference in late 2024.
Where: NeurIPS 2024, a prominent technological symposium focusing on artificial intelligence and computational systems.

Andrew Feldman and a team of engineers with expertise in advanced computing technologies are the brainiacs behind the CS-3 builds on Cerebras' Wafer-Scale Engine (WSE) technology, which revolutionized AI hardware by integrating 900,000 AI-optimized cores onto a single wafer-scale chip (WSE-3 chip). Which is just one chip—imagine 10 of them, or 100, oh wait, let’s imagine we have a black budget, how about a 1000+ of these bad boys. The power would make you feel super human.

How is it That Quantum Computers Can Run With Nearly Zero Energy?

Judah Estrada

January 15, 2015

Read full story

Measuring 46,225 mm²—essentially the size of an entire 12-inch silicon wafer. This is no ordinary chip; it’s a computational behemoth, dwarfing traditional GPUs like NVIDIA's H100 by being 56 times larger, with 52 times more cores, and 7,000 times the memory bandwidth here are a few key features and innovations. Here’s one, it can process up to 2100 tokens per second (I’ll break this down in a moment), train large-scale AI models such as GPT-4, LLaMA, and other generative AI systems. It is also used in national security research—as used and demonstrated by Sandia National Laboratories'. Let’s continue diving:

Wafer-Scale Engine 3 (WSE-3): CS-3 features over 4 trillion transistors, providing unmatched performance for AI workloads. It delivers 2x the speed of its predecessor, the CS-2.
MemoryX Integration: Supports up to 1.2 petabytes of external memory, enabling the training of models with up to 24 trillion parameters without traditional memory bottlenecks.
SwarmX Interconnect: Allows seamless scaling across up to 2,048 systems, creating hyperscale AI supercomputers capable of reaching up to a quarter zettaflop performance.
Compact Design: Fits into a 16RU rack unit with advanced power and water-cooling systems, delivering the compute power of entire GPU clusters in a fraction of the space.

Below are a few of CS-3 system performance highlightable achievements.

Near-linear performance scaling when moving from single-node to multi-node configurations.
Simplified workflows by eliminating the need for complex distributed computing frameworks.
The ability to train trillion-parameter models with the same ease as smaller models on GPUs.

Cerebras Systems introduced the first Wafer-Scale Engine in 2019, breaking Moore’s Law by creating processors specifically designed for AI workloads. At the time of this writing and publishing—the CS-3 continues this legacy by offering insane scalability and efficiency for machine learning research. This marks a significant advancement in the development of large-scale AI models. Traditional requirements previously training such large models typically required thousands and thousands of GPUs.

So, what in the world is a Token?

A token in natural language processing (NLP) a discrete unit of text, such as a word, subword, or character, depending on the tokenization algorithm used. When the model processes a token, it passes it through multiple layers of its neural network (also known as nodes or neurons)—a system inspired by how the human brain works. These layers (known as, “inputs”, intermediary “hidden” and “outputs”) analyze the token in the context of the surrounding text to understand its meaning and figure out what might come next.

For example, the sentence "Subscribe to Xybercraft!" might be tokenized into: ["Sub”, “scribe", "to", "Xyber", "craft", “!”] by a subword-based tokenizer like Byte Pair Encoding (BPE).

What can 1 token do?

The processing of a single token involves passing it through the layers of a neural network to compute its contextual representation and generate predictions for the next token. This process begins with tokenization, where text is divided into manageable units, followed by converting each token into a numerical vector using embeddings to capture its semantic meaning. Even processing just one token per second illustrates how a model can handle small-scale tasks, such as generating short responses or analyzing brief text snippets. While this may seem slow for larger tasks, it demonstrates a model's ability to break down complex language into smaller pieces and process them with remarkable accuracy. This step-by-step approach, utilizing advanced architectures like transformers, and allows language models to create coherent sentences and make sense of human language.

Continued: if a model processes one token per second, it means it takes one second to analyze or generate a single word or sub-word. This speed is influenced by factors like the model's size, hardware (e.g., GPUs), and the complexity of the processing(s) could be argued from the 1950’s all the way up to today. The speed at which large language models (LLMs) process tokens is typically measured in tokens per second (TPS), and this can vary significantly. Here’s a breakdown:

Typical Processing Speeds: Smaller models or optimized setups can handle anywhere from 10 to 200 tokens per second on consumer-grade GPUs. Larger models with billions of parameters (e.g., 70B) tend to process tokens more slowly due to their higher computational requirements.
Hardware Performance: High-performance GPUs, such as NVIDIA A100 or RTX 3090, are capable of generating 100–200 tokens per second for smaller or moderately sized models. However, as the model size increases, the processing speed decreases because of the added complexity.
Real-World Context: For practical applications, a system that generates one token every 100 milliseconds is producing about 10 tokens per second, which translates to roughly 450 words per minute—faster than most people can read or type.

These speeds are influenced by factors like the length of the input text, the efficiency of the hardware being used, and any optimizations applied to the model (e.g., reducing precision or using faster algorithms).

Now—what can 2100 tokens do, you say?

What can 2,100 tokens per second do? Hmmm, so much. Where to begin? This impressive processing speed, achieved by Cerebras' CS-3 system with the Llama 3.1-70B model, signifies a major leap in computational power. It allows the system to handle entire sequences of text rapidly, drastically cutting down latency and enabling real-time inference for large-scale language models.

Compared to a slower rate of one token per second, processing at 2,100 tokens per second enhances efficiency by approximately 2,100 times the processing power as above (of course, right?) Making it ideal for demanding applications such as conversational AI, real-time translation, and agentic capabilities.

Do quantum computers use tokens?

Quantum computers do not utilize tokens in the same manner as classical AI systems do for natural language processing (NLP). Instead, in the context of Quantum Natural Language Processing (QNLP), text data is typically encoded into quantum states or represented as quantum circuits. These representations are manipulated using quantum algorithms that exploit quantum phenomena such as superposition and entanglement.

Qubit Superposition (Bloch Sphere)

\( |\psi\rangle = \cos\left(\frac{\theta}{2}\right)|0\rangle + e^{i\phi}\sin\left(\frac{\theta}{2}\right)|1\rangle \)

Quantum circuits perform operations on qubits using quantum gates, such as the Hadamard gate, to manipulate their states.

Entangled Qubit States (Bell State)

\(|\Phi^+\rangle = \frac{1}{\sqrt{2}} \left( |00\rangle + |11\rangle \right)\)

Quantum Natural Language Processing (QNLP) leverages algorithms that take advantage of quantum phenomena; for example, Grover's algorithm offers a quadratic speedup for search tasks. Additionally, another notable algorithm is the Quantum Approximate Optimization Algorithm (QAOA), which was developed to tackle complex optimization problems by integrating both classical and quantum computing techniques, if, there aren’t implications, which is a bit to much to explain for this publication alone.

LaTeX Block Support for Substack Authors.

Judah Estrada

January 16, 2020

Read full story

These mathematical frameworks enable quantum systems to efficiently execute tasks like semantic similarity and question answering by leveraging the unique properties of quantum mechanics to represent and process language data in ways classical systems cannot achieve.

Are you still with me, or did I lose you?

So what’s up with the 1 trillion parameter thing?

If you don’t know, and I have your wheels are turning—in AI, parameters are the internal values (e.g., weights and biases—yes, biases—in neural networks) remember the nodes and neurons I mention in the “So, what in the world is a Token?” section. A model learns during training to map inputs to outputs—because of the hidden nodes inbetween—also known as backpropagation. These values are optimized through algorithms like gradient descent to minimize a loss function.

What is a Parameter?

Parameters define the model's behavior and are crucial for learning patterns in data. In neural networks, these include the weights of connections between neurons and biases added to the output of neurons. They are adjusted during training to improve accuracy.

The development of large-scale AI models with billions or trillions of parameters represents a significant advancement in artificial intelligence, enabling these systems to capture highly intricate patterns and relationships within complex datasets. This scalability has demonstrated superior performance across a variety of domains, including natural language processing, computer vision, and advanced problem-solving, while also giving rise to emergent capabilities such as multilingual understanding and contextual reasoning. However, the deployment of such models is not without challenges; their immense computational and energy requirements, coupled with the need for vast amounts of training data, raise critical concerns regarding efficiency, sustainability, and accessibility. Furthermore, the increased complexity of these models amplifies risks associated with bias and ethical accountability, underscoring the need for responsible development and deployment practices in the field.

Role in AI Models

Parameters capture patterns and relationships in training data, allowing models to generalize and make predictions on new data. The quality of these parameters directly influences the model's performance. If you want, you can always learn more here:

Neural Networks vs. Deep Neural Networks?

Judah Estrada

November 9, 2016

Read full story

What exactly does this mean for AI?

So much, and by the end of 2025, I’m about 75% sure we will see this tripled. By 2030 I won’t be surprised if we see a true “tetration” of scaled parameters after we figure out our source of fuel to power the machines. Which is another thing that may be worth investing in or using AI for to reduce the time to chemically engineer a new molecule or compound to produce and even store energy.

Keypoints of this new breakthrough

Wafer-Scale Processing: at the core of the CS-3 system is the WSE-3, the world’s largest AI processor. Clustered together—up to 2,048 units—when combined performance can reach 256 exaFLOPS
4 trillion transistors—57 times more than the largest GPUs. Unlike traditional GPU clusters that rely on external interconnects, the WSE-3 uses on-wafer wiring to connect all cores, offering:
- 27 petabytes/second of bandwidth, significantly outperforming GPU-based systems.
- Elimination of inter-GPU communication bottlenecks, enabling seamless and efficient scaling.

This unified architecture allows compute and memory operations to remain on-wafer, reducing latency and power consumption while simplifying programming models.

MemoryX External Memory: the MemoryX system provides a flexible, high-capacity storage solution for model weights. With up to 1.2 petabytes of external memory, it supports models as large as 24 trillion parameters without requiring partitioning or refactoring. Key features include:
- Weight Streaming: Model weights are stored off-chip and streamed directly to the WSE-3 for computation, layer by layer.
- Gradient Updates: Gradients are streamed back to MemoryX during training, where weight updates are performed in real-time.

Built using cost-effective DDR5 memory, MemoryX ensures scalability while maintaining affordability. This disaggregated compute-storage architecture allows independent scaling of model size and training speed, overcoming traditional memory bottlenecks.

SwarmX Communication Fabric: the SwarmX interconnect fabric enables efficient scaling across multiple CS-3 systems. It uses a broadcast-reduce topology to manage data flow:
- Broadcast: Model weights are replicated across all nodes in real-time.
- Reduce: Gradients from all nodes are aggregated back into MemoryX for weight updates.

SwarmX supports near-linear scaling across up to 2,048 systems, enabling the creation of hyperscale AI supercomputers capable of achieving up to a quarter zettaflop of performance. Its hardware-based replication and gradient reduction ensure high efficiency without requiring changes to software or infrastructure.

Available for iOS and Android

Is AI, untouchable?

Cerebras' achievement has far-reaching implications for AI research and applications, but as we continue to advance to more quantum solutions things seem to be getting on track. This achievement underscores Cerebras' leadership in advancing AI hardware and paves the way for more efficient and accessible large-scale machine learning solutions. The integration of WSE-3, MemoryX, and SwarmX positions the CS-3 as a transformative solution for training next-generation AI models:

Simplification of Training Processes: Trillion-parameter models can now be trained as easily as smaller models on traditional GPUs, reducing barriers for researchers.
Energy Efficiency: By consolidating computation into a single system, power consumption is significantly reduced compared to GPU-based clusters, addressing sustainability concerns in AI development.
Accelerated Innovation: This breakthrough enables faster development of advanced generative AI technologies, including large language models (LLMs), drug discovery simulations, and renewable energy optimizations. It also sets the stage for scaling beyond trillion parameters to even larger models in the future.

It not only reduces infrastructure complexity but also accelerates innovation in areas such as generative AI, scientific simulations, and large language models (LLMs). By addressing traditional limitations in compute power and memory bandwidth, Cerebras' CS-3 system paves the way for building even larger models with applications across enterprise and research domains.

AI, ANI, AGI, ASI?

Judah Estrada

April 15, 2018

Read full story

Which will eventually reduced infrastructure complexity by over 90% but more than likely create the need for more power so we may find ourselves chasing our tails. Speaking positively for now it does a spended job at eliminating the need for extensive hardware reconfigurations or code modifications during scaling.

Traditional computational challenges to consider

Massive GPU Requirements: The space and facilities. Thousands of GPUs interconnected in complex distributed systems were needed to handle the computational load(s).
Memory Limitations: Not limited to, Non-Volatile (NVM) but it all has a price. Storing and managing trillions of parameters; weights; data; require terabytes of memory, far exceeding the GPU capacity.
- Not to mention read and write limits that memory is defined by.
Complex Parallelism: Intricate parallel computing strategies were necessary to distribute workloads efficiently, often requiring weeks of setup and optimization.

Recommend | Xybercraft ™

Technical Comparison: Cerebras (CS-3) vs. Google's Gemini 2.0 (Willow)

It’s astonishing to consider how just when we believe we've encountered something truly remarkable, an even more extraordinary innovation emerges. This phenomenon reflects the relentless evolution of technology and life itself, where advancements continually surpass our expectations. The concept of "leapfrogging" encapsulates this dynamic progression, illustrating how each breakthrough builds upon and transcends the previous ones. In both life and information technology, we are in a constant state of discovery, where the boundaries of what is possible are perpetually expanded, inspiring us to envision a future filled with even greater possibilities.

Model Capacity and Architectural Scalability for both Cerebras and Gemini

Cerebras CS-3: The CS-3 system, powered by the Wafer Scale Engine 3 (WSE-3), is engineered to handle neural networks with up to 24 trillion parameters, a scale far beyond existing models like GPT-4 or Gemini 2.0. Its architecture leverages a 1.2 petabyte memory system via MemoryX, enabling seamless parameter streaming without partitioning or manual intervention. This disaggregated compute-memory design ensures that even multi-trillion parameter models can be trained without traditional bottlenecks.

Gemini 2.0 (Willow): While Gemini 2.0 is optimized for large-scale multi-modal AI, its model size is estimated to align with GPT-4, significantly smaller than the CS-3's theoretical capacity. Google's TPU v5 hardware provides robust performance but lacks the architectural innovations of wafer-scale integration seen in Cerebras' systems.

Computational Throughput and Training Efficiency for both Cerebra’s and Gemini

CS-3: The WSE-3 chip delivers 125 petaflops of dense AI compute per system, equivalent to approximately 62 NVIDIA H100 GPUs in terms of raw performance. In a cluster configuration of up to 2,048 nodes, the CS-3 achieves near-linear scaling and can train models like LLaMA 70B in under 24 hours—a task that traditionally takes weeks on GPU-based systems. Its on-wafer communication fabric eliminates inter-device latency, ensuring unparalleled efficiency at hyperscale levels.
Gemini 2.0 (Willow): Built on Google’s TPU v5 infrastructure, Gemini benefits from highly optimized tensor processing units designed for both training and inference workloads. While TPUs are known for their efficiency, they rely on distributed systems requiring complex orchestration, which introduces latency and limits scalability compared to Cerebras' unified wafer-scale design.

Energy Optimization and Power Efficiency for both Cerebra’s and Gemini

CS-3: The wafer-scale architecture consolidates computation onto a single chip, drastically reducing energy consumption by eliminating interconnect overheads typical in GPU or TPU clusters. This results in significantly lower power requirements per training run while maintaining high throughput, making it a more sustainable solution for hyperscale AI workloads.
Gemini 2.0 (Willow): TPU v5 systems are also designed with energy efficiency in mind but are constrained by the inherent power demands of distributed architectures and inter-node communication.

Software Complexity and Usability for both Cerebra’s and Gemini

CS-3: The CS-3 system reduces software complexity by over 90%, requiring only hundreds of lines of code to train trillion-parameter models. Its Weight Streaming architecture abstracts distributed computing challenges, allowing researchers to treat an entire cluster as a single computational entity. This simplification eliminates the need for extensive parallelization strategies or model partitioning.
Gemini 2.0 (Willow): While Google’s software stack is highly advanced, it still requires intricate orchestration of TPU clusters for large-scale training tasks. The reliance on traditional distributed frameworks adds layers of complexity compared to Cerebras’ streamlined approach.

Applications and Use Cases for both Cerebra’s and Gemini

CS-3: Designed for extreme-scale AI workloads, the CS-3 excels in training massive generative AI models (e.g., LLMs), scientific simulations, and multi-modal architectures requiring trillions of parameters. Its scalability makes it ideal for frontier research in domains like molecular modeling, astrophysics, and next-gen large language models far beyond current capabilities.
Gemini 2.0 (Willow): Optimized for multi-modal AI tasks such as text-image understanding and conversational agents, Gemini is deeply integrated into Google’s ecosystem (e.g., Search, Assistant). While powerful within its scope, its hardware and software stack are less suited for pushing the boundaries of AI model size compared to Cerebras.

Yes quantum is superior, and will almost always be

Classical computing isn’t going anywhere anytime soon and to be frank it might not ever go away—for the sake of argument, researchers exploring ultra-large model architectures or frontier-scale AI applications will find that the Cerebras CS-3 provides a more advanced and efficient platform than Gemini 2.0’s TPU infrastructure.

Available for iOS and Android

While Gemini 2.0 is primarily an advanced AI multimodal model, it differs from quantum computing systems like Willow, which features 105 qubits. Gemini's strength lies in its multimodal processing capabilities, seamlessly integrating text, images, and other data types. Theoretically, if we consider that a quantum computer with 19 qubits can match around 200 petaflops, we can extrapolate from this. To achieve performance comparable to 256 exaFLOPS (or 256,000 petaflops), we can use the relationship that each additional qubit significantly increases computational capacity. Given that 256 exaFLOPS is 1,280 timesgreater than 200 petaflops, we estimate that approximately 27-30 qubits would be necessary to match or exceed this level of performance, depending on the specific algorithms and tasks being executed.

Related: STT-MRAM vs., SOT-MRAM?

Summary

All of this happens because of CePO (Cerebras Planning and Optimization) framework. I didn’t even get into MATH, GPQA, or CRUX benchmarks to limit the size of this article.

The Cerebras CS-3 system represents a paradigm shift in AI hardware design, offering unmatched scalability through its wafer-scale architecture and MemoryX integration. It is purpose-built for hyperscale training at massive levels of complexity and parameter count. In contrast, Google’s Gemini 2.0 excels within its multi-modal domain but is constrained by traditional distributed hardware limitations inherent to TPU-based systems, I’ll save all that for another day. So what’s next in AI? A few things—my favorite, Infinite Memory.

#CS3 #AI #NeurIPS #Cerebras #Gemini

Recommend | Xybercraft ™