TOPIC 2.5

Chip Architecture 101

⏱️24 min read

📚Technical Fundamentals

Understanding modern chip architecture requires moving beyond simple "processor" concepts to recognize that today's computing devices contain highly sophisticated, specialized systems-on-a-chip that integrate multiple processor types, memory hierarchies, and specialized accelerators— all working in concert to deliver performance while managing power consumption.

CPU Architecture Fundamentals

ISA Battle: x86 vs. ARM vs. RISC-V

The Central Processing Unit (CPU) remains the "brain" of computing systems, executing general-purpose instructions sequentially. The fundamental instruction language is the Instruction Set Architecture (ISA):

x86 (Intel, AMD) dominates desktops and servers with complex instruction sets
ARM (licensed by Apple, Qualcomm, MediaTek) dominates mobile with power-efficient RISC designs
RISC-V is emerging as an open-source alternative, gaining traction in embedded systems and IoT

Multi-Core, Hybrid Architectures, and Cache Hierarchies

Modern CPUs integrate multiple processor cores on a single die. Desktop CPUs typically contain 8-16 cores; server CPUs can have 64-128 cores (AMD EPYC 9005 "Turin" series). Intel's hybrid architecture separates high-performance P-cores for demanding tasks from efficient E-cores for background workloads.

CPUs maintain multiple levels of fast memory (L1, L2, L3 caches) to reduce latency accessing main RAM. Cache sizes have grown from kilobytes to megabytes, with AMD's 3D V-Cache technology stacking additional cache vertically on the processor die.

The Clock Speed Plateau and Power Wall

The GHz race plateaued around 2004 due to thermal and power constraints (the "power wall"). Performance gains now come from increased core counts, improved architectures, and specialized execution units rather than raw clock speed increases.

GPU: From Graphics to General-Purpose Parallel Computing

Massively Parallel Architecture (18,000+ Cores)

Graphics Processing Units evolved from specialized graphics accelerators into general-purpose parallel processors now essential for AI and scientific computing. While CPUs have dozens of powerful cores optimized for sequential tasks, GPUs contain thousands of simpler cores designed for parallel workloads. NVIDIA's H100 GPU contains 18,432 CUDA cores plus specialized Tensor Cores for AI matrix operations.

⚖️ CPU vs GPU Architecture Comparison

CPU (Intel Xeon)

Cores 64-128

Design Sequential

Best For General tasks

Cache Large (MB)

Power ~350W

GPU (NVIDIA H100)

Cores 18,432

Design Parallel

Best For AI/Graphics

Memory 80GB HBM3

Power ~700W

CPUs excel at sequential tasks with complex logic; GPUs dominate parallel workloads like AI training

CUDA, ROCm, and Software Ecosystem Lock-in

GPUs use Single Instruction, Multiple Data (SIMD) processing, applying the same operation across many data elements simultaneously— ideal for graphics rendering, matrix multiplication in AI, and scientific simulations.

NVIDIA's dominance in AI accelerators stems partly from CUDA— a proprietary parallel computing platform that has become the de facto standard for AI development. AMD's ROCm provides an open alternative, while Apple's Metal optimizes for its integrated GPU architectures.

High-Bandwidth Memory Integration

Modern AI GPUs are paired with High-Bandwidth Memory (HBM)— stacked memory chips connected with thousands of parallel data paths. NVIDIA's Blackwell Ultra GPU features 288GB of HBM3e memory, providing terabytes per second of bandwidth to feed the compute cores.

Die Size, Transistor Density, and Manufacturing Economics

Die Size vs. Yield Trade-offs

Modern chip architectures must balance die size (physical area) against transistor density (transistors per square millimeter). Larger dies yield more defects during manufacturing, reducing yields and increasing cost per chip. A single defect can destroy an entire die. High-end GPUs can exceed 800mm² die area, representing significant economic risk.

Transistor Density at 2nm (300M+ per mm²)

At 2nm process nodes, transistor densities exceed 300 million transistors per square millimeter. The Apple M4 chip (28 billion transistors, 3nm process) occupies approximately 140mm² die area. NVIDIA's B100 (208 billion transistors) demonstrates the transistor budgets available for specialized AI accelerators.

Reticle Limits and Chiplet Solutions

Lithography equipment has physical limits on the maximum die size that can be exposed in a single pass (typically around 800-850mm²). Apple's strategy of "chiplets" (multiple smaller dies packaged together) and AMD's chiplet-based EPYC processors mitigate this yield risk.

Domain-Specific Accelerators

ASICs: Google TPU, Tesla Dojo, AWS Inferentia

Application-Specific Integrated Circuits represent the extreme of specialization— custom chips designed for specific workloads. Google's TPU (Tensor Processing Unit) is optimized exclusively for neural network inference and training, achieving superior performance-per-watt versus general-purpose GPUs for these specific tasks. Tesla's Dojo chip is custom-designed for training autonomous driving models.

NPUs for On-Device AI

Neural Processing Units are specialized accelerators integrated into mobile and PC processors for on-device AI inference, enabling features like real-time translation, image processing, and voice recognition without cloud connectivity.

Specialization vs. Flexibility Trade-offs

ASICs sacrifice flexibility for efficiency. A TPU excels at matrix multiplication for AI but cannot run general-purpose software or render graphics. This specialization makes economic sense only for companies with massive scale running specific workloads continuously.

3D Integration and Advanced Packaging

HBM Stacking and Through-Silicon Vias

The most significant architectural innovation of the 2020s is the transition from flat, 2D chip designs to three-dimensional integration. High-Bandwidth Memory stacks 8-12 memory dies vertically, connected to the processor through thousands of parallel connections. This provides the extreme bandwidth needed for AI accelerators.

Chiplet Architectures (AMD EPYC, Intel Tile-Based)

AMD's EPYC server processors use a chiplet design with multiple smaller compute dies (fabricated on advanced 5nm node) connected to I/O dies (fabricated on cost-effective 12nm node) via high-speed interconnects. This approach improves manufacturing yields, allows mixing process nodes optimally, enables modular scaling, and reduces overall manufacturing cost.

Wafer-Scale Integration (Cerebras WSE-2)

Cerebras Systems has taken integration to the extreme with the Wafer Scale Engine 2— a single processor manufactured across an entire 300mm silicon wafer containing 2.6 trillion transistors. This approach maximizes on-chip communication bandwidth and eliminates multi-chip bottlenecks, though it faces extreme engineering challenges in cooling, power delivery, and yield management.

System-on-a-Chip Integration

Apple M-Series: The SoC Paradigm

The pinnacle of modern chip architecture is the SoC— integrating complete computing systems on a single die or package. Apple's M-series chips pioneered unified memory accessible by both CPU and GPU without copying data between separate memory pools. This reduces latency, power consumption, and programming complexity.

Heterogeneous Integration and Unified Memory

Modern SoCs integrate dedicated engines for specific tasks: Neural Processing Units (NPUs) for AI inference, Image Signal Processors (ISPs) for camera processing, video encoders/decoders, cryptographic accelerators, and more. Apple's M5 features a revolutionary GPU architecture with dedicated Neural Accelerators integrated into each GPU core, delivering 4x performance increase for AI tasks versus M4.

Power Management and Specialized Accelerators

SoCs employ sophisticated power management with independent voltage and frequency scaling for each component, dynamically allocating power budget to active units while power-gating idle components. This enables the multi-day battery life of modern mobile devices.

The industry has fundamentally shifted from "make faster transistors" to "architect smarter systems." The greatest performance and efficiency gains now come from heterogeneous integration, memory hierarchy optimization, domain-specific acceleration, 3D integration, and software-hardware co-design.

🎯 Key Takeaways

CPUs (dozens of powerful cores for sequential tasks) excel at general-purpose computing, while GPUs (thousands of simple cores for parallel workloads) dominate AI/graphics with NVIDIA's H100 containing 18,432 CUDA cores plus Tensor Cores
At 2nm nodes, transistor density exceeds 300 million/mm²; larger dies yield more defects, driving chiplet strategies where AMD and Intel combine multiple smaller dies to improve yields and reduce costs
Application-Specific Integrated Circuits (Google TPU, Tesla Dojo) sacrifice flexibility for 10-100x efficiency gains on specific workloads, making economic sense only at hyperscale
Modern architectures integrate vertically through HBM memory stacks (8-12 dies), 3D V-Cache, and chiplet packaging; Cerebras' 2.6 trillion transistor Wafer Scale Engine represents the extreme of integration

[

← Previous Topic Chip Fabrication & Manufacturing

](topic-4.html)[

Next Topic → Semiconductor Companies & Roles

](topic-6.html)