TOPIC 2.5
Chip Architecture 101
⏱️24 min read
📚Technical Fundamentals
Understanding modern chip architecture requires moving beyond simple "processor" concepts to recognize that today's computing devices contain highly sophisticated, specialized systems-on-a-chip that integrate multiple processor types, memory hierarchies, and specialized accelerators— all working in concert to deliver performance while managing power consumption.
CPU Architecture Fundamentals
ISA Battle: x86 vs. ARM vs. RISC-V
The Central Processing Unit (CPU) remains the "brain" of computing systems, executing general-purpose instructions sequentially. The fundamental instruction language is the Instruction Set Architecture (ISA):
- x86 (Intel, AMD) dominates desktops and servers with complex instruction sets
- ARM (licensed by Apple, Qualcomm, MediaTek) dominates mobile with power-efficient RISC designs
- RISC-V is emerging as an open-source alternative, gaining traction in embedded systems and IoT
Multi-Core, Hybrid Architectures, and Cache Hierarchies
Modern CPUs integrate multiple processor cores on a single die. Desktop CPUs typically contain 8-16 cores; server CPUs can have 64-128 cores (AMD EPYC 9005 "Turin" series). Intel's hybrid architecture separates high-performance P-cores for demanding tasks from efficient E-cores for background workloads.
CPUs maintain multiple levels of fast memory (L1, L2, L3 caches) to reduce latency accessing main RAM. Cache sizes have grown from kilobytes to megabytes, with AMD's 3D V-Cache technology stacking additional cache vertically on the processor die.
The Clock Speed Plateau and Power Wall
The GHz race plateaued around 2004 due to thermal and power constraints (the "power wall"). Performance gains now come from increased core counts, improved architectures, and specialized execution units rather than raw clock speed increases.
GPU: From Graphics to General-Purpose Parallel Computing
Massively Parallel Architecture (18,000+ Cores)
Graphics Processing Units evolved from specialized graphics accelerators into general-purpose parallel processors now essential for AI and scientific computing. While CPUs have dozens of powerful cores optimized for sequential tasks, GPUs contain thousands of simpler cores designed for parallel workloads. NVIDIA's H100 GPU contains 18,432 CUDA cores plus specialized Tensor Cores for AI matrix operations.
⚖️ CPU vs GPU Architecture Comparison
CPU (Intel Xeon)
Cores 64-128
Design Sequential
Best For General tasks
Cache Large (MB)
Power ~350W
GPU (NVIDIA H100)
Cores 18,432
Design Parallel
Best For AI/Graphics
Memory 80GB HBM3
Power ~700W
CPUs excel at sequential tasks with complex logic; GPUs dominate parallel workloads like AI training
CUDA, ROCm, and Software Ecosystem Lock-in
GPUs use Single Instruction, Multiple Data (SIMD) processing, applying the same operation across many data elements simultaneously— ideal for graphics rendering, matrix multiplication in AI, and scientific simulations.
NVIDIA's dominance in AI accelerators stems partly from CUDA— a proprietary parallel computing platform that has become the de facto standard for AI development. AMD's ROCm provides an open alternative, while Apple's Metal optimizes for its integrated GPU architectures.
High-Bandwidth Memory Integration
Modern AI GPUs are paired with High-Bandwidth Memory (HBM)— stacked memory chips connected with thousands of parallel data paths. NVIDIA's Blackwell Ultra GPU features 288GB of HBM3e memory, providing terabytes per second of bandwidth to feed the compute cores.
Die Size, Transistor Density, and Manufacturing Economics
Die Size vs. Yield Trade-offs
Modern chip architectures must balance die size (physical area) against transistor density (transistors per square millimeter). Larger dies yield more defects during manufacturing, reducing yields and increasing cost per chip. A single defect can destroy an entire die. High-end GPUs can exceed 800mm² die area, representing significant economic risk.
Transistor Density at 2nm (300M+ per mm²)
At 2nm process nodes, transistor densities exceed 300 million transistors per square millimeter. The Apple M4 chip (28 billion transistors, 3nm process) occupies approximately 140mm² die area. NVIDIA's B100 (208 billion transistors) demonstrates the transistor budgets available for specialized AI accelerators.
Reticle Limits and Chiplet Solutions
Lithography equipment has physical limits on the maximum die size that can be exposed in a single pass (typically around 800-850mm²). Apple's strategy of "chiplets" (multiple smaller dies packaged together) and AMD's chiplet-based EPYC processors mitigate this yield risk.
Domain-Specific Accelerators
ASICs: Google TPU, Tesla Dojo, AWS Inferentia
Application-Specific Integrated Circuits represent the extreme of specialization— custom chips designed for specific workloads. Google's TPU (Tensor Processing Unit) is optimized exclusively for neural network inference and training, achieving superior performance-per-watt versus general-purpose GPUs for these specific tasks. Tesla's Dojo chip is custom-designed for training autonomous driving models.
NPUs for On-Device AI
Neural Processing Units are specialized accelerators integrated into mobile and PC processors for on-device AI inference, enabling features like real-time translation, image processing, and voice recognition without cloud connectivity.
Specialization vs. Flexibility Trade-offs
ASICs sacrifice flexibility for efficiency. A TPU excels at matrix multiplication for AI but cannot run general-purpose software or render graphics. This specialization makes economic sense only for companies with massive scale running specific workloads continuously.
3D Integration and Advanced Packaging
HBM Stacking and Through-Silicon Vias
The most significant architectural innovation of the 2020s is the transition from flat, 2D chip designs to three-dimensional integration. High-Bandwidth Memory stacks 8-12 memory dies vertically, connected to the processor through thousands of parallel connections. This provides the extreme bandwidth needed for AI accelerators.
Chiplet Architectures (AMD EPYC, Intel Tile-Based)
AMD's EPYC server processors use a chiplet design with multiple smaller compute dies (fabricated on advanced 5nm node) connected to I/O dies (fabricated on cost-effective 12nm node) via high-speed interconnects. This approach improves manufacturing yields, allows mixing process nodes optimally, enables modular scaling, and reduces overall manufacturing cost.
Wafer-Scale Integration (Cerebras WSE-2)
Cerebras Systems has taken integration to the extreme with the Wafer Scale Engine 2— a single processor manufactured across an entire 300mm silicon wafer containing 2.6 trillion transistors. This approach maximizes on-chip communication bandwidth and eliminates multi-chip bottlenecks, though it faces extreme engineering challenges in cooling, power delivery, and yield management.
System-on-a-Chip Integration
Apple M-Series: The SoC Paradigm
The pinnacle of modern chip architecture is the SoC— integrating complete computing systems on a single die or package. Apple's M-series chips pioneered unified memory accessible by both CPU and GPU without copying data between separate memory pools. This reduces latency, power consumption, and programming complexity.
Heterogeneous Integration and Unified Memory
Modern SoCs integrate dedicated engines for specific tasks: Neural Processing Units (NPUs) for AI inference, Image Signal Processors (ISPs) for camera processing, video encoders/decoders, cryptographic accelerators, and more. Apple's M5 features a revolutionary GPU architecture with dedicated Neural Accelerators integrated into each GPU core, delivering 4x performance increase for AI tasks versus M4.
Power Management and Specialized Accelerators
SoCs employ sophisticated power management with independent voltage and frequency scaling for each component, dynamically allocating power budget to active units while power-gating idle components. This enables the multi-day battery life of modern mobile devices.
The industry has fundamentally shifted from "make faster transistors" to "architect smarter systems." The greatest performance and efficiency gains now come from heterogeneous integration, memory hierarchy optimization, domain-specific acceleration, 3D integration, and software-hardware co-design.
🎯 Key Takeaways
- CPUs (dozens of powerful cores for sequential tasks) excel at general-purpose computing, while GPUs (thousands of simple cores for parallel workloads) dominate AI/graphics with NVIDIA's H100 containing 18,432 CUDA cores plus Tensor Cores
- At 2nm nodes, transistor density exceeds 300 million/mm²; larger dies yield more defects, driving chiplet strategies where AMD and Intel combine multiple smaller dies to improve yields and reduce costs
- Application-Specific Integrated Circuits (Google TPU, Tesla Dojo) sacrifice flexibility for 10-100x efficiency gains on specific workloads, making economic sense only at hyperscale
- Modern architectures integrate vertically through HBM memory stacks (8-12 dies), 3D V-Cache, and chiplet packaging; Cerebras' 2.6 trillion transistor Wafer Scale Engine represents the extreme of integration
[
← Previous Topic Chip Fabrication & Manufacturing
](topic-4.html)[
Next Topic → Semiconductor Companies & Roles
](topic-6.html)