Only ~20% of FAANG onsite candidates get offers. Master 50 system design interview questions for 2026 with difficulty ratings and L4–L6 expectations.
50 FAANG system design interview questions for 2026, ranked by difficulty (L4–L6), with worked examples for URL shorteners, Kafka, and LLM inference services.
Walk into a FAANG onsite today and your odds aren't great. Only about 20% of candidates who reach the full onsite loop receive an offer, according to Interviewing.io data from 2025. System design is the round that most often separates L4 hires from L5 rejections - not coding, not behavioral. The bar has also shifted in 2026. AI/ML integration and cost-aware architecture are now baseline L5 expectations, not Staff-level aspirations (Hello Interview, March 2026).
This guide covers all 50 system design interview questions you're likely to face across L4, L5, and L6 loops at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Each question includes difficulty ratings, core concept tags, and companies that commonly ask it. Three deep-dive sections walk through URL shorteners, distributed message queues, and - the one most candidates aren't ready for - LLM inference services. You'll also get a 90-day prep strategy, a structured answer framework, and a full FAQ.
Start with our distributed systems fundamentals guide if you need to brush up before tackling the questions below.
Key Takeaways
Only ~20% of FAANG onsite candidates receive offers (Interviewing.io, 2025). System design is often the deciding round.
AI/ML system design and cost-aware architecture became baseline L5 expectations in 2025, not Staff-level (Hello Interview, 2026).
Kubernetes now powers 82% of container users in production, and 66% of orgs running generative AI use it for inference (CNCF Annual Survey, 2026).
Google L5 median total comp is ~$421K. The L5-to-L6 jump brings a 40–70% increase, mostly from stock grants (Levels.fyi + Apex Interviewer, 2026).
Use the RADIO framework (Requirements, Architecture, Data model, Interface, Optimization) to structure any system design answer.
The goalposts moved. AI/ML integration was a Staff-engineer concern two years ago. In 2025, both AI/ML integration and cost optimization as an architectural discipline became baseline L5 expectations (Hello Interview, March 2026). That means a Senior SWE candidate who can't reason about model serving, inference latency, or GPU cost tradeoffs is now under-prepared - even if they nail distributed databases and caching.
The infrastructure layer shifted, too. Kubernetes production adoption reached 82% of container users in 2025, up from 66% in 2023 (CNCF 2025 Annual Survey, January 2026). Interviewers at companies running K8s at this scale expect candidates to speak fluently about container scheduling, resource limits, and horizontal pod autoscaling - not just treat them as buzzwords.
Microservices are also in retreat. 42% of organizations that adopted microservices are now consolidating services back into larger deployable units (CNCF 2025 Annual Survey). This is a real shift interviewers are aware of. Proposing a 12-service microservices architecture in 2026 without justifying operational complexity will raise eyebrows, not impress panels.
Container tooling is nearly universal now. Docker adoption jumped to 71.1% of all developers in 2025 - the largest single-year increase of any technology tracked (Stack Overflow Developer Survey 2025, 49,000+ respondents). The practical implication: containerization is assumed. You won't earn points for mentioning Docker. You earn points for reasoning about multi-stage builds, image layer caching, and registry latency under failure.
So what does this mean for your prep? It means the baseline moved up. The questions haven't changed that much. The depth expected in your answers has.
Citation Capsule: In 2025, AI/ML integration and cost-aware architecture became standard L5 expectations at FAANG companies, no longer reserved for Staff-level discussions. Kubernetes production adoption reached 82% of container users that year, up from 66% in 2023, according to the CNCF 2025 Annual Survey published in January 2026.
The gap between levels isn't just about knowing more patterns - it's about the quality of reasoning under constraints. At L4, interviewers want to see that you can define a working system. At L5, they want you to optimize it under real-world pressure. At L6, you're expected to challenge the problem itself and propose architectural strategies that account for org-level tradeoffs (Apex Interviewer, 2026).
L4 (Mid-level SWE) - You should define clear functional requirements, choose appropriate storage (SQL vs NoSQL), sketch a basic service architecture, and identify one or two scale bottlenecks. Interviewers don't expect you to solve every edge case. They do expect you to ask clarifying questions before jumping to solutions.
L5 (Senior SWE) - This is where most interview prep guides underestimate the bar. You need to handle dynamic constraints mid-interview, reason about cost tradeoffs, discuss consistency models (eventual vs strong), design for observability, and address failure modes proactively. AI/ML integration knowledge is now expected at this level, not optional.
L6 (Staff SWE) - Answers should reflect system thinking at organizational scale. You'll be asked how a design evolves over 3-5 years, how it aligns with company infrastructure strategy, and how teams would own and operate it. Interviewers probe for constraints you'd push back on - a good L6 candidate challenges flawed assumptions in the problem statement itself.
PERSONAL EXPERIENCE -
We've found that candidates who treat L5 prep as "L4 but faster" consistently underperform. The jump isn't speed. It's depth of reasoning about failure, cost, and tradeoffs that didn't exist in the original requirements.
The compensation stakes make this worth the investment. Google L4 median total comp sits at approximately $264K; L5 is around $421K; L6 reaches approximately $700K (Levels.fyi, 2025). The L5-to-L6 jump brings a 40-70% increase, driven almost entirely by stock grants (Apex Interviewer, 2026). That's not a small difference to optimize for.
See our FAANG interview process overview for a full breakdown of each round and what to expect at each level.
Citation Capsule: Google L5 (Senior SWE) median total compensation in the US is approximately $421K, while L6 reaches around $700K - a 40-70% increase driven primarily by stock grants, according to Levels.fyi and Apex Interviewer 2026. At Meta, E5 sits near $500K and E6 near $750K.
FAANG Compensation by Level (2025–2026) Google L6 Google L5 Google L4 Meta E6 Meta E5 Meta E4 Netflix Staff Netflix Senior $0 $250K $500K $750K $1M $700K $421K $264K $750K $500K $261K $900K+ $700K+ Google Meta Netflix
Source: Levels.fyi + Apex Interviewer, 2026
The structure of a system design round hasn't changed much - 45-60 minutes, one open-ended problem, a whiteboard or shared doc. But the content expectations inside that window shifted considerably. AI/ML system design questions are now standard at L5 and above. You may be asked to design a recommendation system, a vector similarity search service, or an LLM inference API - not just as a "hard bonus" question but as a core evaluation criteria (Hello Interview, 2026).
Dynamic constraints are showing up more frequently. Interviewers will let you build a solid architecture, then change a constraint mid-session. "Now assume the read volume is 10x what you assumed" or "what changes if this needs to be multi-region?" These pivots test adaptability, not just knowledge.
Cost-aware design is now an explicit evaluation dimension. Interviewers ask candidates to estimate infrastructure costs, compare storage options by price-per-GB, and justify architectural choices against budget constraints. This is new. Two years ago, cost was rarely discussed below Staff level.
Failure mode probing is also more structured. Interviewers now explicitly ask: "What happens when this service goes down?" and "How does your design degrade gracefully?" Partial availability, circuit breakers, and fallback strategies are expected topics at L5+.
Citation Capsule: In 2025, AI/ML integration and cost optimization became standard L5 system design evaluation criteria at FAANG companies, according to Hello Interview's March 2026 updated learning guide. Candidates who treat these as optional depth areas are likely to underperform in Senior SWE loops.
These 50 questions cover the full range of what you'll encounter across L4 through L6 loops at FAANG companies. They're drawn from community reports, interviewing.io transcripts, and first-hand candidate accounts. Each tier includes a deep-dive walkthrough of the most instructive question at that level.
These questions test fundamental distributed systems knowledge. At L4, you're expected to demonstrate sound judgment in component selection and a working mental model of scale. You don't need to have built these systems. You do need to reason clearly about why your choices make sense.
| # | Question | Core Concepts | Commonly Asked At |
|---|---|---|---|
| 1 | Design a URL shortener | Hashing, key-value storage, redirect logic | Google, Meta, Amazon |
| 2 | Design a rate limiter | Token bucket, sliding window, Redis | Amazon, Stripe, Cloudflare |
| 3 | Design a key-value store | Storage engine, replication, consistency | Amazon, Google |
| 4 | Design a web crawler | BFS/DFS, politeness, deduplication | Google, LinkedIn |
| 5 | Design a notification system | Fan-out, push/pull, message queues | Meta, Uber, Twitter |
| 6 | Design a pastebin service | Object storage, expiry, read-heavy patterns | Amazon, Adobe |
| 7 | Design a parking lot system | OOP, state machines, concurrency | Microsoft, Apple |
| 8 | Design a leaderboard | Sorted sets, Redis ZSET, cache invalidation | Riot Games, LinkedIn |
| 9 | Design a simple chat application | WebSockets, message persistence, presence | Slack, Discord, Meta |
| 10 | Design a file storage service | Chunking, deduplication, metadata DB | Dropbox, Box, Google |
| 11 | Design a content delivery network | Edge caching, TTL, origin pull | Cloudflare, Akamai, Netflix |
| 12 | Design a job scheduler | Priority queue, idempotency, retry logic | Airbnb, LinkedIn, Uber |
| 13 | Design an autocomplete service | Trie, prefix search, caching | Google, Amazon |
| 14 | Design a type-ahead search | Read-heavy, denormalized index, latency | Twitter, LinkedIn |
| 15 | Design a stock ticker | Time-series data, pub/sub, low latency | Bloomberg, Robinhood |
URL shorteners look simple. They're actually a clean lens into foundational distributed systems thinking - which is exactly why Google and Meta keep using them as L4 screen questions.
Requirements clarification first. How many URLs per day? Read-to-write ratio? Do shortened links expire? Are custom slugs supported? A good L4 candidate asks these before drawing anything. Assume 100 million URLs created per day and a 100:1 read-to-write ratio.
High-level design. A stateless API layer handles creation and redirect. A key-value store (DynamoDB or Redis-backed Cassandra) maps short codes to long URLs. Short codes are generated via base62 encoding of an auto-incremented ID or a hash of the original URL. Use a separate ID generation service (like Twitter's Snowflake) if you need uniqueness at scale across multiple writers.
Scale considerations. At 100M URLs/day, writes are ~1,200/second. Reads at 100:1 are 120,000/second. A single Redis cluster handles this fine, but you'll want read replicas and a CDN layer for redirect responses. Hash the short code across shards if storage grows beyond a single node's capacity.
Follow-up questions interviewers ask: "How would you handle URL collisions?" "What if users want analytics on clicks?" "How would you support link expiry with minimal storage overhead?" These follow-ups probe whether your initial design painted you into a corner. A good design leaves room to extend without a full rewrite.
These questions require reasoning about consistency tradeoffs, distributed coordination, and failure scenarios. L5 candidates are expected to discuss at least two design alternatives and explain why they chose one over the other under the given constraints.
| # | Question | Core Concepts | Commonly Asked At |
|---|---|---|---|
| 16 | Design a distributed message queue | Brokers, partitions, delivery guarantees | Amazon, LinkedIn, Uber |
| 17 | Design a ride-sharing service | Geo-indexing, matching, real-time updates | Uber, Lyft, DoorDash |
| 18 | Design a social media feed | Fan-out on write/read, ranking, caching | Meta, Twitter, LinkedIn |
| 19 | Design a distributed cache | Consistent hashing, eviction, TTL | Amazon, Google, Netflix |
| 20 | Design a search engine | Inverted index, ranking, crawl pipeline | Google, Elastic, Bing |
| 21 | Design a video streaming service | Transcoding, CDN, adaptive bitrate | Netflix, YouTube, Twitch |
| 22 | Design a payment processing system | Idempotency, exactly-once, ledger model | Stripe, PayPal, Square |
| 23 | Design a distributed lock service | Lease-based locking, fencing tokens | Google, Amazon, Redis |
| 24 | Design an API gateway | Rate limiting, auth, routing, circuit breaker | Kong, AWS, Cloudflare |
| 25 | Design a real-time analytics dashboard | Stream processing, pre-aggregation, push | Meta, Amplitude, Datadog |
| 26 | Design a recommendation engine | Collaborative filtering, embeddings, serving | Netflix, Spotify, Amazon |
| 27 | Design a distributed tracing system | Trace IDs, sampling, storage | Datadog, Google, Uber |
| 28 | Design an event sourcing system | Append-only log, projections, CQRS | EventStoreDB users, Axon |
| 29 | Design a multi-region database | Conflict resolution, latency, replication | Google Spanner users |
| 30 | Design a fraud detection system | Feature store, real-time scoring, rules | PayPal, Stripe, Square |
| 31 | Design a live location sharing service | Geo-updates, fan-out, privacy | Uber, Google Maps, Apple |
| 32 | Design a distributed file system | Chunk servers, master node, fault tolerance | Google GFS, Hadoop |
| 33 | Design a hotel booking system | Inventory locking, overbooking prevention | Booking.com, Expedia |
| 34 | Design a feature flag system | A/B routing, gradual rollout, config store | LaunchDarkly, Statsig |
| 35 | Design a metrics collection system | Aggregation, cardinality, time-series DB | Datadog, Prometheus, Grafana |
This is one of the most common L5 questions, and the one candidates most often over-architect on first attempt. The goal isn't to reinvent Kafka. It's to show you understand why Kafka is designed the way it is.
Producers, brokers, consumers. Producers write messages to a named topic. Brokers store and replicate those messages. Consumers read from topics, often in consumer groups that enable parallel processing. Each layer has different failure modes. Producers need confirmation of write success. Brokers need to handle leader failure. Consumers need to track their position (offset) across restarts.
Delivery guarantees. At-most-once is easy - fire and forget, accept loss. At-least-once is practical - retry on failure, tolerate duplicates. Exactly-once is expensive - requires idempotency keys and two-phase coordination. Most FAANG systems accept at-least-once delivery and push deduplication responsibility to consumers. Be explicit about which you're designing for and why.
Partitioning for scale. A single broker can't handle millions of messages per second. You partition each topic across multiple brokers. Messages for the same entity (say, the same user ID) go to the same partition, preserving order per entity. A partition is just an ordered, append-only log on disk. This is the insight that makes Kafka fast: sequential disk writes are near-memory speed.
Follow-up questions: "How do you handle a slow consumer that falls behind?" "What happens if the leader broker crashes mid-write?" "How would you implement dead-letter queues?" Each question probes fault tolerance. Have an answer for all three.
These questions have no clean textbook answer. They require synthesis across domains - distributed systems, ML infrastructure, cost modeling, and organizational design. L6 candidates are expected to reason about multi-year system evolution, team ownership, and architectural tradeoffs that span organizational boundaries.
| # | Question | Core Concepts | Commonly Asked At |
|---|---|---|---|
| 36 | Design an LLM inference service | GPU batching, KV cache, autoscaling | Google, OpenAI, Anthropic, Meta |
| 37 | Design a global CDN from scratch | Anycast routing, PoP placement, failover | Cloudflare, Fastly, Netflix |
| 38 | Design a real-time bidding system | Sub-100ms latency, auction logic, budget pacing | Google Ads, The Trade Desk |
| 39 | Design a vector similarity search service | HNSW, IVF-PQ, approximate nearest neighbors | Pinecone, Weaviate, Google |
| 40 | Design a large-scale ML feature store | Online/offline parity, time-travel queries | Uber Michelangelo, Meta Feast |
| 41 | Design a multi-tenant SaaS platform | Data isolation, noisy neighbors, per-tenant limits | Salesforce, Snowflake, AWS |
| 42 | Design a distributed key-value store (Dynamo-style) | Consistent hashing, vector clocks, quorum | Amazon, Google, Cassandra users |
| 43 | Design a zero-downtime database migration | Shadow tables, dual-write, traffic cutover | All FAANG, Stripe |
| 44 | Design a content moderation pipeline | Async processing, human review queue, ML scoring | Meta, TikTok, YouTube |
| 45 | Design an ad click aggregation system | Idempotent counters, windowed aggregation | Google, Meta, Amazon |
| 46 | Design a global social graph | Graph partitioning, BFS at scale, edge storage | Meta, LinkedIn, Twitter |
| 47 | Design a health monitoring system for 1M services | Metrics, alerting, anomaly detection, cardinality | Google, Netflix, Datadog |
| 48 | Design a privacy-preserving analytics system | Differential privacy, aggregation, data minimization | Apple, Meta, Google |
| 49 | Design a multi-modal search system | Embedding fusion, cross-modal retrieval, reranking | Google, Pinterest, Snapchat |
| 50 | Design Kubernetes itself | Control plane, scheduler, etcd, reconciliation loops | Google, VMware, Red Hat |
This is the question that separates 2026 L5 candidates from 2024 L5 candidates. If you've never thought about how language models are served at scale, this section is where you start.
The core constraint is GPU memory. A 70B parameter model in fp16 takes approximately 140GB of GPU memory just for weights. An A100-80GB card holds 80GB. You need at least two cards for a single model replica. The inference service's job is to maximize GPU utilization while keeping per-request latency within budget - typically under 500ms for the first token.
Batching changes everything. Naively, you process one request at a time. Smart inference services use continuous batching: new requests join an ongoing batch mid-generation. This dramatically increases throughput without meaningfully increasing per-user latency. vLLM's PagedAttention is the canonical implementation - it manages the KV cache like an OS paging system, dramatically reducing memory waste.
KV cache is the hidden cost center. Each transformer layer generates key-value vectors for every input token. These must be stored in GPU memory for the duration of generation. A long prompt (4K tokens) with a long response (2K tokens) can occupy more GPU memory than the model weights for that single request. Designing the KV cache eviction policy - and how to offload to CPU or NVMe when GPU memory fills - is a core L6 design question.
Autoscaling for inference is tricky. Traditional CPU-based services scale on CPU utilization. GPU inference services don't fit that model. You scale on queue depth and time-to-first-token percentile. 66% of organizations running generative AI use Kubernetes for some or all inference workloads (CNCF 2025 Annual Survey, 2026). Most use custom horizontal pod autoscalers tied to inference-specific metrics.
Latency budget breakdown. A 500ms first-token budget might allocate: 20ms for request routing, 50ms for tokenization and KV cache lookup, 380ms for prefill on GPU, and 50ms buffer. Anything beyond the prefill is auto-regressive generation - you can't speed that up without changing the model. Streaming responses (token by token) improve perceived latency dramatically.
Citation Capsule: 66% of organizations hosting generative AI models use Kubernetes for some or all of their inference workloads, according to the CNCF 2025 Annual Survey. GPU batching strategies like continuous batching and PagedAttention KV cache management are now expected knowledge for L5+ candidates designing AI-serving infrastructure.
Kubernetes Production Adoption 2020–2025 0% 25% 50% 75% 100% 48% 56% 60% 66% 82% 2020 2021 2022 2023 2025 % of Container Users
Source: CNCF Annual Surveys, 2020–2026
Most interview failures aren't knowledge failures - they're structure failures. Candidates who know the right patterns still bomb if they ramble for 20 minutes without a coherent path. The RADIO framework gives you a repeatable, interview-proven structure for any design question (Hello Interview, 2026).
R - Requirements (5–8 minutes). Ask clarifying questions before drawing anything. Functional requirements (what the system does), non-functional requirements (latency, throughput, availability), and constraints (scale, geography, budget). This sets the evaluation criteria for the rest of the session.
A - Architecture (10–15 minutes). Sketch the high-level system. Name the major components: API layer, service layer, storage layer, caching layer. Use arrows to show data flow. Don't go deep on any one component yet. Get the full picture on the whiteboard first.
D - Data model (5–8 minutes). What does your primary storage schema look like? Key entities, relationships, and how you'd index for your read patterns. SQL vs NoSQL choice goes here, with justification.
I - Interface (3–5 minutes). What are the core APIs? HTTP endpoints, event schemas, or gRPC contracts. Name the 2-3 most important endpoints with request/response shapes.
O - Optimization (10–15 minutes). This is where you differentiate. Caching strategies, sharding decisions, consistency tradeoffs, failure handling, monitoring. At L5+, add cost estimates and multi-region considerations.
Time allocation matters. A 45-minute session doesn't leave room for open-ended exploration. Interviewers evaluate you on how efficiently you use the time. Running out of time before reaching the Optimization phase is a common L4 failure mode.
What do interviewers actually score? Clarity of communication, systematic thinking, ability to handle pivots, knowledge of appropriate tools, and ability to reason about failure. Not memorization.
New to system design? Read our system design fundamentals primer before practicing these questions.
Citation Capsule: Structured frameworks like RADIO (Requirements, Architecture, Data model, Interface, Optimization) are recommended by Hello Interview's March 2026 system design guide as the most effective way to ensure complete coverage in a 45-60 minute FAANG system design round without running out of time before addressing optimization.
PERSONAL EXPERIENCE -
Ninety days is enough time to go from "I know what a load balancer is" to confidently handling L5 system design rounds - if you structure it right. We've seen candidates cram ByteByteGo chapters for four weeks and plateau. The difference between those who improve and those who don't is deliberate practice: designing, getting feedback, and iterating. Resources are secondary to reps.
Weeks 1–4: Build the foundation. Read System Design Interview Vol. 1 (Alex Xu). Cover URL shorteners, consistent hashing, key-value stores, CDNs, and rate limiters. Do one practice design per week, write it up, and compare against reference answers. Focus: can you complete a working design in 45 minutes?
Weeks 5–8: Go deeper on hard problems. Move to Vol. 2 and Hello Interview's problem library. Target: payment systems, distributed message queues, search indexing, and real-time feeds. Start recording yourself. Watching your own sessions reveals pacing issues and gaps you can't see in the moment.
Weeks 9–12: Simulate the real thing. Do mock interviews with a partner or on a platform. Review AI/ML infrastructure topics: feature stores, vector search, LLM inference. Study one FAANG engineering blog post per week - Google SRE Blog, Netflix Tech Blog, Meta Engineering. These give you the vocabulary interviewers use.
Resources that consistently work: ByteByteGo (Alex Xu's newsletter and YouTube), Hello Interview's interactive guides, and Exponent's mock interview community. AI/ML engineers command 30-40% premium over general SWEs at FAANG (Apex Interviewer, 2026), so even if you're not an ML engineer, understanding ML infrastructure pays off.
Book a mock system design interview to practice under real time pressure with structured feedback.
FAANG Interview Loop Breakdown Interview Loop Coding - 50% System Design - 30% Behavioral - 20%
Source: Interviewing.io + Exponent guides, 2025–2026
Reading about system design is necessary but not sufficient. The only way to get better is to do it, get feedback, and repeat. Start a mock system design interview on StackInterview and practice against real FAANG-caliber prompts with structured feedback on your Requirements, Architecture, Data model, Interface, and Optimization coverage.
Most FAANG system design rounds run 45–60 minutes. The first 5–8 minutes should go to requirements clarification. Coding-focused companies like Google sometimes run back-to-back system design rounds for Staff-level loops. Amazon includes system design elements in their Leadership Principles round for L6+ candidates.
At L4, you need to design a working system with sound component choices. At L5, you're expected to handle dynamic constraints, reason about consistency tradeoffs, and address failure modes proactively. AI/ML integration and cost-aware architecture became baseline L5 expectations in 2025 (Hello Interview, 2026), not optional depth areas.
Yes, and you should. Almost all FAANG system design rounds use a shared whiteboard tool (Miro, Excalidraw, or an internal tool). Draw your architecture before explaining it - interviewers follow visual structure much more easily than spoken descriptions alone. Keep diagrams clean: boxes for services, cylinders for databases, arrows for data flow.
Focus on the infrastructure layer, not the model layer. You don't need to know how transformers work. You do need to know how to serve them: batching strategies, latency budgets, GPU memory constraints, and autoscaling signals. 66% of orgs running generative AI use Kubernetes for inference (CNCF 2025 Survey, 2026). Study that layer.
The most consistently recommended resources are: System Design Interview Vol. 1 and Vol. 2 (Alex Xu / ByteByteGo), Hello Interview's interactive problem library, and Exponent's mock interview community. For AI/ML infrastructure, the CNCF blog, Google SRE Site Reliability Engineering book, and Netflix Tech Blog are worth studying weekly.
The 2026 system design interview is harder than it was two years ago - not because the questions changed, but because the expected depth of answers increased. AI/ML infrastructure, cost-aware design, and failure mode reasoning are no longer advanced topics. They're table stakes at L5 and above.
Only 20% of candidates who reach the FAANG onsite receive an offer (Interviewing.io, 2025). The gap between that 20% and the rest isn't always raw knowledge. It's structured thinking, deliberate practice, and the ability to reason clearly under pressure.
Start with the 50 questions in this guide. Build your framework. Do mock interviews early - not at the end of your prep. The candidates who improve fastest treat every practice session as a real interview, debrief honestly, and target specific weaknesses each week.
System design prep doesn't end when you get the offer. The same skills that get you hired at L5 are the ones that get you promoted to L6 and beyond.
Continue your prep with our interview questions guide - the round most engineers under-prepare for.