Interview Framework

How to approach a system design interview — A step-by-step framework for SDE-3 / Senior Software Engineer interviews

Overview

System design interviews evaluate your ability to clarify ambiguity, reason about scale, make trade-offs, and design for failure. This framework gives you a repeatable structure so you spend time on high-value discussion instead of figuring out what to do next.

Typical duration: 45–60 minutes Your goals: Show structured thinking, ask the right questions, drive the conversation, and demonstrate senior-level judgment (trade-offs, cost, operations, resilience).

The Seven Phases

Phase

Time

Focus

1. Clarify requirements

5–10 min

Functional + non-functional; scope and constraints

2. Estimate scale

5 min

QPS, storage, bandwidth; back-of-envelope

3. High-level architecture

10–15 min

Components, data flow, boundaries

4. Identify bottlenecks

5 min

Where the system will break or slow down

5. Discuss trade-offs

5 min

Consistency vs availability, latency vs cost, etc.

6. Deep dive into components

15–20 min

1–2 components in detail (APIs, data model, scaling)

7. Scaling and failure handling

5–10 min

Horizontal scaling, failover, degradation

Phase 1: Clarify Requirements

Why it matters: Jumping into boxes and arrows before understanding the problem is the most common mistake. Senior engineers align on scope first.

Functional requirements

Core features: What are the 3–5 must-have features?
Users: Who uses the system (B2C, B2B, internal)?
Critical user journeys: e.g. “User shortens URL → later opens short URL → gets redirected.”
Out of scope (for now): Explicitly deprioritize (e.g. “No custom aliases in v1”).

Questions to ask:

“What’s in scope for this discussion—MVP or full product?”
“Who are the main users and what’s the most important flow?”
“Are there any features we should explicitly leave out?”

Non-functional requirements

Use a simple checklist so you don’t forget dimensions:

Dimension

What to clarify

Performance

Latency (e.g. P99 < 200 ms), throughput (QPS), tail latency

Availability

Uptime target (e.g. 99.9%), planned maintenance, multi-region

Scalability

Growth (users, data, traffic), peak vs average (e.g. 3×)

Consistency

Strong vs eventual; read-after-write requirements

Durability

Can we lose data? RPO/RTO if applicable

Security

Auth, PII, compliance (GDPR, etc.)

Cost

Any rough budget or “optimize for cost” constraint?

Example (URL shortener):

“Redirect latency: P99 < 100 ms.”
“Availability: 99.99%.”
“We can accept eventual consistency for redirects; strong consistency for create.”
“URLs must not be lost once created.”

Output of this phase

Short list of must-have vs nice-to-have features.
Clear non-functional targets (latency, availability, scale, consistency).
Agreement on scope so you don’t over- or under-design.

Phase 2: Estimate Scale

Why it matters: Scale drives technology choices (single DB vs sharding, cache vs no cache, sync vs async). Show you think in numbers.

What to estimate

Traffic
- DAU/MAU or requests per day/month.
- Reads vs writes ratio (e.g. 100:1 for URL shortener).
- Peak QPS (e.g. 3× average).
Storage
- Record size and retention (e.g. 5 years).
- Total size and growth rate.
- Replication factor (e.g. 3×).
Bandwidth
- In/out per request and total (optional for first pass).

How to present

State assumptions clearly: “Assume 100M new URLs per month, 100:1 read:write.”
Do simple math on the whiteboard:
- Writes: 100M / (30 × 86,400) ≈ 40/s → peak ~120/s.
- Reads: 40 × 100 = 4,000/s → peak ~12,000/s.
Round to one significant figure for discussion: “~100 writes/s, ~10K reads/s.”

Output

Read QPS and write QPS (or equivalent).
Total storage and storage per node if relevant.
These numbers justify caching, sharding, and replication later.

Phase 3: High-Level Architecture

Why it matters: This is the “picture” the rest of the interview builds on. Keep it simple first; add detail in deep dives.

Components to consider

Clients (web, mobile, API consumers).
Edge / CDN (static assets, sometimes redirects).
Load balancer(s).
API / application servers (stateless).
Caches (e.g. Redis).
Databases (primary + replicas, or sharded).
Message queues (async jobs, events).
External services (payments, notifications).

How to draw

Draw client → LB → app servers → cache → DB as a first cut.
Add queues and workers if you have async or background work.
Label read vs write path if they differ.
Add replicas or shards only after you’ve stated the need (e.g. “We’ll need read replicas for 10K reads/s”).

Data flow

Describe in one sentence: “User hits short URL → LB → app server → cache; on miss, DB → cache → redirect.”
Mention sync vs async: “Create short URL is synchronous; click analytics are sent to a queue and processed asynchronously.”

Output

A single diagram with 5–10 boxes and clear flow.
A one-paragraph narrative: “Traffic hits the LB, then stateless API servers. We cache hot URLs; on miss we hit the DB and backfill cache. Writes go to the primary; we’ll add read replicas for scale.”

Phase 4: Identify Bottlenecks

Why it matters: Shows you think about limits, not just “happy path.” Senior engineers anticipate failure modes.

Typical bottlenecks

Component

Risk

Mitigation (to discuss in Phase 7)

Single DB

Writes and reads saturate one node

Sharding, read replicas

Cache

Stampede on miss, or cache down

Cache-aside + TTL, fallback to DB; consider single-flighter

API servers

CPU or memory under spike

Horizontal scaling, rate limiting

Message queue

Consumer lag, queue depth

More consumers, backpressure, DLQ

External API

Latency or rate limits

Timeouts, circuit breaker, cache, queue

How to present

“The main bottlenecks I see: (1) DB write capacity if we grow beyond one node, (2) cache stampede on a viral link, (3) DB as single point of failure.”
Then briefly: “I’d address (1) with sharding, (2) with TTL and maybe request coalescing, (3) with failover and replicas.”

Output

2–4 concrete bottlenecks and one-line mitigations. You’ll detail them in Phase 6–7.

Phase 5: Discuss Trade-offs

Why it matters: SDE-3 is expected to articulate why a design is chosen, not just what it is.

Common trade-offs

Consistency vs availability: CP vs AP; strong vs eventual.
Latency vs consistency: Synchronous replication vs async.
Cost vs performance: More cache vs more DB; more replicas vs lower durability.
Complexity vs flexibility: Monolith vs microservices; single DB vs polyglot persistence.
Operational complexity: Self-managed vs managed services (e.g. RDS vs self-hosted Postgres).

How to present

“For redirects we’ll use eventual consistency and cache heavily—low latency and high availability matter more than perfect freshness. For creating a short URL we’ll want strong consistency so we don’t hand out duplicates.”
“We could use 2PC for cross-service transactions, but that’s blocking and complex; I’d prefer a saga with compensating actions and idempotent steps.”

Output

2–3 explicit trade-offs with a clear “we choose X because Y” statement.

Phase 6: Deep Dive into Components

Why it matters: Interviewers want to see depth in at least one or two areas: API design, data model, or a specific component (cache, queue, DB).

What to prepare

API design
- REST or RPC; idempotency for writes (e.g. idempotency key).
- Key endpoints: create short URL, redirect, optional analytics.
- Status codes and errors (rate limit, not found, conflict).
Data model
- Main entities and relationships.
- Schema (tables or documents): e.g. short_code (PK), long_url, user_id, created_at.
- Indexes: by short_code (lookup), by user_id (list user’s URLs).
- Sharding key if you shard (e.g. short_code).
One or two components in detail
- Cache: Strategy (cache-aside), TTL, eviction (LRU), key format, stampede mitigation.
- Database: Sharding strategy (hash vs range), replication (sync vs async), failover.
- Queue: Use case (analytics, notifications), at-least-once vs exactly-once, consumer scaling.

How to run the deep dive

“I’ll go deeper on the cache and the database.”
For cache: “We use cache-aside. Key is url:{short_code}. TTL 24 hours. On miss we load from DB and backfill. We use LRU when memory is full.”
For DB: “We shard by hash(short_code) % N so lookups are single-shard. We use read replicas for read scaling; writes go to primary.”

Output

Concrete API (method, path, body, idempotency).
Schema + indexes + sharding key.
Clear behavior of 1–2 components (cache, DB, or queue).

Phase 7: Scaling and Failure Handling

Why it matters: Production systems scale and fail; you need to show you think about both.

Scaling

Horizontal: More app servers behind LB; more DB shards; more cache nodes; more consumers.
Vertical: Bigger DB/cache instances when it’s simpler (e.g. early stage).
Read scaling: Read replicas, cache, CDN.
Write scaling: Sharding, async writes (queue), batching.

State clearly: “We scale reads with replicas and cache; we scale writes by sharding when we outgrow one DB.”

Failure handling

DB primary down: Failover to replica; promote replica to primary; use replication.
Cache down: Fall back to DB; accept higher latency; optional stale cache from another region.
App server down: Stateless; LB stops sending traffic; no session state lost.
Queue backlog: Scale consumers; backpressure; DLQ for poison messages; alert on lag.

Degradation

“If the recommendation service is down, we show a default list instead of failing the page.”
“If DB is slow, we might serve stale from cache and surface a short delay message.”

Output

Scaling strategy in one or two sentences (read vs write, when to shard).
2–3 failure scenarios with clear mitigations (failover, fallback, degradation).

Quick Reference: What to Say When

Situation

Suggested response

Unclear scope

“For this exercise, should we focus on MVP or include analytics/custom aliases?”

Unclear scale

“What order of magnitude are we designing for—e.g. 1K vs 100K vs 1M DAU?”

After drawing HLD

“The main bottlenecks I see are … I’d address them by …”

Choosing technology

“I’d use X because … The trade-off is …”

Asked “what if X fails?”

“We’d … (failover / fallback / degrade). The impact would be …”

Running out of time

“If we had more time, I’d go deeper on … and add …”

Common Mistakes to Avoid

Designing before clarifying — Always do requirements and scale first.
Over-complicating the first diagram — Start with 5–7 boxes; add detail when asked.
Ignoring failure — Always mention at least one failure mode and mitigation.
No trade-offs — Explicitly state at least one “we choose A over B because …”.
No numbers — Do at least rough QPS and storage so your choices are justified.
Monologue — Ask 2–3 clarifying questions and confirm scope before drawing.

Quick Revision (Interview Day)

Clarify: Functional + non-functional; scope; 2–3 questions.
Estimate: QPS (read/write), storage, state assumptions.
HLD: Client → LB → app → cache → DB (+ queues if needed); one-sentence flow.
Bottlenecks: 2–4 with one-line mitigations.
Trade-offs: 2–3 “we choose X because Y.”
Deep dive: API + schema + 1–2 components (cache, DB, or queue).
Scaling: Horizontal, read vs write, when to shard.
Failure: Failover, fallback, graceful degradation.

Use this framework to drive the interview and to demonstrate structured, senior-level thinking.

PreviousRepo Audit NextFolder-by-Folder Review

Last updated 18 days ago

hashtagOverview

hashtagThe Seven Phases

hashtagPhase 1: Clarify Requirements

hashtagFunctional requirements

hashtagNon-functional requirements

hashtagOutput of this phase

hashtagPhase 2: Estimate Scale

hashtagWhat to estimate

hashtagHow to present

hashtagOutput

hashtagPhase 3: High-Level Architecture

hashtagComponents to consider

hashtagHow to draw

hashtagData flow

hashtagOutput

hashtagPhase 4: Identify Bottlenecks

hashtagTypical bottlenecks

hashtagHow to present

hashtagOutput

hashtagPhase 5: Discuss Trade-offs

hashtagCommon trade-offs

hashtagHow to present

hashtagOutput

hashtagPhase 6: Deep Dive into Components

hashtagWhat to prepare

hashtagHow to run the deep dive

hashtagOutput

hashtagPhase 7: Scaling and Failure Handling

hashtagScaling

hashtagFailure handling

hashtagDegradation

hashtagOutput

hashtagQuick Reference: What to Say When

hashtagCommon Mistakes to Avoid

hashtagQuick Revision (Interview Day)

Overview

The Seven Phases

Phase 1: Clarify Requirements

Functional requirements

Non-functional requirements

Output of this phase

Phase 2: Estimate Scale

What to estimate

How to present

Output

Phase 3: High-Level Architecture

Components to consider

How to draw

Data flow

Output

Phase 4: Identify Bottlenecks

Typical bottlenecks

How to present

Output

Phase 5: Discuss Trade-offs

Common trade-offs

How to present

Output

Phase 6: Deep Dive into Components

What to prepare

How to run the deep dive

Output

Phase 7: Scaling and Failure Handling

Scaling

Failure handling

Degradation

Output

Quick Reference: What to Say When

Common Mistakes to Avoid

Quick Revision (Interview Day)