#16 web crawler

Here’s a complete, time-boxed, 1-hour interview-ready answer for designing a web crawler. It follows your usual system design interview structure, including functional & non-functional requirements, APIs/data model, architecture, deep dive, and trade-offs.

0 – 5 min — Problem recap, scope & assumptions

Goal: Design a scalable web crawler that can efficiently discover, fetch, and index web pages. The crawler should handle billions of URLs, respect politeness (robots.txt), support incremental crawling, detect duplicates, and feed data to downstream systems like search engines or analytics pipelines.

Scope for interview:

Web crawling: URL discovery, fetching, parsing HTML, handling redirects, errors, duplicate detection.
Politeness: obey robots.txt, rate-limiting per domain.
Incremental / scheduled crawling.
Prioritization of URLs (freshness, importance, domain policies).
Scalable and fault-tolerant distributed system.

Assumptions:

Target: crawl ~1B URLs globally.
Page size: average 100 KB.
Crawl frequency: daily updates for news sites, weekly/monthly for static sites.
Peak fetch: 100k pages/sec globally.

5 – 15 min — Functional & Non-Functional Requirements

Functional Requirements

Must

URL frontier management: maintain queue of URLs to crawl.
Fetcher: download HTML pages efficiently.
Parser: extract links, metadata, content.
Duplicate detection: avoid re-crawling same content or URL unnecessarily.
Politeness enforcement: respect robots.txt, rate-limit per domain/IP.
URL prioritization: based on freshness, importance, domain policies.
Content storage: store raw HTML, metadata, and optionally parsed text.
Error handling & retries: manage fetch failures, timeouts, redirects.
Incremental crawling: update previously crawled pages periodically.
Metrics & monitoring: track fetch rates, success/failure, latency.

Should

Support distributed crawling across multiple machines/regions.
Support multi-threaded or async fetchers per domain.
Optional page content extraction for search indexing.

Nice-to-have

Page classification (news, blogs, e-commerce).
Change detection & differential updates.
Politeness adaptation based on server responsiveness.

Non-Functional Requirements

Scalability: handle billions of URLs; distribute crawl across many machines.
Throughput: 100k pages/sec globally.
Latency: prioritize fresh or high-value pages; crawl and feed data within few hours.
Fault tolerance: failure of fetchers, parsers, or storage nodes shouldn’t halt system.
Durability: ensure fetched pages are reliably stored; no data loss.
Consistency: eventual consistency acceptable for crawl results.
Resource efficiency: avoid overloading websites; manage network and compute.
Monitoring & observability: crawl stats, success/failure, queue lengths, domain coverage.

SLO examples

Retry failed fetch 3x before marking failed.
Average fetch time < 2 sec per page.
Duplicate URL ratio < 1%.

15 – 25 min — API / Data Model

Internal API / Component Contracts

URL Frontier

enqueue(url, priority)   # add new URL
dequeue()                # fetch next URL to crawl
mark_done(url, status)   # record completion and status

Fetcher

fetch(url) -> {status_code, headers, html_content, fetch_ts}

Parser

parse(html_content) -> {extracted_urls[], metadata}

Duplicate Detector

is_duplicate(url) -> bool

Content Storage

store(page_id, url, content, metadata, fetched_ts)

Core Data Models

URL Record: {url, domain, priority, last_crawl_ts, status, retry_count}
Page Content: {page_id, url, content, metadata, fetch_ts}
Domain Info: {domain, robots_txt, last_access_ts, crawl_delay}
Duplicate Tracker: {url_hash, page_id} (can use Bloom filter for fast checks)

25 – 40 min — High-level architecture & data flow

+------------------+        +-------------------+        +----------------+
| URL Frontier     |<------>| Fetcher Cluster   |<------>| Parser Cluster |
+------------------+        +-------------------+        +----------------+
       |                           |                            |
       v                           v                            v
  Duplicate Detector            Content Storage            New URLs (enqueue)
       |                           |                            |
       v                           v                            |
  Bloom filter / DB            HDFS / S3 / DB                Priority Queue
       |
       v
 Monitoring & Metrics

Components

URL Frontier: distributed queue (priority queue) to manage URLs by freshness, domain, and priority. Partitioned by domain for politeness enforcement.
Fetcher Cluster: multiple fetchers per domain/IP, multithreaded/async. Apply rate-limits per domain. Handles HTTP redirects, errors, retries.
Parser Cluster: extract links, metadata (title, headers, keywords), store page content.
Duplicate Detector: Bloom filters or content hash index to skip already fetched URLs.
Content Storage: HDFS, S3, or database to store fetched HTML and metadata.
Politeness Service: keeps track of domain last-access times, crawl delays.
Monitoring & Metrics: queue sizes, fetch success rate, per-domain stats, error rates.

Data flow

Dequeue URL from URL frontier.
Fetcher downloads page respecting domain politeness.
Page passed to parser; extracted URLs sent back to frontier if not duplicate.
Page content + metadata stored in storage for downstream usage.
Metrics updated asynchronously.

40 – 50 min — Deep dive — distributed crawling, deduplication, prioritization

Distributed crawling

Frontier is sharded by domain hash: ensures only one fetcher per domain shard at a time.
Multiple fetchers per shard for efficiency.
Leader election or coordination via Zookeeper / etcd ensures one master per domain shard.

Deduplication

URL-level: normalize URLs, hash, store in Bloom filter.
Content-level: hash content (MD5/SHA-1) for exact duplicates, store fingerprints in DB.

Prioritization

Priority = f(freshness, domain importance, crawl policy, recency).
News sites get higher priority; rarely updated blogs lower.
Use priority queue or scoring function in frontier.

Politeness & rate-limiting

Track last-access per domain; ensure minimum crawl-delay (robots.txt or policy).
Queue per domain or domain bucket ensures fetchers respect delays.

Incremental crawling

Track last crawl timestamp. Only enqueue if due.
Periodically recrawl pages based on change frequency.

Fault-tolerance

Fetcher crash → unacknowledged URL re-enqueued.
Parser crash → URL remains in frontier or failure queue for retry.
Storage failures → retry async; maintain idempotency via page_id or URL hash.

50 – 55 min — Back-of-the-envelope calculations

Assumptions

1B URLs, avg 100 KB per page → 100 TB raw content.
Fetch throughput: 100k pages/sec globally.
Page fetch = 100 KB → 10 GB/sec → 864 TB/day if fully saturating network (practical throttled).

Frontier sizing

1B URLs in frontier → partition across N nodes (say 100 shards) → 10M URLs per node.
Priority queues + metadata ~ 100 bytes/URL → ~1 GB per node.

Fetcher cluster

100k pages/sec → 10 GB/sec outbound traffic. Can scale horizontally: 1000 fetchers at 100 pages/sec each.

Storage

Use distributed object store (HDFS/S3) for fetched HTML.
Content deduplication reduces storage for repeated pages (~20–30% savings).

Deduplication

Bloom filter: 1B URLs, 1% false positive → ~1.2 GB memory (fits in RAM).
Additional DB for content hash for content-level deduplication.

55 – 58 min — Monitoring, ops & security

Monitoring & metrics

Fetch success/failure per domain.
Frontier queue size.
Duplicate detection hit rate.
Latency per fetch.
Storage utilization.

Operational concerns

Handle network errors, timeouts, retries.
Respect robots.txt & crawl-delay policies.
Detect and block crawler traps or infinite loops.
Maintain secure storage and access control for content.

Security

IP rotation or proxy network to avoid being blocked.
Avoid fetching sensitive/private URLs.
Rate-limits prevent DoS against sites.

58 – 60 min — Trade-offs, evolution & summary

Trade-offs

Freshness vs throughput: higher frequency = more bandwidth; prioritize high-value sites.
Storage vs deduplication: full content storage vs hash-only dedup for savings.
Single frontier vs distributed: centralized easier but bottleneck; sharded frontier needed for global scale.
Politeness vs speed: obey robots.txt + crawl-delay limits peak fetch per domain; balance via multiple domains concurrently.

Evolution

MVP: single-region, frontier + fetcher + parser + storage, obey politeness.
Scale: distributed frontier, multiple shards, domain-based sharding, multi-region fetchers.
Enhance: incremental crawling, content hashing, freshness scoring, ML for prioritization, advanced deduplication, adaptive politeness based on server load.

Summary

Build a distributed, sharded, fault-tolerant web crawler:

URL frontier sharded by domain.
Fetcher cluster respects politeness & rate-limits.
Parser extracts links & content.
Deduplication (URL & content) avoids re-fetching.
Content stored in distributed storage, metrics collected for monitoring.
System scales horizontally and incrementally crawls high-priority content efficiently.

If you want, I can:

Draw a sequence diagram for enqueue → fetch → parse → store → enqueue new URLs, or
Provide a capacity calculator for your expected number of URLs / page sizes / fetch rates.

Do you want me to create the sequence diagram next?

Previous#15 ATM system Next#17 traffic control system

Last updated 4 months ago

hashtag0 – 5 min — Problem recap, scope & assumptions

hashtag5 – 15 min — Functional & Non-Functional Requirements

hashtagFunctional Requirements

hashtagNon-Functional Requirements

hashtag15 – 25 min — API / Data Model

hashtagInternal API / Component Contracts

hashtagCore Data Models

hashtag25 – 40 min — High-level architecture & data flow

hashtag40 – 50 min — Deep dive — distributed crawling, deduplication, prioritization

hashtagDistributed crawling

hashtagDeduplication

hashtagPrioritization

hashtagPoliteness & rate-limiting

hashtagIncremental crawling

hashtagFault-tolerance

hashtag50 – 55 min — Back-of-the-envelope calculations

hashtag55 – 58 min — Monitoring, ops & security

hashtag58 – 60 min — Trade-offs, evolution & summary

hashtagTrade-offs

hashtagEvolution

hashtagSummary

0 – 5 min — Problem recap, scope & assumptions

5 – 15 min — Functional & Non-Functional Requirements

Functional Requirements

Non-Functional Requirements

15 – 25 min — API / Data Model

Internal API / Component Contracts

Core Data Models

25 – 40 min — High-level architecture & data flow

40 – 50 min — Deep dive — distributed crawling, deduplication, prioritization

Distributed crawling

Deduplication

Prioritization

Politeness & rate-limiting

Incremental crawling

Fault-tolerance

50 – 55 min — Back-of-the-envelope calculations

55 – 58 min — Monitoring, ops & security

58 – 60 min — Trade-offs, evolution & summary

Trade-offs

Evolution

Summary