#16 web crawler
Here’s a complete, time-boxed, 1-hour interview-ready answer for designing a web crawler. It follows your usual system design interview structure, including functional & non-functional requirements, APIs/data model, architecture, deep dive, and trade-offs.
0 – 5 min — Problem recap, scope & assumptions
Goal: Design a scalable web crawler that can efficiently discover, fetch, and index web pages. The crawler should handle billions of URLs, respect politeness (robots.txt), support incremental crawling, detect duplicates, and feed data to downstream systems like search engines or analytics pipelines.
Scope for interview:
Web crawling: URL discovery, fetching, parsing HTML, handling redirects, errors, duplicate detection.
Politeness: obey robots.txt, rate-limiting per domain.
Incremental / scheduled crawling.
Prioritization of URLs (freshness, importance, domain policies).
Scalable and fault-tolerant distributed system.
Assumptions:
Target: crawl ~1B URLs globally.
Page size: average 100 KB.
Crawl frequency: daily updates for news sites, weekly/monthly for static sites.
Peak fetch: 100k pages/sec globally.
5 – 15 min — Functional & Non-Functional Requirements
Functional Requirements
Must
URL frontier management: maintain queue of URLs to crawl.
Fetcher: download HTML pages efficiently.
Parser: extract links, metadata, content.
Duplicate detection: avoid re-crawling same content or URL unnecessarily.
Politeness enforcement: respect robots.txt, rate-limit per domain/IP.
URL prioritization: based on freshness, importance, domain policies.
Content storage: store raw HTML, metadata, and optionally parsed text.
Error handling & retries: manage fetch failures, timeouts, redirects.
Incremental crawling: update previously crawled pages periodically.
Metrics & monitoring: track fetch rates, success/failure, latency.
Should
Support distributed crawling across multiple machines/regions.
Support multi-threaded or async fetchers per domain.
Optional page content extraction for search indexing.
Nice-to-have
Page classification (news, blogs, e-commerce).
Change detection & differential updates.
Politeness adaptation based on server responsiveness.
Non-Functional Requirements
Scalability: handle billions of URLs; distribute crawl across many machines.
Throughput: 100k pages/sec globally.
Latency: prioritize fresh or high-value pages; crawl and feed data within few hours.
Fault tolerance: failure of fetchers, parsers, or storage nodes shouldn’t halt system.
Durability: ensure fetched pages are reliably stored; no data loss.
Consistency: eventual consistency acceptable for crawl results.
Resource efficiency: avoid overloading websites; manage network and compute.
Monitoring & observability: crawl stats, success/failure, queue lengths, domain coverage.
SLO examples
Retry failed fetch 3x before marking failed.
Average fetch time < 2 sec per page.
Duplicate URL ratio < 1%.
15 – 25 min — API / Data Model
Internal API / Component Contracts
URL Frontier
Fetcher
Parser
Duplicate Detector
Content Storage
Core Data Models
URL Record: {url, domain, priority, last_crawl_ts, status, retry_count}
Page Content: {page_id, url, content, metadata, fetch_ts}
Domain Info: {domain, robots_txt, last_access_ts, crawl_delay}
Duplicate Tracker: {url_hash, page_id} (can use Bloom filter for fast checks)
25 – 40 min — High-level architecture & data flow
Components
URL Frontier: distributed queue (priority queue) to manage URLs by freshness, domain, and priority. Partitioned by domain for politeness enforcement.
Fetcher Cluster: multiple fetchers per domain/IP, multithreaded/async. Apply rate-limits per domain. Handles HTTP redirects, errors, retries.
Parser Cluster: extract links, metadata (title, headers, keywords), store page content.
Duplicate Detector: Bloom filters or content hash index to skip already fetched URLs.
Content Storage: HDFS, S3, or database to store fetched HTML and metadata.
Politeness Service: keeps track of domain last-access times, crawl delays.
Monitoring & Metrics: queue sizes, fetch success rate, per-domain stats, error rates.
Data flow
Dequeue URL from URL frontier.
Fetcher downloads page respecting domain politeness.
Page passed to parser; extracted URLs sent back to frontier if not duplicate.
Page content + metadata stored in storage for downstream usage.
Metrics updated asynchronously.
40 – 50 min — Deep dive — distributed crawling, deduplication, prioritization
Distributed crawling
Frontier is sharded by domain hash: ensures only one fetcher per domain shard at a time.
Multiple fetchers per shard for efficiency.
Leader election or coordination via Zookeeper / etcd ensures one master per domain shard.
Deduplication
URL-level: normalize URLs, hash, store in Bloom filter.
Content-level: hash content (MD5/SHA-1) for exact duplicates, store fingerprints in DB.
Prioritization
Priority = f(freshness, domain importance, crawl policy, recency).
News sites get higher priority; rarely updated blogs lower.
Use priority queue or scoring function in frontier.
Politeness & rate-limiting
Track last-access per domain; ensure minimum crawl-delay (robots.txt or policy).
Queue per domain or domain bucket ensures fetchers respect delays.
Incremental crawling
Track last crawl timestamp. Only enqueue if due.
Periodically recrawl pages based on change frequency.
Fault-tolerance
Fetcher crash → unacknowledged URL re-enqueued.
Parser crash → URL remains in frontier or failure queue for retry.
Storage failures → retry async; maintain idempotency via page_id or URL hash.
50 – 55 min — Back-of-the-envelope calculations
Assumptions
1B URLs, avg 100 KB per page → 100 TB raw content.
Fetch throughput: 100k pages/sec globally.
Page fetch = 100 KB → 10 GB/sec → 864 TB/day if fully saturating network (practical throttled).
Frontier sizing
1B URLs in frontier → partition across N nodes (say 100 shards) → 10M URLs per node.
Priority queues + metadata ~ 100 bytes/URL → ~1 GB per node.
Fetcher cluster
100k pages/sec → 10 GB/sec outbound traffic. Can scale horizontally: 1000 fetchers at 100 pages/sec each.
Storage
Use distributed object store (HDFS/S3) for fetched HTML.
Content deduplication reduces storage for repeated pages (~20–30% savings).
Deduplication
Bloom filter: 1B URLs, 1% false positive → ~1.2 GB memory (fits in RAM).
Additional DB for content hash for content-level deduplication.
55 – 58 min — Monitoring, ops & security
Monitoring & metrics
Fetch success/failure per domain.
Frontier queue size.
Duplicate detection hit rate.
Latency per fetch.
Storage utilization.
Operational concerns
Handle network errors, timeouts, retries.
Respect robots.txt & crawl-delay policies.
Detect and block crawler traps or infinite loops.
Maintain secure storage and access control for content.
Security
IP rotation or proxy network to avoid being blocked.
Avoid fetching sensitive/private URLs.
Rate-limits prevent DoS against sites.
58 – 60 min — Trade-offs, evolution & summary
Trade-offs
Freshness vs throughput: higher frequency = more bandwidth; prioritize high-value sites.
Storage vs deduplication: full content storage vs hash-only dedup for savings.
Single frontier vs distributed: centralized easier but bottleneck; sharded frontier needed for global scale.
Politeness vs speed: obey robots.txt + crawl-delay limits peak fetch per domain; balance via multiple domains concurrently.
Evolution
MVP: single-region, frontier + fetcher + parser + storage, obey politeness.
Scale: distributed frontier, multiple shards, domain-based sharding, multi-region fetchers.
Enhance: incremental crawling, content hashing, freshness scoring, ML for prioritization, advanced deduplication, adaptive politeness based on server load.
Summary
Build a distributed, sharded, fault-tolerant web crawler:
URL frontier sharded by domain.
Fetcher cluster respects politeness & rate-limits.
Parser extracts links & content.
Deduplication (URL & content) avoids re-fetching.
Content stored in distributed storage, metrics collected for monitoring.
System scales horizontally and incrementally crawls high-priority content efficiently.
If you want, I can:
Draw a sequence diagram for enqueue → fetch → parse → store → enqueue new URLs, or
Provide a capacity calculator for your expected number of URLs / page sizes / fetch rates.
Do you want me to create the sequence diagram next?
Last updated