githubEdit

Search System

Difficulty: Hard Topics: Inverted Index, Crawling, Ranking, Distributed Search, Caching Time: 60 minutes Companies: Google, Amazon, Meta, Microsoft


1. Requirements Clarification

Functional requirements

Must-have:

  • Query: User enters text; system returns ranked list of relevant documents (web pages, products, etc.).

  • Indexing: New/updated documents are searchable within a bounded delay (e.g. minutes to hours).

  • Relevance: Results ordered by relevance (and optionally recency, popularity).

Nice-to-have:

  • Autocomplete / suggestions.

  • Filters (date, category, site).

  • Spelling correction (“did you mean?”).

  • Personalized ranking (optional).

Non-functional requirements

  • Latency: P99 < 200–500 ms for query path.

  • Throughput: 10K–100K queries per second (QPS) at scale.

  • Availability: 99.9%+.

  • Freshness: Index updated within agreed SLA (e.g. minutes for product search, hours for web crawl).

  • Scale: Billions of documents; petabytes of index.


2. High-Level Architecture

Components:

  • Query servers: Parse query, fetch from cache or search index, rank/merge, return results.

  • Search index: Inverted index (term → list of doc IDs + metadata); sharded by term or doc.

  • Cache: Cache top-K results for popular queries (e.g. Redis).

  • Document store: Raw documents (for snippets, metadata); can be blob store or DB.

  • Crawler / ingester: Discovers or receives documents; sends to queue; indexer consumes and updates index.

  • Indexer: Builds/updates inverted index from document stream.


3. API Design

Search query

Response:

Indexing (internal or admin)


4. Data Model

Inverted index (conceptual)

  • Term dictionary: term → term_id (for compression).

  • Posting list: term_id → list of (doc_id, frequency, positions, metadata).

  • Document store: doc_id → (title, url, snippet, metadata).

Sharding:

  • By term hash: Each shard owns a range of terms; query “hello world” → query shards for “hello” and “world”, then merge.

  • By document: Each shard owns a subset of documents; full document index per shard; query broadcast to all shards, merge results. (Common in Elasticsearch-style systems.)

Example (simplified)


5. Scaling Strategy

  • Query path:

    • Cache popular queries (e.g. 20% of traffic); cache hot posting lists if needed.

    • Query servers stateless; scale horizontally.

    • Index sharded (by term or by doc); each shard replicated for read capacity and HA.

  • Indexing path:

    • Documents flow via queue (Kafka); indexer workers consume and update shards.

    • Index updates can be in-place (update posting list) or periodic merge (LSM-style).

  • Storage: Index on SSD for latency; raw docs in blob store (S3) or DB; cache in RAM/SSD.


6. Bottlenecks and Mitigations

Bottleneck
Mitigation

Query latency

Cache hot queries; optimize ranking (two-phase: cheap candidate retrieval + expensive rerank); reduce shard fan-out where possible

Index size

Compression (posting lists); tiered storage (hot terms in RAM); sharding

Index update latency

Async indexing via queue; batch updates; merge strategies (e.g. LSM)

Ranking cost

Retrieve top-K candidates with cheap scoring (BM25, etc.); optional rerank with heavier model on smaller set

Crawl rate / politeness

Rate limit per domain; respect robots.txt; distributed crawlers with coordinator


7. Improvements and Extensions

  • Autocomplete: Separate prefix index (trie or n-gram) or use same index with prefix queries; cache suggestions.

  • Spell correction: Edit distance, n-gram overlap, or ML; suggest correction and re-query.

  • Personalization: User history / embeddings; rerank or boost by user affinity.

  • Faceted search: Store metadata in index; aggregate counts per filter (e.g. category) for faceted navigation.

  • Real-time: For product search, ingest events (Kafka); indexer updates index; query sees new docs within seconds/minutes.


Quick Revision

  • Core: Inverted index (term → posting list); query = look up terms, merge, rank, return top K.

  • Scale: Shard index (by term or doc); replicate for reads; cache hot queries.

  • Indexing: Queue (Kafka) + indexer workers; async; eventual consistency of index.

  • Ranking: BM25 or similar for candidate retrieval; optional neural rerank on small set.

  • Interview: “We maintain an inverted index sharded by term; query servers hit cache for popular queries and otherwise query index shards, merge and rank results. New documents are pushed to a queue and indexer workers update the index asynchronously.”

Last updated