githubEdit

17. Design Search Engine

Difficulty: Hard Topics: Inverted Index, TF-IDF, Ranking, Tries Key Concepts: Inverted Index, Document Scoring, Tokenization.

Phase 1: Requirements Gathering

Goals

  • Design a system to ingest documents and allow keyword search.

  • Return a ranked list of documents based on relevance.

1. Who are the actors?

  • User: Types query "banana recipe".

  • Publisher: Uploads documents/pages.

  • Crawler/Indexer: Background process (Scope out crawler, focus on Indexer).

2. What are the must-have features? (Core)

  • Indexing: Parse text and store efficient lookup structures.

  • Search: exact match or partial match of keywords.

  • Ranking: Order results by frequency (or TF-IDF).

3. What are the constraints?

  • Latency: Search should be < 100ms.

  • Throughput: High read QPS.


Phase 2: Use Cases

UC1: Index Document

Actor: Publisher/System Flow:

  1. System receives DocID: 1, Content: "Apple banana".

  2. Tokenizer splits text -> ["apple", "banana"].

  3. Inverted Index updates:

    • "apple" -> adds Doc1

    • "banana" -> adds Doc1

UC2: Search Query

Actor: User Flow:

  1. User searches "banana".

  2. System looks up "banana" in Inverted Index.

  3. Returns list [Doc1, ...].

  4. Ranker sorts list by score.


Phase 3: Class Diagram

Step 1: Core Entities

  • SearchEngine: Facade.

  • InvertedIndex: The core data structure Map<Word, List<Posting>>.

  • Posting: Metadata about word in doc (frequency, position).

  • Document: The raw content.

UML Diagram

spinner

Phase 4: Design Patterns

1. Inverted Index (Data Structure Pattern)

  • Description: A data structure mapping content (words/tokens) to its location in a document or a set of documents.

  • Why used: It turns the search problem from "Scan all docs" (O(N)) into "Look up the word" (O(1)). This is fundamental for full-text search engines to achieve sub-second latency.

2. Strategy Pattern

  • Description: Defines a family of algorithms, encapsulates each one, and makes them interchangeable.

  • Why used: Search relevance is subjective and evolves. We need to swap Scoring Algorithms (TF-IDF, BM25, PageRank) easily to experiment with ranking quality without rewriting the Indexer or Searcher.


Phase 5: Code Key Methods

Java Implementation


Phase 6: Discussion

Ranking Algorithms

Q: How do we improve simple frequency counting?

  • A: "Use TF-IDF.

    • TF (Term Frequency): How often word appears in THIS doc.

    • IDF (Inverse Doc Frequency): log(TotalDocs / DocsWithTerm). Penalizes common words like 'the'."

Concurrency

Q: How to handle simultaneous Reads and Writes?

  • A: "Read-Write Locks or double buffering. In real systems (Lucene/Elasticsearch), segments are immutable. New docs go to a new segment. Segments are merged in background."

Scalability

Q: How to shard index?

  • A: "Document Partitioning: Each node holds a subset of documents (e.g., Doc 1-1000). Query goes to ALL nodes, results merged. Better for parallel processing."


SOLID Principles Checklist

  • S (Single Responsibility): InvertedIndex manages data structure, MiniGoogle manages retrieval logic.

  • O (Open/Closed): Scoring strategy can be injected.

  • L (Liskov Substitution): N/A.

  • I (Interface Segregation): N/A.

  • D (Dependency Inversion): N/A.

Last updated