Design a system to ingest documents and allow keyword search.
Return a ranked list of documents based on relevance.
1. Who are the actors?
User: Types query "banana recipe".
Publisher: Uploads documents/pages.
Crawler/Indexer: Background process (Scope out crawler, focus on Indexer).
2. What are the must-have features? (Core)
Indexing: Parse text and store efficient lookup structures.
Search: exact match or partial match of keywords.
Ranking: Order results by frequency (or TF-IDF).
3. What are the constraints?
Latency: Search should be < 100ms.
Throughput: High read QPS.
Phase 2: Use Cases
UC1: Index Document
Actor: Publisher/System Flow:
System receives DocID: 1, Content: "Apple banana".
Tokenizer splits text -> ["apple", "banana"].
Inverted Index updates:
"apple" -> adds Doc1
"banana" -> adds Doc1
UC2: Search Query
Actor: User Flow:
User searches "banana".
System looks up "banana" in Inverted Index.
Returns list [Doc1, ...].
Ranker sorts list by score.
Phase 3: Class Diagram
Step 1: Core Entities
SearchEngine: Facade.
InvertedIndex: The core data structure Map<Word, List<Posting>>.
Posting: Metadata about word in doc (frequency, position).
Document: The raw content.
UML Diagram
Phase 4: Design Patterns
1. Inverted Index (Data Structure Pattern)
Description: A data structure mapping content (words/tokens) to its location in a document or a set of documents.
Why used: It turns the search problem from "Scan all docs" (O(N)) into "Look up the word" (O(1)). This is fundamental for full-text search engines to achieve sub-second latency.
2. Strategy Pattern
Description: Defines a family of algorithms, encapsulates each one, and makes them interchangeable.
Why used: Search relevance is subjective and evolves. We need to swap Scoring Algorithms (TF-IDF, BM25, PageRank) easily to experiment with ranking quality without rewriting the Indexer or Searcher.
Phase 5: Code Key Methods
Java Implementation
Phase 6: Discussion
Ranking Algorithms
Q: How do we improve simple frequency counting?
A: "Use TF-IDF.
TF (Term Frequency): How often word appears in THIS doc.
IDF (Inverse Doc Frequency): log(TotalDocs / DocsWithTerm). Penalizes common words like 'the'."
Concurrency
Q: How to handle simultaneous Reads and Writes?
A: "Read-Write Locks or double buffering. In real systems (Lucene/Elasticsearch), segments are immutable. New docs go to a new segment. Segments are merged in background."
Scalability
Q: How to shard index?
A: "Document Partitioning: Each node holds a subset of documents (e.g., Doc 1-1000). Query goes to ALL nodes, results merged. Better for parallel processing."
SOLID Principles Checklist
S (Single Responsibility): InvertedIndex manages data structure, MiniGoogle manages retrieval logic.
O (Open/Closed): Scoring strategy can be injected.