githubEdit

19. Design Version Control

Difficulty: Hard Topics: Graph Theory (DAG), Hashing, Content Addressable Storage Key Concepts: Merkle Trees, Snapshots vs Deltas, Deduplication.

Phase 1: Requirements Gathering

Goals

  • Design a distributed version control system like Git.

  • Track history of files with efficiency (deduplication).

  • Support branching and merging.

1. Who are the actors?

  • Developer: Adds files, commits changes, creates branches.

  • Repository: Stores the data and history logic.

2. What are the must-have features? (Core)

  • Commit: Save a snapshot of the project.

  • Branch: Create a lightweight pointer to a commit.

  • Checkout: Switch working directory to a specific branch/commit.

  • Log: View history.

3. What are the constraints?

  • Storage: Don't store duplicate files. If file hasn't changed, point to old blob.

  • Integrity: History cannot be altered without changing IDs (SHA-1).


Phase 2: Use Cases

UC1: Commit Changes

Actor: Developer Flow:

  1. Dev adds file.txt to Staging Area.

  2. Dev executes commit("Fix bug").

  3. System calculates Hash of file.txt (Blob).

  4. System creates Tree Object (Directory structure).

  5. System creates Commit Object (Points to Tree, Parent Commit, Metadata).

  6. Update current Branch pointer to new Commit.

UC2: Create Branch

Actor: Developer Flow:

  1. Dev executes branch("feature-x").

  2. System creates a new Reference refs/heads/feature-x.

  3. Point it to the current Commit ID (HEAD).


Phase 3: Class Diagram

Step 1: Core Entities

  • Repository: Manages the objects.

  • Commit: Node in the history graph.

  • Blob: File content (Immutable).

  • Ref/Branch: Mutable pointer to a Commit.

UML Diagram

spinner

Phase 4: Design Patterns

1. Composite Pattern

  • Description: Composes objects into tree structures to represent part-whole hierarchies.

  • Why used: A Version Control System models a directory tree. A Tree object contains entries which can be Blobs (Files) or other Trees (Subdirectories). Composite allows treating individual files and directories uniformly.

2. Flyweight Pattern (Content Addressable Storage)

  • Description: Uses sharing to support large numbers of fine-grained objects efficiently.

  • Why used: In Git, if a file hasn't changed between commits, we don't save a new copy. Instead, both commits point to the exact same Blob hash. This massive deduplication makes Git efficient.


Phase 5: Code Key Methods

Java Implementation


Phase 6: Discussion

Delta vs Snapshot

Q: "Why Snapshots?"

  • A: "SVN used Deltas (Diffs). To check out version 100, you need Base + 100 diffs. Slow. Git uses Snapshots. Version 100 is fully linked. Unchanged files just point to the same Blob hash as version 99. Fast checkout."

Merging

Q: "How to handle Merge Conflicts?"

  • A: "Find Lowest Common Ancestor (LCA) of Branch A and Branch B.

    • If File X changed in A but not B (vs LCA) -> Keep A.

    • If File X changed in Both -> Conflict. User must resolve."

Distributed

Q: "How does git push work?"

  • A: "You send your Commit Graph to the server. Server checks which Objects (Commits/Blobs) it is missing and asks for them. It then updates its Branch Pointer (Reference)."


SOLID Principles Checklist

  • S (Single Responsibility): Commit stores data, MiniGit manages workflow.

  • O (Open/Closed): Logic to calculate Hash could be plugged in (SHA-1/SHA-256).

  • L (Liskov Substitution): N/A.

  • I (Interface Segregation): N/A.

  • D (Dependency Inversion): N/A.

Last updated