Arpit biyani System design beginner
Introduction
build a deeper understanding
pause, Absorb & proceed
Always better to implement
What is system design
Mixture of all these along with how they interact with each other
Decide architecture
decide component
decide modules
What will we do when we design a system
break down problem statement into solvable sub problems
decide on key componenent and responsibilities
decide on boundries of each component
touch upon key challanges in scaling it
make out architecture fault tolerant and availability
How to approach system design
Steps to follow
take baby steps, no matter what
understand the problem statement
without understanding the problem at hand one would easily digress
break it down into component (essentials)
do not create component for the sake of it
create component that you know are must
ex — design facebook
features / components [these features are big enough to be a micro-service]
authentication
notification
feed
gamification
now go through each features (microservice)
ex— feed system

web server — to serve the content to the user from the database
generator — to generate the content and store them into the database
aggregator — we can skip this if not sure about this functionality
For each sub component look into [repeat these for each component]
database and caching
scaling and fault tolerant
async processing (delegation)
communication
Add more sub component if needed
understand the scope
How to evaluate if you have good system?
you broke your system into components
every component has a clear set of responsibilities, in most cases exclusive
feed web-server — serves feed over http
feed generator — pulls data from multiple services and puts them in db
feed aggregator
combines candidate items fetch be genrator
filters out redundant ranks and creates a final consumable feed
for each component you have slight technical details figured out
database and caching
scaling and fault tolerance
async processing(delegation)
communication
each component (in isolation) is
scalable — horizontally scalable
fault tolerant — plan for recovery (mostly data, to a stable state ) in case of a faliure
available — component function
Relational database
most critical component of any system, these make or break a system
data is stored and represented in rows and columns.
ex— MySQL, PostgreSQL, SQL Server, Oracle
key properties
data consistency
data durability
data integrity
constraints
everything in one place
Because of these reasons relational data base provides transactions
A - Atomicity
all statement within a transaction takes place or none
ex — publish a post and increase total posts count
all should be executed if it is complete if not none
Mechanism How It Ensures Atomicity
Transactions Ensures all SQL commands are executed together or not at all.
A transaction is a unit of work consisting of multiple SQL operations executed together.
If all operations succeed, the transaction commits (changes are permanently saved).
If any operation fails, the transaction rolls back (all changes are undone).
Write-Ahead Logging Logs changes before applying them to disk
Before making changes to the database, relational databases first write the intended changes to a log file.
If a failure occurs, the database can use this log to rollback incomplete transactions.
Locking Mechanisms Prevents concurrent modifications leading to partial updates
Row-level locks: Ensure that multiple transactions do not modify the same data simultaneously.
Pessimistic locking: Prevents other transactions from modifying data until the current transaction completes.
Optimistic locking: Used in high-concurrency environments, ensuring transactions only commit if no other changes occurred.
Savepoints Allows rolling back part of a transaction
Savepoints allow rolling back only a part of a transaction instead of the entire transaction.
C - Consistency
data will never go incorrect no matter what
constraints, cascades, triggers
one have the necessary tools to ensure that your data never go inconsistent
enforce consistency using various mechanisms:
Data Integrity Constraints
Primary Key Constraint – Ensures unique identifiers
Foreign Key Constraint – Maintains referential integrity
Check Constraints – Ensures column values meet specific conditions
Not Null Constraint – Prevents missing values
Transactions and Rollbacks
Relational databases ensure that transactions preserve consistency by rolling back if any error occurs.
Cascading Actions for Referential Integrity
Relational databases enforce consistency across related tables using ON DELETE CASCADE and ON UPDATE CASCADE
Isolation Levels to Prevent Dirty Reads
Databases use isolation levels to ensure consistency by controlling how transactions interact.

Write-Ahead Logging (WAL)
Before applying a change, the database writes the intended changes to a log.
If the system crashes, the database replays the log to restore consistency.
ex — foreign key checks do not allow you to delete parent if the child exists
cascade, trigger, constraints
I - Isolation
when transaction commits the changes outlive the outage
when multiple transaction are executing parellaly the isolation level determines how much changes of one transaction are visible to the other
how much details of one transaction is visible to other transaction.
if one transaction has 5 transaction and another transaction wit 3 different transaction, isolation means how much of transparency of one transaction is available to another transaction.
one can configure these isolation level for these transactions.

Relational databases implement isolation using transaction isolation levels and locking mechanisms.
Transaction Isolation Levels
Locking Mechanisms
Multi-Version Concurrency Control (MVCC)
MVCC allows multiple versions of a record, so readers don’t block writers and writers don’t block readers.
Used by PostgreSQL, MySQL (InnoDB), Oracle.
Instead of locking, transactions see old snapshots of data.
there are 4 isolation level for relational database
Read Uncommitted,
Read Committed,
Repeatable Read (standard practice, most cases this is used)
Serializable
Which Isolation Level Should You Use?
D- Durability
df whene transaction commits the changes outlive the outage

Relational databases (MySQL, PostgreSQL, SQL Server, Oracle) implement durability using:
1️⃣ Write-Ahead Logging (WAL)
2️⃣ Commit & Flush to Disk
3️⃣ Redo & Undo Logs
4️⃣ Replication & Backups
Write-Ahead Logging (WAL)
Before committing a transaction, databases log changes in a separate file (the WAL log).
If a crash occurs, the database can replay the log to recover committed transactions.
Used in PostgreSQL, MySQL (InnoDB), Oracle, SQL Server.
Commit & Flush to Disk
When a transaction commits, data must be physically written to disk.
Databases use fsync() or fdatasync() to ensure data is safely stored.
Redo & Undo Logs
Redo Logs: Ensure committed transactions are reapplied after a crash.
Undo Logs: Allow rollback of uncommitted transactions to maintain consistency.
Replication & Backups
Replication: Database copies data to secondary nodes, ensuring data is not lost if the primary server fails.
Backups: Periodic snapshots prevent data loss beyond the last backup.
excecise
setup a ssql database(mysql or postgres sql)
create a schema for social network
insert data in users and profile in one transaction
Database isolation level
relational data provides ACID guarantees and i in acid is isolation level helps us tune them
isolation level dictates how much one transaction knows about the other
repeatable reads
consistent reads within same transaction
even if other transaction committed 1st transaction would not see the changes (if value already read)
no matter how many other transaction make any changes to the value one would get the same value which was there in the start of the transaction
read committed
reads within the same transaction always reads the fresh value
con: multiple reads within the same transaction are different
always reads the latest committed value even if the update was done in another transaction
read uncommitted
reads even uncommitted values from other transaction
dirty read
reads the latest value even if it was not committed in another transaction
serialisable
every read is locking (depends on engine) nd while one transaction reads, others will have to wait
storage engine can alter the implementation so read documentation before you alter.
when another transaction tries to read when another transaction is not committed it waits till the earlier transaction is completed.
Scaling databases
database are most important component of any system out there, it makes or breaks any system
vertical scaling
adds more CPU, RAM, DISK to the db
requires downtime during reboot
gives you ability to handle scale more load
vertical scaling has a physical hardware limitation
Horizontal scaling: Read replicas
when read:write= 90:10
you move read to other database so that master is free to do writes
to manage this we use SYNC/ASYNC replication, where the master is free to do writes while replica handles the reads . to make sure things are consistent across two db is we use the aync/async replication.
API servers should know which DB to connect to get things done
replication
changes on one db(master) needs to be sent to replica to maintain consistency
synchronous replication
do not terminate write unless the write operation is done in both master and replica
strong consistency
zero replication lags
slower write
master pushes the data to the replica
Asynchronous replication
terminates the write as soon as the write operation is done in master
eventual consistency
same replication lag
faster write
replica pulls the data from the master
to do this all db has a replication where it periodically checks with the master whare are the updates after the last update
Sharding
as one note cannot handle the data load we split it into multiple exclusive substes write on a particular row/document will go to one particular shard.a
This way we scale our overall database load
not: shareds are indepentend no replication b/w them
api server needs to know whom to connect to to get things done
Note— some database has a proxy that cares of routing , one can use API as well
each shard can have its own replication

exercise
create a master replica db where the read and write are handled by the api
create a shard and replicate the master db into different shard using both api and reverse proxy
Sharding and partitoning
SHARDING: method of distribute data across multiple machines, multiple database
adv:
handles large reads and writes
increase overall storage capacity
high availability
dis:
operationally expensive
cross-shard of queries are expensive
Partitoninng : splitting a subset of data within the same instance, splitting the data
How a database is scaled
a db server is just a database process(mysql, mongodb) running on an ec2 instance
single machine has an exposed port to it.
as we are getting more users, we scale up our DB give it more PCU, RAM and disk for the ec2 instance
but this has a limit, after this we need to scale it up again
for this we need to do the horizontal scaling
if one DB is able to handle 1000 WPS and we can't scale up beyond that so we split the data horizontally and SPLIT the data

by adding one more DB server we reduced the load to 750 WPS on each node thus high through put.
each data base server is thus a shard and we say that the data is partitoned
overall a db is sharded while the data is partitoned.
one can choose these 5 partiton to be stored in the 5 different shard or onto a 2 different shard

how do we partiton the data
vertical partioning
pick some table and put it in one db and pick other tables and put it in another db
monolith to micoservice
horizontal partioning
very common
pick some row put it in on one db and pick others and put it on another db

Non relational databases
very broad generalisation of databases that are non relational like mysql, pstgresql
ex — MongoDB, Cassandra, Redis, Couchbase, and Neo4j.
these are interesting as shard out of the box (horizontally scalability)
document db
ex — mongo db, elastic search
mostly json based
supports complex queries (almost like relational (Sql) databases
partial updates to document possible (can do total-post+=1 without rewriting the entire document). one doesn't need to read the entire document to update a single info
closest to relational database
use-case: in app notification service, catalog service(for e-commerce site)
key value store
ex— redis, dynamodb, aerospike
extremely simple db
limited functionalities (get, put, del)
ment for key based access pattern
does not support complex queries like aggregation
can be heavily shared and partitioned
use-case: profile data, order data, auth data
one can use relational db and docuemntdb
Graph database
ex— neo4j, neptune, DGraph
what if graph data structure had a database
it stores data that are represented as nodes, edges and relations
ex— A — > B
great for running complex graph algorithms and running the graph algorithms will be hard to implement over the key value stores
powerful to model social networks, recommendation and fraud detection
Picking the right data base
not a fight, so no need to pick a side
a db is designed to solve a particular problem really well
each kind of db picks a segment with a slight overlap
common misconception
picking non relational db because relation db do not scale
why non-relational DB scale
there are no relations and constrains
data is modelled to be sharded (split across multiple db)
if we relax the above on relational db we can scale it too
do no use foreign key
do not use cross shard transaction
do manual sharding
Does this mean no DB is different
no each single db has some peculiar properties and guarantees and if you need those you pick that DB
How does this helps in designing system
while designing any system
do not jump to a particular DB right away
Understand WHAT data you are storing
understand HOW MUCH of data you will be storing
understand how you will be ACCESSING the data
what kind of QUERIES you will be firing
ANY SPECIAL feature you expect ex— expiration
how to pick the right DB (not exhaustive
if a data can fit on a single node
relational DB -you need strong consistency? data correctness is needed
relational DB— need complex queries, aggregation
redis — access is KV based but it needs it to be really fase
redis — need advanced data structures and algorithms
if data can't feed in one node
relational db — if you have expertise in sql and can do manual sharding
kV store like DDB or mongoDB — simple KV based access
graph DB like Neo4J — require sophisticated graph algorithms
mongoDB or any document DB — nothing specific but want future proof
Caching
Caches are anything that helps us avoid an expensive any network i/o, disk i/o or computation. EX--
API call to get profile information
reading a specific line from a file
doing multiple table joins
Store frequently data in a temporary storage location.
user sends a request to api if the data is present in the cache then we get that from cache if not we fire a query and stores them in cache for future and then send the data to user
Caches are faster and expensive hence we don't cache all the data (we just cache a subset of data that is more likely to be accessed)
we typically use redis, memcache.
what to cache— recently published tweets, as these are more likely to be sent to the user.
Note— caches are not restricted to ram based storage, any storage that is neare and helps us avoid something expensive is a cache for us.
One can call cache as glorified hash tables.
Example --
google news — most recent news articles are stored in cache to be accessed, hence served from cache
auth tokens— authentication are cached in cache to avoid load on the db
live stream — last 10 minutes of live stream is cached on CDN as it will be accessed the most
excercies--
setup redis locally
put and get some data
measure time taken
compare it with database
Populate and scaling a cache
cache sits between the API server and the database.
there are twoo ways to populate the cache.
Lazy population (most popular)
read first go to the cache, if the data exists, return data , else go to database/do heavy operation persisits in the cache, return data.
ex— caching blogs we can store the json
Eager population
write s go to both the db and cache in the same request call
ex— live cricket score
thousand of people are watching score, we will be serving it from cache . so why not update cache and db at once and save the cache miss.
proactively push the data to the cache, as we anticipate the need
ex— when a celibrity tweets/posts something
when account with 10000 followers post something, proactively put it to the cache.
we will anyway need itt and save a cache miss
Scaling a cache
cache is similar to a db hence scaling a cache is similar to a regular db.
vertical scaling --
make you cache bigger to handle more data/load
horizontal scaling (replica) scaling reads --
same data replicated to multiple nodes so that reads could scale
api -> master + replica
horizontal scaling (sharding )
data partitoned across multiple shards so that writes could scale
each shard can have multiple replica
Caching at different levels
Most common cache we see was redis
but that is not the only type of cache out there or not the only place that can be used as a cache
literally every piece/component in out infrastructure caches something for us.
Also too much caching is bad— stale data and invalidated data
Client side caching
storing frequently accessed data on client side/user side.
ex— browser, mobile devices etc.
cache near constant data. ex— images, js files, user information etc.
it should be okay serving the cached into(stale)
invalidation by time(expiry)
massive performance boost as we need not make any request to the backend
CDN
content delivery network
set of servers distributed across the world
request from a user goest to the nearest CDN server and hence user gets very quick response
US folks getting images from US servers is faster than fetching it from india
CDN does lazy cache population
user request coms to the CD(closest server)
CDN server checks if it has the data it yes return the data to origin else, cdn makes the same request.
hosting is in aus, the CDN is in front of the infra . the request to the cdn goes to the aus if the csn doens't have the data . the data from aus goes back to the cdn and then gets served to the user.
Remote cache(redis)
centralized cache that we most commonly use redis.
multiple API servers use it to store frequently accessed data
every key stored should have an expiry(memory leak)
size of the cache is relatively small as compared to the DB
Database caching
instead of computing total post by users everytime we sotre the total post as column and update it one a while
saves an expensive DB computation.
every time a post is published also update the users table which has the total posts and do total-posts=total-posts+1
Notes — there are other places like load balancer where we can cache
note — we can cache some data at every single component in the system but should we do it — no necessarily ( it is very use-case specific and subject to tolerance level of staleness of the served data)
Just because we can it doesn't means we should
Asynchronous processing
User sent the request and we immediatly handle it is synchronous
loading insta feed is synchronous logging on a website is synchronous payments are synchronous
Most interactions on the web are synchronous
some things that should not be synchronous
spinning up a VM as it takes time and user will not on the same page waiting for the response
instead he/she would love to move around and keep checking status once in a while
this is asynchronous.
message queues/brokers
brokers help two services/application through messages.
we use message broker when we want to do something asynchronously
ex— video processing
once the video is uploaded we need to convert to other resolution
when a video is uploaded a message is sent to message broker where we send the details of the video which is sent to the workers which processes the video
Features of message broker
brokers help us connect different subsystems
brokers act as a buffer for the messages thus the consumers can consume at their own pace, ex— notification system
workers can retain the messages for n days (depends on the broker)
brokers can re-queue the message if not delated
ex— consumer read the message but before it could delete it , it deletes.
Typical flow
ex — auto subtitle(auto captioning)
user uploads video to s3 through video service
video service puts a message(after upload completes to brker and returns response to the user(user sees upload complete )
message is asynchronously read by captioner service
Exercise
setup rabbit mq locally
write some code to push and read messages
go through the documentation
Message streams
similar to message queues
ex— wheenever a new vlog is uploaded , we want to index it in a search engine and do a count ++ for the user total blog
Approach 1 : we use one message brokder nad logic in consumer table

Problem with this is what if the job fails after one tasks
Approach 2:
two brokers and two set of consumers
api servers writes to two brokers and each has it s own set of consumers
this still does not solve the problem
hence we want write to one and read by many semantic
this is where message streams come into the picture
approach 3 : message streams
similar to message brokers with one change : multiple types of consumers read the same message

api server pushes one message in kafka, search and counter service both reads the message and does their work
message quees— sqs, rabbit mq
message streams — kafka, kinesis here we have consumer groups like search service and counter service.
the consumer group iterate over the message in the same order.
the messages lie in the order and stay there forever.
Kafka essentials
message streams that holds the messages internally kafka has topic
each topic has n partitions
message is sent to a topic and depending on the configured hash key it is [ut into a partition
within partition messages are ordered, no ordering guarantee across partitions
limitations of kafka
#consumers =#partitions
when we commit a message we can mark that message that we have read this message
excercis
setup kafka locally
write some code to push and read messages
go thorugh the documentation
Realtime PubSub
Both message brokers and messages strams require consumers to pull the messages out.
realtime pubsub makes things reactive instead of continuous pooling
advantage: consumers can pull at their own pace consumers do not get overwhelmed
disadvantage — consumption log when high ingestion
what if we want low latency? Zero lag _> realtime pubsub
instead of consumers pullting the message, the message is pushed to them ex— redis pubsub
this way we get really fast delivery time, but it can overwhelms the consumers
what if consumers receive message faster than they could process
practical use case : message broadcast, configuration push.
ex—
setup redis locally
go through redis pubsub dosumentation
rest realtime broadcast
test it it persists the message.
Load balancers
One of the most important component in distributed system that makes it each to scale the load horizontally
load balancer is the only point of contant.
it abstracts of the details of the system, what apis are there how may servers are ther e
every load balancer has either
static IP
static DNS name
Load balancer hides the #servers that are behind it allowing us to add as many servers as possible without client knowing about it.
Request response flow
client already has ip/domain of laod balancer ex — auth.example.com
client makes api call and it comes to the load balancer — ex— get auth.example.com/login
load balancer picks one server and makes the same request
load balancer gets the response from the sever
load balancer responds back to the client.
Job of the load balancer is to balance to load. it does that using load balancing algorithms.
round robin
distribute the load iteratively, uniform distribution
weighted round robin
distribute the load iteratively but as per weights , non uniform laod
used when we have non uniform servers
least connection
pick the server having the least connection from the load balancer , used when the response time has a big varriance.
hash based routing
hash of some attribute ip, userid , url determines which server to use.
random enough
one can build a sticy session where the request from one user goes to one particular server.
Key advantages of load balancers
scalability
with more servers behind the load balancer we can how handle more requests
availability
even if one of the servers crash it does not take down our entire systems.
load balancer will forward request to other healthy servers. improving the availability
with s2 down load balancer will forward any new request to s1 and s3.
Circuit breakers
prevent cascading failures
no matter whatever happens it stops the failure at some point of time, it helps in minimising a complete collapse of the system.
ex--
social network (feed)
feed service depends on recommendation and trending service. trending service depends on post service which depend on profile service.
profile service depends on profile db and post db depends on post db.
if for some reason profile db is on heavy load profile service load time shoots up then all the features that depends on the profile db shoots up.
if it is too slow the http connections start taking time and this hogs up the number of http request one service can accept.
to solve this we make a cal to a service only if the service is healthy (this is circuit breaker) . We break the circuit down when we see the failure cascade.
It prevents the entire product from collapsing by preventing cascading failure
How is it implemented
a common db holds the settings for each breakers
service before making calls to others, checks the config. Cache the config to avoid checking the db

Circuit breaker DB--
in case of the outage the circuit is tripped and DB is updated
services will periodically check and stop sending requirement to affected service
Excercis
implement a simple circuit breaker(DB)
Write a simple service that checks this settings every-time before calling
ex — post service serves post
profile service serves profile
post service calls profile for info
Data redundancy and recovery
API servers are stateless but databases are stateful
api servers going down is fine because a new one will spinup instantly
api servers gets request and it does not matter which one handles it
DB going down is catastrophic almost always an outage
A good system always takes care of catastrophic situation
the only way to protect ourselves against loss of data is to create multiple copies of it -> data redundancy
Redundancy can be implemented at row/document level, table level or db level.
Redundant data can be stored on different table, different db or different regions
Backup and restore
daily backup of data(incremental)
weekly complete backup
storing one copy across region -> disaster recovery
When something goes wrong just restore the last backup
almost always the easies thing to do
Continuous redundancy
setup replica of the DB and writes goes from both DB(sync/async)
API server writing to both db
API writes to one and is copied to other asynchronously
If the main DB goes down replica can takes its place
Leader Election for auto recovery --
Leader election is base condition of recursion, if one server goes down then another server comes up.
Orchestrator spins up a new machine and puts up behind the load balancer.
When one orchestrator is down somehow other artitecture comes back up and takes responsibility.
In a leader-orchestrator setup we have orchestrator leader(this keeps on eye on the orchestrate leader) and orchestrator worker (this keeps up an eye on the server. When a orchestrator leader goes down one of the orcehstrator worker gets promoted to the orchestrator leader.
Client- Server model --
most common way for machine to talk with each other.
client (demands a job) and server (does the job)
The communication happens over the common network connecting the two.
Two protocols uses are TCP(most of the time) and UDP.
Some important properties of TCP --
breaks because of network interruption
breaks because server/client initiated it
Hence connection remains open almost forever.
Protocol over TCP --
TCP does not dictate what data can be sent over it common format agreed upon by client and server is called a protocol:HTTP
HTTP is the language and TCP is whether we are speaking or writing.
HTTP --
http is a format that client and server understands, one can make their own format and make
client send data in it
server passes and processes it
There are many versions of it — HTTP1.1/ HTTP 2/ HTTP3
HTTP 1.1 is the most commonly used one
For client and server to talk over HTTP 1.1, they need to establish TCP connection
Connection is typically terminated once response is sent to client
Almost new connection for every request/response
Hence people pass "connection: Keep-alive" which tells client and server to not close the connection
Web-socket --
heart and soul of any real time communication
there are bi-directional communication.
Key-features: server can proactively send data to client, without client asking for it.
As there is no need for setting up TCP, every single time we get really low latency.
Anywhere we need realtime, low latency communication is needed your end user over the internet think about web socket.
ex— chat, realtime likes on live stream, stock market ticks.
ex — build a chat application using SocketIO
Blob Storage and S3 --
blob — binary large object.
earlier when people uploaded any files they uploaded it to server and were stored on the hard disk attached to it
getting a file was simple, make and API call , the handler reads the file and return
This is precisely how static folders/ routed worked.
When user uploads the file —
accept it on HTTP POST
create absolute path using folder
store the file at 1 location
Early days of internet
This worked well for some time but it can't work with multiple servers
Hence, we need to infinitely scalable network attached storage/file system
Any file that needs to be accesible by any server is stored at places accible by all. Here API servers are now stateless
On S3 you have
bucket (namespace) eg: insta-imagees, mybucket
keys : path of the file within bucket
s3;//insta-images/user123/72896.png. [bucket/key]
One can seamlessly, create the file, store file can be stored in s3.
Disadvantages--
reads are slower
Bloom Filter
Makes sure a particular this is definitely not part of it.
Key insight: Once something is "watched" you cannot take it back
once a part/reel is watched by you we do not take it out of the set.
This means, storing actual data is not work it.
This is the concept over which bloom filters are setup
Bloom Filter --
Filter — bit array ( we take 8 bit array)
when bloom filters says it is not preset it is 100 percent sure but if it says that it says yes then it is not sure
redis has it as one of its core feature, now a days mostly people go for it.
Practical application
use it whenever we insert but not remember
need a no with 100% certainty
having false positive is okay
Consistent hashing --
One of the most amazing and popular algorithm out there
We first understand consistent hashing and ten look into practical implementation.
Hash based ownership --
say we have a load balancer and when a request comes it it uses hash function to route the request of the user to a particular server.
Note— hashing logic is not a service but just a simple code running in the load balancer's code.
If one of the server is taken down then the routing function changes now the request will be every distributed between remaining two servers so no hiccups
This is why we use hash based routing as one of ht almost common ways of routing for stateless backends.
But the problem arises when the Backend is stateful
picking which node owns the data depends on the hash based ownership.
So the request for the same key should go to a particular node.
Challenge : if a node is removed or added then the output of the hash function changes.
To solve this we need to move the key to the new node to make sure after any change the hash function is working as intended.
To solve this we use consistent hashing --
THis algorithm helps in determining data ownership.
It will not do data transfer for us.
It is not a service in itself.
How consisiten hashing works --
Look it as a ring.
Given a hash function are cyclin we can visualize it as a ring of integers .
Every node occupies one slot in th ering, the slot is calulcated by passing node's IP adress
This ring can be modelled as a simple array and can be part of proxy.
the process is
key k1 -> hash function -> 0 -> node to right -. node 0
key k2 -> hash -> 10 -> node to right -> node 3
Scaling up --
when we add a new node to the ring
Say node 3 hashes to slot 1
the key that hased between slot 12 and slot 1 will now be owned by node 3 insted of nde 0
Other keys continue to remain at their respective node (minimal data movement)
operationally
snapshot node 0
create node 3
delete unwanted keys
Scaling down --
say we scale down and remove node 0 all the keys that were owned by node 0 will now be owned bt node 2 next in the ring
minimal data transfer
operationally
copy everything from node 0 to node 1
create node 2 from the ring
delete unwanted keys.
Exercise --
understand consistent hashing
read the blog
implement it in your favourite programming language
2 arrays +binary search
Big Data processing --
Set of tools that helps up process the large amount of data in parallize manner
problem — count number of words in 1 tb text data
approach 1 : run a loop which finds a space and increases the counter
approach 2 : pareelalize this so that we can divide the data and process it using threads
approach 3 : if we have multiple servers we make one server as coordinator which divides the data into multiple chunks and and the coordinator passes the chunk and sends it to the worker
we have tools for this known as spark and other which helps us in solving the big data taks
we just need to write the business logic and tools will handle the distributed computation.
Real world systems --
E-commerce product listing page
Shop owner has to list 100 products
Requirements --
add a new product
update/delete existing product
list all the products on the website
customer should be able to quickly access the catalog
Storage --
not huge data only 100 rows (fits into single node)
100 *1 kb = 100 kb data
There seems to be a structure we can go ahead with sql DB as we have a structure although any db can be used
product table to hold all the catalog
Serving the data
simple REST based HTTP web-server is fine
we will need many (handle request)
hence put a load balancer
load balancer will have DNS like api.mystore.com
Artitecture till now
user -> LB-> api server->DB
instead of drawing LB and multiple servers every time we can draw simple digram
user -> catalog backend -> catalog db
now adding a frontend
user -> catalog frontend -> catalog backend -> catalog db
Show owner admin interface
shop owner needs admin console to manage the catalog
we will have admin ui which makes call to the backend to update the catalog
user -> frontend -> backend ->(admin ui) and backend
Understand load on each component
the call on the frontend is balanced as we have lb on the frontend as this can be scaled independently
for the admin UI we can use 1 server is good enough as long as it is available all the time
Backend will have decent load , so we would need another lb to scale the backend
As end user is only reading and not writing, so for DB we create a read replica so that read request goes to the catalog DB replica and the catalog master handles the write
the master can handle the read request as well.
We can add the cache as well so that but we would need to handle the cache invalidation as well. we would have to make sure the data in cache is latest.
exercise --
design DB schema for this system
write a simple backend API service exposing APIs
Setup DB replication
Move read APIs to read from replica.
Design API rate limiter
System break down under tremendous load and we need to ensure that doesn't happen
Design a rate limiter that
limit the number of request in a given period
allows developers to configure threshold of a granuer level
does not add a massive additional overhead
Where does it fit
if we have a frontend proxy then the request goes to rate limited to check if we can pass the request
if yes then the request is passed to the service
or if we have lb it goes to the service, then it check if the request is within the limits
Dissecting rate limiter
rate limiter needs to track number of request in a given period
hence we need a db to hold the count but which one
for every incoming request we could update the db
for every incoming request we would read from db(aggregate)
we need ability to clock the time
So we choose redis to store the key value store and update the counter for every new request
The rate limmiter is now only a redis db, but we need to scale this
this db is write heavy so we can use shard
Last updated


