Arpit biyani system design masterclass
Foundational topics in system design |
To get most value
speak up and participate
note down new terms that you hear
read about them later
Revise
Ask questions
If you need 1:1 help do not hesitate
What is system design
requirements -> system -> product development ( artitecture, service, module, application )
artitecture design - macro components (bird eye view)
Logical design business logic (algorithms, database, work flows)
Physical design — input/output interface storage, processing [ capacity planning , H/W decision UI framework, Backup and restore redundancy ]
How to approach system design?
define the scope --> most important
functional scope [pm]
features for the user
follow tweet retweet
input and output
input behaviour and output
something that system must have
non functional [tech lead]
how the system should be implemented
redundancy, availability, security
speed security and reliability
something that determines system operational capability
while designing any system
do not assume
ask critical question
challenges product/ engineering decisions
Seek clarification
Building systems
bottom up
build individual components that run in harmony
incremental
build MVP
incrementally add features
complexities to it
Golden rule — start small and build on top
define building block
define relationships between them
define communication
identify bottle necks and fix them
No premature optimisation
Building a blog
requirements
single user blog
multiple posts
each posts has tags
browsing by tags
Search bar— enabling readers to find blogs
Blogs ordered by time (most recent first)
Blogs should support !m concurrent readers
blog should be available
blog should be fast(subjective)
For any system we need mostly 6 components (each of the above decision will define how the system scales)
database
caching
scaling
delegation [one of the best ways to gain performance out of any system like message queues]
concurrency
communication
Lets go in details to all of them
Databases (each one is defined for a certain use case but each one comes up with their advantages and disadvantages)
in memory — caches
disk based — my sql mongo db etc
server'ed — db that has its own server to run
embedded — db which runs within the same application
row based — how the data is structured in db
columnar
graph db — model which models subject object relationship
time series DB — time along with metric, handling time series data
Relational — sql
Non relational — no sql
Blob storage — data is just binary long object
Flat file storage — store flat files (blob storage where we can query on them)
Storages for text based search — elastic search
Tables
users
id
name
bio
blog
id
title
user_id
published_at [ we can store them in datetime or in epochs (ints) which is much faster than datetime or we can also store them in string of format YYYYMMDD ] one can store the time in UTC and store it and send it on user time on the fly.
is_deleted
body [ we can store body but storing big data like images into the db can lead to performance of the db, instead we can use an external reference to the large data bodies like images]
excerpt
slug
updated
tags
id
name
blog_tags
blog id
tag_id
Stuff to read
database redundancy
standby
1:1
N:N
Storage redundancy
RAID
Goe redundancy
data center redundancy
Now we need API server
user -> API -> database
API server will
talk to db on behalf of the user (abstraction)
hasve all the business logic
authorization checks
authentication
post process/ filter results
Search box
To build an efficient search on top of our blog we can use elastic search where we store all the data and use specialized db for searching.
So we have to make sure that the data is replicated in two db the original mysql db and elastic search db.
For this blog we can skip the mysql db and store everything in the elastic search db. but of some usecase we would need both db as elastic search doesn't gives us ACID properties.
Caching
as this is read heavy system we would need caching but these are very costly
it improves the locality of reference
recent data most likely to be accessed
caches existes at all the levels in the artitecture
cache hit (maximize) and cache miss (minimize)
one can use cache at all the places in the system.
Application server cache [most underrated ]
request layer cache [request lccal]
central cache
redis memcache
central and distributed cache
shareded data
CDN
akamai
cloudflare
helps in geographic spread
mostly static data is cached in the CDN
Caches also sit next to db
seamless access and transport
Honarable mentions
DIsk cache
cpu cache
and GPU cache
What one should read
cache eveiction policies
CHarachteristics of cache
build hhigh through put systems
volatility
not for transactonal data as these might go stale
quick reads and writes
super expensive
meant for performance
reduces disk I/O on DB
Network I/O on network calls
Types of cache
write through cache
Write to DB and write to cache
both write pass then write pass
data persent at both places everytime
pros fast retrivals data consistency fault tolerance
higher write latency
Write around cache
write to db but no write to cache
good for applications that do not read immediately after write
Pros no flooding cache with unused writes
Cons recently written items will be cache missed higher read latency [first read]
Write abck cache
write to cache and I/o done
data written to db in background
low latency high through put and write intensive application
data availability risk
Content delivery network
get content from closest availaqble servers instead of hitting main servers
improve speed (response times) by keeping content close to users
dirty content served to users (time for invalidation)
caching policies and confugirations
we store content that does not change often image, vid, text, api
Lets add cache to our blog
Foundational topics in system design ||
Caching at various level
DNS level
API server (in memory of the server or on- disk)
Client side
Pre computed (materilized table views) good on read havy system
Load balancers
centralised cache
transparent cache in front of DB ex — cacheops
API Gateway cache
DB buffer pool
Scaling
what is scalability
number of requests system can handle simultaniously
add more resources to the infrastructure
this refers to the scaling strategy to be applied
horizontal — the minions
adding more machine
we have load balancers to balance the load to multiple machines of our load balancers
load balancers in itself are multiple machines which talk with each other to make sure the laod is balanced among multiple servers.
although we have multiple machines behind lb but the db is not scaled till now as it can't handle that many requests, so all db elastic search all others needs to be scaled accordingly.
So how to scale db horizontally
master/worker — master writes and worker reads
sharded/ distributed db — we shard the database and partition the data . we can again replicate the sharded db
vertical — the hulk
vertical infrastructure bulky scalling up
more CPU, RAM and disk
So which strategy to use
do not over optimize
do not over provison
Do sequentially
start by vertical scaling
then add read replicas
then shard the database
Delegation
what does not need to be done in realtime should not be done in realtime
delegate whatever at hand and respond
typical flow of delegation
Client makes a request server delegates and immediately returns a response task picked up by workers . workers upon completing it updates the database
cleint constantly pings through a seperate API call to fetch the status (optional)
effectively manage requests in a large scale distributed system(asynchronous)
Client no longer need to wait for task to be completed
makes it possible for applications/ services to communicate asynshronously

How this works
we use queue it can be either notaml queue or a fifo queue
what should be the features of such queue
expiration , delay at most once, at least once
exactly once waits untill message retreived
Two common implementation
message queues
sqs rabbit mq, redis etc
data streams pub sub
kafka kinesis
ways to delegate
pub-sub — some publisher which are publishing the data and there are subscriber which are reading the data
this is done my message brokers like SQS, RabbitMQ, Redis etc.
Getting basic analytics
Concurrency
one being able to run multiple jobs with an impression that all jobs are running at the same time
read on concurrency vs parellelism
Sync- async programming model
task executed one after other -sync
tasks could context switch in b/w aync
programming languages upon above model
node js
java
python
Issues
communication b/w sub-componenet complex
number of possible execution path — very large
indeterminals outcomes
depending on the execution path
concurrent use of shared resources
high DB connections
same memory blocks without locks
complicated cordination and data exchange
How do we handle concurrency
make things thread safe
locks
lock free data structure
Dead lock and starvation
parellel algorithms
lock free data structures (conflich free)
persistent data structures
How will our users communicate? what if we want to show
communication decisions determine how different components of the system will communicate with each other
client server
client client -> p2p
inter service communication
intera service communication
inter process communication
intra process communication
the usual communication
client (demands) -> <- server (fulfills)
channel is kept open (optional timeout)
server does the heavy lifting
short polling
bad idea
client keeps bugging every x seconds to send the data for nay new data the server will response instantly but not sure about the data . whatever it has it sends
the connection
ex — criccbuzz , provisioning a server on the cloud
http long polling
same as short polling except it does not send emty response
in case of timeout the process needs to be repeated — establish connections and requests
what is the difference b/w short and long polling
server does not sends empty response
this will lead to connections being open untill
server sends the data
connection timeout
client reconnects when connection times out
Web sockets
client --— bidirenctional channel --> server
best for quick response cases chats
instead of client asking for data server can proactively send the data to client over the established channel
all the data reaches the client depending on when the data is available
real time data transfer
low communication overhead why
which protocol could it be using and why
when data changes frequently
How will our client talk to server for blog
HTTP rest
Relational databases
Data records stored in tables (rows and columns)
tables related to other tables using 1:1, 1:N N:N relation
Normal forms: 1NF, 2NF, 3NF each new normal forms reduces redundancy
Ex DB: MYSql Oracle, PostgreSQL
Constrains on columns and combination of columns like Unique, Not Null, Primary key, Foreign Key
Transactions (because of transactions relational DB are being used in the industry)
Atomicity - each transaction is a single unit of execution
Consistency : Database goes from one valid state to another
Durability: Post commit data is recoverable even after failure of system
Isolation: Concurrent execution= sequential execution
Database index -—
primary job is to improve lookups
improves lookup performance HOW?
Decreases the write performance HOW
Index is typically a small table having two columns primary/ candidate key — disk where KV stored
Data structure used : B+ tree internal and serialization of B+ tree are good to know
Types of index
primary index — key — address to data
clustered index — index+data residing together ordered by the key Huge boost in performance recording and updated expensive.
Secondary index, Composite index, Partial index etc.
Databaase locking
Read lock/ shared lock
reserved for read by the curresnt session
other session can read the locked data
other sessions cannot write update locked data
Write lock/ exclusive lock
reserved for write by current session
other session cannot read/write locked data
lock granuality : table, row, column, page level locking
if engine detects deadlock then it kills the transation running which the deadlock was detected
why do we need DB locks when we can handle it at application side
Distributed arch/locks needs a single source <— Multiple app and process
Locking reads
if you query the data and then INSERT/UPDATE within same transaction, regular select statement does not give enough protection [auto commit=off]
Select .... for share ;
shared mode locks on any rows that are read, other sessions can read but not modify if rows modified other transactions waits until the first one ends and then uses the latest values
Select ...... for Update
locks the rows and any associated index entries other transactions are blocked from updating these rows or even reading them in creation isolcation levels
Useful when dealing with tree structured data in same or split tables
Non relational databases
Distributed systems
Distributed ID generators
Social networks |
Social network ||
Storage Engines ||
High throughput systems |
High through put systems ||
Information retrieval systems |
Information retrival systems ||
Algorithmic system design |
Algorithmic system design ||
Last updated
