#5 Metrics monitoring and alerting system

Design: Metrics Monitoring & Alerting System

Below is a production-ready design for a Metrics Monitoring & Alerting System (think Prometheus + Cortex/Thanos + Grafana + Alertmanager at scale). It covers scope, FR/NFR, APIs, architecture, data model and retention, ingestion & sampling, query & dashboards, alerting lifecycle, scaling/back-of-the-envelope math, operational concerns, trade-offs and an interview-ready summary.

1 — Scope & goals

Provide a system to:

Ingest time-series metrics from services (pull or push).
Store and index metrics efficiently for short/medium/long-term queries.
Serve dashboards, ad-hoc queries and SLO/SLA reporting.
Evaluate alerts and route incidents to on-call teams with deduplication and suppression.
Scale to tens/hundreds of thousands of hosts and millions of series, while controlling cardinality and cost.

Target use-cases: infrastructure metrics (CPU, mem), application metrics (requests/sec, error rate), service-level indicators (latency histograms), business KPIs.

2 — Functional requirements

Metrics ingestion: accept labelled time-series metrics (gauge, counter, histogram, summary). Support pull (Prometheus) and push (Pushgateway / remote write).
High-frequency scraping: support scrape intervals from 1s to 1m.
Durable storage: short-term hot storage for fast queries, long-term cold storage with downsampling.
Query API: support PromQL-like queries (range and instant).
Dashboards: render real-time visualizations (Grafana) with templating and sharing.
Alerting: evaluate rules continuously (thresholds, anomaly, SLO burn rate), support grouping, dedupe, silences, escalations.
SLO integration: compute error budgets & burn rates from metrics.
Multi-tenancy & RBAC: namespace isolation and per-tenant quotas.
Retention policies: configurable retention & downsampling per metric / tenant.
Exports & integrations: webhooks, PagerDuty, Slack, Opsgenie, email.

3 — Non-functional requirements

Latency: queries P95 < 200–500 ms for short-range, dashboard refresh < 1s for small panels.
Availability: 99.95% for reads & writes during business hours. Graceful degradation acceptable for cold queries.
Scalability: scale horizontally for ingestion, storage and query; support millions of series.
Durability: no data loss for committed metrics; configurable replication.
Cost-efficiency: downsampling & cold storage to control cost.
Security: TLS, authN/authZ, encrypted-at-rest for sensitive data.
Observability: internal metrics for ingestion rate, cardinality, query latency, alert eval times.

4 — High-level architecture

Metrics Producers (apps, infra, exporters)
        |
   [Pull: Prometheus scrapers]  [Push: remote_write]
        |                         |
   Prometheus (or sidecar) ------> Write API / Ingest Layer (gateway)
                                       |
                           Ingest Sharders / Rate-limiters
                                       |
                            Message Bus (Kafka / kinesis)  <-- optional
                                       |
                +-----------------------------------------------+
                |           Short-term TSDB cluster            |
                |  (memory+disk, high-cardinality, fast query) |
                +-----------------------------------------------+
                                       |
                Downsampler & Compactor (rollups) -> Cold object store (S3)
                                       |
               Query Layer (Distributed TSDB: Cortex/Thanos/M3DB)
                                       |
                 Grafana / Query API / SLO Service / Alert Manager
                                       |
                          Alertmanager / Notifier (PagerDuty, Slack)

Components explained:

Scrapers / Sidecars: pull exporters or accept push via remote_write.
Ingest Gateway: shards writes (by tenant/label), enforces quotas, rejects high-cardinality bursts.
Short-term TSDB: fast append-only local stores per ingester (e.g., Prometheus TSDB format) for recent data.
Downsampler / Compactor: aggregates high-resolution to lower-resolution (1s → 1m → 5m) for older windows.
Long-term storage: object store (S3/GCS) for compressed blocks/segments; partitioned by time & tenant.
Query Layer: global query frontends, run distributed queries by merging results from short-term and long-term.
Alerting Engine: evaluate rules (push or pull), manage silences & notification retries.

Common open-source building blocks: Prometheus + Prometheus remote_write → Cortex/Thanos/M3DB backend + Grafana + Alertmanager.

5 — Data model & labels

Each sample: {metric_name, labels{service,instance,region,env,...}, timestamp, value}.
Histograms stored as buckets & exemplars; counters need monotonic handling.
Important: Labels drive cardinality. Limit free-form labels (e.g., user_id, request_id) — those belong in logs/traces.

Label hygiene best practices included in system (validation, drop rules).

6 — Ingestion & throttling strategy

Sharding: hash by (tenant + metric_name) or label subset to map writes to ingesters.
Batching: clients should batch remote_write to amortize overhead.
Rate limits & quotas: per-tenant sample rate, series cardinality cap, label value length limits. Reject writes or sample/delay with backpressure.
Spike protection: burst buffers + reject policy for sustained overload.

7 — Storage, retention & downsampling (with math example)

Example assumptions (to size the system)

10,000 hosts (nodes)
Each host exposes 100 unique time-series
Scrape interval: 10 seconds
Average sample size (including labels & overhead): 512 bytes per sample

Compute samples/sec:

Hosts × series = 10,000 × 100 = 1,000,000 series total.
Each series scraped every 10s → samples/sec = 1,000,000 / 10 = 100,000 samples/sec.

Compute bytes/sec:

3. bytes/sec = samples/sec × bytes_per_sample = 100,000 × 512 = 51,200,000 bytes/sec.

Per day:

4. bytes/day = 51,200,000 × 86,400 =

Step: 51,200,000 × 86,400 = 51.2e6 × 86,400 = 4,423,680,000,000 bytes/day.
That equals ≈ 4.42368 TB/day (decimal TB).

30-day retention raw:

5. 4.42368 TB/day × 30 = 132.7104 TB raw.

After compression/downsampling: assume 5× effective compression + aggressive downsampling for >7 days:

6. Effective storage ≈ 132.7104 TB / 5 = ≈26.54208 TB for 30-day stored data.

Notes:

Real systems often get 3–10× compression depending on metric churn.
Histograms and exemplars cost more per sample.
Use hot storage for recent 7–14 days, cold object store for older (downsampled) data.

8 — Querying & dashboards

Query frontend: route queries to appropriate ingesters & long-term store; parallelize across time ranges and tenants.
Caching: cache recent query results (per-panel) and shared subqueries.
Query ops: time range split (short range served from hot store; long range stitched with downsampled blocks).
Dashboard templating: Grafana integrated with metric query API.

Optimization: precompute rollups for heavy business metrics and SLO calculations.

9 — Alerting system design & lifecycle

Components:

Rule definitions: recording rules (for derived metrics) and alerting rules (prometheus-style).
Evaluation engine: scheduled evaluators that run rules at defined intervals and create or resolve alerts.
Dedup & group: group incidents by labels (service, region), dedupe similar alerts.
Silences & maintenance windows: suppress alerts via silences or maintenance mode.
Routing & escalation: route by label-based receiver mappings; escalate if not acknowledged.
Noise reduction:
- Require condition to hold for N evaluation intervals (for flapping avoidance).
- Use alert inhibitions (suppress less important alerts when a critical one is firing).
- Use anomaly detection + baseline-based thresholds to reduce static threshold noise.
Integration: PagerDuty, Opsgenie, Slack, email, webhooks.
Audit & incident timeline: record alert history, acknowledgments, responders.

Alert maturity lifecycle:

Create rule with severity & runbook link.
Test in staging for X hours.
Promote to prod; monitor false-positive rate.
Iterate thresholds & add suppression where needed.

10 — SLOs, SLIs & burn rate integration

Provide an SLO service that calculates error rates/latency from metrics and computes burn rate. Alerts for:
- High error budget burn rate
- SLA breaches
Use windowed aggregation and burn-rate-based alerts (e.g., 14-day & 1-day windows).

11 — Multi-tenancy, security & access control

Tenant isolation: tagged ingest streams and storage partitions by tenant; enforce quotas and per-tenant resource accounting.
AuthN/AuthZ: OAuth2 / mTLS for write & read access. Grafana with RBAC for dashboards.
Data encryption: TLS in transit; encrypt sensitive labels/metrics in storage if needed.
Audit logs: who created rule, who silenced, who viewed dashboard.

12 — Reliability, failure modes & mitigation

Hot ingester failure: replicate data to multiple ingesters; use consistent hashing + replication factor for series.
Controller failure: use consensus-backed metadata (etcd) and multiple replicas.
Object store outage: serve recent blocks from local cache; read-only mode for dashboards.
Partition hotspots/cardinality spikes: reject or downsample high-cardinality series; autoscale ingestion.
Alert storms: implement alert inhibition & burst protection; route to escalation policies.

13 — Operational concerns & observability of the monitoring system

Monitor internal metrics: ingestion rate, samples/sec, series count, active series per tenant, disk usage, compaction duration, query latency, alert evaluation latency.
Run synthetic monitoring and end-to-end tests (scrape exporters in staging, ensure metrics flow to alerts).
Capacity planning based on samples/sec and retention (see math).
Regular housekeeping: label hygiene audits, alert quality reviews.

14 — Trade-offs & choices

Prometheus-only (single server): simple, great for small clusters, limited scale & HA.
Prometheus + remote_write → Cortex/Thanos/M3: better scalability, multi-tenant, long-term retention.
Push vs Pull: Pull (Prometheus) simplifies target discovery & scraping; push required for ephemeral jobs/k8s short-lived tasks.
High cardinality: best solved by design & constraints; allows cost control.
Complex alerts (ML/anomaly): reduce noise but add complexity and false negatives risk.

15 — Interview talking points / 10-minute pitch

Clarify scale & retention targets (samples/sec, retention days, cardinality limits).
Propose architecture: Prometheus scrapers + remote_write → distributed TSDB (Cortex/Thanos) + Grafana + Alertmanager.
Explain ingestion/sharding and cardinality protection (quotas, drop rules).
Describe storage tiers: hot TSDB + downsample + cold object store, with math showing sample/sec → storage.
Walk through alert lifecycle: rule eval → dedupe → route → silence/escalate.
Finish with trade-offs & future improvements (AIOps, adaptive sampling, per-tenant cost control).

Appendix — Quick checklist for production rollout

Enforce label/metric naming standards and training.
Implement per-tenant quotas and backpressure.
Provide runbooks and playbooks linked in alert metadata.
Auto-scale ingesters & query frontends.
Periodically review alerts for relevance and flapping.
Add SLO service and integrate burn-rate alerts.

If you want, I can now:

produce a time-boxed 1-hour interview script for presenting this design (what to say every 5–10 minutes), or
generate the capacity calculator as a small spreadsheet (you give me your scale assumptions), or
sketch an alert rule template / runbook template you can reuse in interviews.

Which would you like next?

Previous#7 Hotel reservation system Next#6 Ad click event aggregation system

Last updated 4 months ago

hashtagDesign: Metrics Monitoring & Alerting System

hashtag1 — Scope & goals

hashtag2 — Functional requirements

hashtag3 — Non-functional requirements

hashtag4 — High-level architecture

hashtag5 — Data model & labels

hashtag6 — Ingestion & throttling strategy

hashtag7 — Storage, retention & downsampling (with math example)

hashtagExample assumptions (to size the system)

hashtag8 — Querying & dashboards

hashtag9 — Alerting system design & lifecycle

hashtag10 — SLOs, SLIs & burn rate integration

hashtag11 — Multi-tenancy, security & access control

hashtag12 — Reliability, failure modes & mitigation

hashtag13 — Operational concerns & observability of the monitoring system

hashtag14 — Trade-offs & choices

hashtag15 — Interview talking points / 10-minute pitch

hashtagAppendix — Quick checklist for production rollout