Skip to content

Overview

🧱 The Big Picture (One-line)

Agents collect data → Backends store it → Grafana visualizes it → Alerts notify you

Your servers / apps
Agents (Promtail, Node Exporter)
Logs / Metrics / Traces storage (Loki, Prometheus, Tempo)
Grafana dashboards & alerts

1️⃣ Metrics → Prometheus

📊 What are Metrics?

Numbers measured over time, like:

  • CPU usage (%)

  • Memory usage

  • Disk space

  • Request count

  • Request latency

  • Error rate

🔧 What Prometheus Does

  • Scrapes metrics every few seconds

  • Stores time-series data

  • Very fast for numerical data

🧠 Example

CPU usage: 82%
Requests per second: 120
Error rate: 1.2%

🏢 Why companies use it

  • Kubernetes standard

  • Lightweight

  • Powerful queries (PromQL)


2️⃣ Logs → Loki

📄 What are Logs?

Text events, like:

  • Errors

  • Stack traces

  • Access logs

  • Application output

🔧 What Loki Does

  • Stores logs efficiently

  • Labels logs instead of indexing full text (cheaper)

  • Works seamlessly with Grafana

🧠 Example

ERROR: Database connection failed
WARN: Token expired
INFO: User logged in

🏢 Why companies use it

  • Much cheaper than ELK

  • Cloud-native

  • Easy to scale


3️⃣ Traces → Tempo

🔗 What are Traces?

End-to-end request journeys, across services.

Example:

Frontend → API → Auth Service → DB

🔧 What Tempo Does

  • Stores distributed traces

  • Helps find slow services

  • Integrates with OpenTelemetry

🧠 Example

Request ID: abc123
Frontend: 10ms
API: 50ms
DB: 300ms  ← bottleneck

🏢 Why companies use it

  • Debug performance issues

  • See request flow visually

  • No high indexing cost


4️⃣ Dashboards → Grafana

🖥️ What Grafana Does

  • Single UI for metrics, logs, traces

  • Interactive dashboards

  • Correlate data easily

🧠 Example

  • Click CPU spike → see logs → open trace

  • One dashboard for entire system health

🏢 Why companies use it

  • Industry standard UI

  • Works with many data sources

  • Strong alerting


5️⃣ Alerts → Alertmanager

🚨 What Alerts Are

Automated notifications when something goes wrong.

🔧 What Alertmanager Does

  • Manages alert rules

  • Deduplicates alerts

  • Sends notifications

🧠 Example Alerts

CPU > 90% for 5 minutes
Error rate > 5%
Service down
Disk < 10%

📢 Alert destinations

  • Email

  • Slack

  • Microsoft Teams

  • PagerDuty


6️⃣ Agents → Promtail & Node Exporter

🧲 What Agents Do

Run on each server to collect data.


🔹 Node Exporter

  • Collects system metrics

  • CPU, RAM, Disk, Network

Example:

node_cpu_seconds_total
node_memory_MemAvailable_bytes

🔹 Promtail

  • Collects logs

  • Reads:

    • /var/log/syslog

    • journalctl

    • App logs

  • Sends logs to Loki


🧠 How All Layers Work Together (Real Scenario)

❗ Problem

Your website becomes slow.

🔍 Investigation Flow

  1. Grafana Dashboard shows CPU spike

  2. Prometheus Metrics show DB latency increase

  3. Loki Logs show DB connection errors

  4. Tempo Traces reveal DB query taking 500ms

  5. Alertmanager already notified your team

➡️ Root cause found in minutes


Benefit Reason
Cost-effective No per-GB pricing
Scalable Works from 1 to 1000+ servers
Cloud-agnostic Works on any cloud
Open-source No vendor lock-in
Industry standard CNCF backed

🎯 Simple Memory Trick

P-L-T-G-A

  • Prometheus → Metrics

  • Loki → Logs

  • Tempo → Traces

  • Grafana → UI

  • Alertmanager → Alerts