Overview

🧱 The Big Picture (One-line)¶

Agents collect data → Backends store it → Grafana visualizes it → Alerts notify you

Your servers / apps
   ↓
Agents (Promtail, Node Exporter)
   ↓
Logs / Metrics / Traces storage (Loki, Prometheus, Tempo)
   ↓
Grafana dashboards & alerts

1️⃣ Metrics → Prometheus¶

📊 What are Metrics?¶

Numbers measured over time, like:

CPU usage (%)
Memory usage
Disk space
Request count
Request latency
Error rate

🔧 What Prometheus Does¶

Scrapes metrics every few seconds
Stores time-series data
Very fast for numerical data

🧠 Example¶

CPU usage: 82%
Requests per second: 120
Error rate: 1.2%

🏢 Why companies use it¶

Kubernetes standard
Lightweight
Powerful queries (PromQL)

2️⃣ Logs → Loki¶

📄 What are Logs?¶

Text events, like:

Errors
Stack traces
Access logs
Application output

🔧 What Loki Does¶

Stores logs efficiently
Labels logs instead of indexing full text (cheaper)
Works seamlessly with Grafana

🧠 Example¶

ERROR: Database connection failed
WARN: Token expired
INFO: User logged in

🏢 Why companies use it¶

Much cheaper than ELK
Cloud-native
Easy to scale

3️⃣ Traces → Tempo¶

🔗 What are Traces?¶

End-to-end request journeys, across services.

Example:

Frontend → API → Auth Service → DB

🔧 What Tempo Does¶

Stores distributed traces
Helps find slow services
Integrates with OpenTelemetry

🧠 Example¶

Request ID: abc123
Frontend: 10ms
API: 50ms
DB: 300ms  ← bottleneck

🏢 Why companies use it¶

Debug performance issues
See request flow visually
No high indexing cost

4️⃣ Dashboards → Grafana¶

🖥️ What Grafana Does¶

Single UI for metrics, logs, traces
Interactive dashboards
Correlate data easily

🧠 Example¶

Click CPU spike → see logs → open trace
One dashboard for entire system health

🏢 Why companies use it¶

Industry standard UI
Works with many data sources
Strong alerting

5️⃣ Alerts → Alertmanager¶

🚨 What Alerts Are¶

Automated notifications when something goes wrong.

🔧 What Alertmanager Does¶

Manages alert rules
Deduplicates alerts
Sends notifications

🧠 Example Alerts¶

CPU > 90% for 5 minutes
Error rate > 5%
Service down
Disk < 10%

📢 Alert destinations¶

Email
Slack
Microsoft Teams
PagerDuty

6️⃣ Agents → Promtail & Node Exporter¶

🧲 What Agents Do¶

Run on each server to collect data.

🔹 Node Exporter¶

Collects system metrics
CPU, RAM, Disk, Network

Example:

node_cpu_seconds_total
node_memory_MemAvailable_bytes

🔹 Promtail¶

Collects logs
Reads:
- /var/log/syslog
- journalctl
- App logs
Sends logs to Loki

🧠 How All Layers Work Together (Real Scenario)¶

❗ Problem¶

Your website becomes slow.

🔍 Investigation Flow¶

Grafana Dashboard shows CPU spike
Prometheus Metrics show DB latency increase
Loki Logs show DB connection errors
Tempo Traces reveal DB query taking 500ms
Alertmanager already notified your team

➡️ Root cause found in minutes

🏆 Why This Stack Is Popular in Companies¶

Benefit	Reason
Cost-effective	No per-GB pricing
Scalable	Works from 1 to 1000+ servers
Cloud-agnostic	Works on any cloud
Open-source	No vendor lock-in
Industry standard	CNCF backed

🎯 Simple Memory Trick¶

P-L-T-G-A

Prometheus → Metrics
Loki → Logs
Tempo → Traces
Grafana → UI
Alertmanager → Alerts