Overview
🧱 The Big Picture (One-line)¶
Agents collect data → Backends store it → Grafana visualizes it → Alerts notify you
Your servers / apps
↓
Agents (Promtail, Node Exporter)
↓
Logs / Metrics / Traces storage (Loki, Prometheus, Tempo)
↓
Grafana dashboards & alerts
1️⃣ Metrics → Prometheus¶
📊 What are Metrics?¶
Numbers measured over time, like:
-
CPU usage (%)
-
Memory usage
-
Disk space
-
Request count
-
Request latency
-
Error rate
🔧 What Prometheus Does¶
-
Scrapes metrics every few seconds
-
Stores time-series data
-
Very fast for numerical data
🧠 Example¶
🏢 Why companies use it¶
-
Kubernetes standard
-
Lightweight
-
Powerful queries (PromQL)
2️⃣ Logs → Loki¶
📄 What are Logs?¶
Text events, like:
-
Errors
-
Stack traces
-
Access logs
-
Application output
🔧 What Loki Does¶
-
Stores logs efficiently
-
Labels logs instead of indexing full text (cheaper)
-
Works seamlessly with Grafana
🧠 Example¶
🏢 Why companies use it¶
-
Much cheaper than ELK
-
Cloud-native
-
Easy to scale
3️⃣ Traces → Tempo¶
🔗 What are Traces?¶
End-to-end request journeys, across services.
Example:
🔧 What Tempo Does¶
-
Stores distributed traces
-
Helps find slow services
-
Integrates with OpenTelemetry
🧠 Example¶
🏢 Why companies use it¶
-
Debug performance issues
-
See request flow visually
-
No high indexing cost
4️⃣ Dashboards → Grafana¶
🖥️ What Grafana Does¶
-
Single UI for metrics, logs, traces
-
Interactive dashboards
-
Correlate data easily
🧠 Example¶
-
Click CPU spike → see logs → open trace
-
One dashboard for entire system health
🏢 Why companies use it¶
-
Industry standard UI
-
Works with many data sources
-
Strong alerting
5️⃣ Alerts → Alertmanager¶
🚨 What Alerts Are¶
Automated notifications when something goes wrong.
🔧 What Alertmanager Does¶
-
Manages alert rules
-
Deduplicates alerts
-
Sends notifications
🧠 Example Alerts¶
📢 Alert destinations¶
-
Email
-
Slack
-
Microsoft Teams
-
PagerDuty
6️⃣ Agents → Promtail & Node Exporter¶
🧲 What Agents Do¶
Run on each server to collect data.
🔹 Node Exporter¶
-
Collects system metrics
-
CPU, RAM, Disk, Network
Example:
🔹 Promtail¶
-
Collects logs
-
Reads:
-
/var/log/syslog -
journalctl -
App logs
-
-
Sends logs to Loki
🧠 How All Layers Work Together (Real Scenario)¶
❗ Problem¶
Your website becomes slow.
🔍 Investigation Flow¶
-
Grafana Dashboard shows CPU spike
-
Prometheus Metrics show DB latency increase
-
Loki Logs show DB connection errors
-
Tempo Traces reveal DB query taking 500ms
-
Alertmanager already notified your team
➡️ Root cause found in minutes
🏆 Why This Stack Is Popular in Companies¶
| Benefit | Reason |
|---|---|
| Cost-effective | No per-GB pricing |
| Scalable | Works from 1 to 1000+ servers |
| Cloud-agnostic | Works on any cloud |
| Open-source | No vendor lock-in |
| Industry standard | CNCF backed |
🎯 Simple Memory Trick¶
P-L-T-G-A
-
Prometheus → Metrics
-
Loki → Logs
-
Tempo → Traces
-
Grafana → UI
-
Alertmanager → Alerts