Technical Dashboards with Grafana and Prometheus: A Practical, No‑Fluff Guide

November 21, 2025 at 05:31 PM | Est. read time: 13 min
Valentina Vianna

By Valentina Vianna

Community manager and producer of specialized marketing content

If your dashboards don’t drive decisions, they’re just pretty pictures. Grafana and Prometheus are the backbone of modern observability because they turn raw telemetry into real, actionable insight. This guide shows you how to plan, build, scale, and operationalize technical dashboards that engineering, DevOps, and SRE teams actually use—without burning time or budget.

You’ll learn practical patterns, query recipes, alerting tips, and dashboard design best practices that work in the real world, from single services to Kubernetes-scale platforms.

What Grafana and Prometheus Do (and why they belong together)

  • Prometheus is a time-series database and monitoring system built around a pull-based model. It scrapes metrics exposed by your applications and exporters, stores them locally, and lets you query them using PromQL.
  • Grafana is the visualization and alerting layer. It connects to Prometheus (and many other data sources) to build dashboards, alerts, annotations, and reports that teams can actually act on.

Modern stacks often extend this duo:

  • Logs with Loki
  • Traces with Tempo or Jaeger via OpenTelemetry
  • Long-term metric storage with Thanos, Mimir, or Cortex
  • Synthetic monitoring with the Blackbox exporter

Together, these give you a complete observability picture.

Start with the right questions: SLIs, SLOs, and the “Golden Signals”

Before drawing a single graph, define what “good” looks like.

  • SLIs (Service Level Indicators): the things you measure (e.g., availability, latency, error rate).
  • SLOs (Service Level Objectives): the targets you aim for (e.g., 99.9% availability over 30 days).
  • Golden Signals: latency, traffic, errors, saturation (popularized by Google SRE).

Helpful frameworks:

  • RED (Requests, Errors, Duration) for user-facing services
  • USE (Utilization, Saturation, Errors) for infrastructure resources

These ensure dashboards reflect service health and user experience—not vanity metrics.

Instrumentation essentials: get the data right at the source

Choose the right metric type:

  • Counter: a value that only increases (e.g., requests_total)
  • Gauge: value goes up and down (e.g., memory_used_bytes)
  • Histogram: buckets for latency/size distributions
  • Summary: client-side quantiles; use carefully as they can’t be aggregated across instances

Prefer histograms for service latency, then compute p95/p99 via PromQL.

Good naming and labels prevent pain later:

  • Names: http_request_duration_seconds_bucket, cpu_usage_seconds_total
  • Labels: {service="payments", endpoint="/checkout", method="POST", status="2xx"}
  • Avoid high-cardinality labels like user_id, session_id, or randomly generated IDs
  • Keep cardinality budgets—hundreds to a few thousand series per service is typical; millions is a fire alarm

Example (Go) histogram for HTTP latency:

`go

var httpLatency = prometheus.NewHistogramVec(

prometheus.HistogramOpts{

Name: "http_request_duration_seconds",

Help: "Latency of HTTP requests.",

Buckets: prometheus.ExponentialBuckets(0.005, 2, 12), // 5ms to ~10s

},

[]string{"method", "route", "status"},

)

`

Prometheus setup: scrape configs, recording rules, and retention

Basic scrape config example (static):

`yaml

scrape_configs:

  • job_name: 'app'

scrape_interval: 15s

static_configs:

  • targets: ['app1:9100', 'app2:9100']

`

Kubernetes discovery (kube-prometheus or ServiceMonitor) is recommended for cluster-scale.

Recording rules speed up heavy queries and standardize metrics:

`yaml

groups:

  • name: service.rules

rules:

  • record: job:http_request_duration_seconds:95p

expr: histogram_quantile(

0.95,

sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))

`

Set local retention (e.g., 15–30 days) and write aggregated series to long-term storage if needed.

PromQL that teams use daily: query patterns and recipes

  • Request rate (RPS):
  • sum(rate(http_requests_total[5m]))
  • sum by (service)(rate(http_requests_total[5m]))
  • Error rate:
  • sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
  • Latency p95 (histogram):
  • histogram_quantile(0.95, sum by (le)(rate(http_request_duration_seconds_bucket[5m])))
  • CPU usage per pod:
  • sum by (pod)(rate(container_cpu_usage_seconds_total{image!=""}[5m]))
  • Saturation (memory):
  • node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
  • SLI for availability:
  • 1 - (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])))

Use rate() for counters, avg_over_time() for gauges, and max_over_time() or min_over_time() for boundaries. Keep windows (e.g., [5m]) consistent across panels and alerts.

Alerting that reduces noise: Alertmanager patterns that work

Build alerts that map to SLOs, not just infrastructure blips.

  • Multi-window, multi-burn-rate alerts (catch both fast and slow burns of error budgets)
  • Grouping and routing by service/team
  • Silence with expiry and label-based targeting
  • Link alerts to runbooks and dashboards

Examples:

High error rate (generic)

`yaml

  • alert: HighErrorRate

expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)

/ sum(rate(http_requests_total[5m])) by (service) > 0.05

for: 10m

labels:

severity: page

annotations:

summary: "High error rate ({{ $labels.service }})"

runbook: "https://internal.runbooks/sre/{{ $labels.service }}/errors"

`

SLO burn-rate alert (p95 latency)

`yaml

  • alert: LatencySLOBurn

expr: histogram_quantile(0.95, sum by (service, le)(rate(http_request_duration_seconds_bucket[5m])))

> 0.3

for: 15m

labels:

severity: page

annotations:

summary: "p95 latency above SLO ({{ $labels.service }})"

`

Grafana dashboard design: structure for speed and clarity

A great dashboard is fast to read and hard to misinterpret.

  • Layout
  • Top row: global status (SLOs, error rate, RPS, p95 latency)
  • Middle: resource saturation (CPU, memory, I/O, GC)
  • Bottom: detailed breakdowns and tables for drill-down, logs/traces links
  • Consistency
  • Use the same colors and thresholds across dashboards (e.g., green < yellow < red)
  • Display units (ms, %, RPS, bytes) and value mappings
  • Align time ranges and refresh rates (e.g., 15s or 30s for real-time)
  • Variables and templating
  • Variables for cluster, namespace, service, instance
  • Repeat panels by variable (one panel per instance without handcrafting)
  • Annotations for deploys, incidents, feature flags
  • Actionability
  • Link to runbooks and relevant panels
  • Add panel descriptions and “how to read this panel” notes in tooltips

If you’re connecting multiple backends (Prometheus, Loki, Tempo, SQL, etc.), see these best practices for building Grafana dashboards with multiple data sources.

Step-by-step: build a production-grade API dashboard

1) Pick SLIs and targets

  • Availability: 99.9% over 30 days
  • Latency: p95 < 300ms
  • Error rate: < 1%

2) Create PromQL panels

  • p95 latency: histogram_quantile(0.95, sum by (le)(rate(http_request_duration_seconds_bucket[5m])))
  • Error rate: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
  • RPS by endpoint: sum by (route)(rate(http_requests_total[5m]))

3) Add system health

  • CPU/memory of pods
  • GC pause (for JVM/Go apps)
  • DB connections (from SQL exporter)
  • External dependency availability (Blackbox exporter)

4) Add variables

  • service, namespace, environment, instance

5) Wire alerts

  • HighErrorRate, LatencySLOBurn, InstanceDown
  • Include runbook and owner

6) Annotate deployments

  • Add CI/CD webhook to Grafana annotations so every deploy shows on timelines

7) Test under load

  • Run a load test; verify panels update smoothly and alerts fire appropriately

Scaling and reliability: from one Prometheus to many

  • High availability: run Prometheus in pairs (identical scrape configs) behind a load-balanced Grafana data source
  • Federation: aggregate metrics at higher tiers (e.g., per-region → global)
  • Remote write: ship to Thanos/Mimir/Cortex for long-term storage and query
  • Sharding: split scrape targets across Prometheus servers to reduce load
  • SSDs and IOPS: Prometheus is write-heavy; provision storage accordingly
  • Cardinality control:
  • Drop high-cardinality labels with relabel_configs
  • Avoid per-request IDs in labels
  • Set retention and churn budgets

If you’re streaming business events alongside metrics, learn how pipelines fit in with this primer on Apache Kafka and real-time streaming. Grafana can visualize Kafka cluster metrics via JMX exporters and correlate performance across systems.

Beyond metrics: logs, traces, and web performance

  • Logs (Loki): query logs side-by-side with metric anomalies, drill down via labels like service, pod, and trace_id
  • Traces (Tempo/Jaeger): add exemplars to time-series panels; jump into traces for slow requests
  • Real User Monitoring (RUM): visualize Core Web Vitals (LCP, FID/INP, CLS) in Grafana to connect infra changes to UX

For a deeper dive into performance KPIs and where they come from, see this guide: Measuring what matters: web performance metrics, tools, and APIs.

Must-have exporters and integrations

  • node_exporter: system metrics
  • blackbox_exporter: HTTP/TCP/ICMP probes for external checks
  • cAdvisor/kubelet metrics + kube-state-metrics (Kubernetes)
  • database exporters: postgres_exporter, mysqld_exporter, redis_exporter
  • nginx/nginx-ingress exporter
  • JVM/MicroProfile/Go client library instrumentation
  • Cloud provider metrics (CloudWatch/Azure Monitor/Stackdriver) through Grafana data sources

Security, governance, and “dashboards as code”

  • Authentication: SSO (OIDC, SAML), organization/team-based permissions
  • Folder and dashboard permissions: least privilege
  • Secrets: use Grafana’s secure data source provisioning
  • Provisioning: treat dashboards as code (JSON/Jsonnet/Terraform), review via PRs
  • Versioning: tie dashboards to application versions; roll back cleanly
  • Audit: who changed what, and when
  • Backups: keep a restore plan for data sources and dashboards

Common pitfalls to avoid

  • High-cardinality labels (user_id, random IDs) exploding storage and query time
  • Summaries for latency across many instances (can’t aggregate)
  • Inconsistent buckets across services (quantiles become meaningless)
  • Unbounded time ranges and heavy queries (slow dashboards)
  • Alert noise from tight thresholds and single-window checks
  • Dashboards without owners or runbooks
  • Mixing environments (prod/stage/dev) on the same panels without clear filters

Quick checklist

  • Clear SLIs/SLOs defined and visible
  • Consistent metric names, units, buckets, and labels
  • Recording rules for heavy queries
  • Dashboards with variables, annotations, and value mappings
  • Multi-window SLO burn alerts
  • HA Prometheus or remote write to long-term storage
  • Logs and traces linked from key panels
  • Dashboards as code with reviews and versioning

FAQ: Grafana + Prometheus, Answered

1) What’s the difference between Prometheus and Grafana?

  • Prometheus collects and stores time-series metrics; it’s the source of truth and query engine (PromQL).
  • Grafana connects to Prometheus (and other sources) to visualize metrics, build alerts, and share insights. You often use both.

2) How do I pick the right scrape interval?

  • Start with 15s for critical services, 30–60s for infrastructure. Faster intervals improve granularity but increase load. Match your alert windows (e.g., 5m) to the scrape rate so rate() calculations are smooth.

3) Should I use histogram or summary for latency?

  • Prefer histograms; they aggregate across instances and can compute global p95/p99 via histogram_quantile(). Summaries compute quantiles locally and can’t be combined.

4) How do I avoid high-cardinality disasters?

  • Don’t add labels with unbounded values (user_id, request_id). Keep a label budget. Use relabel_configs to drop problematic labels. Aggregate series with sum by() wherever possible.

5) What’s the best way to alert on SLOs?

  • Use multi-window, multi-burn-rate alerts (e.g., 5m/1h and 1h/6h windows). This catches both sudden spikes and slow-burn issues without paging on noise.

6) Can Grafana handle multiple data sources on one dashboard?

7) How do I keep dashboards fast as the environment grows?

  • Use recording rules for heavy PromQL. Limit time ranges for default views. Avoid overly granular tables by default. Use variables and downsampling. Scale Prometheus with sharding/federation and consider Thanos/Mimir for long-term queries.

8) What’s the right retention strategy?

  • Keep high-resolution metrics locally for 15–30 days. Remote write aggregated series and key SLIs to long-term storage (90–365 days) for trend analysis and audit.

9) How do I correlate metrics with logs and traces?

  • Add trace exemplars to metrics and include trace_id in logs (via OpenTelemetry). In Grafana, link panels to logs/traces using labels so you can pivot from an anomaly to exact requests.

10) Where do business or event streams fit in?

  • Business events often flow through streaming platforms and have their own metrics (lag, throughput, retries). Visualize them alongside service metrics. For an overview of streaming and where it helps, see this guide to Apache Kafka and real-time streaming. Combine with Prometheus exporters for a full system view.

Ready to build dashboards that drive action? Start with a single service: define SLIs, instrument latency with histograms, add three core panels (RPS, errors, p95), wire two smart alerts, and iterate from real incidents. That’s how technical dashboards become operational superpowers.

Don't miss any of our content

Sign up for our BIX News

Our Social Media

Most Popular

Start your tech project risk-free

AI, Data & Dev teams aligned with your time zone – get a free consultation and pay $0 if you're not satisfied with the first sprint.