Smart, Noise‑Free Monitoring: How to Build Alerts and Notifications with Grafana and Airflow

December 02, 2025 at 03:59 PM | Est. read time: 13 min
Valentina Vianna

By Valentina Vianna

Community manager and producer of specialized marketing content

If your data pipelines are critical to the business, you can’t afford to find out about failures from stakeholders. You need timely, actionable alerts that reach the right people—and you need them without the noise. This guide shows you how to combine Grafana’s unified alerting with Airflow’s orchestration and notification capabilities to build a reliable, low‑noise alerting system for modern data platforms.

You’ll learn:

  • What to monitor (and what not to) across data pipelines and infrastructure
  • Proven architectures for combining Grafana and Airflow alerts
  • Step‑by‑step setup with Prometheus, Slack/Teams, email, and webhooks
  • Practical examples for data freshness, data quality, and incident triage
  • Best practices to reduce alert fatigue and speed up incident response

For extra context as you go, see:


Why Alerts Matter (and Where Teams Go Wrong)

Great monitoring doesn’t just show you charts—it tells you precisely when to act. The problem is alert fatigue. Too many alerts, unprioritized channels, and no clear runbook waste time and degrade trust.

The fix:

  • Monitor the essentials (golden signals): latency, errors, saturation, throughput
  • Add data‑specific checks: data freshness (SLA), data quality, pipeline success rates
  • Route alerts by severity and team
  • Include links to dashboards and runbooks so responders can act fast

What to Monitor in Data Platforms

Focus on leading indicators that prevent bigger incidents later.

  • Airflow health and performance
  • DAG/task failure rates
  • Task duration p90/p95
  • Queues and executor saturation
  • SLA misses and backlog growth
  • Data freshness and delivery
  • Time since last successful load per dataset
  • Rows ingested vs baseline
  • Upstream dependency lag
  • Data quality
  • Failed expectations/tests (Great Expectations, dbt tests)
  • Schema drift detection
  • Null/duplicate spikes
  • Infra and dependencies
  • Database latency and errors
  • Object storage 4xx/5xx
  • API rate limits and timeouts

Three Proven Architectures (Pick One to Start)

1) Separation of concerns

  • Airflow sends pipeline‑level notifications (task failures, SLAs)
  • Grafana covers infra and service SLOs (Prometheus/Loki/Cloud)
  • Simple and reliable for most teams

2) Centralized alerting in Grafana

  • Everything publishes metrics (Airflow + business KPIs) to Prometheus
  • Grafana’s unified alerting manages routing, silences, and escalation
  • One place for policies and on‑call rotations

3) Auto‑remediation via Airflow

  • Critical Grafana alerts trigger an Airflow DAG via webhook
  • Airflow triages, enriches context, restarts jobs, opens tickets
  • Ideal when runbooks can be automated

Step‑by‑Step: Set Up Alerts and Notifications

1) Instrument Airflow with Metrics

Option A: StatsD -> Prometheus

  • Enable StatsD in airflow.cfg:
  • [metrics] statsd_on = True
  • statsd_host = statsd-exporter
  • statsd_port = 8125
  • Use Prometheus statsd_exporter to convert StatsD metrics to Prometheus
  • Scrape with Prometheus, visualize and alert in Grafana

Option B: Prometheus exporter

  • Use a Prometheus exporter for Airflow (community plugin) or a metrics sidecar that exposes DAG/task metrics directly
  • Prometheus scrapes exporter; Grafana reads Prometheus

You’ll typically get metrics like:

  • airflow_dag_run_duration_seconds
  • airflow_dag_run_failures_total
  • airflow_task_duration_seconds
  • airflow_scheduler_heartbeat

Tip: Keep metric labels consistent: env, service, team, dag_id, dataset.

2) Build Dashboards that Show What Matters

Create panels for:

  • DAG failure rates (by dag_id)
  • Task duration p95 by task_id
  • Data freshness gauges (time since last success)
  • Queue saturation and running tasks

If you combine multiple sources (Prometheus, Loki, Cloud logs), follow these design tips: From many to one—Grafana dashboards with multiple data sources.

3) Create Grafana Alert Rules (Prometheus example)

Example: alert on any DAG failures in the past 5 minutes:

  • Query (PromQL): sum by(dag_id)(increase(airflow_dag_run_failures_total[5m]))
  • Condition: > 0
  • For: 5m (prevents flapping)
  • Labels: severity=critical, team=data, env=prod
  • Annotations: summary, dashboard link, runbook_url

Set No data state = OK (for optional sources) or Alerting (for critical ones).

Use Error state = Alerting only when ingestion failures are severe.

4) Route Notifications with Contact Points and Policies

In Grafana Alerting:

  • Contact points: Slack/Teams, Email, Webhook, PagerDuty, Opsgenie
  • Notification policies:
  • Severity=critical -> PagerDuty + Slack #oncall
  • Severity=warning -> Slack #data‑alerts
  • env=staging -> Email only
  • Group by labels (e.g., alertname, team) to deduplicate bursts
  • Mute timings (maintenance windows) during planned releases

5) Trigger Auto‑Remediation in Airflow (Webhook Flow)

Wire a Grafana contact point (Webhook) to trigger an Airflow DAG run:

  • Grafana Webhook target:

POST https://your-airflow/api/v1/dags/incident_triage/dagRuns

Headers: Authorization (API token), Content‑Type: application/json

Body example:

{

"conf": {

"source": "grafana",

"alertname": "Airflow DAG failures",

"labels": {"dag_id": "orders_etl", "severity": "critical"},

"annotations": {"runbook_url": "https://internal/wiki/orders-etl"}

}

}

  • Minimal Airflow DAG to handle triage:

`python

from airflow.decorators import dag, task

from datetime import datetime

from airflow.providers.slack.hooks.slack_webhook import SlackWebhookHook

@dag(schedule=None, start_date=datetime(2025,1,1), catchup=False, tags=["incident"])

def incident_triage():

@task

def triage(conf: dict | None = None):

conf = conf or {}

labels = conf.get("labels", {})

alert = conf.get("alertname", "unknown")

dag_id = labels.get("dag_id", "n/a")

runbook = conf.get("annotations", {}).get("runbook_url", "#")

msg = f":rotating_light: Incident triage started\nAlert: {alert}\nDAG: {dag_id}\nRunbook: {runbook}"

SlackWebhookHook(http_conn_id="slack_alerts").send(text=msg)

Example remediation: trigger a backfill or validate dependencies...

(insert custom logic here)

triage()

incident_triage = incident_triage()

`

Security tip: Use a dedicated API user/role and restrict the webhook to a single DAG.

6) Add Native Airflow Notifications (Failures, SLAs)

Airflow can notify on failures and SLA misses without Grafana:

`python

from airflow import DAG

from airflow.operators.python import PythonOperator

from airflow.providers.slack.hooks.slack_webhook import SlackWebhookHook

from datetime import datetime, timedelta

def notify_failure(context):

ti = context["task_instance"]

msg = (

f":x: Airflow task failed\n"

f"DAG: {ti.dag_id}\nTask: {ti.task_id}\nWhen: {context['ts']}\nLogs: {ti.log_url}"

)

SlackWebhookHook(http_conn_id="slack_alerts").send(text=msg)

def sla_miss_alert(dag, task_list, blocking_task_list, slas, blocking_tis):

SlackWebhookHook(http_conn_id="slack_alerts").send(

text=f":hourglass_flowing_sand: SLA missed in DAG {dag.dag_id} for tasks: {task_list}"

)

default_args = {

"owner": "data",

"retries": 1,

"retry_delay": timedelta(minutes=5),

"email": ["[email protected]"],

"email_on_failure": True,

"sla": timedelta(minutes=30),

}

with DAG(

"example_pipeline",

start_date=datetime(2025,1,1),

schedule="@hourly",

default_args=default_args,

sla_miss_callback=sla_miss_alert,

catchup=False,

) as dag:

def do_work():

raise Exception("Example failure")

t1 = PythonOperator(

task_id="do_work",

python_callable=do_work,

on_failure_callback=notify_failure,

)

`

7) Alert on Data Freshness and Quality

  • Freshness metric: emit timestamp of last successful load per dataset (e.g., dataset_last_success_timestamp)
  • PromQL: time() - max(dataset_last_success_timestamp{dataset="orders"}) > 1800
  • Alert if freshness exceeds SLA (e.g., 30 minutes)
  • Data quality: export test results (Great Expectations/dbt tests) as counters
  • PromQL: sum(increase(data_quality_test_failures_total[15m])) > 0

Make alerts actionable by including the dataset owner and runbook.

8) Prevent Noise with Silences and Flapping Controls

  • Use For durations (e.g., 5–10 minutes) on volatile signals
  • Add inhibition rules (e.g., suppress task failure alerts when the whole scheduler is down)
  • Mute timings during releases or backfills
  • Use OK notifications sparingly; only where they help the responder close incidents

Real‑World Scenarios

  • Pipeline failure triage
  • Grafana detects rising Airflow DAG failures and posts to #oncall with links
  • Webhook triggers an Airflow triage DAG that checks dependencies and restarts a safe subset
  • If the issue persists, Airflow opens a ticket and escalates
  • Data freshness breach
  • Grafana fires when orders dataset exceeds a 30‑minute SLA
  • Slack alert includes owner, dashboard, and runbook
  • Airflow validates upstream services and kicks off a targeted backfill
  • Data quality regression
  • dbt tests fail after a schema change
  • Grafana groups “data-quality” alerts per domain
  • Team gets one consolidated critical alert with context and diffs

Best Practices to Reduce Alert Fatigue

  • Align alerts to business SLOs, not raw errors
  • Add clear ownership: team, service, dataset labels
  • Include a runbook_url and dashboard links in every critical alert
  • Group and deduplicate aggressively
  • Separate warning vs critical; route them differently
  • Test alerts in staging; use synthetic signals to validate
  • Version alert rules (provisioning as code) and review changes via PRs
  • Measure the alert pipeline itself (delivery failures, time to ack, time to resolve)

Common Pitfalls (and Fixes)

  • Alerting on everything
  • Fix: Alert on user impact and SLO breaches; keep raw errors for dashboards
  • Missing labels and context
  • Fix: Standardize labels (team, env, severity, dataset). Add runbook links
  • No differentiation by environment
  • Fix: route dev/staging differently; set lower severities or mute timings
  • Flapping alerts
  • Fix: add For durations, baselines (timeshift), and inhibition rules
  • Costly queries in alerts
  • Fix: use Prometheus recording rules or pre‑aggregations for heavy panels

Next Steps and Helpful Resources

  • Start simple: Airflow native notifications + a handful of Grafana alerts tied to SLOs
  • Move to centralized alerting once you standardize labels and routing
  • Automate runbooks via Airflow for your highest‑impact alerts

To deepen your implementation:


FAQ: Alerts and Notifications with Grafana and Airflow

1) Should we configure alerts in Grafana or in Airflow?

Use both, but for different purposes:

  • Airflow: task‑level notifications (failures, retries, SLA misses)
  • Grafana: SLO‑oriented and cross‑system alerts (infra, service health, data freshness, quality)

Centralizing in Grafana simplifies routing, escalation, and silences across teams.

2) How do we avoid alert fatigue?

  • Tie alerts to SLOs and business impact
  • Add For durations to reduce flapping
  • Group by team/service/dataset
  • Separate warning vs critical and route accordingly
  • Include runbook links so responders can act fast

3) What’s the best way to alert on data freshness?

Emit a metric like dataset_last_success_timestamp and alert when time() - max(timestamp) exceeds your SLA per dataset. Include owner and runbook in the alert.

4) How can Grafana trigger Airflow automatically?

Use a Grafana webhook contact point to call the Airflow REST API:

POST /api/v1/dags/{dag_id}/dagRuns with a conf payload. The Airflow DAG parses conf and executes your runbook (validation, restarts, ticketing).

5) What channels should we use—Slack, email, or PagerDuty?

  • Warnings: Slack or Teams
  • Critical, customer‑impacting: PagerDuty/Opsgenie + Slack
  • Email: lower‑priority or daily summaries

Use Grafana’s notification policies to route by severity, env, and team.

6) How do we set up Airflow to send Slack alerts on task failure?

Define an on_failure_callback that posts to Slack via SlackWebhookHook (or a provider notifier). Combine with email_on_failure for redundancy.

7) Can we alert on data quality tests (dbt or Great Expectations)?

Yes. Export test failures as metrics (counters) and create Grafana alerts on increases over a time window. Alternatively, fail the task and rely on Airflow failure notifications for pipeline‑blocking checks.

8) How do we handle planned maintenance without flooding alerts?

Use Grafana’s mute timings during maintenance windows. For long backfills or schema migrations, add silences with clear expiration and owner labels.

9) Is Prometheus required for Grafana alerting?

No, but it’s the most common pairing. Grafana can alert on multiple sources (e.g., Prometheus, Loki, CloudWatch). Choose the one that best fits your stack and supports efficient alert queries.

10) How should we test alert rules before going live?

  • Create synthetic metrics (test time series) or use a staging Prometheus
  • Lower thresholds temporarily to force a fire/resolution
  • Validate routing, formatting, and runbook links
  • Track time to ack/resolve as part of your rollout

By combining Grafana’s unified alerting with Airflow’s orchestration and native notifications, you’ll catch issues earlier, reduce noise, and turn runbooks into automated recovery—exactly what a modern data platform needs.

Don't miss any of our content

Sign up for our BIX News

Our Social Media

Most Popular

Start your tech project risk-free

AI, Data & Dev teams aligned with your time zone – get a free consultation and pay $0 if you're not satisfied with the first sprint.