Community manager and producer of specialized marketing content

If your data pipelines are critical to the business, you can’t afford to find out about failures from stakeholders. You need timely, actionable alerts that reach the right people—and you need them without the noise. This guide shows you how to combine Grafana’s unified alerting with Airflow’s orchestration and notification capabilities to build a reliable, low‑noise alerting system for modern data platforms.

You’ll learn:

What to monitor (and what not to) across data pipelines and infrastructure
Proven architectures for combining Grafana and Airflow alerts
Step‑by‑step setup with Prometheus, Slack/Teams, email, and webhooks
Practical examples for data freshness, data quality, and incident triage
Best practices to reduce alert fatigue and speed up incident response

For extra context as you go, see:

Best practices for unifying data sources in Grafana: From many to one—Grafana dashboards with multiple data sources
How to design reliable data pipelines: Process orchestration with Apache Airflow
Building solid metrics foundations: Technical dashboards with Grafana and Prometheus

Why Alerts Matter (and Where Teams Go Wrong)

Great monitoring doesn’t just show you charts—it tells you precisely when to act. The problem is alert fatigue. Too many alerts, unprioritized channels, and no clear runbook waste time and degrade trust.

The fix:

Monitor the essentials (golden signals): latency, errors, saturation, throughput
Add data‑specific checks: data freshness (SLA), data quality, pipeline success rates
Route alerts by severity and team
Include links to dashboards and runbooks so responders can act fast

What to Monitor in Data Platforms

Focus on leading indicators that prevent bigger incidents later.

Airflow health and performance
DAG/task failure rates
Task duration p90/p95
Queues and executor saturation
SLA misses and backlog growth

Data freshness and delivery
Time since last successful load per dataset
Rows ingested vs baseline
Upstream dependency lag

Data quality
Failed expectations/tests (Great Expectations, dbt tests)
Schema drift detection
Null/duplicate spikes

Infra and dependencies
Database latency and errors
Object storage 4xx/5xx
API rate limits and timeouts

Three Proven Architectures (Pick One to Start)

1) Separation of concerns

Airflow sends pipeline‑level notifications (task failures, SLAs)
Grafana covers infra and service SLOs (Prometheus/Loki/Cloud)
Simple and reliable for most teams

2) Centralized alerting in Grafana

Everything publishes metrics (Airflow + business KPIs) to Prometheus
Grafana’s unified alerting manages routing, silences, and escalation
One place for policies and on‑call rotations

3) Auto‑remediation via Airflow

Critical Grafana alerts trigger an Airflow DAG via webhook
Airflow triages, enriches context, restarts jobs, opens tickets
Ideal when runbooks can be automated

Step‑by‑Step: Set Up Alerts and Notifications

1) Instrument Airflow with Metrics

Option A: StatsD -> Prometheus

Enable StatsD in airflow.cfg:
[metrics] statsd_on = True
statsd_host = statsd-exporter
statsd_port = 8125
Use Prometheus statsd_exporter to convert StatsD metrics to Prometheus
Scrape with Prometheus, visualize and alert in Grafana

Option B: Prometheus exporter

Use a Prometheus exporter for Airflow (community plugin) or a metrics sidecar that exposes DAG/task metrics directly
Prometheus scrapes exporter; Grafana reads Prometheus

You’ll typically get metrics like:

airflow_dag_run_duration_seconds
airflow_dag_run_failures_total
airflow_task_duration_seconds
airflow_scheduler_heartbeat

Tip: Keep metric labels consistent: env, service, team, dag_id, dataset.

2) Build Dashboards that Show What Matters

Create panels for:

DAG failure rates (by dag_id)
Task duration p95 by task_id
Data freshness gauges (time since last success)
Queue saturation and running tasks

If you combine multiple sources (Prometheus, Loki, Cloud logs), follow these design tips: From many to one—Grafana dashboards with multiple data sources.

3) Create Grafana Alert Rules (Prometheus example)

Example: alert on any DAG failures in the past 5 minutes:

Query (PromQL): sum by(dag_id)(increase(airflow_dag_run_failures_total[5m]))
Condition: > 0
For: 5m (prevents flapping)
Labels: severity=critical, team=data, env=prod
Annotations: summary, dashboard link, runbook_url

Set No data state = OK (for optional sources) or Alerting (for critical ones).

Use Error state = Alerting only when ingestion failures are severe.

4) Route Notifications with Contact Points and Policies

In Grafana Alerting:

Contact points: Slack/Teams, Email, Webhook, PagerDuty, Opsgenie
Notification policies:
Severity=critical -> PagerDuty + Slack #oncall
Severity=warning -> Slack #data‑alerts
env=staging -> Email only
Group by labels (e.g., alertname, team) to deduplicate bursts
Mute timings (maintenance windows) during planned releases

5) Trigger Auto‑Remediation in Airflow (Webhook Flow)

Wire a Grafana contact point (Webhook) to trigger an Airflow DAG run:

Grafana Webhook target:

POST https://your-airflow/api/v1/dags/incident_triage/dagRuns

Headers: Authorization (API token), Content‑Type: application/json

Body example:

{

"conf": {

"source": "grafana",

"alertname": "Airflow DAG failures",

"labels": {"dag_id": "orders_etl", "severity": "critical"},

"annotations": {"runbook_url": "https://internal/wiki/orders-etl"}

}

Minimal Airflow DAG to handle triage:

`python

from airflow.decorators import dag, task

from datetime import datetime

from airflow.providers.slack.hooks.slack_webhook import SlackWebhookHook

@dag(schedule=None, start_date=datetime(2025,1,1), catchup=False, tags=["incident"])

def incident_triage():

@task

def triage(conf: dict | None = None):

conf = conf or {}

labels = conf.get("labels", {})

alert = conf.get("alertname", "unknown")

dag_id = labels.get("dag_id", "n/a")

runbook = conf.get("annotations", {}).get("runbook_url", "#")

msg = f":rotating_light: Incident triage started\nAlert: {alert}\nDAG: {dag_id}\nRunbook: {runbook}"

SlackWebhookHook(http_conn_id="slack_alerts").send(text=msg)

Example remediation: trigger a backfill or validate dependencies...

(insert custom logic here)

triage()

incident_triage = incident_triage()

Security tip: Use a dedicated API user/role and restrict the webhook to a single DAG.

6) Add Native Airflow Notifications (Failures, SLAs)

Airflow can notify on failures and SLA misses without Grafana:

`python

from airflow import DAG

from airflow.operators.python import PythonOperator

from airflow.providers.slack.hooks.slack_webhook import SlackWebhookHook

from datetime import datetime, timedelta

def notify_failure(context):

ti = context["task_instance"]

msg = (

f":x: Airflow task failed\n"

f"DAG: {ti.dag_id}\nTask: {ti.task_id}\nWhen: {context['ts']}\nLogs: {ti.log_url}"

)

SlackWebhookHook(http_conn_id="slack_alerts").send(text=msg)

def sla_miss_alert(dag, task_list, blocking_task_list, slas, blocking_tis):

SlackWebhookHook(http_conn_id="slack_alerts").send(

text=f":hourglass_flowing_sand: SLA missed in DAG {dag.dag_id} for tasks: {task_list}"

)

default_args = {

"owner": "data",

"retries": 1,

"retry_delay": timedelta(minutes=5),

"email": ["[email protected]"],

"email_on_failure": True,

"sla": timedelta(minutes=30),

}

with DAG(

"example_pipeline",

start_date=datetime(2025,1,1),

schedule="@hourly",

default_args=default_args,

sla_miss_callback=sla_miss_alert,

catchup=False,

) as dag:

def do_work():

raise Exception("Example failure")

t1 = PythonOperator(

task_id="do_work",

python_callable=do_work,

on_failure_callback=notify_failure,

)

7) Alert on Data Freshness and Quality

Freshness metric: emit timestamp of last successful load per dataset (e.g., dataset_last_success_timestamp)
PromQL: time() - max(dataset_last_success_timestamp{dataset="orders"}) > 1800
Alert if freshness exceeds SLA (e.g., 30 minutes)

Data quality: export test results (Great Expectations/dbt tests) as counters
PromQL: sum(increase(data_quality_test_failures_total[15m])) > 0

Make alerts actionable by including the dataset owner and runbook.

8) Prevent Noise with Silences and Flapping Controls

Use For durations (e.g., 5–10 minutes) on volatile signals
Add inhibition rules (e.g., suppress task failure alerts when the whole scheduler is down)
Mute timings during releases or backfills
Use OK notifications sparingly; only where they help the responder close incidents

Real‑World Scenarios

Pipeline failure triage
Grafana detects rising Airflow DAG failures and posts to #oncall with links
Webhook triggers an Airflow triage DAG that checks dependencies and restarts a safe subset
If the issue persists, Airflow opens a ticket and escalates

Data freshness breach
Grafana fires when orders dataset exceeds a 30‑minute SLA
Slack alert includes owner, dashboard, and runbook
Airflow validates upstream services and kicks off a targeted backfill

Data quality regression
dbt tests fail after a schema change
Grafana groups “data-quality” alerts per domain
Team gets one consolidated critical alert with context and diffs

Best Practices to Reduce Alert Fatigue

Align alerts to business SLOs, not raw errors
Add clear ownership: team, service, dataset labels
Include a runbook_url and dashboard links in every critical alert
Group and deduplicate aggressively
Separate warning vs critical; route them differently
Test alerts in staging; use synthetic signals to validate
Version alert rules (provisioning as code) and review changes via PRs
Measure the alert pipeline itself (delivery failures, time to ack, time to resolve)

Common Pitfalls (and Fixes)

Alerting on everything
Fix: Alert on user impact and SLO breaches; keep raw errors for dashboards

Missing labels and context
Fix: Standardize labels (team, env, severity, dataset). Add runbook links

No differentiation by environment
Fix: route dev/staging differently; set lower severities or mute timings

Flapping alerts
Fix: add For durations, baselines (timeshift), and inhibition rules

Costly queries in alerts
Fix: use Prometheus recording rules or pre‑aggregations for heavy panels

Next Steps and Helpful Resources

Start simple: Airflow native notifications + a handful of Grafana alerts tied to SLOs
Move to centralized alerting once you standardize labels and routing
Automate runbooks via Airflow for your highest‑impact alerts

To deepen your implementation:

Unifying observability data: From many to one—Grafana dashboards with multiple data sources
Designing resilient pipelines: Process orchestration with Apache Airflow
Solid metrics foundations: Technical dashboards with Grafana and Prometheus

FAQ: Alerts and Notifications with Grafana and Airflow

1) Should we configure alerts in Grafana or in Airflow?

Use both, but for different purposes:

Airflow: task‑level notifications (failures, retries, SLA misses)
Grafana: SLO‑oriented and cross‑system alerts (infra, service health, data freshness, quality)

Centralizing in Grafana simplifies routing, escalation, and silences across teams.

2) How do we avoid alert fatigue?

Tie alerts to SLOs and business impact
Add For durations to reduce flapping
Group by team/service/dataset
Separate warning vs critical and route accordingly
Include runbook links so responders can act fast

3) What’s the best way to alert on data freshness?

Emit a metric like dataset_last_success_timestamp and alert when time() - max(timestamp) exceeds your SLA per dataset. Include owner and runbook in the alert.

4) How can Grafana trigger Airflow automatically?

Use a Grafana webhook contact point to call the Airflow REST API:

POST /api/v1/dags/{dag_id}/dagRuns with a conf payload. The Airflow DAG parses conf and executes your runbook (validation, restarts, ticketing).

5) What channels should we use—Slack, email, or PagerDuty?

Warnings: Slack or Teams
Critical, customer‑impacting: PagerDuty/Opsgenie + Slack
Email: lower‑priority or daily summaries

Use Grafana’s notification policies to route by severity, env, and team.

6) How do we set up Airflow to send Slack alerts on task failure?

Define an on_failure_callback that posts to Slack via SlackWebhookHook (or a provider notifier). Combine with email_on_failure for redundancy.

7) Can we alert on data quality tests (dbt or Great Expectations)?

Yes. Export test failures as metrics (counters) and create Grafana alerts on increases over a time window. Alternatively, fail the task and rely on Airflow failure notifications for pipeline‑blocking checks.

8) How do we handle planned maintenance without flooding alerts?

Use Grafana’s mute timings during maintenance windows. For long backfills or schema migrations, add silences with clear expiration and owner labels.

9) Is Prometheus required for Grafana alerting?

No, but it’s the most common pairing. Grafana can alert on multiple sources (e.g., Prometheus, Loki, CloudWatch). Choose the one that best fits your stack and supports efficient alert queries.

10) How should we test alert rules before going live?

Create synthetic metrics (test time series) or use a staging Prometheus
Lower thresholds temporarily to force a fire/resolution
Validate routing, formatting, and runbook links
Track time to ack/resolve as part of your rollout

By combining Grafana’s unified alerting with Airflow’s orchestration and native notifications, you’ll catch issues earlier, reduce noise, and turn runbooks into automated recovery—exactly what a modern data platform needs.

Business Intelligence

Smart, Noise‑Free Monitoring: How to Build Alerts and Notifications with Grafana and Airflow