Smart, Noise‑Free Monitoring: How to Build Alerts and Notifications with Grafana and Airflow

Community manager and producer of specialized marketing content
If your data pipelines are critical to the business, you can’t afford to find out about failures from stakeholders. You need timely, actionable alerts that reach the right people—and you need them without the noise. This guide shows you how to combine Grafana’s unified alerting with Airflow’s orchestration and notification capabilities to build a reliable, low‑noise alerting system for modern data platforms.
You’ll learn:
- What to monitor (and what not to) across data pipelines and infrastructure
- Proven architectures for combining Grafana and Airflow alerts
- Step‑by‑step setup with Prometheus, Slack/Teams, email, and webhooks
- Practical examples for data freshness, data quality, and incident triage
- Best practices to reduce alert fatigue and speed up incident response
For extra context as you go, see:
- Best practices for unifying data sources in Grafana: From many to one—Grafana dashboards with multiple data sources
- How to design reliable data pipelines: Process orchestration with Apache Airflow
- Building solid metrics foundations: Technical dashboards with Grafana and Prometheus
Why Alerts Matter (and Where Teams Go Wrong)
Great monitoring doesn’t just show you charts—it tells you precisely when to act. The problem is alert fatigue. Too many alerts, unprioritized channels, and no clear runbook waste time and degrade trust.
The fix:
- Monitor the essentials (golden signals): latency, errors, saturation, throughput
- Add data‑specific checks: data freshness (SLA), data quality, pipeline success rates
- Route alerts by severity and team
- Include links to dashboards and runbooks so responders can act fast
What to Monitor in Data Platforms
Focus on leading indicators that prevent bigger incidents later.
- Airflow health and performance
- DAG/task failure rates
- Task duration p90/p95
- Queues and executor saturation
- SLA misses and backlog growth
- Data freshness and delivery
- Time since last successful load per dataset
- Rows ingested vs baseline
- Upstream dependency lag
- Data quality
- Failed expectations/tests (Great Expectations, dbt tests)
- Schema drift detection
- Null/duplicate spikes
- Infra and dependencies
- Database latency and errors
- Object storage 4xx/5xx
- API rate limits and timeouts
Three Proven Architectures (Pick One to Start)
1) Separation of concerns
- Airflow sends pipeline‑level notifications (task failures, SLAs)
- Grafana covers infra and service SLOs (Prometheus/Loki/Cloud)
- Simple and reliable for most teams
2) Centralized alerting in Grafana
- Everything publishes metrics (Airflow + business KPIs) to Prometheus
- Grafana’s unified alerting manages routing, silences, and escalation
- One place for policies and on‑call rotations
3) Auto‑remediation via Airflow
- Critical Grafana alerts trigger an Airflow DAG via webhook
- Airflow triages, enriches context, restarts jobs, opens tickets
- Ideal when runbooks can be automated
Step‑by‑Step: Set Up Alerts and Notifications
1) Instrument Airflow with Metrics
Option A: StatsD -> Prometheus
- Enable StatsD in airflow.cfg:
- [metrics] statsd_on = True
- statsd_host = statsd-exporter
- statsd_port = 8125
- Use Prometheus statsd_exporter to convert StatsD metrics to Prometheus
- Scrape with Prometheus, visualize and alert in Grafana
Option B: Prometheus exporter
- Use a Prometheus exporter for Airflow (community plugin) or a metrics sidecar that exposes DAG/task metrics directly
- Prometheus scrapes exporter; Grafana reads Prometheus
You’ll typically get metrics like:
- airflow_dag_run_duration_seconds
- airflow_dag_run_failures_total
- airflow_task_duration_seconds
- airflow_scheduler_heartbeat
Tip: Keep metric labels consistent: env, service, team, dag_id, dataset.
2) Build Dashboards that Show What Matters
Create panels for:
- DAG failure rates (by dag_id)
- Task duration p95 by task_id
- Data freshness gauges (time since last success)
- Queue saturation and running tasks
If you combine multiple sources (Prometheus, Loki, Cloud logs), follow these design tips: From many to one—Grafana dashboards with multiple data sources.
3) Create Grafana Alert Rules (Prometheus example)
Example: alert on any DAG failures in the past 5 minutes:
- Query (PromQL): sum by(dag_id)(increase(airflow_dag_run_failures_total[5m]))
- Condition: > 0
- For: 5m (prevents flapping)
- Labels: severity=critical, team=data, env=prod
- Annotations: summary, dashboard link, runbook_url
Set No data state = OK (for optional sources) or Alerting (for critical ones).
Use Error state = Alerting only when ingestion failures are severe.
4) Route Notifications with Contact Points and Policies
In Grafana Alerting:
- Contact points: Slack/Teams, Email, Webhook, PagerDuty, Opsgenie
- Notification policies:
- Severity=critical -> PagerDuty + Slack #oncall
- Severity=warning -> Slack #data‑alerts
- env=staging -> Email only
- Group by labels (e.g., alertname, team) to deduplicate bursts
- Mute timings (maintenance windows) during planned releases
5) Trigger Auto‑Remediation in Airflow (Webhook Flow)
Wire a Grafana contact point (Webhook) to trigger an Airflow DAG run:
- Grafana Webhook target:
POST https://your-airflow/api/v1/dags/incident_triage/dagRuns
Headers: Authorization (API token), Content‑Type: application/json
Body example:
{
"conf": {
"source": "grafana",
"alertname": "Airflow DAG failures",
"labels": {"dag_id": "orders_etl", "severity": "critical"},
"annotations": {"runbook_url": "https://internal/wiki/orders-etl"}
}
}
- Minimal Airflow DAG to handle triage:
`python
from airflow.decorators import dag, task
from datetime import datetime
from airflow.providers.slack.hooks.slack_webhook import SlackWebhookHook
@dag(schedule=None, start_date=datetime(2025,1,1), catchup=False, tags=["incident"])
def incident_triage():
@task
def triage(conf: dict | None = None):
conf = conf or {}
labels = conf.get("labels", {})
alert = conf.get("alertname", "unknown")
dag_id = labels.get("dag_id", "n/a")
runbook = conf.get("annotations", {}).get("runbook_url", "#")
msg = f":rotating_light: Incident triage started\nAlert: {alert}\nDAG: {dag_id}\nRunbook: {runbook}"
SlackWebhookHook(http_conn_id="slack_alerts").send(text=msg)
Example remediation: trigger a backfill or validate dependencies...
(insert custom logic here)
triage()
incident_triage = incident_triage()
`
Security tip: Use a dedicated API user/role and restrict the webhook to a single DAG.
6) Add Native Airflow Notifications (Failures, SLAs)
Airflow can notify on failures and SLA misses without Grafana:
`python
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.slack.hooks.slack_webhook import SlackWebhookHook
from datetime import datetime, timedelta
def notify_failure(context):
ti = context["task_instance"]
msg = (
f":x: Airflow task failed\n"
f"DAG: {ti.dag_id}\nTask: {ti.task_id}\nWhen: {context['ts']}\nLogs: {ti.log_url}"
)
SlackWebhookHook(http_conn_id="slack_alerts").send(text=msg)
def sla_miss_alert(dag, task_list, blocking_task_list, slas, blocking_tis):
SlackWebhookHook(http_conn_id="slack_alerts").send(
text=f":hourglass_flowing_sand: SLA missed in DAG {dag.dag_id} for tasks: {task_list}"
)
default_args = {
"owner": "data",
"retries": 1,
"retry_delay": timedelta(minutes=5),
"email": ["[email protected]"],
"email_on_failure": True,
"sla": timedelta(minutes=30),
}
with DAG(
"example_pipeline",
start_date=datetime(2025,1,1),
schedule="@hourly",
default_args=default_args,
sla_miss_callback=sla_miss_alert,
catchup=False,
) as dag:
def do_work():
raise Exception("Example failure")
t1 = PythonOperator(
task_id="do_work",
python_callable=do_work,
on_failure_callback=notify_failure,
)
`
7) Alert on Data Freshness and Quality
- Freshness metric: emit timestamp of last successful load per dataset (e.g., dataset_last_success_timestamp)
- PromQL: time() - max(dataset_last_success_timestamp{dataset="orders"}) > 1800
- Alert if freshness exceeds SLA (e.g., 30 minutes)
- Data quality: export test results (Great Expectations/dbt tests) as counters
- PromQL: sum(increase(data_quality_test_failures_total[15m])) > 0
Make alerts actionable by including the dataset owner and runbook.
8) Prevent Noise with Silences and Flapping Controls
- Use For durations (e.g., 5–10 minutes) on volatile signals
- Add inhibition rules (e.g., suppress task failure alerts when the whole scheduler is down)
- Mute timings during releases or backfills
- Use OK notifications sparingly; only where they help the responder close incidents
Real‑World Scenarios
- Pipeline failure triage
- Grafana detects rising Airflow DAG failures and posts to #oncall with links
- Webhook triggers an Airflow triage DAG that checks dependencies and restarts a safe subset
- If the issue persists, Airflow opens a ticket and escalates
- Data freshness breach
- Grafana fires when orders dataset exceeds a 30‑minute SLA
- Slack alert includes owner, dashboard, and runbook
- Airflow validates upstream services and kicks off a targeted backfill
- Data quality regression
- dbt tests fail after a schema change
- Grafana groups “data-quality” alerts per domain
- Team gets one consolidated critical alert with context and diffs
Best Practices to Reduce Alert Fatigue
- Align alerts to business SLOs, not raw errors
- Add clear ownership: team, service, dataset labels
- Include a runbook_url and dashboard links in every critical alert
- Group and deduplicate aggressively
- Separate warning vs critical; route them differently
- Test alerts in staging; use synthetic signals to validate
- Version alert rules (provisioning as code) and review changes via PRs
- Measure the alert pipeline itself (delivery failures, time to ack, time to resolve)
Common Pitfalls (and Fixes)
- Alerting on everything
- Fix: Alert on user impact and SLO breaches; keep raw errors for dashboards
- Missing labels and context
- Fix: Standardize labels (team, env, severity, dataset). Add runbook links
- No differentiation by environment
- Fix: route dev/staging differently; set lower severities or mute timings
- Flapping alerts
- Fix: add For durations, baselines (timeshift), and inhibition rules
- Costly queries in alerts
- Fix: use Prometheus recording rules or pre‑aggregations for heavy panels
Next Steps and Helpful Resources
- Start simple: Airflow native notifications + a handful of Grafana alerts tied to SLOs
- Move to centralized alerting once you standardize labels and routing
- Automate runbooks via Airflow for your highest‑impact alerts
To deepen your implementation:
- Unifying observability data: From many to one—Grafana dashboards with multiple data sources
- Designing resilient pipelines: Process orchestration with Apache Airflow
- Solid metrics foundations: Technical dashboards with Grafana and Prometheus
FAQ: Alerts and Notifications with Grafana and Airflow
1) Should we configure alerts in Grafana or in Airflow?
Use both, but for different purposes:
- Airflow: task‑level notifications (failures, retries, SLA misses)
- Grafana: SLO‑oriented and cross‑system alerts (infra, service health, data freshness, quality)
Centralizing in Grafana simplifies routing, escalation, and silences across teams.
2) How do we avoid alert fatigue?
- Tie alerts to SLOs and business impact
- Add For durations to reduce flapping
- Group by team/service/dataset
- Separate warning vs critical and route accordingly
- Include runbook links so responders can act fast
3) What’s the best way to alert on data freshness?
Emit a metric like dataset_last_success_timestamp and alert when time() - max(timestamp) exceeds your SLA per dataset. Include owner and runbook in the alert.
4) How can Grafana trigger Airflow automatically?
Use a Grafana webhook contact point to call the Airflow REST API:
POST /api/v1/dags/{dag_id}/dagRuns with a conf payload. The Airflow DAG parses conf and executes your runbook (validation, restarts, ticketing).
5) What channels should we use—Slack, email, or PagerDuty?
- Warnings: Slack or Teams
- Critical, customer‑impacting: PagerDuty/Opsgenie + Slack
- Email: lower‑priority or daily summaries
Use Grafana’s notification policies to route by severity, env, and team.
6) How do we set up Airflow to send Slack alerts on task failure?
Define an on_failure_callback that posts to Slack via SlackWebhookHook (or a provider notifier). Combine with email_on_failure for redundancy.
7) Can we alert on data quality tests (dbt or Great Expectations)?
Yes. Export test failures as metrics (counters) and create Grafana alerts on increases over a time window. Alternatively, fail the task and rely on Airflow failure notifications for pipeline‑blocking checks.
8) How do we handle planned maintenance without flooding alerts?
Use Grafana’s mute timings during maintenance windows. For long backfills or schema migrations, add silences with clear expiration and owner labels.
9) Is Prometheus required for Grafana alerting?
No, but it’s the most common pairing. Grafana can alert on multiple sources (e.g., Prometheus, Loki, CloudWatch). Choose the one that best fits your stack and supports efficient alert queries.
10) How should we test alert rules before going live?
- Create synthetic metrics (test time series) or use a staging Prometheus
- Lower thresholds temporarily to force a fire/resolution
- Validate routing, formatting, and runbook links
- Track time to ack/resolve as part of your rollout
By combining Grafana’s unified alerting with Airflow’s orchestration and native notifications, you’ll catch issues earlier, reduce noise, and turn runbooks into automated recovery—exactly what a modern data platform needs.








