
Community manager and producer of specialized marketing content
Reactive pipelines are quickly becoming the backbone of modern automation—especially when you’re coordinating multiple AI agents that need to respond to changes in real time. Whether you’re routing user requests, reacting to data quality issues, or orchestrating complex multi-step workflows, Kafka is one of the most reliable ways to connect agents through an event-driven architecture.
In this guide, we’ll break down how to build reactive pipelines between agents using Kafka, what “reactive” really means in practice, and how to design agent-to-agent flows that scale without turning into a fragile tangle of scripts and retries.
What Is a “Reactive” Pipeline Between Agents?
A reactive pipeline is a workflow where components (in this case, AI agents and services) respond to events as they happen—rather than waiting for scheduled batch jobs or synchronous API calls.
In an agent context, that typically means:
- An agent emits an event like
ticket.created,invoice.failed_validation, oruser.asked_question. - Kafka routes that event to any agent or service subscribed to it.
- One or more downstream agents react immediately—enriching data, calling tools, triggering approvals, or starting new workflows.
This approach is especially powerful for multi-agent systems, because it avoids tight coupling. Agents can evolve independently, as long as they maintain the same event contract.
Why Kafka Works So Well for Agent-to-Agent Pipelines
Kafka is a natural fit for real-time pipelines and distributed automation because it provides:
1) Durable, Replayable Event Streams
Agents can fail, restart, and catch up by replaying events from Kafka—critical for reliability and auditability.
2) Loose Coupling Between Producers and Consumers
One agent produces an event, and many consumers can react—without the producer needing to know who they are.
3) Horizontal Scalability
If an agent becomes a bottleneck (e.g., an LLM-based classifier), you can scale consumers in a group.
4) Backpressure Handling
Kafka can buffer bursts. Agents don’t need to be perfectly fast; they need to be consistent and resilient.
5) Clear Boundaries for Governance and Observability
Events provide a structured trail of what happened, when, and why—useful for debugging and compliance.
If your pipelines also include data platforms, you may find it useful to pair Kafka-driven agent workflows with orchestration patterns (e.g., retries, SLAs, and DAG visibility). Related reading: Automating real-time data pipelines with Airflow, Kafka, and Databricks.
Core Architecture: Agents as Event Producers and Consumers
A typical Kafka-based agent pipeline looks like this:
Key Components
- Agent A (Producer): emits events (facts) to Kafka topics
- Kafka topics: event channels like
requests.incoming,docs.extracted,risk.flagged - Agent B/C/D (Consumers): subscribe to topics and react
- State store (optional): Redis / Postgres / vector DB for memory and context
- Observability: logs, metrics, traces, prompt evaluations
A Concrete Example: “Support Triage” Flow
- Intake agent receives a ticket and emits
support.ticket.created - Classification agent consumes it and emits
support.ticket.classified - Knowledge agent consumes classification results, searches docs, emits
support.answer.drafted - Policy/QA agent checks compliance, emits
support.answer.approved(orrejected) - Delivery agent posts the approved answer and emits
support.answer.sent
Each step is independent, and Kafka becomes the “circulatory system” connecting them.
For a broader overview of agent patterns and what makes an “agent” different from a simple LLM call, see: AI agents.
Designing Kafka Topics for Multi-Agent Systems
Kafka topic design is where reactive pipelines either stay clean—or become unmanageable. Here are practical guidelines.
Use Event-Based Naming
Prefer describing what happened rather than who produced it.
Good:
invoice.validateduser.profile.updatedfraud.risk.flagged
Less ideal:
agent_classifier_outputagent2_to_agent3
Decide Between “Command” vs “Event” Topics
- Event topics: “Something happened” (
payment.received) - Command topics: “Do something” (
generate.monthly_report)
In agent pipelines, events are usually safer and more auditable. Commands can be useful, but they can create hidden coupling if overused.
Keep Payloads Focused
Your event should contain:
- a stable identifier (
ticket_id,customer_id,trace_id) - the minimal context needed for downstream work
- a schema version (
v1,v2)
Avoid dumping large raw content when possible. Instead, store large artifacts externally (object storage, document DB) and send references in Kafka.
Message Schemas: The Difference Between a Prototype and Production
Reactive agent pipelines break when event payloads drift. Treat schemas as first-class.
Best Practices
- Use a schema format: Avro / Protobuf / JSON Schema
- Version events explicitly (
schema_version) - Maintain backward compatibility whenever possible
- Add metadata for observability:
correlation_id/trace_idcreated_atproducermodel_version(if an LLM is involved)
A simple JSON example:
`json
{
"schema_version": "1.0",
"event_type": "support.ticket.classified",
"trace_id": "d4c2a0...",
"ticket_id": "TCK-10291",
"category": "billing",
"priority": "high",
"confidence": 0.82,
"created_at": "2026-01-12T10:18:32Z",
"model_version": "classifier-2026-01"
}
`
This is what makes downstream agents reliable—even when upstream logic evolves.
Orchestration Patterns: Choreography vs Orchestrator
Kafka naturally supports choreography, where agents react to events without a central controller. But sometimes you need tighter coordination.
Pattern A: Choreography (Event-Driven)
Best for:
- high-scale pipelines
- loosely coupled teams
- workflows that can tolerate eventual consistency
Risk:
- harder to reason about end-to-end state
- debugging can become distributed archaeology
Pattern B: Orchestration (Workflow Engine + Kafka)
Best for:
- multi-step approvals
- strict SLAs
- “exactly-once” business actions (e.g., payouts)
A workflow engine can manage state transitions while Kafka carries events. If you’re orchestrating complex agent flows, it’s worth studying multi-agent orchestration approaches like: LangGraph in practice: orchestrating multi-agent systems.
Reliability: Handling Duplicates, Retries, and Ordering
In real production systems, your agent pipeline must assume:
- events may be delivered more than once
- consumers can crash mid-processing
- messages can arrive out of order across partitions
Build Idempotent Consumers
Each agent should be able to process the same event twice without causing double side effects.
Common tactics:
- store a “processed events” table keyed by
event_id - use idempotency keys on downstream APIs
- separate “decision” from “action” events
Use Dead Letter Topics (DLQs)
When an agent can’t process a message:
- publish to a DLQ like
support.ticket.classified.dlq - include error context and stack traces
- alert and/or route to a recovery agent
Decide How to Partition
If ordering matters per entity (like ticket_id), partition by that key. That ensures all events for one ticket are processed in order by a given consumer group.
Practical Use Cases for Kafka-Based Agent Pipelines
Here are real-world scenarios where reactive agent pipelines shine:
1) Real-Time Document Processing
- OCR/extraction agent emits
doc.extracted - validation agent flags anomalies
doc.validation.failed - enrichment agent links entities
doc.entities.resolved
2) Data Quality Incident Response
- anomaly detector emits
dq.anomaly.detected - investigation agent gathers context and lineage
- remediation agent triggers a rollback, backfill, or alert escalation
3) Product Analytics and User Behavior Automation
- event stream emits
user.action.performed - segmentation agent updates cohorts
- lifecycle agent triggers personalized messaging or in-app prompts
4) Security and Compliance Workflows
- policy agent evaluates content or access events
- audit agent stores immutable records
- enforcement agent revokes tokens or flags accounts
Observability: Don’t Fly Blind in Distributed Agent Systems
Reactive pipelines can feel “invisible” unless you invest in observability.
What to Monitor
- consumer lag (are agents falling behind?)
- error rate by topic and agent
- DLQ volume and causes
- throughput spikes and cost drivers (LLM calls)
- end-to-end latency by correlation ID
Track Prompt and Model Changes
If agents use LLMs, prompt drift is a frequent cause of “it worked yesterday” failures. Consider tracing and evaluating agent behavior across releases with tools and practices like those in: LangSmith simplified: tracing and evaluating prompts.
Security and Governance for Kafka-Connected Agents
Agent pipelines often touch sensitive data (tickets, invoices, internal docs). Kafka doesn’t automatically solve governance—you design it.
Key Practices
- Topic-level ACLs (who can read/write which topics)
- Encryption in transit (TLS) and at rest
- PII minimization: store sensitive blobs outside Kafka, pass references
- Schema validation at the boundary to prevent “poison messages”
- Audit logs: who produced what event and when
A good rule: treat Kafka topics like APIs—because that’s what they are in an event-driven architecture.
Step-by-Step Blueprint: Building Your First Reactive Agent Pipeline with Kafka
Step 1: Define the Workflow as Events
Write the event chain in plain English:
- “When X happens, emit Y”
- “When Y is emitted, agent Z reacts and emits W”
Step 2: Create Topics Around Business Events
Start with 3–6 topics max. Keep it simple.
Step 3: Implement Producers with Strong Metadata
Always include:
event_idtrace_idcreated_atschema_version
Step 4: Build Consumers to Be Idempotent
Assume duplicates. Store processing markers.
Step 5: Add DLQs and Retry Strategy
Make failure explicit and observable.
Step 6: Add Observability from Day One
Measure lag, latency, and errors before you scale.
Step 7: Scale Agent Workers by Consumer Groups
If one agent is slow (common with tool-using LLMs), scale that consumer group horizontally.
Common Pitfalls (and How to Avoid Them)
Pitfall 1: Treating Kafka as a “Dumping Ground”
Fix: keep events small and purposeful; store big artifacts elsewhere.
Pitfall 2: No Schema Ownership
Fix: assign owners to event contracts and enforce versioning.
Pitfall 3: Over-Orchestration
Fix: start with choreography; introduce orchestration only where business guarantees demand it.
Pitfall 4: Unbounded LLM Costs
Fix: rate-limit agent consumption, cache results, and monitor per-topic throughput.
Pitfall 5: Debugging Without Correlation IDs
Fix: enforce trace_id and propagate it across every event.
FAQ: Reactive Agent Pipelines with Kafka
1) What does “reactive” mean in Kafka-based agent pipelines?
Reactive means agents respond to events as they happen, rather than waiting for a scheduled batch or a synchronous request/response chain. Kafka acts as the event backbone so agents can react independently, in near real time, and recover by replaying events.
2) Do I need Kafka for agent-to-agent communication, or can I just use REST APIs?
You can use REST for simple, synchronous interactions, but Kafka is better when you need decoupling, buffering, replayability, and scale. REST tends to create tight dependencies (agent A must wait for agent B), while Kafka enables asynchronous pipelines and fan-out to multiple downstream agents.
3) How do I prevent agents from processing the same Kafka message twice?
You typically design idempotent consumers. Common approaches include:
- storing processed
event_ids in a database - using idempotency keys for downstream side effects (emails, tickets, payments)
- separating “decision” events from “action executed” events
Kafka can reduce duplicates depending on configuration, but production designs should assume duplicates are possible.
4) How many topics should I create for a multi-agent workflow?
Start small: 3–6 topics aligned to business events. Too many topics early on increases operational overhead and makes the pipeline harder to understand. You can split topics later when you identify scaling boundaries, ownership needs, or different retention requirements.
5) Should agents publish “commands” or “events” to Kafka?
In most cases, publish events (facts like invoice.approved). Events are easier to audit and don’t create hidden dependencies. Use commands sparingly when you truly need to request an action and track its fulfillment (e.g., generate.report.requested).
6) What’s the best way to handle failures in agent pipelines?
Use a combination of:
- retries with backoff (for transient failures)
- dead letter topics (DLQs) for messages that repeatedly fail
- alerting based on DLQ volume and error types
- clear error payloads so a “recovery agent” (or on-call engineer) can act quickly
7) How do I ensure message ordering for a specific workflow (like one ticket or one customer)?
Partition by a stable key such as ticket_id or customer_id. Kafka guarantees ordering within a partition, so events for that key will be processed sequentially by consumers in the same group.
8) Can Kafka-based agents work with LLM frameworks and tool-using agents?
Yes—Kafka is often a great fit for tool-using AI agents because it buffers workload spikes and allows you to scale consumers. The key is to:
- control throughput (rate limits)
- keep payloads small (store big context externally)
- trace requests across agents using correlation IDs
9) What should I log and measure to keep reactive pipelines healthy?
At minimum:
- consumer lag per agent
- message processing latency (end-to-end and per stage)
- error rate and retry counts
- DLQ volume and top failure reasons
- throughput per topic (helps forecast cost for LLM-heavy agents)
10) When should I add a workflow orchestrator instead of pure Kafka choreography?
Add orchestration when you need:
- explicit state machines (pending → approved → executed)
- strong SLAs and timed retries
- human approvals in the loop
- guaranteed single execution of business-critical actions
For more flexible pipelines, Kafka choreography is usually simpler and scales better.







