Event-Driven Architecture with Redpanda (Kafka API): A Practical Guide to Building Real-Time Systems

Community manager and producer of specialized marketing content

Event-driven architecture (EDA) is the backbone of modern real-time software-from fraud detection and IoT telemetry to order processing and personalized user experiences. Instead of forcing systems to constantly poll (“Has anything changed?”), event-driven systems react as soon as an event occurs.

In this guide, you’ll learn how to build event-driven architecture with Redpanda (a streaming platform with a Kafka-compatible API)-including the core concepts, proven patterns (Pub/Sub, streaming pipelines, Outbox pattern, Sagas), and the production details that matter: topic configs, retention/compaction, producer acknowledgements, consumer offsets, retries, and dead-letter queues (DLQs).

What Is Event-Driven Architecture (EDA)?

Event-driven architecture is a style of system design where services communicate by producing and consuming events-immutable records of something that happened in the business domain.

Examples of events:

OrderPlaced
PaymentAuthorized
InventoryReserved
UserSignedUp
ShipmentDelivered

Instead of tightly coupling services through synchronous APIs, EDA introduces a streaming layer where events flow and services respond asynchronously.

Why teams choose EDA

EDA is often adopted to achieve:

Loose coupling between services
Scalable, asynchronous processing
Real-time data propagation
Resiliency (temporary outages don’t necessarily stop the entire pipeline)
Auditability (events can be retained as a historical record)

Where Redpanda Fits (and Why It Matters)

Redpanda is a modern streaming platform designed for high-throughput, low-latency event streaming. Many teams look at Redpanda when they want Kafka-style streaming with a simpler operational footprint.

Key idea: Kafka API compatibility

Redpanda’s Kafka API compatibility makes it easier to:

Migrate from Kafka
Use existing Kafka client libraries (Java, Python, Go, Node, etc.)
Integrate with common streaming ecosystems (connectors, stream processors, schema registries-depending on your stack)

In practice, this often means your producers/consumers can stay the same while you evaluate or migrate infrastructure.

A quick mental model (diagram)

Here’s the “shape” of a typical Redpanda-based EDA system:

[Service A] --produces--> [ Topic: orders.events ] --consumed by--> [Service B]

| | partitions |

| v v

[Postgres DB] -> Outbox Relay [ Topic: payments.events ] [Analytics/ML]

[ DLQ: orders.events.dlq ]

Core Concepts You Need Before Designing with Redpanda

Even if you’ve used message queues before, streaming systems introduce a few important concepts.

Events and topics

A topic is a named stream of events-like orders.events, payments.events, or user-activity.events.

Good topic naming tips:

Use domain-driven naming (orders.events, inventory.updates)
Avoid “kitchen sink” topics like all-events
Keep ownership clear (one domain/team per topic when possible)

Partitions and ordering

Topics are typically split into partitions for scalability. Ordering is generally guaranteed within a partition, not across the whole topic.

This matters when events must be processed in order. A common technique is to use a partition key such as:

orderId for order flows
userId for user timelines
deviceId for IoT streams

Consumers, consumer groups, and scaling

Consumers scale horizontally using consumer groups:

Multiple instances can share the work
Each partition is processed by only one consumer instance in a group (at a time)

Scaling rule of thumb: max parallelism = number of partitions (per consumer group).

Common Event-Driven Architecture Patterns (and How to Apply Them)

1) Pub/Sub for broadcasting domain events

Use pub/sub when multiple downstream services need the same event.

Example:

Checkout service publishes OrderPlaced
Inventory service consumes it to reserve stock
Fraud service consumes it to score risk
Analytics service consumes it to update metrics

Benefits: decoupling and easy extensibility

Watch out for: uncontrolled fan-out and unclear ownership

2) Event streaming for real-time pipelines

Streaming pipelines continuously process data-filtering, aggregating, enriching, and joining streams.

Example:

Consume clickstream events
Enrich with user profile data
Compute rolling session analytics
Output to user-sessions topic and analytics store

Benefits: near real-time insights

Watch out for: stateful processing complexity (windows, late events, reprocessing)

3) The Outbox pattern (for reliable event publishing)

A classic challenge: how do you ensure the database update and event publish either both happen-or neither happens?

The Outbox pattern solves this:

Write business data and an “outbox event” record in the same database transaction
A separate process reads outbox rows and publishes to Redpanda
Mark outbox rows as delivered

Benefits: avoids “ghost events” and lost messages

Watch out for: operational overhead and deduplication needs downstream

Outbox table (Postgres example)

Here’s a minimal schema that works well in production:

`sql

CREATE TABLE outbox_events (

id uuid PRIMARY KEY,

aggregate_id text NOT NULL,

event_type text NOT NULL,

payload jsonb NOT NULL,

occurred_at timestamptz NOT NULL DEFAULT now(),

published_at timestamptz

);

CREATE INDEX ON outbox_events (published_at) WHERE published_at IS NULL;

Outbox relay loop (implementation sketch)

This relay publishes in batches and marks rows as published only after a successful send. (You can extend this with retries, backoff, and metrics.)

`python

Pseudocode: outbox relay

rows = db.query("""

SELECT id, aggregate_id, event_type, payload

FROM outbox_events

WHERE published_at IS NULL

ORDER BY occurred_at

LIMIT 500

FOR UPDATE SKIP LOCKED

""")

for r in rows:

producer.send(

topic="orders.events",

key=r["aggregate_id"].encode(),

value=json.dumps({

"eventId": str(r["id"]),

"eventType": r["event_type"],

"data": r["payload"],

}).encode()

)

producer.flush()

db.execute("""

UPDATE outbox_events

SET published_at = now()

WHERE id = ANY(%s)

""", [ [r["id"] for r in rows] ])

Implementation detail (practical):

Put the aggregate ID (e.g., orderId) as the message key so all events for that aggregate preserve order.
Include eventId (UUID) and eventType to simplify dedupe and routing.
Assume the relay will retry; design consumers as idempotent (see Delivery Semantics).

4) Saga pattern (for distributed workflows)

When a business process spans multiple services (order → payment → inventory → shipping), you can use:

Choreography: services react to events
Orchestration: a coordinator service drives steps

Benefits: handles distributed transactions without 2PC

Watch out for: debugging complexity and edge-case handling (timeouts, retries, compensation)

Rule that prevents “mystery failures”: define one clear owner for the saga state (either a coordinator topic/state store, or a single service that can answer “what is the current saga step?”).

Designing Event Contracts That Don’t Break Downstream Systems

A strong EDA system depends on stable, well-defined event schemas.

Event schema best practices

Include a unique eventId
Include a timestamp and (if relevant) occurredAt
Include version (schema versioning is critical)
Favor immutable facts (“PaymentAuthorized”) over mutable state (“PaymentStatusUpdated”)

Versioning strategy

To evolve safely:

Additive changes (adding optional fields) are easiest
Breaking changes require careful coordination or parallel topics/schemas

If your ecosystem uses a schema registry (e.g., Avro/Protobuf/JSON Schema), enforce compatibility rules early.

Delivery Semantics: At-Most-Once, At-Least-Once, Exactly-Once (What You Actually Need)

In real systems:

At-most-once: fastest but can lose events
At-least-once: common default; may deliver duplicates
Exactly-once: hardest; usually requires tight coordination between consumption, processing, and state updates

Most teams choose at-least-once and implement idempotency.

Producer reliability knobs (practical defaults)

Java (Kafka client) producer example

`properties

acks=all

enable.idempotence=true

retries=2147483647

delivery.timeout.ms=120000

linger.ms=5

batch.size=65536

Use acks=all for critical domain events.
Enable idempotence where supported to reduce duplicates on retry.
Add batching (linger.ms, batch.size) when throughput matters.

Consumer offset management (how duplicates happen)

Duplicates typically appear when:

You process a message,
then crash before committing the offset,
then restart and re-read the same message.

At-least-once pattern (recommended default):

Consume message
Apply side effects (DB write / API call / state update)
Commit offset

Python consumer snippet (commit after processing)

`python

from confluent_kafka import Consumer

c = Consumer({

"bootstrap.servers": "localhost:9092",

"group.id": "orders-service",

"enable.auto.commit": False,

"auto.offset.reset": "earliest"

})

c.subscribe(["orders.events"])

while True:

msg = c.poll(1.0)

if msg is None:

continue

if msg.error():

raise Exception(msg.error())

handle(msg.key(), msg.value()) # must be idempotent

c.commit(message=msg) # commit only after success

Practical idempotency strategies

Use a deduplication table keyed by eventId
Use upserts instead of inserts
Keep processing deterministic so replays are safe

Dedup table sketch

`sql

CREATE TABLE processed_events (

event_id uuid PRIMARY KEY,

processed_at timestamptz NOT NULL DEFAULT now()

);

Observability for Event-Driven Systems (Non-Negotiable)

EDA can be harder to debug than synchronous APIs because the “call chain” is distributed and asynchronous.

What to monitor

Consumer lag (are we falling behind?)
Throughput per topic/partition
Error rates and retry counts
Dead-letter queue (DLQ) volume
End-to-end latency (event produced → business effect visible)

Logging and tracing tips

Include correlation IDs (e.g., orderId, traceId)
Log event metadata: topic, partition, offset, consumer group
Use distributed tracing where possible (OpenTelemetry is common)

Practical Use Cases for Redpanda-Based Event Streaming

Real-time order processing

Emit OrderPlaced
Consumers validate, reserve inventory, authorize payment
Emit OrderConfirmed or OrderRejected
Drive notifications and fulfillment asynchronously

IoT telemetry ingestion

Devices stream events by deviceId
Real-time anomaly detection
Store raw events for reprocessing and analytics

User activity tracking and personalization

Capture events like ProductViewed, AddedToCart
Build near real-time recommendation updates
Feed analytics warehouses and feature stores

Implementation Details: Topic Configs, Retention/Compaction, and DLQs

This section adds the “last mile” configuration decisions that make an EDA system stable in production.

Sample topic configuration (starting point)

1) Domain events topic (e.g., orders.events)

Goal: durable audit trail + replay for new consumers
Suggested baseline:
Replication factor: 3 (where available)
Partitions: sized to consumer parallelism + growth
Retention: 7–30 days (or longer if events are part of your audit strategy)

Example commands (Kafka CLI-compatible)

`bash

Domain events (retention-based)

kafka-topics --bootstrap-server localhost:9092 \

--create --topic orders.events --partitions 12 --replication-factor 3 \

--config retention.ms=1209600000

2) Compacted “current state” topic (e.g., orders.state)

Goal: latest known state by key (orderId)
Suggested baseline:
cleanup.policy=compact
Optional: cleanup.policy=compact,delete if you also want time-based retention
Key must be stable (orderId) and always present

`bash

State topic (compacted)

kafka-topics --bootstrap-server localhost:9092 \

--create --topic orders.state --partitions 12 --replication-factor 3 \

--config cleanup.policy=compact

Retention vs compaction guidance

Use retention (delete) for event history you may replay.
Use compaction for “last write wins” state, caches, and reference data (e.g., user profiles, product catalog snapshots).
Consider tiered storage if you need longer retention without keeping all segments on local disks (Redpanda supports tiered storage in self-managed deployments; see docs: https://docs.redpanda.com/current/manage/tiered-storage/).

Dead-letter queue (DLQ) implementation (practical pattern)

A DLQ is not just a topic-it’s an operational workflow.

Pattern

Main topic: orders.events
Retry topic(s): orders.events.retry.1m, orders.events.retry.10m (optional)
DLQ topic: orders.events.dlq

What goes into DLQ messages

Original payload (or a pointer to it)
Error reason/classification
Consumer group / service name
Original topic/partition/offset
Timestamp and retry count

How to replay safely

Fix code/schema/infra
Reprocess DLQ into the original topic or into a dedicated orders.events.replay topic
Keep idempotency checks enabled (DLQ replays can duplicate work)

Short Case Study (Concrete Example)

Scenario: checkout → payment → inventory flow

A common production issue in synchronous systems is cascading latency: if payment or inventory is slow, checkout becomes slow, and then timeouts cause retries that create more load.

EDA approach

Checkout writes OrderPlaced (key=orderId)
Payment and Inventory consume independently
Each emits a decision event (PaymentAuthorized / PaymentDeclined, InventoryReserved / InventoryOutOfStock)
A coordinator (or a choreography set of services) determines OrderConfirmed vs OrderRejected

Concrete metrics you can put on a dashboard (example targets)

p95 “checkout API” latency drops from “blocked on dependencies” to “enqueue + ack”: many teams aim for <100–200ms p95 for the synchronous request path once the workflow is truly async.
Consumer lag under load becomes the early warning signal: e.g., alert when lag is > 10,000 messages for 5 minutes on payments consumers (threshold depends on your SLA and traffic).
DLQ rate is your quality indicator: aim for <0.1% of messages to DLQ; spikes usually mean a deploy/schema issue or downstream dependency failure.

If you already run this flow in production, replace these targets with your real before/after numbers (even two numbers makes the story far more credible).

Implementation Checklist: Getting It Right the First Time

Use this checklist to avoid the most common early-stage mistakes:

Architecture checklist

[ ] Define your domain events (nouns + past tense verbs)
[ ] Decide partition keys based on ordering requirements
[ ] Build for idempotent consumers
[ ] Choose a schema/versioning strategy
[ ] Plan retry policies and DLQs
[ ] Add observability: lag, throughput, latency, errors
[ ] Document topic ownership and contracts
[ ] Plan reprocessing (“replay”) scenarios
[ ] Define topic retention/compaction per topic (don’t use one default for everything)
[ ] Decide producer acks and retry behavior for critical events
[ ] Decide consumer offset commit strategy (and test crash/restart behavior)
] Set up [logs and alerts for distributed pipelines so you can debug failures end-to-end

FAQ: Event-Driven Architecture with Redpanda

1) What is the difference between event-driven architecture and message queues?

Message queues often focus on task distribution (e.g., “do this job”), while event-driven architecture focuses on broadcasting facts (e.g., “this happened”). Streaming platforms like Redpanda can support both, but EDA typically emphasizes durable event logs, replayability, and multiple independent consumers.

2) When should I use Redpanda instead of a traditional broker?

Consider Redpanda when you need:

High-throughput event streaming
Low-latency pub/sub
Kafka API compatibility (to reuse tooling and clients)
A log-based architecture that supports replay and multiple consumers

If your use case is only small background jobs with simple routing, a classic queue might be sufficient.

3) How do I choose the right number of partitions?

Start with:

Desired consumer parallelism (max parallelism ≈ number of partitions per consumer group)
Expected throughput
Future scaling needs

A practical approach is to begin with a moderate count and adjust as you learn real traffic patterns. Changing partitions later is possible but can require planning due to key-based ordering implications.

4) How do I prevent duplicate processing?

Assume duplicates can happen (especially under retries). Use:

Idempotent handlers
Deduplication storage keyed by eventId
Upserts and deterministic updates
Exactly-once patterns only where truly necessary

5) Should I publish “events” or “commands”?

Events: immutable facts (“OrderPlaced”) meant for broad consumption.
Commands: requests (“ReserveInventory”) intended for a specific handler.

In many architectures, services publish events and consume events, while commands are used sparingly for targeted actions or orchestration flows.

6) How do I handle failures-retries, poison messages, and DLQs?

Use a structured approach:

Retry transient errors with backoff
Set a max retry count
Route poison messages to a DLQ with context (error, payload, metadata)
Build DLQ tooling for replay after fixes

If you want stronger guarantees around operational traceability, see data pipeline auditing and lineage patterns for proving “what happened” across distributed systems.

7) Is event-driven architecture a good fit for monoliths?

Yes-EDA can be introduced inside a monolith to decouple modules and establish event contracts. Many teams start there, then later split services once boundaries and event flows are proven.

8) How do I ensure event schemas don’t break consumers?

Adopt schema versioning and compatibility rules:

Prefer additive changes (new optional fields)
Deprecate fields gradually
Version events explicitly when breaking changes are unavoidable
Document contracts and ownership clearly

9) Can I rebuild a system’s state by replaying events?

Often, yes-if events are durable and sufficiently descriptive. This is a key benefit of log-based architectures: reprocessing becomes possible for debugging, backfills, and new consumers. The trade-off is that you must design events as durable domain facts and manage retention and storage costs.

Conclusion

Event-driven architecture works best when events are treated as durable contracts, not ad-hoc messages. If you get the fundamentals right-partition keys, at-least-once delivery with idempotent consumers, and production-ready workflows for retries/DLQs-you can evolve systems more safely and scale them with fewer cross-service bottlenecks.

For further reading on event-driven architecture use cases, Redpanda’s guide is a solid reference: https://www.redpanda.com/guides/kafka-use-cases-event-driven-architecture

And if long retention is a requirement, tiered storage is worth evaluating early: https://docs.redpanda.com/current/manage/tiered-storage/

For a deeper view into modern platform reliability, observability in 2025 with Sentry, Grafana, and OpenTelemetry is a helpful complement to EDA monitoring practices.

Data Engineering