
Community manager and producer of specialized marketing content
Event-driven architecture (EDA) is the backbone of modern real-time software-from fraud detection and IoT telemetry to order processing and personalized user experiences. Instead of forcing systems to constantly poll (“Has anything changed?”), event-driven systems react as soon as an event occurs.
In this guide, you’ll learn how to build event-driven architecture with Redpanda (a streaming platform with a Kafka-compatible API)-including the core concepts, proven patterns (Pub/Sub, streaming pipelines, Outbox pattern, Sagas), and the production details that matter: topic configs, retention/compaction, producer acknowledgements, consumer offsets, retries, and dead-letter queues (DLQs).
What Is Event-Driven Architecture (EDA)?
Event-driven architecture is a style of system design where services communicate by producing and consuming events-immutable records of something that happened in the business domain.
Examples of events:
OrderPlacedPaymentAuthorizedInventoryReservedUserSignedUpShipmentDelivered
Instead of tightly coupling services through synchronous APIs, EDA introduces a streaming layer where events flow and services respond asynchronously.
Why teams choose EDA
EDA is often adopted to achieve:
- Loose coupling between services
- Scalable, asynchronous processing
- Real-time data propagation
- Resiliency (temporary outages don’t necessarily stop the entire pipeline)
- Auditability (events can be retained as a historical record)
Where Redpanda Fits (and Why It Matters)
Redpanda is a modern streaming platform designed for high-throughput, low-latency event streaming. Many teams look at Redpanda when they want Kafka-style streaming with a simpler operational footprint.
Key idea: Kafka API compatibility
Redpanda’s Kafka API compatibility makes it easier to:
- Migrate from Kafka
- Use existing Kafka client libraries (Java, Python, Go, Node, etc.)
- Integrate with common streaming ecosystems (connectors, stream processors, schema registries-depending on your stack)
In practice, this often means your producers/consumers can stay the same while you evaluate or migrate infrastructure.
A quick mental model (diagram)
Here’s the “shape” of a typical Redpanda-based EDA system:
`
[Service A] --produces--> [ Topic: orders.events ] --consumed by--> [Service B]
| | partitions |
| v v
[Postgres DB] -> Outbox Relay [ Topic: payments.events ] [Analytics/ML]
|
v
[ DLQ: orders.events.dlq ]
`
Core Concepts You Need Before Designing with Redpanda
Even if you’ve used message queues before, streaming systems introduce a few important concepts.
Events and topics
A topic is a named stream of events-like orders.events, payments.events, or user-activity.events.
Good topic naming tips:
- Use domain-driven naming (
orders.events,inventory.updates) - Avoid “kitchen sink” topics like
all-events - Keep ownership clear (one domain/team per topic when possible)
Partitions and ordering
Topics are typically split into partitions for scalability. Ordering is generally guaranteed within a partition, not across the whole topic.
This matters when events must be processed in order. A common technique is to use a partition key such as:
orderIdfor order flowsuserIdfor user timelinesdeviceIdfor IoT streams
Consumers, consumer groups, and scaling
Consumers scale horizontally using consumer groups:
- Multiple instances can share the work
- Each partition is processed by only one consumer instance in a group (at a time)
Scaling rule of thumb: max parallelism = number of partitions (per consumer group).
Common Event-Driven Architecture Patterns (and How to Apply Them)
1) Pub/Sub for broadcasting domain events
Use pub/sub when multiple downstream services need the same event.
Example:
- Checkout service publishes
OrderPlaced - Inventory service consumes it to reserve stock
- Fraud service consumes it to score risk
- Analytics service consumes it to update metrics
Benefits: decoupling and easy extensibility
Watch out for: uncontrolled fan-out and unclear ownership
2) Event streaming for real-time pipelines
Streaming pipelines continuously process data-filtering, aggregating, enriching, and joining streams.
Example:
- Consume clickstream events
- Enrich with user profile data
- Compute rolling session analytics
- Output to
user-sessionstopic and analytics store
Benefits: near real-time insights
Watch out for: stateful processing complexity (windows, late events, reprocessing)
3) The Outbox pattern (for reliable event publishing)
A classic challenge: how do you ensure the database update and event publish either both happen-or neither happens?
The Outbox pattern solves this:
- Write business data and an “outbox event” record in the same database transaction
- A separate process reads outbox rows and publishes to Redpanda
- Mark outbox rows as delivered
Benefits: avoids “ghost events” and lost messages
Watch out for: operational overhead and deduplication needs downstream
Outbox table (Postgres example)
Here’s a minimal schema that works well in production:
`sql
CREATE TABLE outbox_events (
id uuid PRIMARY KEY,
aggregate_id text NOT NULL,
event_type text NOT NULL,
payload jsonb NOT NULL,
occurred_at timestamptz NOT NULL DEFAULT now(),
published_at timestamptz
);
CREATE INDEX ON outbox_events (published_at) WHERE published_at IS NULL;
`
Outbox relay loop (implementation sketch)
This relay publishes in batches and marks rows as published only after a successful send. (You can extend this with retries, backoff, and metrics.)
`python
Pseudocode: outbox relay
rows = db.query("""
SELECT id, aggregate_id, event_type, payload
FROM outbox_events
WHERE published_at IS NULL
ORDER BY occurred_at
LIMIT 500
FOR UPDATE SKIP LOCKED
""")
for r in rows:
producer.send(
topic="orders.events",
key=r["aggregate_id"].encode(),
value=json.dumps({
"eventId": str(r["id"]),
"eventType": r["event_type"],
"data": r["payload"],
}).encode()
)
producer.flush()
db.execute("""
UPDATE outbox_events
SET published_at = now()
WHERE id = ANY(%s)
""", [ [r["id"] for r in rows] ])
`
Implementation detail (practical):
- Put the aggregate ID (e.g.,
orderId) as the message key so all events for that aggregate preserve order. - Include
eventId(UUID) andeventTypeto simplify dedupe and routing. - Assume the relay will retry; design consumers as idempotent (see Delivery Semantics).
4) Saga pattern (for distributed workflows)
When a business process spans multiple services (order → payment → inventory → shipping), you can use:
- Choreography: services react to events
- Orchestration: a coordinator service drives steps
Benefits: handles distributed transactions without 2PC
Watch out for: debugging complexity and edge-case handling (timeouts, retries, compensation)
Rule that prevents “mystery failures”: define one clear owner for the saga state (either a coordinator topic/state store, or a single service that can answer “what is the current saga step?”).
Designing Event Contracts That Don’t Break Downstream Systems
A strong EDA system depends on stable, well-defined event schemas.
Event schema best practices
- Include a unique
eventId - Include a
timestampand (if relevant)occurredAt - Include
version(schema versioning is critical) - Favor immutable facts (“PaymentAuthorized”) over mutable state (“PaymentStatusUpdated”)
Versioning strategy
To evolve safely:
- Additive changes (adding optional fields) are easiest
- Breaking changes require careful coordination or parallel topics/schemas
If your ecosystem uses a schema registry (e.g., Avro/Protobuf/JSON Schema), enforce compatibility rules early.
Delivery Semantics: At-Most-Once, At-Least-Once, Exactly-Once (What You Actually Need)
In real systems:
- At-most-once: fastest but can lose events
- At-least-once: common default; may deliver duplicates
- Exactly-once: hardest; usually requires tight coordination between consumption, processing, and state updates
Most teams choose at-least-once and implement idempotency.
Producer reliability knobs (practical defaults)
Java (Kafka client) producer example
`properties
acks=all
enable.idempotence=true
retries=2147483647
delivery.timeout.ms=120000
linger.ms=5
batch.size=65536
`
- Use
acks=allfor critical domain events. - Enable idempotence where supported to reduce duplicates on retry.
- Add batching (
linger.ms,batch.size) when throughput matters.
Consumer offset management (how duplicates happen)
Duplicates typically appear when:
- You process a message,
- then crash before committing the offset,
- then restart and re-read the same message.
At-least-once pattern (recommended default):
- Consume message
- Apply side effects (DB write / API call / state update)
- Commit offset
Python consumer snippet (commit after processing)
`python
from confluent_kafka import Consumer
c = Consumer({
"bootstrap.servers": "localhost:9092",
"group.id": "orders-service",
"enable.auto.commit": False,
"auto.offset.reset": "earliest"
})
c.subscribe(["orders.events"])
while True:
msg = c.poll(1.0)
if msg is None:
continue
if msg.error():
raise Exception(msg.error())
handle(msg.key(), msg.value()) # must be idempotent
c.commit(message=msg) # commit only after success
`
Practical idempotency strategies
- Use a deduplication table keyed by
eventId - Use upserts instead of inserts
- Keep processing deterministic so replays are safe
Dedup table sketch
`sql
CREATE TABLE processed_events (
event_id uuid PRIMARY KEY,
processed_at timestamptz NOT NULL DEFAULT now()
);
`
Observability for Event-Driven Systems (Non-Negotiable)
EDA can be harder to debug than synchronous APIs because the “call chain” is distributed and asynchronous.
What to monitor
- Consumer lag (are we falling behind?)
- Throughput per topic/partition
- Error rates and retry counts
- Dead-letter queue (DLQ) volume
- End-to-end latency (event produced → business effect visible)
Logging and tracing tips
- Include correlation IDs (e.g.,
orderId,traceId) - Log event metadata: topic, partition, offset, consumer group
- Use distributed tracing where possible (OpenTelemetry is common)
Practical Use Cases for Redpanda-Based Event Streaming
Real-time order processing
- Emit
OrderPlaced - Consumers validate, reserve inventory, authorize payment
- Emit
OrderConfirmedorOrderRejected - Drive notifications and fulfillment asynchronously
IoT telemetry ingestion
- Devices stream events by
deviceId - Real-time anomaly detection
- Store raw events for reprocessing and analytics
User activity tracking and personalization
- Capture events like
ProductViewed,AddedToCart - Build near real-time recommendation updates
- Feed analytics warehouses and feature stores
Implementation Details: Topic Configs, Retention/Compaction, and DLQs
This section adds the “last mile” configuration decisions that make an EDA system stable in production.
Sample topic configuration (starting point)
1) Domain events topic (e.g., orders.events)
- Goal: durable audit trail + replay for new consumers
- Suggested baseline:
- Replication factor: 3 (where available)
- Partitions: sized to consumer parallelism + growth
- Retention: 7–30 days (or longer if events are part of your audit strategy)
Example commands (Kafka CLI-compatible)
`bash
Domain events (retention-based)
kafka-topics --bootstrap-server localhost:9092 \
--create --topic orders.events --partitions 12 --replication-factor 3 \
--config retention.ms=1209600000
`
2) Compacted “current state” topic (e.g., orders.state)
- Goal: latest known state by key (
orderId) - Suggested baseline:
cleanup.policy=compact- Optional:
cleanup.policy=compact,deleteif you also want time-based retention - Key must be stable (
orderId) and always present
`bash
State topic (compacted)
kafka-topics --bootstrap-server localhost:9092 \
--create --topic orders.state --partitions 12 --replication-factor 3 \
--config cleanup.policy=compact
`
Retention vs compaction guidance
- Use retention (delete) for event history you may replay.
- Use compaction for “last write wins” state, caches, and reference data (e.g., user profiles, product catalog snapshots).
- Consider tiered storage if you need longer retention without keeping all segments on local disks (Redpanda supports tiered storage in self-managed deployments; see docs: https://docs.redpanda.com/current/manage/tiered-storage/).
Dead-letter queue (DLQ) implementation (practical pattern)
A DLQ is not just a topic-it’s an operational workflow.
Pattern
- Main topic:
orders.events - Retry topic(s):
orders.events.retry.1m,orders.events.retry.10m(optional) - DLQ topic:
orders.events.dlq
What goes into DLQ messages
- Original payload (or a pointer to it)
- Error reason/classification
- Consumer group / service name
- Original topic/partition/offset
- Timestamp and retry count
How to replay safely
- Fix code/schema/infra
- Reprocess DLQ into the original topic or into a dedicated
orders.events.replaytopic - Keep idempotency checks enabled (DLQ replays can duplicate work)
Short Case Study (Concrete Example)
Scenario: checkout → payment → inventory flow
A common production issue in synchronous systems is cascading latency: if payment or inventory is slow, checkout becomes slow, and then timeouts cause retries that create more load.
EDA approach
- Checkout writes
OrderPlaced(key=orderId) - Payment and Inventory consume independently
- Each emits a decision event (
PaymentAuthorized/PaymentDeclined,InventoryReserved/InventoryOutOfStock) - A coordinator (or a choreography set of services) determines
OrderConfirmedvsOrderRejected
Concrete metrics you can put on a dashboard (example targets)
- p95 “checkout API” latency drops from “blocked on dependencies” to “enqueue + ack”: many teams aim for <100–200ms p95 for the synchronous request path once the workflow is truly async.
- Consumer lag under load becomes the early warning signal: e.g., alert when lag is > 10,000 messages for 5 minutes on
paymentsconsumers (threshold depends on your SLA and traffic). - DLQ rate is your quality indicator: aim for <0.1% of messages to DLQ; spikes usually mean a deploy/schema issue or downstream dependency failure.
If you already run this flow in production, replace these targets with your real before/after numbers (even two numbers makes the story far more credible).
Implementation Checklist: Getting It Right the First Time
Use this checklist to avoid the most common early-stage mistakes:
Architecture checklist
- [ ] Define your domain events (nouns + past tense verbs)
- [ ] Decide partition keys based on ordering requirements
- [ ] Build for idempotent consumers
- [ ] Choose a schema/versioning strategy
- [ ] Plan retry policies and DLQs
- [ ] Add observability: lag, throughput, latency, errors
- [ ] Document topic ownership and contracts
- [ ] Plan reprocessing (“replay”) scenarios
- [ ] Define topic retention/compaction per topic (don’t use one default for everything)
- [ ] Decide producer
acksand retry behavior for critical events - [ ] Decide consumer offset commit strategy (and test crash/restart behavior)
- ] Set up [logs and alerts for distributed pipelines so you can debug failures end-to-end
FAQ: Event-Driven Architecture with Redpanda
1) What is the difference between event-driven architecture and message queues?
Message queues often focus on task distribution (e.g., “do this job”), while event-driven architecture focuses on broadcasting facts (e.g., “this happened”). Streaming platforms like Redpanda can support both, but EDA typically emphasizes durable event logs, replayability, and multiple independent consumers.
2) When should I use Redpanda instead of a traditional broker?
Consider Redpanda when you need:
- High-throughput event streaming
- Low-latency pub/sub
- Kafka API compatibility (to reuse tooling and clients)
- A log-based architecture that supports replay and multiple consumers
If your use case is only small background jobs with simple routing, a classic queue might be sufficient.
3) How do I choose the right number of partitions?
Start with:
- Desired consumer parallelism (max parallelism ≈ number of partitions per consumer group)
- Expected throughput
- Future scaling needs
A practical approach is to begin with a moderate count and adjust as you learn real traffic patterns. Changing partitions later is possible but can require planning due to key-based ordering implications.
4) How do I prevent duplicate processing?
Assume duplicates can happen (especially under retries). Use:
- Idempotent handlers
- Deduplication storage keyed by
eventId - Upserts and deterministic updates
- Exactly-once patterns only where truly necessary
5) Should I publish “events” or “commands”?
- Events: immutable facts (“OrderPlaced”) meant for broad consumption.
- Commands: requests (“ReserveInventory”) intended for a specific handler.
In many architectures, services publish events and consume events, while commands are used sparingly for targeted actions or orchestration flows.
6) How do I handle failures-retries, poison messages, and DLQs?
Use a structured approach:
- Retry transient errors with backoff
- Set a max retry count
- Route poison messages to a DLQ with context (error, payload, metadata)
- Build DLQ tooling for replay after fixes
If you want stronger guarantees around operational traceability, see data pipeline auditing and lineage patterns for proving “what happened” across distributed systems.
7) Is event-driven architecture a good fit for monoliths?
Yes-EDA can be introduced inside a monolith to decouple modules and establish event contracts. Many teams start there, then later split services once boundaries and event flows are proven.
8) How do I ensure event schemas don’t break consumers?
Adopt schema versioning and compatibility rules:
- Prefer additive changes (new optional fields)
- Deprecate fields gradually
- Version events explicitly when breaking changes are unavoidable
- Document contracts and ownership clearly
9) Can I rebuild a system’s state by replaying events?
Often, yes-if events are durable and sufficiently descriptive. This is a key benefit of log-based architectures: reprocessing becomes possible for debugging, backfills, and new consumers. The trade-off is that you must design events as durable domain facts and manage retention and storage costs.
Conclusion
Event-driven architecture works best when events are treated as durable contracts, not ad-hoc messages. If you get the fundamentals right-partition keys, at-least-once delivery with idempotent consumers, and production-ready workflows for retries/DLQs-you can evolve systems more safely and scale them with fewer cross-service bottlenecks.
For further reading on event-driven architecture use cases, Redpanda’s guide is a solid reference: https://www.redpanda.com/guides/kafka-use-cases-event-driven-architecture
And if long retention is a requirement, tiered storage is worth evaluating early: https://docs.redpanda.com/current/manage/tiered-storage/
For a deeper view into modern platform reliability, observability in 2025 with Sentry, Grafana, and OpenTelemetry is a helpful complement to EDA monitoring practices.







