How to Stream MongoDB to ClickHouse in Real Time (No Code, No Kafka) -

Sales Development Representative and excited about connecting people

Looking to power real‑time analytics without slowing down your app? Streaming MongoDB data to ClickHouse gives you the best of both worlds: MongoDB keeps doing what it’s great at (flexible, high‑velocity writes), while ClickHouse delivers lightning‑fast queries over massive datasets.

In this guide, you’ll learn how to build a production‑ready, no‑code pipeline from MongoDB to ClickHouse using Estuary Flow. You’ll capture changes in real time (CDC), backfill historical data, and land analytics‑ready tables in ClickHouse—without managing Kafka or writing complex ETL scripts.

We’ll walk through the architecture, setup steps, schema design tips, and performance best practices so you can move from idea to dashboard in hours, not weeks.

Why Sync MongoDB to ClickHouse for Analytics

MongoDB strengths: flexible schema, fast writes, great for operational workloads (user profiles, content, events).
MongoDB limits for analytics: long‑running aggregations, complex filters, and joins across large collections get expensive and slow as data grows.
ClickHouse advantages: columnar OLAP database designed for real‑time analytics—fast scans, high compression, and blazingly quick aggregations over billions of rows.

If you’re new to ClickHouse, this overview is a helpful primer on why it’s built for speed: ClickHouse for real-time analytics.

By syncing data from MongoDB into ClickHouse, you separate OLTP from OLAP:

MongoDB remains the system of record for application logic and real‑time writes.
ClickHouse becomes your analytics layer for dashboards, reporting, and product analytics.

Typical use cases:

Real‑time dashboards powered by ClickHouse—without overloading MongoDB clusters.
Scalable product or behavioral analytics from app events and user actions.
Offloading heavy aggregations from MongoDB to a purpose‑built analytics engine.
Customer‑facing analytics that demand sub‑second query performance.

Not sure when to choose streaming vs batch? Start here: Batch vs stream processing.

The Architecture (Simple, Resilient, No Kafka to Manage)

Here’s the streamlined pattern you’ll set up with Estuary Flow:

Flow connects to MongoDB and captures changes in real time via change streams.
Flow backfills historical data, then continuously streams inserts, updates, and deletes.
Flow exposes a Kafka‑compatible endpoint (Dekaf), so you don’t deploy Kafka.
ClickHouse ingests from Flow via ClickPipes (in ClickHouse Cloud) and writes into MergeTree tables optimized for analytics.

Alternative for self‑managed ClickHouse: ingest via the Kafka engine + materialized views (covered in the Advanced section below).

For a broader look at how modern pipelines fit together, see: Data pipelines explained.

Prerequisites Checklist

MongoDB

A MongoDB Atlas cluster or self‑managed replica set/sharded cluster with change streams enabled.
A user with read permissions on databases/collections you’ll sync.
Optional: changeStreamPreAndPostImages enabled if you need before/after snapshots of updates.
Network access for Flow (allowlist IP or configure SSH tunneling).

ClickHouse

ClickHouse Cloud with ClickPipes, or self‑managed ClickHouse (23.8+ recommended).
Network access from ClickHouse to Flow’s Kafka‑compatible endpoint.

Estuary Flow

An account and access to the Flow dashboard to configure capture and delivery.

Step-by-Step: Streaming MongoDB to ClickHouse with Estuary Flow

Step 1: Prepare Your MongoDB Cluster

Ensure your deployment supports change streams (Atlas or a replica set/sharded cluster).
Verify user roles (read permissions on target DB/collections).
Configure network access for Flow (allowlist or SSH tunnel).
Optional: enable pre/post images for richer CDC events.

Step 2: Create a MongoDB Source in Flow

Sign in to the Flow dashboard and go to Sources.
Click “New Source” and choose MongoDB.
Enter connection details:
Connection string (SRV or standard URI).
Username/password.
Databases/collections to capture (you can include all or select specific ones).
Network tunnel: if your MongoDB is private, use the SSH tunnel section in Flow to provide host, user, and private key for secure access.
Save and start the capture. Flow will discover collections and begin backfilling.

Pro tip: Start with a single high‑value collection (for example, orders or events) to validate end‑to‑end, then expand.

Step 3: Normalize and Shape Data in Flow

MongoDB’s flexible documents need a consistent schema for analytics. Flow helps you:

Infer schemas automatically and enforce types.
Flatten nested fields (for example, customer.name -> customer_name).
Choose which fields to keep, rename, or mask (useful for PII).
Handle arrays via flattening or map to ClickHouse Nested types (covered below).

Keep column names analytics‑friendly (snake_case) and ensure a stable primary key (usually the MongoDB _id).

Step 4: Let Flow Backfill and Switch to Live Sync

When you start the capture, Flow performs a historical load to ensure a complete dataset in your destination.
After backfill, Flow streams change events (inserts, updates, deletes) in near real time.
If the pipeline is interrupted, Flow resumes from the last checkpoint automatically.

Step 5: Deliver Data to ClickHouse with ClickPipes (No Kafka to Manage)

In ClickHouse Cloud:

Navigate to ClickPipes and create a new pipeline.
Choose Kafka as the source and supply the Kafka‑compatible endpoint/credentials provided by Flow (Dekaf).
Select topics corresponding to your collections (for example, mongo.orders).
Set input format to JSONEachRow (or the format you configured in Flow).
Map fields to the destination table schema (you can let ClickPipes help auto‑create tables, then refine).

Recommended table design for CDC upserts:

Use ReplacingMergeTree with a version column to deduplicate updates.
Use a stable ORDER BY key (typically the document id).
Partition by event date for faster retention and pruning.

Example (simplified):

Columns: id (String), status (LowCardinality(String)), total (Float64), updated_at (DateTime64(3, 'UTC')), op (insert|update|delete), version (UInt64).
Engine: ReplacingMergeTree(version)
ORDER BY (id)
PARTITION BY toYYYYMM(updated_at)

Tip: Use event time (or Flow’s metadata timestamp) as your version so newer records replace older ones deterministically.

Step 6: Validate End-to-End

Insert or update a document in MongoDB:
Example: update orders._id=“5f...” status from “pending” to “paid”.
Watch Flow logs to confirm capture.
Query ClickHouse:
Select from your table and confirm the latest version of that id.
Run a quick aggregation (for example, revenue by hour) to validate performance.

Handling Schema Evolution and Nested Documents

MongoDB documents change over time—ClickHouse prefers predictable columns. Use these patterns:

Flatten nested objects: customer.name -> customer_name. Keep the flattened output consistent.
Arrays:
If you need item‑level analytics (for example, top SKUs), map the array to a Nested type in ClickHouse: items Nested(sku String, qty UInt32, price Float64). This preserves structure while remaining query‑friendly.
If you only need aggregate metrics (for example, item_count), compute them in Flow and store a single numeric column.
Type drift: enforce types in Flow (for example, always cast ObjectId to String, dates to DateTime64(3, 'UTC')).
Unknown fields: keep a json_payload column (String) for raw document storage if you need a safety net.

Queries over Nested types are fast and expressive in ClickHouse. Example: explode items for SKU‑level aggregations using arrayJoin on Nested columns.

Designing ClickHouse Tables for CDC

ClickHouse doesn’t do row‑level updates like an OLTP database. To represent “latest state,” use a MergeTree variant:

ReplacingMergeTree(version): Keep only the newest record per primary key (id). Provide a monotonically increasing version (event timestamp in milliseconds works well).
CollapsingMergeTree(sign): Represent inserts as +1 and deletes as −1; ClickHouse collapses pairs. Useful when you need delete semantics reflected in aggregates.
TTL and partitioning: Partition by month (toYYYYMM(updated_at)) and set TTL (for example, 90 days) to control storage costs.
LowCardinality for enums/strings: Convert high‑repetition strings (for example, status) to LowCardinality(String) to save space and accelerate GROUP BY.

Example DDL (conceptual):

CREATE TABLE analytics.orders USING ReplacingMergeTree(version)
Columns: id String, user_id String, status LowCardinality(String), total Float64, updated_at DateTime64(3, 'UTC'), version UInt64
ORDER BY (id)
PARTITION BY toYYYYMM(updated_at)
TTL updated_at + INTERVAL 90 DAY

Performance and Cost Tips

Partition wisely: date‑based partitions make retention and pruning cheap.
Use proper ORDER BY: choose a key that supports your most common filters/joins (often id or (user_id, updated_at)).
Control table bloat: set TTL and use ReplacingMergeTree with an appropriate version to avoid duplicates.
Compress and encode: rely on ClickHouse defaults; consider setting codecs on wide text columns if needed.
Avoid high‑cardinality dimensions in raw tables: pre‑aggregate into summary tables for common dashboards.

Security and Reliability Considerations

Encryption: use TLS for MongoDB, Flow, and ClickHouse connections.
Credentials: store secrets in Flow and ClickHouse securely (never hard‑code).
Private networking: if data is sensitive, prefer private endpoints/SSH tunnels or VPC peering.
Delivery semantics: Flow provides at‑least‑once delivery. Use idempotent table designs (ReplacingMergeTree with version) to deduplicate safely.
Resumability: if ingestion pauses, Flow resumes from the last checkpoint via change stream resume tokens.

Monitoring and Observability

In Flow: watch capture/materialization logs, lag, and retry metrics; set alerts for failure states.
In ClickHouse:
system.query_log and system.part_log for ingestion/query diagnostics.
SELECT count(), max(updated_at) FROM table to validate freshness.
EXPLAIN and system.query_thread_log to optimize slow analytics queries.

Common Pitfalls (and How to Avoid Them)

Oplog/Change stream retention too short: ensure MongoDB oplog size fits your backfill and peak change rate.
ObjectId handling: cast _id to String consistently; avoid mixed types.
Time zones: normalize to UTC in Flow; use DateTime64 with timezone for consistency.
Array updates: partial array updates can be tricky; prefer full document upserts or flatten to Nested types and treat updates as full replacements.
Out‑of‑order events: rely on versioned ReplacingMergeTree to keep the latest state.

Real-World Examples You Can Launch Fast

Real‑time sales dashboard: revenue by minute, AOV, top products, regional trends (sub‑second in ClickHouse).
Product analytics: funnels, retention cohorts, feature adoption metrics from event collections.
Customer‑facing analytics: usage insights and self‑serve reports with predictable performance.
Anomaly detection: detect drops/spikes in key metrics by comparing current vs trailing averages.

Advanced: Self‑Managed ClickHouse (Kafka Engine + Materialized Views)

If you’re not using ClickPipes, you can still ingest from Flow’s Kafka‑compatible endpoint with native ClickHouse objects:

Create a Kafka engine table pointing to Flow’s broker/topic (JSONEachRow).
Create a destination MergeTree table.
Create a materialized view to transform and insert from the Kafka table into the MergeTree table.

This pattern mirrors ClickPipes behavior but gives you full control over ingestion and transformation logic.

Alternatives (And When They Make Sense)

Debezium + Kafka + Connectors: powerful and flexible, but more infra to deploy, secure, and operate.
Batch ETL (for example, nightly jobs): simpler for low‑frequency reporting, but you lose freshness and operational visibility.
ELT tools that support ClickHouse: great if you’re all‑in on batch; add CDC if you need near real time.

If you need low‑latency analytics without running Kafka, Flow + ClickPipes is often the fastest path to value.

Next Steps

You now have a clear blueprint to stream MongoDB to ClickHouse in real time—backfilled, deduplicated, and analytics‑ready. From here:

Start with one high‑impact collection (orders, events, or sessions).
Land data in a ReplacingMergeTree table with a clear version strategy.
Build a few core dashboards and iterate toward richer models (Nested arrays, summary tables, and aggregates).

Want to dive deeper into the architecture choices you’ll make along the way? These guides can help:

FAQ

Do I need Kafka?
No. Flow exposes a Kafka‑compatible interface (Dekaf) so ClickHouse can consume CDC events via ClickPipes—no Kafka cluster to operate.

Can I start with batch and switch to streaming later?
Yes. Flow supports historical backfill followed by continuous streaming. You can start with backfill only, then enable CDC when you’re ready.

How do deletes work?
Include an op column (insert/update/delete) in the payload. Use ReplacingMergeTree for soft deletes (flag a deleted_at) or CollapsingMergeTree to reflect deletes in aggregates.

What about schema changes?
Use Flow’s schema inference and mapping to keep ClickHouse columns stable. Add new fields as needed; avoid renaming frequently used columns mid‑stream.

How fresh is the data?
Typically seconds from change to queryable, depending on network, throughput, and ClickHouse ingestion settings.

With this setup, you’ll offload analytics from MongoDB, keep your application snappy, and deliver real‑time insights at scale—without owning a single Kafka server.

Software Development

How to Stream MongoDB to ClickHouse in Real Time (No Code, No Kafka)

Why Sync MongoDB to ClickHouse for Analytics

The Architecture (Simple, Resilient, No Kafka to Manage)

Prerequisites Checklist

Step-by-Step: Streaming MongoDB to ClickHouse with Estuary Flow

Step 1: Prepare Your MongoDB Cluster

Step 2: Create a MongoDB Source in Flow

Step 3: Normalize and Shape Data in Flow

Step 4: Let Flow Backfill and Switch to Live Sync

Step 5: Deliver Data to ClickHouse with ClickPipes (No Kafka to Manage)

Step 6: Validate End-to-End

Handling Schema Evolution and Nested Documents

Designing ClickHouse Tables for CDC

Performance and Cost Tips

Security and Reliability Considerations

Monitoring and Observability

Common Pitfalls (and How to Avoid Them)

Real-World Examples You Can Launch Fast

Advanced: Self‑Managed ClickHouse (Kafka Engine + Materialized Views)

Alternatives (And When They Make Sense)

Next Steps

FAQ

Don't miss any of our content

Sign up for our BIX News

Our Social Media

Most Popular

5 key considerations for implementing Gen AI in Your business

dbt Semantic Layer: How Metrics Work in Practice (and Why It Changes Analytics)

Best Observability Tools for LLM-Based Applications: A Practical Guide to Traces, Costs, Quality, and Safety

Implementing dbt in an Existing Data Warehouse: A Practical, Low-Risk Playbook

The Best BI Tools for Non‑Technical Users (and How to Choose the Right One)

The Hidden Costs of “Cheap” Data Solutions: Why Low Price Often Means High Risk

Is Your Company Ready to Use Generative AI? A Practical Readiness Guide for Leaders

Start your tech project risk-free