Edge buffering for sites that lose connectivity

Every remote site loses its uplink eventually: a cellular modem renegotiates, a VPN tunnel drops, a substation reboots, or the backhaul degrades for twenty minutes during a storm. If your telemetry pipeline writes straight from the field to a central TSDB over that link, those minutes are gone, and with metering and forecasting downstream, gaps are not cosmetic. A small buffer at the plant turns an outage into a delay instead of a loss.

A site without local buffering does not have a resilient pipeline; it has a pipeline that works until the link does not.

Store and forward, on the plant side

Put a durable queue between the field and the uplink. A collector reads Modbus registers, OPC UA nodes or an MQTT feed, timestamps each sample at the source, and appends it to a local append-only log on disk rather than firing it straight at the central store. A forwarder drains that log to the backend and advances its read offset only once delivery is acknowledged. When the link drops, the collector keeps writing locally and the forwarder stalls; when the link returns, it resumes from the last acknowledged offset.

Two properties make this safe. Delivery must be at-least-once with idempotent writes keyed on (source, tag, timestamp), so a sample replayed after a half-acknowledged batch lands once, not twice. And ordering must be preserved per tag: a TSDB that ingests a backfilled block out of order corrupts the downstream aggregates and ramp-rate calculations that assume monotonic time.

Bounded storage and backpressure

Edge hardware is constrained: an industrial gateway or a small box in the control cabinet, not a server. The buffer needs a hard ceiling. Cap it by bytes or by retention window, whichever you hit first, and fix the eviction policy before you deploy. For most plant telemetry, dropping the oldest low-priority samples once full is correct, while alarms and state changes get a reserved partition that is never evicted. The failure mode to design against is silent unbounded growth that fills the disk and takes the collector down with it.

Backpressure has to reach the producer. If the forwarder cannot keep up after a long outage, the collector should slow its poll rate or coarsen resolution rather than thrash the disk. A few hours of samples at a one-to-five-second cadence is a modest footprint under a compact binary encoding; size the ceiling from your worst-case outage, not the best case.

Reconnecting and backfilling cleanly

Reconnection is where naive implementations fall over. Throttle the catch-up: a forwarder that dumps six hours of backlog at full rate the instant the tunnel returns saturates the link and stalls live data behind the replay. Send bounded batches, prioritise the most recent samples so dashboards recover first, then drain the historical tail in the background. Flag backfilled data so operators can tell a late arrival from a real-time one.

On the backend, treat backfill as normal ingestion, not a special path: the same idempotent, time-ordered writes. The result is a continuous series with the outage stitched over, not a hole followed by a spike. That distinction matters when the same data feeds settlement-grade metering and a 15-minute forecast that cannot quietly tolerate missing intervals.

Size the buffer for your worst-case outage, not your average one, and give alarms a reserved partition that eviction can never touch.

Edge buffering is unglamorous, and it is the difference between a pipeline that survives the field and one that only survives the demo. The details (idempotency keys, per-tag ordering, eviction policy, throttled backfill) are where it is won or lost, and they are exactly the kind of plumbing we build and keep running. If you have sites that go dark and come back with gaps, let's talk about closing them.

← Back to Insights

Edge buffering for sites that lose connectivity

Store and forward, on the plant side

Bounded storage and backpressure

Reconnecting and backfilling cleanly

More from Insights

Generation reports operators actually trust

Modbus or OPC UA? How to talk to the plant

Why plant telemetry belongs in a time-series database

Sites that go dark and come back with gaps?