Tales from Vector Land - Poison Payloads

Vector is for building observability pipelines. It can work as an agent/sidecar too, but where it really shines is as an aggregator. Funnel a bunch of stuff into it and you can use it apply standard structures, annotations, and centralize governance. This won’t be a post describing its basic usage (Vector’s docs do that pretty well!) but rather a fun exploration of a failure scenario where a slow trickle of “poison” payloads can gum the whole thing up.

This blag is probably best suited to readers who are already exploring or operating Vector, but if you take the view that Everything is a Queue™ then I think it could be instructive for broader audiences too.

Basic Controls Link to heading

Events come in, maybe get transformed, filtered, ejected into the sun, or otherwise delivered to a sink. Vector can deliver events to all sorts of sinks like other HTTP services, S3 buckets, queues, cloud vendors, /dev/null and more. With the exception of the blackhole sink, most destinations will not be 100% reliable. Vector sinks have some parameters to deal with this including:

Buffers. All sinks have an adjustable buffer that accepts 500 events by default. An event remains in a sink’s buffer until delivery succeeds or permanently fails. If the buffer becomes full due to failed or slow delivery, the sink buffer can be instructed to either block (no new events can be sent to the sink, and backpressure is applied) or discard (newest events are dropped until the buffer has empty slots again). By default, buffers block.
- Source and transform components also have buffers, but they are smaller (typically 100 events) and non-adjustable.
Retries. Events can be retried a finite number of times, remaining in-buffer until delivery succeeds. After an event exhausts its retries, it is discarded. Vector’s default retry attempt is so high it may as well be infinite (9.223372036854776e+18) and I’m going to treat it as such.
Acknowledgements. Some sinks support end-to-end acknowledgements for clients that put things into Vector. For example: an HTTP source is connected to an S3 sink. With acknowledgements disabled, the HTTP source immediately responds HTTP 200 on successful receipt of events without waiting to find out if the sink deliveres it, or if it even reaches the sink. With acknowledgements enabled, the HTTP source waits until successful delivery to S3 before it returns HTTP 200, or alternatively responds with HTTP 5xx if delivery permanently fails.
- If an event is discarded due to a filter, the event is considered to be “successfully” delivered as far as acknowledgements are concerned.

A reallllly important note about acknowledgements: if a sink attempts to deliver a payload and fails, but it hasn’t yet exhausted its retries, then delivery has not permanently failed yet. Acknowledgements only fire when delivery succeeds or permanently fails, and, and a temporary failure while retries have not been exhausted is neither! No response is given to the client while these retries occur, which could eventually result in a timeout if they go on for too long.

That might seem like a problem but there are much better ways to break things.

Poison Payloads Link to heading

Configuring a sink with infinite retries and blocking buffers isn’t terrible in a case where you really want to guarantee eventual delivery. If you’re sinking events to S3 and S3 explodes, then the sink’s buffer will probably quickly become full. This causes backpressure: the sink blocks the components in front, eventually causing the source component to block, which then refuses to accept new requests. If end-to-end acknowledgements are turned on, then any established connection waiting for buffered events to flush will eventually time out.

The thinking though, is this: S3 will eventually come back, and if the senders are smart, they’ll recognize that the timeouts/connection failures mean they need to re-attempt their own submissions to Vector. Once S3 is available, the buffer can flush, and the flow of events can return to normal.

But what if you could design a situation where a buffer can never be flushed?

Imagine your sink is not S3, but instead something like Datadog. When Vector receives a Datadog payload, it gives you the option to preserve the sender’s API key, which Vector can then use to sink it to the Datadog mothership. The API key is part of the log event itself (though not the actual log body). Delivery will of course fail if the API key is no good, and, well, that’s a problem for a blocking buffer with infinite retries because that means the payload can never be delivered no many how many times it retries. If a client is producing even a trickle of payloads with a “poison” API key, all of those events eventually get stuck in the sink buffer until until the buffer contains only undeliverable payloads. Now the pipeline is blocked and will never unblock.

Ask me how I know.

This isn’t a problem unique to Vector but what’s interesting about it is that once a payload reaches the sink, nothing about it can be modified. Prior to arriving at a sink, the payload (including ride-along secrets like the API key) can be modified, normalized, filtered, discombobulated (and recombobulated) by transforms on the way to a sink, but once it hits a sink, it can only be buffered, flushed, or discarded.

Pipelines don’t have to be straight lines, so sometimes this blocking behaviour can be hidden behind a Y-connector:

source --> transform --> filter --> File sink
                                |
                                --> HTTP sink
                                |
                                --> S3 sink

Even if most of the sinks have been carefully configured to set a finite number of retries or to discard events when full, it only takes one sink with a default configuration (blocking buffer of 500 events) to become full and block upstream components. Consider a full File sink in the above diagram: the filter component can’t send File events, so filter becomes full; the transform can’t send the filter events, so transform becomes full too; the source can’t send the transform events and so the source becomes full; the source then stops accepting requests.

This might seem counterintuitive since a single blocked register at the grocery store doesn’t block all other registers. Of course, most shoppers don’t multiplex themselves into three identical copies where the store’s overall occupancy is only reduced when all three copies check out.

Poison Payload Workarounds Link to heading

The workaround that requires the least amount of over-engineering (but may not suit all workloads) is to ensure all sinks have a finite (and low!) number of retries, and to turn on end-to-end acknowledgements. Finite retries cause Vector to eventually discard poison payloads, and the end-to-end acknowledgements mean that the sender is informed of delivery failures. If the sender knows to treat HTTP 5xx as delivery failures, it will know to re-send events to Vector.

Those re-attempted submissions will also eventually fail, but in the meanwhile you can frantically wrench on the pipeline, maybe by using a transform to fix/normalize a payload or even strip out or replace a known-bad API key. Vector doesn’t (yet) have the concept of DLQ, so at least with finite retries and end-to-end acknowledgements there is a chance that events can be replayed from their source, since they sure can’t be replayed from an infinite-retry buffer.

If looking for a way to over-engineer things though, the likely answer is to put an intermediate between Vector and the sink that can act as a DLQ/re-enqueuer of sorts. What kind of app would you write/deploy to do that? Probably more Vector.

But I’d give the finite retry strategy a shot first.