Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

To what extent do these concerns change or disappear if using alternatives, like Pulsar or Redpanda?

It's been awhile since I dug into this but a couple jobs ago I was very concerned with what happens if replaying a topic from 0 for a new consumer: existing, up-to-date consumers are negatively impacted! As I recall this was due to fundamental architecture around partitions, and a notable advantage of Pulsar was not having such issues. Is that correct? Is that still the case?



Assuming i) you've set topic retention so that old messages still exist and your new consumer can "replay from 0"; and ii) your new consumer is using their own consumer group, then existing consumers won't be impacted.


Is there not resource contention when multiple consumer (groups) are reading from the same Kafka partition? You can of course over-provision your partitions to better allow for this, but rebalancing is not cheap and also tends to affect consumers.

I may be describing the problem incorrectly, but I know vendors we talked to were aware of this issue and had workarounds; IIRC Aiven had tooling to easily spin up a temporary new "mirror" cluster for the new consumer to catch up.


Your main resource contention there would be network throughput, then broker CPU time in iowait if there's a lot of concurrent large disk reads happening on the same broker (although Kafka's optimised for sequential reads).

Partition count only limits concurrent consumption within a single consumer group. One consumer group won't impact another unless its consumers are doing sufficiently bad things to bottleneck the network or disk.


> Your main resource contention there would be network throughput, then broker CPU time

The default consumer read sizes are so small you will hit broker CPU and worker thread limits long before network throughput. (Both consumer batch sizes and broker threads can be increased trivially but there's not much documentation around when to do this.)


The default fetch.min.bytes is 1, that's true, but the default fetch.max.bytes is 50 MiB.

And fetching small amounts of data repeatedly doesn't impose much overhead unless you're deliberately disconnecting and reconnecting between polls.

And I assure you, you can definitely bottleneck network before CPU and/or network threads, Kafka was literally designed for very large numbers of consumers.

And as for tuning, Kafka The Definitive Guide is pretty much as the name suggests. I've been recommending it very strongly, especially the chapters on monitoring and cluster replication, for years.

You can download a free draft copy of the 2nd edition from Confluent, check it out :)


> but the default fetch.max.bytes is 50 MiB.

But max.partition.fetch.bytes is only 1MB.


Yep, it's aligned with the default max.message.bytes, largest batch size the broker will accept.

1 MiB is allegedly the ideal batch size for throughput.


Starting a new consumer group won't cause rebalances (except in the new group).

It sounds like you're asking if you can double the number of readers in a system with no performance impact. If you're at capacity, the answer is obviously no. Yes, every consumer takes some i/o and CPU on the brokers serving the data. I have never used Pulsar but I'm sure that's also the case there.


There's also the file system cache to consider, which Kafka famously leans on heavily for IO performance. If the majority of your consumers are reading the latest messages which were just written, these will likely come from the cache which is in memory. A consumer reading from the earliest messages on a large topic could conceivably cause changes to what's available from file system cache for other consumers reading from the latest messages, so they're not necessarily totally isolated. I've not taken measurements of this though to say it's an actual issue, just saying that I wouldn't dismiss it.


FS cache is, at least for me, included in "i/o resources". But this utilization will occur for any consumer reading any partition from any topic from anywhere other than the tail segment which isn't currently being read (or even the tail segment of a partition not being produced); it's not specific to partition 0. And I don't believe you'd gain anything by turning on a mirroring cluster rather than increasing the number of brokers in the same cluster; in both cases you're solving it by spreading the i/o load out more.


Kafka special cases the "tail" of a partition - the open log segment (i.e., the very end of the partition that's still being written too) is never evicted, and log segments closer to the tail are evicted last, IIRC.

It definitely prioritises tail consumption over read from 0.


I think they were discussing rebalancing replicas when you increase partitions. Cruise Control is great here.


No, Pulsar has partitions too. And I'm surprised you saw a massive effect unless the brokers were under huge load.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: