What would you use for storing and querying long-term audit logs (e.g. 6 months retention), which should be searchable with subsecond latency and would serve 10k
writes per second?
AFAICT this system feels like a decent choice. Alternatives?
"So subsecond I would say is a requirement." you do not make any specific point why you came to that conclusion.
You can easily entertain users to show them that the system is doing something in the background without loosing them and if they are collegues who actually need to search, you don't even need to keep them as they have to use your setup.
Do you know any specific case studies for unstructured logs on clickhouse?
I think achieving sub-second read latency of adhoc text searching over ~150B rows of unstructured data is going to be quite challenging without a high cost. Clickhouse’s inverted indices are still experimental.
If the data can be organized in a way that is conducive to the searching itself, or structured it into columns, that’s definitely possible. Otherwise I suppose a large number of CPUs (150-300) to split the job and just brute force each search?
NATS doesn't really have advanced query features though. It has a lot of really nice things, but advanced querying isn't one of them. Not to mention I don't know if NATS does well with large datasets, does it have sharding capability for it's KV and object stores?
I use NATS at work, and I have had the privilege to speak with some of the folks at Synadia about this stuff.
Re: advanced querying: the recommended way to do this is to build an index out of band (like Redis (or a fork) or SQLite or something) that references the stored messages by sequence number. By doing that, your index is just this ephemeral thing that can be dynamically built to exactly optimize for the queries you're using it for.
Re: sharding: no, it doesn't support simple sharding. You can achieve sharding by standing up multiple NATS instances, and making a new stream (KV and object store are also just streams) on each instance, and capture some subset of the stream on each instance. The client (or perhaps a service querying on behalf of the client) would have to me smart enough to be able to mux the sources together.
Does it handle clustering/redundancy for the data stored in KV/object store? My intuition says yes because I believe it supports it at the "node" level
AFAICT this system feels like a decent choice. Alternatives?