What would you use for storing and querying long-term audit logs (e.g. 6 months ...

AJSDfljff · on July 11, 2024

I would question first if the system needs to search with subsecond latency and if the same system needs to be which can handle 10k writes/sec.

Even google cloud and others let you wait for longer search queries. If not business ciritical, you can definitly wait a bit.

And the write system might not need to write it in the endformat. Especially as it also has to handle transformation and filtering.

Nonetheless, as mentioned in my other comment, the interesting details of this is missing.

endorphine · on July 11, 2024

Let's say that it powers a "search logs" page that an end user wants to see. And let's say that they want last 1d, 14d, 1m, 6m.

So subsecond I would say is a requirement.

And no, it doesn't have to be the same system that ingests/indexes the logs.

AJSDfljff · on July 12, 2024

"So subsecond I would say is a requirement." you do not make any specific point why you came to that conclusion.

You can easily entertain users to show them that the system is doing something in the background without loosing them and if they are collegues who actually need to search, you don't even need to keep them as they have to use your setup.

endorphine · on July 12, 2024

OK, let's say it needs to be <3s, for reasons.

bojanz · on July 11, 2024

You'll find many case studies about using Clickhouse for this purpose.

hipadev23 · on July 11, 2024

Do you know any specific case studies for unstructured logs on clickhouse?

I think achieving sub-second read latency of adhoc text searching over ~150B rows of unstructured data is going to be quite challenging without a high cost. Clickhouse’s inverted indices are still experimental.

If the data can be organized in a way that is conducive to the searching itself, or structured it into columns, that’s definitely possible. Otherwise I suppose a large number of CPUs (150-300) to split the job and just brute force each search?

buzer · on July 11, 2024

There is at least https://news.ycombinator.com/item?id=40936947 though it's a bit of mixed in terms how they handle schema.

SSLy · on July 11, 2024

not sure if an excellent joke or a honest mistake

buzer · on July 12, 2024

Let's go with former, I definitely didn't mean to link https://www.uber.com/en-FI/blog/logging/ :)

SSLy · on July 11, 2024

What if I don't have such latency requirements? I'm willing to trade that for flexibility or anything else

jakjak123 · on July 11, 2024

10k audit logs per sec? I think we have different definitions of audit logs.

jjordan · on July 11, 2024

NATS?

packetlost · on July 11, 2024

NATS doesn't really have advanced query features though. It has a lot of really nice things, but advanced querying isn't one of them. Not to mention I don't know if NATS does well with large datasets, does it have sharding capability for it's KV and object stores?

Zambyte · on July 11, 2024

I use NATS at work, and I have had the privilege to speak with some of the folks at Synadia about this stuff.

Re: advanced querying: the recommended way to do this is to build an index out of band (like Redis (or a fork) or SQLite or something) that references the stored messages by sequence number. By doing that, your index is just this ephemeral thing that can be dynamically built to exactly optimize for the queries you're using it for.

Re: sharding: no, it doesn't support simple sharding. You can achieve sharding by standing up multiple NATS instances, and making a new stream (KV and object store are also just streams) on each instance, and capture some subset of the stream on each instance. The client (or perhaps a service querying on behalf of the client) would have to me smart enough to be able to mux the sources together.

packetlost · on July 11, 2024

Does it handle clustering/redundancy for the data stored in KV/object store? My intuition says yes because I believe it supports it at the "node" level

Zambyte · on July 11, 2024

Yes. When you create a stream (including a KV or object store) you say what cluster you want to put it on, and how many replicas you want it to have.

packetlost · on July 12, 2024

Very cool, I'll have to keep that in mind next time I'm in need of something similar!