Logging is one of those things that remains mostly an afterthought for a lot of languages, frameworks, and engineers.
It's not that hard to get organized. Here's what I've been doing since the last ten years on most of my projects (Java & Kotlin mostly but you should be bale to do this for anything).
1) log levels matter. Debug/tracing is disabled in production. Info is informational only and should not have a signal to noise ratio that gets annoying (people doing debug logging at info level). Warning frequency should be low. If you are not going to fix it, it's not worthy of a warning. Errors should cause alerts and people to be woken up. Simple rule but only if you enforce it. Don't log at error level unless it really is worth waking somebody up for (e.g. me). An error is not "something totally expected happened but I could not be arsed to think about handling that in a sane way". I've seen projects that routinely log thousands of errors per hour and never fix anything until after customers start yelling on the phone. Log levels are totally irrelevant in such projects. Nobody bothers to look at those errors. They are no longer actionable. If errors are normal and expected behavior, how do you tell when something abnormal and unexpected happens? You can't; unless you make people do something about those errors and create a culture where having errors is simply not an acceptable state for the product to be in. Life is great when you do that. We have zero errors on most days. When they do happen, it's usually because something changed. And then we fix that and it goes quiet again. Simple rule to enforce. Generates very little work. But you have to enforce it.
2) Java logging frameworks have something called a mapped diagnostic context (MDC). This is great. Basically it means every log entry can have a context where you can keep track of things in e.g. your request like headers, user agents, ip addresses, session ids, etc. Why don't other languages have this? I don't know. Seriously, how is this not a thing for any web development framework worthy of the name. Why would you not want to know this information when something happens?
3) Logging messages are structured data. Whether you like it or not. Plain text is a shitty way to represent structured data. If you can, log in json format. You have timestamps, logger names, log levels, attributes in your MDC, attributes coming from your server environment like the host name, service name, etc. All of it is relevant.
4) Tailing and grepping plain text logs simply does not scale. It's what you have to do when your ops team is too incompetent to setup proper logging. Usually goes hand in hand with having snow flake servers that people ssh into. It's actually the #1 excuse to have boxes you can ssh into to begin with. Solution: logs go into a data store that allows you to filter on this data. Not having this is the equivalent of running blind. Completely and utterly unacceptable. Most cloud environments come with a reasonably OK logging console but you might want to upgrade to something with a bit more querying capability. But done right even those default logging consoles can be capable enough.
5) Your logs should have alerts on them. If errors happen, alerts should happen and people should do things about those errors. If logs go silent when they shouldn't be, alerts should happen because something is probably broken. If there's a weird spike in logging volume, alerts should happen. Alerts should be actionable and be exceptional. If something is alerting all the time, nobody will check when something important actually does happen. Tricky to get right but once you do, you can react quickly to any incidents.
Deliberately keeping product names out of this. There are plenty of libraries, tools and products that allow you to do this properly. That most likely includes your preferred software stack. And if it doesn't, use something more production ready that does. Or fix it (not that hard usually).
I'd characterize logs as a poor tool for doing the two tasks people use them for: investigating the state of the server, or investigating execution of a request. I instead am a strong believer in separate tools for those two tasks.
Server state should be exposed through metrics. Metrics have far fewer sharp edges than logs, and it's more obvious how to correctly produce, consume, and alert on them. I've seen (variations of) your 5 action items needed for the logs of every company I've worked at, but they've never applied to metrics.
Executions should be exposed through tracing. I'm kind-of cheating here: I expect the traces to have logs attached. But a well-done tracing system, where a developer can add a flag to their Postman query and their request was traced with the debug level set only for that request is a magical thing.
> Solution: logs go into a data store that allows you to filter on this data.
You can get very far getting to know the operating system's remote logging machinery can get you very far on this. It's amazing how often people basically duplicate this and how often people just write logs to text files or database tables instead of hooking up to the tooling that comes with the OS.
We used to call it Perl Programmer's Disease: at some point every Perl programmer in the late 1990s wrote a script to send Apache logs to a remote host because doing that was faster than learning how to make Apache log to a remote host directly.
> log levels matter. Logging messages are structured data.
Swift has those [0][1] and other features like jumping to the file and line of code from where the log was generated, but I wish there was a way to easily add extra information to each message in the debug console such as the current frame being rendered etc. Something I've been wrestling to do for the past few days, but if I write a custom logging function, then the IDE's debug console thinks every log message was generated from my custom function.
And god I wish we started making use of COLOR within text-heavy information. Being able to color different words/values in a log message would massively improve readability and comprehension.
I only used it once for a class in grad school, but it’s things like this that make Swift feel like a really well intentioned programming language, especially paired with the xcode ecosystem.
You've made a lot of good points. I've stepped into a team that is supporting a large product that has been going for years. There are so many error logs and alerts that nobody notices them any more - its so frustrating.
- I (as the CTO) get grumpy when I get alerted for nothing or spammed with non stop alerts. And I see all the alerts. Basically that means I tell people to get their act together (or lead by example). In fairness, it's quite often me that made the changes that caused me to get alerted and grumpy. This is not about finger pointing but about it genuinely being annoying to have to deal with this. This is a necessary level of pain that you seek to minimize.
- I get more grumpy when I don't get alerted when the thing actually breaks. This means I have to explain to others why shit was broken for hours/days on end without me doing anything about it. The dog ate my homework doesn't quite cut it here. I'm responsible, so I need to know.
The balance here is making sure every error gets logged and then making sure that everything that does get logged gets resolved in a way that makes the problem go away permanently. It's either a bug (fix it), an infrastructure failure (fix it), or something that isn't an error (so fix that it doesn't log a such).
It's not that hard to get organized. Here's what I've been doing since the last ten years on most of my projects (Java & Kotlin mostly but you should be bale to do this for anything).
1) log levels matter. Debug/tracing is disabled in production. Info is informational only and should not have a signal to noise ratio that gets annoying (people doing debug logging at info level). Warning frequency should be low. If you are not going to fix it, it's not worthy of a warning. Errors should cause alerts and people to be woken up. Simple rule but only if you enforce it. Don't log at error level unless it really is worth waking somebody up for (e.g. me). An error is not "something totally expected happened but I could not be arsed to think about handling that in a sane way". I've seen projects that routinely log thousands of errors per hour and never fix anything until after customers start yelling on the phone. Log levels are totally irrelevant in such projects. Nobody bothers to look at those errors. They are no longer actionable. If errors are normal and expected behavior, how do you tell when something abnormal and unexpected happens? You can't; unless you make people do something about those errors and create a culture where having errors is simply not an acceptable state for the product to be in. Life is great when you do that. We have zero errors on most days. When they do happen, it's usually because something changed. And then we fix that and it goes quiet again. Simple rule to enforce. Generates very little work. But you have to enforce it.
2) Java logging frameworks have something called a mapped diagnostic context (MDC). This is great. Basically it means every log entry can have a context where you can keep track of things in e.g. your request like headers, user agents, ip addresses, session ids, etc. Why don't other languages have this? I don't know. Seriously, how is this not a thing for any web development framework worthy of the name. Why would you not want to know this information when something happens?
3) Logging messages are structured data. Whether you like it or not. Plain text is a shitty way to represent structured data. If you can, log in json format. You have timestamps, logger names, log levels, attributes in your MDC, attributes coming from your server environment like the host name, service name, etc. All of it is relevant.
4) Tailing and grepping plain text logs simply does not scale. It's what you have to do when your ops team is too incompetent to setup proper logging. Usually goes hand in hand with having snow flake servers that people ssh into. It's actually the #1 excuse to have boxes you can ssh into to begin with. Solution: logs go into a data store that allows you to filter on this data. Not having this is the equivalent of running blind. Completely and utterly unacceptable. Most cloud environments come with a reasonably OK logging console but you might want to upgrade to something with a bit more querying capability. But done right even those default logging consoles can be capable enough.
5) Your logs should have alerts on them. If errors happen, alerts should happen and people should do things about those errors. If logs go silent when they shouldn't be, alerts should happen because something is probably broken. If there's a weird spike in logging volume, alerts should happen. Alerts should be actionable and be exceptional. If something is alerting all the time, nobody will check when something important actually does happen. Tricky to get right but once you do, you can react quickly to any incidents.
Deliberately keeping product names out of this. There are plenty of libraries, tools and products that allow you to do this properly. That most likely includes your preferred software stack. And if it doesn't, use something more production ready that does. Or fix it (not that hard usually).