But then, twenty minutes after you leave, it begins. Errors. Fatal errors. And you're not around to know. So what happens?
In dt, we look out for one another. One of the ways we manage to do this is through an error logging service we've built called Logr. If you read our article on DRE, you'll remember we mentioned a variety of functions we use to generate messages within DRE. One of these is called dre_alert().
For example, you set up an alert in your code that gets triggered when a draft in Sta.sh Writer can't be loaded:
dre_alert("Failed to load draft");
And it seems to be happening every 5 seconds. Oops. It turns out the error is caused by a simple typo. But fortunately everyone in our team gets e-mailed when these severe types of errors are logged and since we're situated all over the world, someone is almost always around to be accountable for problems and emergencies. So yury-n kindly steps in, fixes the problem, and everyone is happy again, yay.
How does LOGR work?
We store both a message and a "payload" (any additional debugging information you pass along) along with host and backtrace information to help figure out the chain of events that led to the alert. All of this information gets sent to a Redis server. We rotate the data using a round-robin algorithm similar to RRDTool to keep the memory usage constant despite the fact that we store millions of different messages within LOGR.
Is it just for emergencies?
Nope. We use it for several things, often things that are recognized problems but not emergencies. Every time the alert happens, the incident gets grouped together with other alerts of the same type, so we can get a better picture of how frequently it's occurring, any patterns, or if there is a particular reason, using the payloads available, of why it's occurring.
We can tag our alerts however we want. So say we have 5 alerts set up for deviantART muro. We can tag each as so and then search for it later to find alerts related to that part of the site. Cool, right?
We also can set thresholds on alerts. Once established, if the incident occurs more than the number of times allowed, an e-mail will be sent to a particular developer.
For example, in this particular alert, $allixsenos is set up to receive an e-mail any time this particular alert happens more than 40 times in one hour:
Mousing over the chart, we can see a bit more information, like the exact number of occurrences that happened at different points in time.
And, as mentioned, we also get a nice payload of information:
With LOGR, we're able to get an idea of which of these alerts are exceeding their thresholds, which are happening frequently, which alerts are most recent, and when they were first seen. This helps us easily rule out which events were transitory or already resolved and which are continuing to happen and need fast attention.
We also generate histograms so that we can see a breakdown of how the error is occurring. This is available for any piece of data stored so that you can see different variations that occur. For example, we could see how the error is occurring, spread out across different servers:
But again, this isn't limited to just servers. It's available for any data stored in the payload.
We're always looking for ways to improve LOGR to help us spot problems with the site in a timely fashion to minimize the amount of impact it has on deviants so you can continue to enjoy the site as usual