How To Monitor RudderStack With Grafana Alerts
Ryan McCrary
Product Manager at RudderStack
Benji Walvoord
When something goes wrong in your data pipelines, you need actionable information in front of you fast. With RudderStack and Grafana, you can set up alerts in the tools you already use like slack and Microsoft Teams. In this technical session, our customer success team will show you how.
- Setting up Grafana
- How to set up RudderStack alerts with Grafana
- Distributing alerts to downstream tools, including RudderStack’s webhook source
Brooks (00:00)
Welcome, everyone. Thank you so much for joining us today. I'm Brooks I'm on the growth team here at RudderStack. We're really excited to bring you a session today on a little-known, but extremely useful feature in RudderStack. We've got our customer success team here today, Ryan McCrary and Benji Walvoord are going to teach y'all how to use Grafana with RudderStack to set up learning.
Ryan McCrary (00:31)
Yep. I'm Ryan. I work on the customer success engineering team here at RudderStack. So primarily working with proof of concepts and ongoing customers providing technical resources to them.
Benji Walvoord (00:45)
Hey everybody, I'm Benji I'm on the marketing and technical content development team. I help produce some of the content here at RudderStack and also help manage our own stack here internally.
Ryan McCrary (00:58)
Cool. Well, today we're briefly just going to come and go through as Brooks mentioned, is something that a lot of folks might not be aware of. That's a part of RudderStack and how to use it. And so, we'll quickly go through, kind of what Grafana is, setting it up, setting up alerts via Grafana, and then how to distribute those into downstream tools so that users are aware of what's going on. So quick little overview. This is a basic diagram of a RudderStack data plane. So as most of you are probably aware if you've made it your way to this video is that RudderStack doesn't store any customer data. So as the events are passing through RudderStack through the Rudder server piece that you see there in the middle, we're processing those into the downstream destinations. And then as soon as they're processed, we drop those tables completely.
Ryan McCrary (01:44)
So we have no record of those long-term within RudderStack or within the data plane. We do, however, persist metrics around those. So you can see that the bottom part is specific to monitoring. And so we pipe a number of metrics into via Telegraph, into InfluxDB that's embedded in RudderStack instance. And so, these metrics could be anything, anything from, as high level, as response codes of events to downstream events via their API, can be errors, at ingestion all the way down to pretty low, pretty low-level stuff. As you know, as minuscule as you know, what type of, or how long is the loop time? How long are these events be taken to be stored in the Postgres instance in RudderStack? How long is it taking to process internally can really get as granular as you want.
Ryan McCrary (02:35)
So there's a lot of different uses for these, but they're all just dumped into influx. And we have a number of other services that you'll see there as well, that interact with those metrics that we store influx. So capacitor is what we use for alerting. So that alerts our internal kind of infrastructure team as to errors within our customer’s instances so that we can be aware of issues that are going, and if we need to step in and debug anything. And, but that's pretty noisy, right? So most of our customers are not interested in seeing every single alert, a lot of which are very internal parts of, RudderStack and things that they honestly can't even act on with their own instance when it's hosted by RudderStack. So Grafana is embedded into our second instance as well, and that's what we're going to use as basically a visualization layer.
Ryan McCrary (03:17)
So that's going to show us how we can look at these metrics from a high level all the way down into some of those more granular things, depending on our business use cases. So setting up Grafana, this is a screenshot of what you will often see as the default dashboard when you sign into RudderStack. So this is a dashboard that we have built previously around some pretty high-level metrics that may be of interest to customers. And so, from the top down, it's going to go more granular as you go down, but the high level you'll see that we're showing, at a very basic level, the received requests, the received events as those are batched as they come in and then the delivered events. And so this is, like it says, high-level view, a big picture of what's happening in RudderStack below.
Ryan McCrary (04:00)
You'll see just the status of your RudderStack server and then your gateway, which is stood up as a separate service going down into the gateway. And then as you go on down, you'll see, the delivery to the individual destinations and then even, the amount of databases that are, or the amount of tables that are being stored under the hood. So this is, again, this is out of the box. This is what we build and include with, a default paid RudderStack plan. And so you have access to this, but again, this is just a jumping-off point. So you can really build a panel around whatever you'd like. So any of those metrics that we are piping into influx can eventually be turned into a table and then can be, can set up alerts around them. The reason we don't do that outside of the box for you is that they're highly dependent upon your business use cases.
Ryan McCrary (04:44)
And those thresholds have to be hardcoded into the alerts from Grafana. So it's not really a one size fits all approach. There's a lot of, there's a lot of tuning that needs to be done to, (A) what are you worried about with your business use cases, and then (B) what are those thresholds at which we want to be notified that something's going on? So I'm going to toss it over to Benji who is in a live Grafana dashboard and is going to show us kind of a little bit more of how these queries are built and then how we can use them to actually generate alert queries as well.
Benji Walvoord (05:13)
Yeah. So what we're looking at here is a sample dashboard of our development instance, and we've gone in and edited one particular panel to show you how, how these panels are just individual queries on the influx database. And so one of the nice things about Grafana is that you can set up individual queries. And if we disable this, you can see that this is just a pretty basic sequel statement on the influx database. And you can create as many queries as you want, and they'll render in, in the same chart, but to create these queries, you just build this select statement and you can select the metric that you're interested in. In this case, we want a panel that's showing us failed processor alerts, and failed processor alerts mean that the data plan failed in sending these to these events, to their downstream destinations.
Benji Walvoord (06:17)
And this is, this is a pretty important thing to monitor because if you're getting failed, events, leads being sent to Salesforce, or in our case events being sent to a Facebook pixel, this is something that you want to know about and that you want to be alerted about. So for this example, again, we're just, we're going to do a count of failed processor alerts. And so we select that particular metric. And for alerts, you do have to select a specific instance in that name. So we'll just select our development environment. And we want to get a sum. We want a total number of failed events. And because these are fairly time-sensitive, we're going to keep our time interval to one minute.
Benji Walvoord (07:08)
One of the things that we also might find useful is not just the total number of events, but we want to see events by destination. So again, we can just create another group by selecting our destination name, and we're going to actually segment this one step further and, break this down by the stage. And what that will tell us is whether or not these events are failing within the user transformation within RudderStack, or at further downstream at the destination transformer. And that just comes in handy when we're trying to troubleshoot it, why are these events happening? And so if we look at these, we can see how these, these events are rendering, and we can see, all right, we're seeing some errors for a customer at our Facebook pixel, a slack error message.
Benji Walvoord (08:05)
And so we might now want to create an alert that says, okay, when these events happen, or when a certain volume of these events happen, send us a message or alert us somehow. And so we do that by clicking on the alert tab, and we can name this. And again, this is a fairly time-sensitive issue here. So we're going to, evaluate this every minute. For general monitoring, you might set an alert at every 30 minutes or every hour, if you just wanted to be tagged on, on event volumes, or if you're monitoring event volume averages over time. And you want to see if something's getting out of a standard deviation or, some norm you might have that set to, to every hour, every six hours, or even every 24 hours. But for the sake of this demo, we're going to set it at one minute to, to fire these events.
Benji Walvoord (09:05)
And again, the alert is just going to create another query of the query we just created. And so we define these conditions of when the total number of events from our query (A) that we created before, between one minute now or above seven. So anytime we get more than seven in events in one minute, it's going to create, it's going to send it, trigger an alert. And then when it goes below that threshold, we actually within our Grafana setup and this is, this is a default setting. We want to get an alert notification. And then when our error average goes below the, the alert level, we want it to send us an okay message. And so you can see those represented in the red lines and the green lines here.
Benji Walvoord (10:03)
And so when we define these notifications, really the next step is just to define, okay, where do we actually want to send these? And you can send them from within Grafana itself, you could natively send these messages to Slack. So we created a Slack channel. You can send them to Microsoft teams, you can send them to pager duty to Data dog, and there are a number of different avenues where you can send these, or you can send them to RudderStack and send them to your database, your data warehouse, or other analytics tools.
Ryan McCrary (10:44)
Yeah. And something to, to call out real quick. As I mentioned earlier, these alerts and charts are kind of hard for us to build out as an out-of-box solution because they're, as you've shown, they're very dependent upon the setup that we're, that we're using, and then the metrics that we're interested in. And so this is a test environment. So it's, you can see that there is some data flowing through that's showing that, but it's really nice to be able to visualize the ups and downs with this specific metric and see that we're going to tune this. I mean, so this seven number is nothing magic right? Like Benji and I didn't get together and say like, man, eight failures is where we're really going to be mad about this. It's really kind of, and you can see as Benji can drag it, it's really kind of tuned around, okay, what are actually those significant peaks where we might be concerned what's out of the norm.
Ryan McCrary (11:33)
And so that norm is going to be very different for any type of customer, any type of setup and really business goals that we have associated with that. So you can see as Benji is, you know, moving that up and down, you can see that this is something that may need to be adjusted over time as well. And that's pretty common from a lot of our customers is to really set up one of these baseline alerts and then you'll know pretty quickly if these are valuable or if they need to be tuned, are you getting alerted? As you can see when Benji and I headed on earlier for testing was probably a bit too noisy to be useful in a production environment, but it was very good for, you know, being able to build out this use case. So, Benji, you know, specifically what would process your errors? You know, we don't have to get too far into it, but this would be something that would be of concern to you as, you know, someone that's consuming the, the data through RudderStack. Why, why would that be?
Benji Walvoord (12:20)
Sure? Well, for one example is let's say that you've got a JavaScript SDK passing, new lead data, new lead records to a Salesforce instance. You, want to have a, very high sensitivity on those records. So if those start to fail, that means you're not getting, new leads in Salesforce. And so that's something that you definitely want to be alerted about very quick, very quickly, and want to respond to on the flip side, in this test environment where we're getting failed error messages to a Facebook pixel that might indicate that, okay, someone's testing out a new form or testing out new calls, and we need to filter those out in a user transformer or in the destination transformer, or, or just tell them to knock it off or add those events to the tracking plan, but that's for another webinar. So,
Ryan McCrary (13:24)
Yeah. And so this example is this is a good example of a way to use this almost as like a, to test drive view development. So this specific, these errors that are being shown, we were able to pick something noisy just for demonstration, but these are to the Facebook pixel. And these are likely because the Facebook pixel destination doesn't support page calls. And so we're able to check in our development and see like, okay, these are being raised just because, we're not filtering, we're not cleaning our data before sending it through, this isn't actually causing an issue in the downstream tool. It's that actually perfectly fine. Those are being filtered out. The, the air message is just that we don't support this message type in that destination. And so we can either use this to say like, okay, this is super noisy.
Ryan McCrary (14:05)
This is not a valuable alert, or we could see, maybe just to like, to tune our own alerts, we want to use our transformation, or either, in our instrumentation, we don't want to send those specific events, those page views, to Facebook, or we could, set up a user transformation. That's going to filter those out, reduce the noise in this chart and make it a more valuable metric for us. And so this is why when one of the reasons this is so extensible from a Grafana standpoint, but also just RudderStack is that we see these alerts internally, regardless. So we're, on our team is acknowledging when we're seeing those, message type, not supported alerts, but that's why we don't pipe everything by default to the customer because there's a lot of these things that are just noise at the end of the day and would just be suppressed anyway. So this is a good way to kind of get a gut check of a new integration like this. And then we can start to filter out some of that noise, get to a better place. And then, as Benji showed, you can tune those alerts into something that's more meaningful from the customer perspective.
Benji Walvoord (15:00)
Sure. And, like we just are, showing here on the screen. Now, there are different ways that you can, either reduce the noise, you can solve the problem. And that's not an either/or so depending on the environment that you're in, the sources and destinations and the uses of that data, you can filter these alerts to just exclude what you know, are, noisy channels that you can't really affect, or you don't really care about, or you can solve the problem upstream in, through tracking plans or through working with the developer team. So,
Ryan McCrary (15:40)
Yep, exactly. Cool. So, again, Benji kind of showed, how we build these metrics or how we build these charts around the metrics. And then the jumping back over to the kind of the alert set up, we're able to set the alert on a query as well. And then there we go. So we can see that chart again. And then as Benji mentioned, the alerts are just, query as well. So that's just going to be a wear clause on the query. And that's going to tell us when to, set that alert, and then within Grafana, we can go ahead preset, a number of different destinations, whether it's Slack, Microsoft teams, we have some things like Datadog or PagerDuty, things like that, that you may already have set up within your own system to do this.
Ryan McCrary (16:24)
So then those can be added. You can add multiples on here for this example we're just sending these to Slack. But one thing to point out further is that because these metrics are` just being piped into influx and Grafana is just the visualization layer. There's also the ability for, our pay plans have access to these, to pipe any, or all of these metrics directly to our customers, alerting systems. So you may, if you have your own PagerDuty or Datadog, and you want to be able to more fine tune these, inside your own environment, versus going into Grafana and building all these pains, we can actually just pipe all that influx data directly into your own system. You can build these same types of visualizations or alerting around that without having to involve Grafana. But this is just an easy way out of the box and something that's included with part of RudderStack that a lot of people, as we mentioned before, don't really know, exist, or if they've seen Grafana, they may not know some of the pitfalls of having to hard code these variables if they want to use the alerting along with it.
Ryan McCrary (17:19)
So, we've got that added now. Anything, more on this that we want to touch on Benji?
Benji Walvoord (17:28)
Well, I would just say that the, the dashboards themselves are infinitely configurable, and it's very nice to see, in and out throughput, average time per connection in RudderStack and maintaining those dashboards, they do involve, a little bit of setup, but once you've start playing around and building your queries, you can knock these out fairly quickly. They're hugely helpful in just making sure that your environment is healthy and, just keeping a pulse on your instance.
Ryan McCrary (18:06)
Yeah, that's great. Yeah. So you can see, so we set that query up, we set up the alert query as well. And so this is just an example. This is, if you're familiar with slack, this will look very familiar, but this is while we were setting this up right before this call, this was an example of that webhook that came in. And so that came directly from Grafana, came in immediately to our Slack channel that we had set up for this. And so you can see the first message there includes a lot of really valuable data out the box, right? So it's going to show us, Hey, this is an alert, it's an incoming alert from Grafana. You can see the source below. It's showing us the metric there that we're keying off of. So it's the processor, air accounts, some, and it's showing the stage as well that it's in the destination transformer.
Ryan McCrary (18:49)
And it's showing actual count as well there so that we had nine errors in the last minute that's above Benji's threshold of seven. Don't give Benji more than seven errors in a minute, or he is going to be mad. And then you can see the very next minute was fine. So that was the alert, changing back into an okay status. And so those can be tuned as well. You can set alerts, you can also set okay status and so that's a pretty common use case as well to think about, for a lot of customers, something like RudderStack is a tool that can oftentimes sit in the background, right? Like once you've got your pipeline set up, hopefully, you're not thinking about RudderStack a ton. Hopefully you're now you've got these great pipelines that are sending data into your downstream tools, and you're able to live in those tools, manage your data accordingly, get your business insights from that.
Ryan McCrary (19:31)
And so there are a lot of cases where it's easy to forget about RudderStack And so it's also, to think about setting up alerts on when things are going wrong. There's also the ability to set up alerts for, things that are below a threshold, right? So maybe, we want to send ourselves an alert every 24 hours that, Hey, over the past 24 hours, the total amount of processor errors was below a certain threshold and that's good. And so then we just get a daily ping of like, Hey, everything's good, everything's good. And so that just gives us some reassurance that (A) RudderStack, the pipelines are behaving as we would expect, and that they're still processing and there's nothing wrong on that. So yeah, So from a high level, again, there's a ton of metrics. If you have a Grafana dashboard would encourage you to go play around, see what metrics are available, you can reach out to us on our community, Slack, Benji, and I are both on there as well as other folks from our team and are happy to make recommendations around what might make sense for your business.
Ryan McCrary (20:23)
And even, help you look at your Grafana dashboard and see, what can we tune? What makes sense to alert ourselves to, or to alert ourselves that's okay as well. And just kind of help tune those as we go. And, for customers, we have that leverage this extensively. It is an ongoing process to tune these as well. Like as we're the idea is to get these better. And so as these metrics climb down, or as we're able to fix parts of those pipelines, we're able to tune these metrics and reduce some of that noise and really make these valuable, instead of just something that we suppress and, can become, bad without us realizing it, because we don't pay attention to them anymore. Anything to add from your side, Benji?
Benji Walvoord (21:02)
I would just say that if you, want to read more about exactly how we set this up, we've got a blog post that's coming out, which I'm sure there will be a link attached to this, this video with step by step instructions on the dashboards that we've created. And then how to extend that even further by sending these events to RudderStack webhook sources to stream them just about anywhere you want.
Ryan McCrary (21:30)
Yeah. So that's something we didn't mention, but there's a lot of built-in integrations into Grafana. One of which is a webhook. So you can, any endpoint that you've got, you can send these alerts to. And so Benji's done some really cool stuff and we've got some interesting use cases around sending some alerts and then some other metrics from our dashboard, actually back through Rutter stack as a RudderStack webhook source. And then we can distribute that into various downstream tools, to track these things in various tools or just for different teams, to be able to visualize these things and see what's coming through RudderStack. So a little example, of Rutter stack eating itself there and getting to use RudderStack to use RudderStack. So, but Benji's the smart one. Benji writes all that. So yeah, we've got some stuff coming out next week and hopefully, it can be helpful. And again, if there's any further questions or anything on this, feel free, always feel free to reach out to us or join the community Slack and, happy to walk through any use cases there. Brooks anything else from your end?
Brooks (22:42)
I appreciate everyone attending today. We will have this up live by tomorrow. And yeah, as Ryan said, join the community Slack you can go to www.RudderStack.com to schedule a session with our customer success team or join slack and Stay tuned for some more tech sessions coming soon. Thank y'all for joining.
Ryan McCrary (23:12)
Okay. See you
Brooks (23:13)
Bye.