Categories
graphite monitoring Technology

Graphite Alerts with Monit

I love Graphite. It’s the most robust, flexible, kick-ass monitoring tool out there. But when I say monitoring, I’m actually not describing what graphite really does. In fact, it does almost anything but monitoring. It collects metrics via carbon, it stores them using whisper, and it provides a front-end (both API and web-based), via graphite-web. It does not however monitor anything, and certainly does not alert when certain things happen (or fail to happen).

So graphite is great for collecting, viewing and analyzing data, particularly with the multitude of dashboard front-ends, my favourite being giraffe ;-). But what can you do when you want to get an email or a text message when, say, carbon throws some errors, or your web server starts to bleed with 500’s like there’s no tomorrow? Even better – do you want to get an email when your conversion signup rates drops below a certain mark??

Monitoring graphite

So what can you use if you want to monitor stuff using graphite? And what kind of stuff can you monitor? I’ve come across a really great approach using nagios. In fact, I ‘borrowed’ the method the author was using for alerting on 500 errors for my own approach. So I wanted to do something very similar, but I really didn’t want nagios. It’s an overkill for me, if all I want is to get an email (or run a script) when something goes wrong.

Monit

I’m using monit on every server I manage. It’s lightweight, easy to install, very stable, and has a great configuration syntax which makes it easy to monitor anything from disk space, through memory, cpu, and up to individual processes and tcp listeners. It can monitor a port, a process id, whatever you want. And last but not least, it doesn’t require a separate management server like other monitoring tools. So why not plug monit into graphite and get alerted when, say, there’s a spike in 500 responses on one of the web servers? Or even, when a business-metric like subscription rate drops below a certain threshold. Being able to query graphite allows far better monitoring, because it gives you access to a much more expressive way of detecting anomalies. Instead of deciding on arbitrary thresholds, you can analyse average values, or calculate a percentage of a certain metric compared to another. This calculation can be performed over any time period you want.

Requirements

This technique relies on the check program option in monit, which was introduced in version 5.3 (and improved in later versions). For best results, please install the latest monit version. If you need help, I’ve created a small open-source project on github called monit-fabric – it uses fabric to install monit from source on debian squeeze, but even if you don’t know fabric, or use a different operating system, reading the fabric script will give you a good idea how to do this yourself. It’s relatively easy. Otherwise, your package manager might already have an up-to-date version of monit.

Approach

The monitoring approach is dead-simple. We’re going to use the program status testing in monit, and launch a little program that will interface with graphite, pull the data you want, and decides if it’s ok, or not. If not, we simply have to return a response code different from zero (0), and monit will then alert it for us, or take action like launch another program. You can write your program in bash, ruby or any language you want. I chose to use python with the two awesome libraries: requests and docopt.

The monit side

Setting up monit to launch and check our program is easy. All you need to do is something like this:

    check program graphite_check_coconuts with path "/usr/local/bin/graphite_check_coconuts"
        if status != 0 then alert
    

pretty straight-forward. Of course you can check for various status responses and act accordingly, for example

    check program graphite_check_coconuts with path "/usr/local/bin/graphite_check_coconuts"
        if status > 0 then alert
        if status = 9 for 3 cycles then exec "/etc/init.d/some_service restart"
    

Querying graphite

Now for the interesting part: Querying graphite for data we might want to alert on.

This little python code performs three different checks, but it can be easily extended to include much more:

  1. check_carbon – checks for various error conditions or anomalies with the carbon collector. Namely, errors carbon reports, or if average update time is slower than a certain threshold.
  2. check_http_errors – alerts if the proportion of 5xx response codes in relation to other responses on our nginx server is over the threshold.
  3. check_subscription_growth – measures our subscription weekly growth rate, and alerts if it falls below 6 percent.
    #!/usr/bin/env python

    """graphite_check

    Usage:
        graphite_check <check>

    """
    import inspect
    import requests
    import sys
    from docopt import docopt

    GRAPHITE_SRV = 'https://localhost'

    def get_datapoints(json, target):
        return [d['datapoints'] for d in json if d['target'] == target][0]

    def get_graphite_stats(target, from_str='-5min', to_str='now'):
        params = {'format': 'json',
                  'target': target,
                  'from': from_str,
                  'to': to_str}
        response = requests.get('%s/render' % GRAPHITE_SRV, 
                                params=params, verify=False)
        response.raise_for_status()
        return response.json()

    def check_carbon():
        """
        Checks for carbon errors or strange behaviour
        """

        # check for errors
        json = get_graphite_stats('summarize(sumSeries(carbon.agents.*.errors),"5min","avg",true)')
        errors = json[0]['datapoints'][0][0]
        if errors > 0:
            return(1, '%s Carbon errors detected.' % errors)

        # check for slow update time
        json = get_graphite_stats('summarize(sumSeries(carbon.agents.*.avgUpdateTime),"5min","avg",true)')
        avg_update = json[0]['datapoints'][0][0]
        if avg_update > 0.1: 
            return (1, 'Carbon average update time of %s above threshold' % avg_update)

    def check_http_errors():
        """
        Check for http errors in relation to standard responses
        """

        # avg percentage of 500 response codes compared to all responses over 5 minute period
        json = get_graphite_stats('summarize(asPercent(sum(servers.*.Http.http_response_rates.5*),sum(servers.*.Http.http_response_rates.*)),"5min","avg",true)')
        avg_errors = json[0]['datapoints'][0][0]
        if avg_errors > 0.1: 
            return (1, 'HTTP 5xx average of %s above threshold' % avg_errors)

    def check_subscription_growth():
        """
        Check for the weekly rate of subscription growth
        """

        # weekly subscriptions
        subs = 'summarize(sumSeries(stats.events.payments.activate.*),"1w","sum",true)'
        # weekly subscriptions for previous week
        prev_subs = 'timeShift(%s,"-1w")' % subs

        # it's theoretically possible to use `diffSeries`, and `divideSeries`
        # to retrieve the value directly from graphite.
        # However, `diffSeries` doesn't seem to work well with None values
        targets = ['alias(%s,"subs")' % subs, 'alias(%s,"prev_subs")' % prev_subs]
        response = get_graphite_stats(targets, from_str="-1w")

        sub_count = get_datapoints(response, 'subs')[0][0] or 0
        prev_sub_count = get_datapoints(response, 'prev_subs')[0][0] or 0

        if prev_sub_count == 0:
            return

        growth_rate = (sub_count - prev_sub_count) / prev_sub_count * 100

        if growth_rate < 6.0: 
            return (1, 'Weekly growth rate of %s%% less than 6%%' % growth_rate)

    if __name__ == '__main__':
        # we search for functions starting with check_
        # and add them to the docopt Usage string
        local_functions = inspect.getmembers(sys.modules[__name__])
        checks = [func[0].replace('check_', '') for func in local_functions if func[0].startswith('check_')]
        doc = __doc__.replace('<check>', "(%s)" % " | ".join(checks))
        arguments = docopt(doc, version='1.0')

        check_name = [k for k, v in arguments.iteritems() if v][0]
        check_func = getattr(sys.modules[__name__], "check_%s" % check_name, False)
        # calling the check function. Returns a tuple of (<status_code>, <message>) or None
        ret = check_func()
        # if there's an error our check function will return a tuple in the form of
        # (status_code, message)
        if ret:
            print ret[1]
            sys.exit(ret[0])
    

The code is flexible enough, so that you can simply add a new function starting with check_, and then the code will ‘pick’ it up automatically, so you can easily call it. For example, adding check_coconuts, will allow you to then run graphite_check coconuts from the command line.

Note that when the program returns a status code other than zero, whatever we print out will automatically be included with the monit alert. So our alerts can explain what went wrong. However, it looks like monit truncates the output, so try to keep it short…

Tying it together

Monit does not support running programs with arguments, so we can’t do something like check program coconuts with path "graphite_check coconuts". This means we need to wrap it with a bash script, one for each check. This is simple enough, but I’ve also created a small bash wrapper that you can use:

    #!/bin/bash

    # monit does not allow sending parameters to programs
    # so we need a thin wrapper to run the python graphite_check
    #
    # This is a generic wrapper that takes anything after `graphite_check_`
    # and passes it to the python script as arg
    #
    # all we need is softlink to this wrapper with the correct param name
    # e.g. ln -s graphite_check_ graphite_check_carbon

    SCRIPT_DIR=$(dirname `readlink -f $0`)
    $SCRIPT_DIR/graphite_check `basename $0 | sed 's/graphite_check_//g'`

    

copy it to a file called graphite_check_ (notice the underscore at the end), and then softlink each check you want to it using:

ln -s graphite_check_ graphite_check_coconuts

Once you created a softlink, you can use graphite_check_coconuts from your monit config.

Graphite monitoring tips

Just a couple of pointers on useful graphite functions to get the data you want, in no particular order:

sumSeries() or sum()

sumSeries() allows aggregating events from different sources into one series, and allows using wildcards too, like this:

sumSeries(events.*.coconuts.{tasty,fresh})

summarize()

summarize() is a very useful function if you essentially want one value. Say, the average CPU over the last 5 minutes. Or the total number of logins in the last week. One slightly annoying default of graphite however, is to automatically adjust the time for you. So, for example, if I query for:

summarize(events.logins.web.success,"1w")

Instead of getting one result, as you would expect, we get two. The solution is to use the alignToFrom parameter, like so:

summarize(events.logins.web.success,"1w","sum",true)

or

summarize(sumSeries(system.*.cpu),"5min","avg",true)

summarize can not only sum your data, but also avg,min,max or last.

asPercent()

asPercent() takes two series, and returns the value of one compared to the other as percent. This is useful for measuring things like proportions of a particular event compared to others.

holtWintersAberration()

holtWintersAberration – has a rather scary name, but can be quite useful. I haven’t used it in production yet, but am hoping to play with this more. My statistics skills are virtually non-existent, so I hope I can explain this correctly. Holt-Winters performs some kind of a clever estimation of “trends” based on past data. In theory it can predict future values as well (however, future data isn’t implemented in graphite). The formula calculates an upper and lower “confidence” band, within which the data should be. The HotWintersAberration function shows you the deviation from this band, i.e. when things behave different from the trend suggests. You can further fine-tune how “sensitive” the deviation is by applying a higher (or lower) delta… This can be useful for a few things. For example, maybe you’re trending your diskspace or memory usage over time. Using holtWintersAberration can help you identify usage spikes that are (statistically) unpredictable. I am currently playing around with something like this:

    summarize(holtWintersAberration(servers.palmtree.system.memory.free,10),"5min","sum",true)
    

The key parameter for tweaking is the delta (currently set at 10). The higher the delta value, the less ‘sensitive’ it is to spikes. YMMV. In fact, My mileage may vary. I’m still not sure exactly how to tune this. Just food for thought.

Perhaps the asPercent() is sufficient for more localized analysis, but Holt-Winters can potentially give better idea over time. Note that the current (hard-coded) default in graphite is to trend past data over a 7 day period (1 week).

Check out this page for some examples of using Holt-Winters forecasts in graphite.

Further ideas

I’d be really curious to see other ideas for using graphite for generating alerts, or even just tips for creating better dashboards or graphs. One useful resource with a handful of graphite tips can be found on obfuscurity’s blog, which I recommend. Feel free to drop me a note if you have more suggestions or resources worth sharing.

14 replies on “Graphite Alerts with Monit”

Awesome note! I followed your steps and successfully got this working on my graphite server.

One typo though, under tying it together section, the command for creating soft link should be `ln -s graphite_check_ graphite_check_coconuts`. You used `ln -s graphite_check_coconuts graphite_check_`. But in the bash script, you made a correct sample.

Thanks again! Really appreciate this excellent blog.

Kan

Cool article, thanks for sharing. A few questions if I may!…

I’m assuming it would be dumb to write script to query your app DB rather than Graphite db to get stats? Because that could degrade production performance?

Is this still working for you, anything new learned? :)

Would you use this approach for a cluster of servers?

Cheers!

T

Hi Tobin,

You can query your DB and alert as well, but I think you’d be better off pushing those checks to Graphite, so you can trend things over time, detect longer-term behaviour, slice&dice your data, correlate etc. So my approach is to try to push stats out to graphite, if necessary, by polling values from the DB (although I try to avoid it for performance reasons, like you mention). Then alerting can be done based on the data in Graphite.

It’s still working for me, but it has some limitations. For example, the alert email isn’t very descriptive or easily configurable. You can’t customize the error message so much. It just uses the standard monit alert format.

I’m not so sure about your question about a cluster. Why would it be any different? The nice thing about graphite is its query language / functions, it allows you to aggregate lots of data (potentially hundreds of servers) into one metric if you want. So you don’t need to query individual server.

Hope this helps
Cheers
Yoav

Hi Scott,

Thanks for the info. To my knowledge, there are several fixes in 0.9.x that are still awaiting an official release into pypi. I don’t fully understand the decision-making process and release prioritization within the project and am not directly involved. I’ve seen this become a source of confusing/frustration amongst graphite users unfortunately.

Cheers
Yoav

What are your reasons for using Monit (which, as you mention, lacks in the reporting department for this purpose) over, say, a simple script through a 5min cron that will run all your checking scripts and that will alert on a non-zero exit code? Just availability?

Hi Matt,

It’s a very valid question. First of all, monit is *designed* for alerting and monitoring purposes so it makes it a primary candidate for me. Secondly, and perhaps more practically however, monit can do more than just alert via email, but also take various actions based on the triggers. This makes it a pretty good tool for this job.

That being said, of course you can write your own script via cron or any other tool and let it poll graphite and take actions. Or perhaps a combination – have monit run the check and trigger an external script that reports things more flexibly for you?

Hi Yoav,

I’m using a similar setup, Graphite to track all the metrics, Grafana to display them and Zabbix as a monitoring tool (although so far we haven’t started monitoring actual Graphite metrics). What I’m interested in now is a tool for detecting metric anomalies and a tool for finding metrics that behave in a similar way. I found out the great Etsy article about Skyline and Oculus but so far I’m having hard time installing them and from their Github page it looks like the project is no longer maintined… Do you have a solution for this too?

Hi Yuval,

I’m not aware of anything simple. The post was highlighting a rather simple approach for doing it using monit, but you can do something similar with nagios, or any tool that you can schedule to poll for some stats periodically. It’s really nothing more than a poor man’s anomaly detection. The post listed a couple of ideas for using Graphite’s built-in function to do a bit of anomaly detection. If you need something even more robust or intelligent, then unfortunately I can’t help much. I’d also love to hear of anything that fits the bill. For me personally solutions like skyline feel like an overkill, but maybe it’s just my wrong impression.

Yoav

p.s. bumped into https://github.com/lytics/anomalyzer and https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-series (but to be honest, I have no clue if or how those might fit in with graphite…)

This post sounds interesting as well, and some points seem to suggest that “my” approach isn’t that bad after all … http://dieter.plaetinck.be/post/practical-fault-detection-alerting-dont-need-to-be-data-scientist/

Hello Yoav,

I have tried your script on my graphite cluster.

graphite-carbon version: 0.9.12-3
graphite-web version: 0.9.12+debian-3

https is not opened in my cluster, so I have modified

GRAPHITE_SRV = ‘http://localhost’

when I execute the script it returns nothing when

#python /usr/local/bin/graphite_check_coconuts carbon

—————————————–
#root@ip:/etc/monit/conf.d# python /usr/local/bin/graphite_check_coconuts http_errors
Traceback (most recent call last):
File “/usr/local/bin/graphite_check_coconuts”, line 95, in
ret = check_func()
File “/usr/local/bin/graphite_check_coconuts”, line 52, in check_http_errors
json = get_graphite_stats(‘summarize(asPercent(sum(servers.*.Http.http_response_rates.5*),sum(servers.*.Http.http_response_rates.*)),”5min”,”avg”,true)’)
File “/usr/local/bin/graphite_check_coconuts”, line 26, in get_graphite_stats
response.raise_for_status()
File “/usr/lib/python2.7/dist-packages/requests/models.py”, line 773, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: INTERNAL SERVER ERROR
root@ip:/etc/monit/conf.d#
——————————————–

#root@ip:/etc/monit/conf.d# python /usr/local/bin/graphite_check_coconuts subscription_growth
Traceback (most recent call last):
File “/usr/local/bin/graphite_check_coconuts”, line 95, in
ret = check_func()
File “/usr/local/bin/graphite_check_coconuts”, line 71, in check_subscription_growth
response = get_graphite_stats(targets, from_str=”-1w”)
File “/usr/local/bin/graphite_check_coconuts”, line 26, in get_graphite_stats
response.raise_for_status()
File “/usr/lib/python2.7/dist-packages/requests/models.py”, line 773, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: INTERNAL SERVER ERROR
root@ip:/etc/monit/conf.d#
======================

I have installed required python modules inspect,requests,sys.

Please help….

Sorry for the late reply, but I somehow missed this comment. It’s hard to know exactly, but looks like you are using my example too literally. You should adapt it to your own metrics.

get_graphite_stats(‘summarize(asPercent(sum(servers.*.Http.http_response_rates.5*),sum(servers.*.Http.http_response_rates.*)),”5min”,”avg”,true)’) – should be changed to have your own graphite function / metrics.

Otherwise, try to debug what your graphite server complains about – it might give you a clue.

Hope this helps, but unfortunately I can’t really debug your script via the blog post… It assumes a certain degree of familiarity with graphite / python etc.

Leave a Reply

Your email address will not be published. Required fields are marked *