Rearview Real-Time Monitoring with Graphite

Steve Akers —  August 15, 2013 — 5 Comments

In my post on creating control charts with seasonal data, I mentioned that a framework for real-time monitoring was the next step. There are tools on the market that will allow you to create simple monitors that alert when your data exceeds some upper or lower threshold. To date, none offer the ability to write custom monitors that allow the creation of control charts or deployment triggered monitors. That’s exactly why the Analytics team at LivingSocial created the open source tool rearview.

Overview

Rearview is a Scala monitoring framework for Graphite time series data. The monitors are simple Ruby scripts which are run in a sandbox to prevent I/O. Each monitor is configured with a crontab compatible time specification used by the scheduler.

Monitors define the following attributes:

  1. One or more Graphite metrics.
  2. Crontab time specification.
  3. Optional Ruby expression. If no custom graph calls are made a default graph is generated.
  4. Optional PagerDuty api keys and/or emails.

The monitor workflow is as follows:
Rearview Workflow

  1. Scheduler triggers job run.
  2. Job is loaded from the database.
  3. Server fetches the metrics from Graphite (note monitors can’t do I/O other than puts.)
  4. Metric data is transformed into data structures for Ruby.
  5. MRI SAFE mode processes are forked to execute the logic.
  6. Monitor optionally raises an exception to indicate a failure based on the data.
  7. Any configured PagerDuty or Email alerts are sent.
  8. Job is re-scheduled.

Monitor Details

A monitor is simply a Ruby script which runs with some timeseries data in scope by default. The variables in scope to the monitor are generated from the job’s definition of metrics and how far back to retrieve data. A monitor author can use the data in scope to determine whether an alert should be generated any way they see fit.

The add or edit monitor UI has several fields, but the most important fields are the metrics, number of minutes and the monitor Ruby expression fields (see Figure 1 below.)

Sample Rearview MonitorFigure 1: Sample Rearview Monitor

Let’s suppose we calculate the conversion rate for our ad server over the last 30 minutes. If the conversion rate drops below 10% we want to generate an alert. In this example, we would specify the following metrics:

alias(stats_counts.adserver.web_traffic.impression, "impressions")
alias(stats_counts.adserver.web_traffic.conversion, "conversions")

By entering 30 into the minutes back field, the monitor will grab 30 minutes worth of data. Depending on your Graphite configuration this could be anywhere from 1800 datapoints per metric (for 1s retention) to 30 datapoints per metric (for 1min retention.) In our example we will be using a 10s retention, which will return 180 datapoints per metric.

The monitor code would be defined as follows:

puts @timeseries

impressions = @a.values.sum # the sum method uses to_f to convert Nils to 0.0
conversions = @b.values.sum

rate = (conversions / impressions) * 100
puts rate

raise "The conversion rate has dropped below 10%" if rate < 10

By default, rearview creates a namespace for the monitor with some implicit instance variables defined. These variables are defined beginning with @a, which corresponds to the first metric in the list, @b which is the second metric, and so on. In this example the timeseries for impressions is @a and conversions is @b. Each timeseries variable @a, @b, … etc. is a TimeSeries instance with the fields:

  • label – the name of the metric for the timeseries (String). This value has an accessor which can be set to some other value for readability in graphs, etc.
  • timestamp – a long value with the timestamp in milliseconds (Fixnum)
  • value – the double value of the entry (may be Nil) (Float)

Additionally, there is a variable @timeseries in scope, which is a an Array of TimeSeries objects represented above. So @a, @b, … etc. are just convenience variables which correspond to each entry of @timeseries in the order specified in the metrics UI text field. The string representation of @timeseries variable for the above example on 1 minute’s worth of data would be:

[
    {
        label: impressions,
        entries: [
            { label: impressions, timestamp: 1361381120, value: 82.0 },
            { label: impressions, timestamp: 1361381130, value: 74.0 },
            { label: impressions, timestamp: 1361381140, value: 72.0 },
            { label: impressions, timestamp: 1361381150, value: 72.0 },
            { label: impressions, timestamp: 1361381160, value: 81.0 },
            { label: impressions, timestamp: 1361381170, value: 70.0 },
            { label: impressions, timestamp: 1361381180, value: nil }
        ]
    },
    {
        label: conversions,
        entries: [
            { label: conversions, timestamp: 1361381120, value: 17.0 },
            { label: conversions, timestamp: 1361381130, value: 17.0 },
            { label: conversions, timestamp: 1361381140, value: 17.0 },
            { label: conversions, timestamp: 1361381150, value: 11.0 },
            { label: conversions, timestamp: 1361381160, value: 18.0 },
            { label: conversions, timestamp: 1361381170, value: 6.0 },
            { label: conversions, timestamp: 1361381180, value: nil }
        ]
    }
]

Notice there are two array entries in @timeseries, which correspond to the variables @a and @b. The default label for each metric is set to the alias for a given timeseries. If an alias is not specified, the default value will match the exact string used in the metric field. Optionally, you can set the label manually within the monitor like this:

@a.label = "impressions"
@b.label = "conversions"

Now back to the example, the first line prints the @timeseries variable. All output from the monitor appears in the output field. The next two lines sum the values of the two entries for impressions and conversions using the utility array method sum located in /src/main/resources/jruby/utilities.rb. This file also contains array methods for calculating mean, median, and percentile. Any method added to this file will be available to all monitors.

The next line calculates the conversion rate and then does a puts call which will be shown in the UI output field. Using puts is a handy way to debug the data initially and determine the shape of the data and so on. Lastly, a monitor generates an alert by simply raising an exception with whatever text you want to appear in an email or PagerDuty alert.

The following are the variables provided implicitly to a monitor:

  • @name – Name of the monitor specified in the name field in the UI
  • @minutes – Number of minutes specified in the minutes field in the UI
  • @jobId – An id generated by the server for the job. This defaults to -1 for new monitors before saving.
  • @timeseries – A 2-dimensional Array containing Hashes with the fields: metric, timestamp and value
  • @a, @b, …, @z – If there are more than 26 metrics the variables wrap and begin again at @a1, @b1, etc. (However, if you have more than 26 metrics you’re likely doing something wrong.)

There are a few utility functions also available to the monitor:

  • with_metrics
  • fold_metrics
  • graph_value

The utility functions are better explained through an example:

impressions = 0
conversions = 0
rate = 0

with_metrics do |a, b|
  impressions += a.value.to_f
  conversions += b.value.to_f
  rate = (conversions / impressions) * 100
  graph_value["# of #{a.label}", a.timestamp, a.value]
  graph_value["# of #{b.label}", b.timestamp, b.value]
  graph_value["Conversion Rate", a.timestamp, rate]
end

raise "The conversion rate has dropped below 10%." if rate < 10

In this example we’re using the utility functions with_metrics and graph_value. The with_metrics function is a convenience function which introduces variables to the passed block which all align to the same timeslice in the time series. So in the example, a corresponds to impressions and b conversions. Each iteration through the block has the successive timestamp until the end of the series. The graph_value function will plot on the graph the specified value for the timestamp given. In the example the resulting graph will render 3 lines, with the labels “# of Impressions”, “# of Conversions” and “Conversion Rate”.

Future Development

If you’ve made it this far you are likely interested in more information on installation and deployment. If so, check out the bottom of the readme on the rearview project page in GitHub. And please feel free to contribute to the project in any way you can. Finally, if you’re more comfortable with Rails than Scala you’ll be happy to know we are working on a Rails port, which should be available soon.

facebooktwittergoogle_plusredditpinterestlinkedinmail
Print Friendly
  • http://kovyrin.net/ Oleksiy Kovyrin

    It is really awesome that you’ve decided to opensource the project! Thanks a lot!
    Any chance you could share some of more complicated expressions you use in real-life checks?

  • http://steveakers.com/ Steve Akers

    You’re welcome, and absolutely! Next on my list is better documentation which will include a monitor cookbook of sorts.

  • Humberto Pereira

    Looks like logstash with kibana

  • http://steveakers.com/ Steve Akers

    I’ll definitely have to check out kibana. Hadn’t seen that one yet. The obvious difference with rearview would be the fact that rearview doesn’t parse logs at all. Additionally, I don’t believe logstash or kibana allow you to write custom monitors that alert you via email and/or pagerduty in real-time. That’s the real reason we built rearview as opposed to using something else.

  • Humberto Pereira

    Custom alerts is a great feature. Really, logstash or kibana doesn’t have it. Thanks for the idea.