10

We use graphite to track history of disk utilisation over time. Our alerting system looks at the data from graphite to alert us when the free space falls below a certain number of blocks.

I'd like to get smarter alerts - what I really care about is "how long do I have before I have to do something about the free space?", e.g. if the trend shows that in 7 days I'll run out of disk space then raise a Warning, if it's less than 2 days then raise an Error.

Graphite's standard dashboard interface can be pretty smart with derivatives and Holt Winters Confidence bands but so far I haven't found a way to convert this to actionable metrics. I'm also fine with crunching the numbers in other ways (just extract the raw numbers from graphite and run a script to do that).

One complication is that the graph is not smooth - files get added and removed but the general trend over time is for disk space usage to increase, so perhaps there is a need to look at local minimum's (if looking at the "disk free" metric) and draw a trend between the troughs.

Has anyone done this?

Amos Shapira
  • 839
  • 2
  • 11
  • 23
  • what's your infrastructure? for instance if you're a vmware house you could look at their Operations Manager products which does this kind of predictive view on disk space. – Chopper3 Aug 07 '13 at 11:30
  • `The volume of crap people have to store will expand to fill the disk available.` - Old Sysadmin Axiom – voretaq7 Aug 07 '13 at 16:19
  • Our servers are split between VMware VM's using IBM XIV for disks, and KVM's using local SD's. I'm not sure we have access to that kind of information (my team does not manage the VMware or XIV) and would prefer a product-independent solution. – Amos Shapira Aug 07 '13 at 19:30

3 Answers3

9

Honestly "Days Until Full" is really a lousy metric anyway -- filesystems get REALLY STUPID as they approach 100% utilization.
I really recommend using the traditional 85%, 90%, 95% thresholds (warning, alarm, and critical you-really-need-to-fix-this-NOW, respectively) - this should give you lots of warning time on modern disks (let's say a 1TB drive: 85% of a terabyte still leaves you lots of space but you're aware of a potential problem, by 90% you should be planning a disk expansion or some other mitigation, and at 95% of a terabyte you've got 50GB left and should darn well have a fix in motion).

This also ensures that your filesystem functions more-or-less optimally: it has plenty of free space to deal with creating/modifying/moving large files.

If your disks aren't modern (or your usage pattern involves bigger quantities of data being thrown onto the disk) you can easily adjust the thresholds.


If you're still set on using a "days until full" metric you can extract the data from graphite and do some math on it. IBM's monitoring tools implement several days-until-full metrics which can give you an idea of how to implement it, but basically you're taking the rate of change between two points in history.

For the sake of your sanity you could use the derivative from Graphite (which will give you the rate of change over time) and project using that, but if you REALLY want "smarter" alerts I suggest using daily and weekly rate of change (calculated based on peak usage for the day/week).

The specific projection you use (smallest rate of change, largest rate of change, average rate of change, weighted average, etc....) depends on your environment. IBM's tools offer so many different views because it's really hard to nail down a one-size-fits-all pattern.


Ultimately no algorithm is going to be very good at doing the kind of calculation you want. Disk utilization is driven by users, and users are the antithesis of the Rational Actor model: All of your predictions can go out the window with one crazy person deciding that today is the day they're going to perform a full system memory dump to their home directory. Just Because.

voretaq7
  • 79,345
  • 17
  • 128
  • 213
  • Thanks for your insights. I see your points. I still think that constant thresholds just try to reflect "how long have I got to remediate?" and feel somewhat vindicated by your "adjust your thresholds" comment. Simple graphite derivatives don't work because the original graph is not smooth. Thanks for the pointer to IBM's tools, what you describe sounds just like what I started thinking about (extract last two minimums and calculate the slope from them). – Amos Shapira Aug 09 '13 at 21:46
  • 1
    Surely the point of a 'days to full' metric is that, with static 85/90/95 thresholds, you have no idea how fast the disk is filling? Sure, you're aware of a potential problem, but how can you know whether you have days to address it, or weeks/months? –  Apr 03 '14 at 10:22
  • 1
    I find it really interesting that you would have this opinion. Let me frame it this way: Your Company has a procurement process that takes about 6 weeks between the initial request for more hard drives to the day that those hard drives are actually installed in the boxes and load redistribution begins to take place. Given that 6 week timeframe at what disk % do you need to be notified of in order to be able to get a disk installed in time? 80%? 75%? The fact of the matter is that you don't know unless you put some effort into calculating the growth rate. – JHixson Aug 24 '17 at 16:09
3

We keep a "mean time till full" or "mean time to failure" metric for this purpose, using the statistical trend and its standard deviation to add the smarter (less dumb) logic over a simple static threshold.

Simplest Alert: Just an arbitrary threshold. Doesn't consider anything to do with the actual diskspace usage.

  • Example: current% > 90%

Simple TTF: A little smarter. Calculate the unused percentage minus a buffer and divide by the zero protected rate. Not very statistically robust, but has saved my butt a few times when my users upload their cat video corpus (true story).

  • Example: (100% - 5% - current%) / MAX(rate(current%), 0.001%)

Better TTF: But I wanted to avoid alerting for static read-only volumes at 99% (unless they ever have any changes), and I wanted more proactive notice for noisy volumes, and to detect applications with un-managed diskspace footprints. Oh, and the occasional negative values in the Simple TTF just bothered me.

  • Example: MAX(100% - 1% - stdev(current%) - current%, 0) / MAX(rate(current%), 0.001%)

I still keep a static buffer of 1%. Both the standard deviation and the consumption rate increase on abnormal usage patterns, which sometimes over compensates. In graphana or alertmanager speak you'll end up with some rather expensive sub-queries. But I did get the smoother timeseries, and less noisy alert I was seeking.

  • Example: clamp_min((100 - 1 - stddev_over_time(usedPct{}[12h:]) - max_over_time(usedPct{}[6h:])) / clamp_min(deriv(usedPct{}[12:]),0.00001), 0)

Quieter drives make for very smooth alerts.

Longer ranges tame even the noisiest public volumes.

Vic Colborn
  • 146
  • 4
  • 1
    This is my favourite answer. It's been long time since I asked this question and the original issue isn't relevant any more but I can imagine using this for other capacity planning tasks (e.g. elastic search load trend increasing over time) – Amos Shapira Jun 13 '20 at 00:09
3

We've recently rolled out a custom solution for this using linear regression.

In our system the primary source of disk exhaustion is stray log files that aren't being rotated.

Since these grow very predictably, we can perform a linear regression on the disk utilization (e.g., z = numpy.polyfit(times, utilization, 1)) then calculate the 100% mark given the linear model (e.g, (100 - z[1]) / z[0])

The deployed implementation looks like this using ruby and GSL, though numpy works quite well too.

Feeding this a week's worth of average utilization data at 90 minute intervals (112 points) has been able to pick out likely candidates for disk exhaustion without too much noise so far.

The class in the gist is wrapped in a class that pulls data from scout, alerts to slack and sends some runtime telemetry to statsd. I'll leave that bit out since it's specific to our infrastructure.

matschaffer
  • 329
  • 3
  • 4
  • I've updated the answer with some info now that we have it rolled out. – matschaffer Dec 03 '15 at 04:53
  • 1
    Just found a funny gotcha with this approach. We also have 90% alarms. One of our hosts was growing so gradually it hit 90% and triggered that alarm even though it still had more than a week before hitting 100% so the predictive alert never fired ;) Guess I should use `(90 - z[1]) / z[0]` instead. – matschaffer Jan 15 '16 at 06:41