Disk Capacity Planning for Whisper / Graphite

Question

Does anyone have any formulas, or maybe some sample data from their environment that can help me estimate how much disk space will be used by graphite per datapoint?

Make sure you're planning your disk I/O correctly too, and not just your disk capacity. rrdtool has, over the years, accumulated a lot of micro-optimizations that make it a lot faster (2x faster?) on writes than Graphite's Whisper database format. If you're planning on keeping all your data on decent SSD, that will get you most of the way there, but I wouldn't plan to keep a whole ton of Whisper DBs on spinning disk. At scale, it's just not cost-effective that the disk I/O levels that Graphite throws. — jgoldschrafe, Sep 26 '13 at 15:12

gWaldo · Accepted Answer · 2015-12-11T17:35:21.617

7

whisper-info.py gives you a lot of insight into what and how each file is aggregated, including the file's size.

However it's only useful for existing whisper files.

When you want to see predictive sizing of a schema before putting it in place, try a Whisper Calculator, such as the one available at https://gist.github.com/jjmaestro/5774063

EDIT:

When asked for an example...

storage_schema:

{
    :catchall => {
      :priority   => "100",
      :pattern    => "^\.*",
      :retentions => "1m:31d,15m:1y,1h:5y"
    }
}

Looking at my file applied-in-last-hour.wsp, ls -l yields

-rwxr-xr-x 1 root root 4415092 Sep 16 08:26 applied-in-last-hour.wsp

and whisper-info.py ./applied-in-last-hour.wsp yields

maxRetention: 157680000
xFilesFactor: 0.300000011921
aggregationMethod: average
fileSize: 4415092

Archive 0
retention: 604800
secondsPerPoint: 10
points: 60480
size: 725760
offset: 52

Archive 1
retention: 2678400
secondsPerPoint: 60
points: 44640
size: 535680
offset: 725812

Archive 2
retention: 157680000
secondsPerPoint: 600
points: 262800
size: 3153600
offset: 1261492

So, basically you combine your hosts per retention-match per retention-period-segment per stat, multiply by a factor of systems that you intend to apply this too, factor in the number of new stats that you're going to track. Then you take whatever amount of storage that is and at least double it (because we're buying storage, and we know we'll use it...)

edited Dec 11 '15 at 17:35

answered Sep 19 '13 at 15:50

gWaldo

11,887
8
41
68

Any chance you have some sample numbers from that (paired with retention settings). Right now I'm thinking about different time series data stores in terms of disk usage - so getting graphite live for that is bit a of a todo. – Kyle Brandt Sep 19 '13 at 16:19
@KyleBrandt Answer updated. – gWaldo Sep 19 '13 at 17:51
Thanks for this. So with the filesize, is that what it will be after an hour of collecting data, or is that what the filesize will pretty much always be? So is 4415092 representative of 5 years worth of data this retention, or is that representative of one hour of 1 minute data? Also, is that bytes or bits? – Kyle Brandt Sep 19 '13 at 18:22
This is a new implementation at this company, and I don't have access to my old one. Since the top-level fileSize result matches the `ls -l` result, I take that to be bytes. When I add up the sizes of the archives within the .wsp file (as reported by `whisper-info.py`), they come close to the overall .wsp file size (the rest I assume being metadata and such. This should be the size of the file for all time, as data falls down to lower data resolutions, and old data points are discarded. – gWaldo Sep 19 '13 at 18:45
Okay, so with this retention settings. Roughly: `ServerCount * MetricCount * 4.5MBytes` – Kyle Brandt Sep 19 '13 at 19:59
You have to calculate this for every retention policy, and 4.5MB is what it works out to for my (single) retention policy, but yes, that's the rough formula... – gWaldo Sep 19 '13 at 20:19
Really the simplest thing is to just try it. Set a retention policy, send in some data, and then use "ls -l" to see the file sizes. Whisper (.wsp) files NEVER change in size. Once created with one storage scheme, they never change. Delete them, change the schema, try it again. Check file sizes. This is a fast process. Do some simple multiplication then based on your number of data sources, and you are done. No need for formulas. – IcarusNM Dec 11 '15 at 16:22
Really, the simplest thing to do is use a Whisper Calculator tool, such as the one at https://gist.github.com/jjmaestro/5774063 – gWaldo Dec 11 '15 at 17:33

score 2 · Answer 2 · answered Feb 09 '14 at 13:43

2

In the documentation for statsd they give an example for a data retention policy.

The retentions are 10s:6h,1min:7d,10min:5y which is 2160 + 10080 + 262800 = 275040 data points and they give an archive size of 3.2 MiB.

Assuming a linear relationship, this would be approximately 12.2 Bytes per data point.

answered Feb 09 '14 at 13:43

AndreKR

523
1
3
16

http://ops-school.readthedocs.org/en/latest/monitoring_201.html (timestamp, value) pairs are stored as a long and a double value consuming 12 bytes per pair. The 0.2 diff probably due to file metadata info overhead ?! – user27465 Jan 31 '15 at 18:07

voretaq7 · Answer 3 · 2013-09-17T22:00:30.063

No direct experience with Graphite, but I imagine the same logic as we used for Cacti or anything else RRD or time-rollover driven would apply (Graphite doesn't use RRD internally anymore but the storage logic seems comparable.)

The quick answer is "Probably not as much space as you think you'll need."

The long answer involves some site-specific math. For our monitoring system (InterMapper) I figure out the retention periods, resolutions, and datapoint size, do some multiplcation, and add in overhead.

As an example I'll use disk space - we store figures with a 5 minute precision for 30 days, a 15 minute precision for a further 60 days, and then an hourly precision for a further 300 days, and we're using a 64-bit (8 byte) integer to store it:

21600 samples total, broken down as:
- 8640 samples for the 30-day 5 minute precision
- 5760 samples for the 60 day 15-minute precision
- 7200 samples for the 300 days 1-hour precision

At 8 bytes per sample that's about 173KB, plus healthy overhead for storage indexing and the like brings it to about 200KB for one partition's disk usage data (any error tending toward overestimation).

From the base metrics I can work out an average "per machine" size (10 disk partitions, swap space, RAM, load average, network transfer, and a few other things) -- works out to about 5MB per machine.

I also add a healthy 10% on top of the final number and round up, so I size things at 6MB per machine.

Then I look at the 1TB of space I have laying around for storing metrics data for charting and say "Yeah, I'm probably not running out of storage in my lifetime unless we grow a whole lot!" :-)

To throw a number from actual practice out there, with my production retention policies (9 months at 5 minutes ; 1 year at hourly ; 5 years at daily) and about 20 machines with ~20 8-byte metrics each, plus the warning/alarm/critical/outage event histories for 5 years I'm using 1.5G of disk space. That's with InterMapper inserting everything into a Postgres database. So again - the quick answer is "Probably not as much space as you think you'll need" :-) — voretaq7, Sep 17 '13 at 21:21
Ya, that math is straightforward, I'm really just looking more on how Graphite stores it data - can make major differences at scale. The only thing I have found is that according to the docs it is not very space efficient (Probably because it counts on fairly aggressive rollups). — Kyle Brandt, Sep 17 '13 at 21:47
[Whisper (the storage back-end Graphite uses)](https://graphite.readthedocs.org/en/latest/whisper.html) has some built-in space-chewing items -- you've probably already seen that page. The section about "Archives overlap time periods" stands out to me because it means the archives are bigger than my examples above because they all go back to the beginning of time (the 60 day archive is actually 90 days long ; the 300 day archive is 390 days long). Whisper also keeps a timestamp (4 or 8 bytes) along with each data point which needs to be added in too. Doesn't look tricky though, just bloated :) — voretaq7, Sep 17 '13 at 22:06

score 0 · Answer 4 · answered Nov 17 '17 at 23:32

I have 70 nodes that generate a lot of data. Using Carbon/Whisper, one node created 91k files alone (the node generates multiple schemas each having multiple counters and variable fields which need to be selectable. eg: (nodename).(schema).(counter).(subcounter).(etc)....and so on).

This provided the granularity I needed to plot any graph I want. After running the script to populate the remaining 69 nodes, I had 1.3Tb of data on disk. And that is only 6hrs worth of data/node. What gets me is the actual flat csv file for 6hrs worth of data is about 230Mb/node. 70 nodes is ~16Gb of data. My storage-schema was 120s:365d.

I'm relatively new to databases, so I might be doing something wrong, but I'm guessing it's all the overhead for each sample.

So it was a fun experiment, but I don't think it makes sense to use whisper for the kind of data I'm storing. MongoDB seems like a better soluton, but I need to figure out how to use it as a backend to Grafana.

Disk Capacity Planning for Whisper / Graphite

4 Answers4