8

I have installed Graphite via Puppet (https://forge.puppetlabs.com/dwerder/graphite) with nginx and PostgresSQL. When I send it data manually, it creates the metric but all its data points are "None" (a.k.a. null). This happens also if I run the example-client.py shipped with Graphite.

echo "jakub.test 42 $(date +%s)" | nc 0.0.0.0 2003 # Carbon listens at 2003
# A minute or so later:
$ whisper-fetch.py --pretty /opt/graphite/storage/whisper/jakub/test.wsp | head -n1
Sun May  4 12:19:00 2014    None
$ whisper-fetch.py --pretty /opt/graphite/storage/whisper/jakub/test.wsp | tail -n1
Mon May  5 12:09:00 2014    None
$ whisper-fetch.py --pretty /opt/graphite/storage/whisper/jakub/test.wsp | grep -v None | wc -l
0

And:

$ python /opt/graphite/examples/example-client.py 
# Wait until it sends two batches of data ...
$ whisper-fetch.py /opt/graphite/storage/whisper/system/loadavg_15min.wsp | grep -v None | wc -l
0

This is, according to ngrep, the data that arrives to the port [from a later attempt] (line 3):

####
T 127.0.0.1:34696 -> 127.0.0.1:2003 [AP]
  jakub.test  45 1399362193. 
####^Cexit
23 received, 0 dropped

This is the relevant part of /opt/graphite/conf/storage-schemas.conf:

[default]
pattern = .*
retentions = 1s:30m,1m:1d,5m:2y

Any idea what is wrong? Carbon's own metrics and data are displayed in the UI. Thank you!

Environment: Ubuntu 13.10 Saucy, graphite 0.9.12 (via pip).

PS: I have written about my troubleshooting attempts here - Graphite Shows Metrics But No Data – Troubleshooting

UPDATE:

  1. Data points in whisper files are only recored every 1m min even if retention policy specifies a higher precision such as "1s" or "10s".
  2. Workaround for data being ignored: Either use an aggregation schema with xFilesFactor = 0.1 (instead of 0.5) or set the lowest precision to 1m instead of <number between 1-49>s. - see the comments below the accepted answer or the Graphite Answers qustion. According to the docs: "xFilesFactor should be a floating point number between 0 and 1, and specifies what fraction of the previous retention level’s slots must have non-null values in order to aggregate to a non-null value. The default is 0.5." So it seems that without regard for having specified precision of 1s, the data gets aggregated to 1 minute and ends up being None because less than 50% of values in the minute period are non-None.

SOLUTION

So @jlawrie lead me to the solution. It turns out the data are actually there but are aggregated to nothing, the reason is double:

  1. Both the UI and whisper-fetch show data aggregated to the highest precision that spans the whole query period, which defaults to 24h. I.e. anything with retention < 1d will never show in the UI or fetch unless you select a shorter period. Since my retention period for 1s was 30min, I'd need to select period of <= last 30 min to actually see the raw data at the highest precision being collected.
  2. When aggregating data (from 1s to 1min in my case), Graphite requires by default that 50% (xFilesFactor = 0.5) of data points in the period have value. If not, it will ignore the existing values and aggregate it to None. So in my case I'd need to send data at least 30 times within a minute (30 is 50% of 60s = 1min) for them to show up in the aggregated 1-min value. But my app only sends data every 10s so I only have 6 out of the possible 60 values.

=> solution is to change the first precision from 1s to 10s and remember to select a shorter period when I want to see the raw data (or extend its retention to 24h to show it by default).

HopelessN00b
  • 53,385
  • 32
  • 133
  • 208
Jakub Holý
  • 363
  • 1
  • 3
  • 14
  • The Graphite Answers question [Dataset filled with nulls?](https://answers.launchpad.net/graphite/+question/246005) is interesting in this context (mentions default addition of null every 60s, last 24h only) and b/c its recommendation of ngrep for troubleshooting. – Jakub Holý May 05 '14 at 15:10
  • I have asked for helps also at Graphite Answers - https://answers.launchpad.net/graphite/+question/248242 – Jakub Holý May 06 '14 at 08:39
  • Have you checked the logs? If there is a problem with the received metric (no \n or use \r\n instead) you should see something in console.log or creates.log. These logs are stored in /opt/graphite/storage/log/carbon-cache/carbon-cache-a/ if you used the default install path. – mattsn0w May 06 '14 at 15:21
  • Yes, I have checked the logs. There was nothing of interest. Console log had essentially only "[..] ServerFactory starting on 7002 [..] Starting factory " and had records of creating the expected metrics but no mention of the data - f.ex. (for another data-less metric) "[..] creating database file /opt/graphite/storage/whisper/ring/handling-time/POST/15MinuteRate.wsp (archive=[(1, 1800), (60, 1440), (300, 210240)] xff=0.5 agg=average)" – Jakub Holý May 07 '14 at 08:15
  • @JakubHolý Could you update jlawrie's answer or post another answer as the question contains an answer now – 030 Jul 05 '16 at 15:38

2 Answers2

8

I encountered the same issue using that same puppet module. I'm not exactly sure why, but changing the default retention policy appears to fix it, e.g.

class { 'graphite':
  gr_storage_schemas => [
    {
      name       => 'carbon',
      pattern    => '^carbon\.',
      retentions => '1m:90d'
    },
    {
      name       => 'default',
      pattern    => '.*',
      retentions => '1m:14d'
    }
  ],
}
jlawrie
  • 196
  • 1
  • Thanks a lot! This mysterious change has really helped. Interesting that changing retention from "1s:30m,1m:1d,5m:2y" to "1m:14d" does "fix" it. I will try to play more with it. May be there is some issue with the 1s granularity? – Jakub Holý May 07 '14 at 08:10
  • It indeed seems to be a problem with the s period - while `'1m:1d,5m:2y` works (data recored), `10s:30m,1m:1d,5m:2y` does not. Actually, from the .wsp file it seems that granularity < 1m is ignored since timestamps for the 10s:... config are still at 1 min intervals - "08:17:00, 08:18:00, etc." – Jakub Holý May 07 '14 at 08:45
  • OK, so the problem is related to the aggregation policy and `xFilesFactor`, the (default) that applies here being average and `xFilesFactor=0.5` (see `/opt/graphite/conf/storage-aggregation.conf`). When I change to `sum` and `0.1` by changing the name, the data gets stored (though the poinst are still at 1m freq): `echo -e "jakub.test.10s30m+1m1d+5m2y.count 42 $(date +%s)" | nc 0.0.0.0 2003` – Jakub Holý May 07 '14 at 09:02
  • I've played with diff. aggreg. schemas, the _data is recorded_ (at 1m interval) when I set `xFilesFactor = 0.1`, the agg. method does not matter (at least all of average, last, sum work). – Jakub Holý May 07 '14 at 09:21
  • According to [this](http://graphite.readthedocs.org/en/latest/whisper.html), the aggregation schemas only come into play with multiple retention policies. If I just have one retention policy, even at a resolution of 10 seconds (which is how often I'm sending data), it's collecting each individual data point. With multiple retention policies, it chooses the one based on the time range of the query, which with whisper-fetch.py defaults to the last day, which is I think why you're only seeing data points every 1 minute. Still not sure why they'd show None, instead of aggregated value though. – jlawrie May 07 '14 at 14:25
  • You're completely right! I have not realized that fetch does not show the data but runs a query and aggregates the data if necessary w.r.t. defined retention periods and the actual query period. The values are ignored by the aggregation because the default factor of 0.5 requires that 50%+ data points in a period are set, if not, you get none. My app sends every 10s but precision was 1s => I'd need 30 points for the aggregation but only have 6. – Jakub Holý May 10 '14 at 09:30
1

There are many ways that Graphite will loose data, which is why I really try to avoid using it. Let me start with a simple one - try having your application connect, wait a second (literally one second) and then output the timestamped data. I've found in many circumstances this will fix that exact problem. Another thing you should try is submitting data at a frequency that is much higher than the frequency at which graphite logs data. I'll go into that a bit more. Another frequent mistake is using the whisper-resize.py utility, which really didn't work for me. If your data isn't important yet, just delete the whisper files and let them get created with the new retention settings.

Graphite's storage files, the whisper files, instead of storing the data as a point with a value and a time (like you provided the program) actually stores it as having a series of slots which the value gets stored in. The program then tries to figure out what slot corresponds to a time period using the retention data file. If it gets a data that doesn't exactly fit in a slot, i think what happens is it uses an average, min, or max depending on another file in the same directory as the retention file. I found that the best way to keep that from messing everything up was to submit data at a frequency that was much higher than the frequency at which graphite was storing data. It honestly gets super complicated - not only are there retention periods for graphite, and averaging algorithms that fill points (I think), but these values are ALSO applied to the whisper files. Very odd things will happen when these don't match, so until your config is working I would suggest deleting your whisper files repeatedly, and letting graphite recreate them.

This program really struck me as acting fairly buggy, so if you encounter something like this don't assume that it's your fault.

Some Linux Nerd
  • 3,157
  • 3
  • 18
  • 20
  • Thank you, I guess I should larn more about how the data retrieval and aggregation works, perhaps it is indeed the cause of the problem. However I think that "_submit data at a frequency that was much higher than the frequency at which graphite was storing data_" is a suboptimal solution as only the last data point received in each graphite period is recorded, others ignored - that's why f.ex. [statsD flush period must = Graphite period](https://github.com/etsy/statsd/blob/master/docs/graphite.md#correlation-with-statsds-flush-interval). – Jakub Holý May 07 '14 at 08:22
  • 1
    BTW, Graphite/Carbon "loosing" data could be related to Carbon settings such as MAX_UPDATES_PER_SECOND=500, MAX_CREATES_PER_MINUTE=50 (I guess data points / metrics over the limit just get dropped). – Jakub Holý May 07 '14 at 08:57
  • It seems I was wrong, the documentation - if I interpret it correctly - says the settings above limit disk access but the data/metrics are still held in memory (though I would like to really verify this first). – Jakub Holý May 07 '14 at 09:40
  • A few of those could definitely explain some of the problems ive had with that application. – Some Linux Nerd Sep 30 '14 at 21:12