5

Quick brief - for testing purpose, I installed puppet agent on 5 nodes (Debian Squeeze + puppet 2.7.20-1puppetlabs1), and puppet master on 1 server (same version).

On puppetmaster side in every manifest I check if $::osfamily == 'Debian'. Sometimes I also use $::fqdn, and check if it's not empty.

The problem is that every day on random hours I get mail from puppetmaster that he can't compile catalog for one of nodes. For example:

Fri Jan 18 19:18:24 +0100 2013 Puppet (err): Could not retrieve catalog from remote server: Error 400 on SERVER: Not supported osfamily at /etc/puppet/modules/system/manifests/skel.pp:20 on node mynodeX
Fri Jan 18 19:18:24 +0100 2013 Puppet (notice): Using cached catalog
Fri Jan 18 19:18:24 +0100 2013 Puppet (err): Could not retrieve catalog; skipping run

Another example, from puppetmaster logs:

Jan 15 18:58:49 monitor puppet-master[14218]: No fqdn at /etc/puppet/modules/system/manifests/motd.pp:29 on node nodeY

Of course after next puppet agent iteration, everything is fine. I have no idea how to find cause of this issue. Problem is common to all 5 nodes.

I'm 100% sure that it's not related to cron.

Tomasz Olszewski
  • 868
  • 9
  • 20

2 Answers2

6

I've seen this issue on RedHat/CentOS. The puppet agent on the client machine would run out of file descriptors due to some ruby/puppet bug not closing them. After hitting the 1024 fd limit, it would not be able to run facter anymore, so the facts would be missing.

If subsequent puppet runs from the same process don't fail, it probably is a different problem, but it might be worth checking out. In my case puppet agent would log about not being able to start facter, and in /proc/PIDOFPUPPETD/fd there'd be 1024 file descriptors.

growse
  • 7,830
  • 11
  • 72
  • 114
arjarj
  • 2,981
  • 1
  • 16
  • 10
  • I checked number of fds, and there is 5 of them :-( So it's probably other problem. I also didn't get any log messages from puppet agent about problems with facter (only information that he can't get catalog from puppetmaster) – Tomasz Olszewski Jan 19 '13 at 23:08
3

I found source of my problem. It was my nagios plugin, that checks if puppet agent works and listen for connections (I run puppet agent with listen=true)

It looks like if there is more than 1 connection to puppet agent in one time, puppet can't gather facts. For example if my osfamily is "Debian", it returned just generic "Linux".

How I tested? I run 2 loops, with commands that connect to:

https://127.0.0.1:8139/production/facts/no_key

Example result:

OK: connection with puppet agent works (facter: 1.6.17, kernel: 2.6.32-5-amd64, os: Debian)
OK: connection with puppet agent works (facter: 1.6.17, kernel: 2.6.32-5-amd64, os: Debian)
OK: connection with puppet agent works (facter: 1.6.17, kernel: 2.6.32-5-amd64, os: Linux)
OK: connection with puppet agent works (facter: 1.6.17, kernel: 2.6.32-5-amd64, os: Debian)
OK: connection with puppet agent works (facter: 1.6.17, kernel: 2.6.32-5-amd64, os: Debian)

If I run loop with only 1 command, it works every time.

I'm not sure if it's really puppet problem, or something deeper (ruby modules), but to fix this issue, I need to stop connecting to puppet agent server.

Tomasz Olszewski
  • 868
  • 9
  • 20