2

Incrontab is set up to monitor approx. 10 directories. The only thing it does is that it starts a Bash script when the new file is received inside one of these directories. Approximately, one file is received every 5 minutes, in each of the dirs. However, incrond occasionally stops. There is no rule when it will happen. It varies from few times per week to few times per month. Error which is logged is:

incrond[35203]: *** unhandled exception occurred ***
incrond[35203]:   polling failed
incrond[35203]:   error: (11) Resource temporarily unavailable
incrond[35203]: stopping service

I am aware I have not posted a lot of information. However, the system is closed so I have shared what I could. I am not looking for the direct answer (since the question might be too broad). I am looking for the ideas I may research. What could be the reason for such behavior? What things I should check? Which resources should I check?

2 Answers2

1

incrond uses the kernel-level inotify subsystem, incapsulating inotify C-based interface in a C++ container. Giving a look at incrond source files, it seems that the error you are facing is related to a failed polling on the file descriptor incapulated in incrond C++ class:

int res = poll(ed.GetPollData(), ed.GetSize(), -1);

  if (res > 0) {
    ed.ProcessEvents();
  }
  else if (res < 0) {
    switch (errno) {
      case EINTR:   // syscall interrupted - continue polling
        break;
      case EAGAIN:  // not enough resources - wait a moment and try again
        syslog(LOG_WARNING, "polling failed due to resource shortage, retrying later...");
        sleep(POLL_EAGAIN_WAIT);
        break;
      default:
        throw InotifyException("polling failed", errno, NULL);
    } 
  }

It is difficult to identify the exact cause for the failed polling. The most common causes can be:

  • an overloaded system
  • a crash/segfault of some incrond functions

Anyway, how many files exist under your monitored directories?

shodanshok
  • 44,038
  • 6
  • 98
  • 162
  • On average, each of the directories contains between 1000 and 2000 files. – Muhamed Huseinbašić Dec 30 '16 at 17:52
  • If it was that then the log would talk about resource shortage. It just says its stopping. – Matthew Ife Dec 30 '16 at 19:32
  • 1
    @Matthew lfe: please watch the code. EAGAING would cause the log message about resources. Any other system error would cause the message the OP reported. – shodanshok Dec 30 '16 at 20:49
  • @Muhamed OK, so it is not related to file number. What OS are you using? What patch level? If necessary, can the OS be updated to current patch level? – shodanshok Dec 30 '16 at 20:51
  • @shodanshok - It's Ubuntu 12.04. Migration to 16.04 is planned soon. But until then, I am looking into more insights in the error. What do you refer by patch level? Any more ideas how it could be traced? – Muhamed Huseinbašić Dec 30 '16 at 21:02
  • @shodanshok error number 11 (the one supplied in the message) is EAGAIN. I think this is happening somewhere else. – Matthew Ife Dec 30 '16 at 22:23
  • @Matthew Ife: You are right, errno 11 is EAGAIN. Some unhandled exception should be directly raised inside the called class/object. Anyway, I think it is nothing the OP can directly deal, as it seems something similar to a component crash... – shodanshok Dec 30 '16 at 22:32
  • @MuhamedHuseinbašić: for patch level, I mean the current stable patches to your software release. In other word: did you run an `apt-get upgrade` (to upgrade all packages in the 12.04 branch) and rebooted with the new kernel? – shodanshok Dec 30 '16 at 22:34
  • @shodanshok - sorry for not replying at the time. Basically, no, system is in closed environment and is updated very rarely, which I am not in charge of, so nothing could have been done there. Anyhow, thanks a lot for your help at the time, I really appreciate it. – Muhamed Huseinbašić Apr 19 '18 at 21:56
0

Use strace on the command, logging to a file, and set the logging file to rotate depending on how frequently you notice the failure has occurred.

eg, if it takes you a week to find that its failed, your log rotation has to be kept for 7 days (or more). If you're generally aware within an hour, then 6 to 10 hours of rotated hourly logs should be sufficient.

More about it and examples: http://www.thegeekstuff.com/2011/11/strace-examples

James
  • 7,553
  • 2
  • 24
  • 33
TG2
  • 101
  • 1
  • You mean to use STRACE on each and every line inside incrontab? Besides already mentioned, [monit](https://mmonit.com/monit/) is actively monitoring incrond process, so I am aware of the stop in an hour max. – Muhamed Huseinbašić Dec 30 '16 at 16:15