0

I have an issue almost identical to what is here: Cron jobs not working anymore

However I can't seem to find any memory-related problems as per the accepted answer, so I thought I should open a new question instead of answering on that one.

Versions:

uname -a

Linux __hostname__ 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

sudo systemctl --version

systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN

cat /etc/redhat-release

CentOS Linux release 7.7.1908 (Core)

My hourly cron jobs were executing successfully for a few weeks, but at some point they stopped executing on both of my servers (they didn't stop at the same time on both servers). The jobs are registered in the crontab of a technical user, i.e. no login shell for that user. They are set to execute at 03 minutes every hour (task is to import certain log files to DB and we accomodated 3 minutes for hourly log rollover).

Successful execution:

Nov 13 18:03:01 server02 CROND[8057]: (__tech_user__) CMD (/opt/xxx/LogImporter/start.sh &>/opt/xxx/LogImporter/import.log)
Nov 13 18:03:01 server02 CROND[8058]: (__tech_user__) CMD (/opt/yyy/LogImporter/start.sh &>/opt/yyy/LogImporter/import.log)

No other logs at that time. Unsuccessful execution an hour later:

Nov 13 19:03:01 server02 CROND[12006]: (__tech_user__) CMD (/opt/xxx/LogImporter/start.sh &>/opt/xxx/LogImporter/import.log)
Nov 13 19:03:01 server02 CROND[12007]: (__tech_user__) CMD (/opt/yyy/LogImporter/start.sh &>/opt/yyy/LogImporter/import.log)
Nov 13 19:03:01 server02 CROND[12009]: (CRON) EXEC FAILED (/usr/sbin/sendmail): Resource temporarily unavailable
Nov 13 19:03:01 server02 CROND[12008]: (CRON) EXEC FAILED (/usr/sbin/sendmail): Resource temporarily unavailable
Nov 13 19:03:01 server02 CROND[12002]: (__tech_user__) MAIL (mailed 71 bytes of output but got status 0x0001#012)
Nov 13 19:03:01 server02 CROND[12003]: (__tech_user__) MAIL (mailed 71 bytes of output but got status 0x0001#012)

Note: __tech_user__ is a replacement for actual username.

Server01 and Server02 are producing the same error logs, but Server01 "broke" 3 hours earlier on same day. When I launch the same script manually, it works fine. For reference, here is the start.sh:

#!/bin/bash

export JAVA_HOME=/opt/xxx/Java/jdk1.8.0_221
export PATH=$JAVA_HOME/bin:$PATH

#Use lockfile to avoid simultaneous instances
exec 200>/opt/xxx/LogImporter/start.lock
flock -n 200 || exit 1

java -cp /opt/xxx/LogImporter/LogImporter.jar x.y.z.logimporter.LogImporter /opt/xxx/LogImporter/

As I mentioned above, I don't see any clear memory issues. On Server01 I have double instance running of everything (separate environments), but there is still available memory:

top - 11:04:25 up 38 days, 22:43,  2 users,  load average: 0.18, 0.25, 0.31
Tasks: 205 total,   1 running, 204 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.1 us,  0.3 sy,  0.0 ni, 98.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16265944 total,   321012 free,  7956004 used,  7988928 buff/cache
KiB Swap:  4063228 total,  3924988 free,   138240 used.  7148196 avail Mem

Server02 has even more (single environment only):

top - 12:01:17 up 38 days, 23:39,  3 users,  load average: 0.23, 0.19, 0.21
Tasks: 199 total,   1 running, 198 sleeping,   0 stopped,   0 zombie
%Cpu(s):  4.0 us,  1.4 sy,  0.0 ni, 94.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16265944 total,  3711468 free,  5559448 used,  6995028 buff/cache
KiB Swap:  4063228 total,  3930364 free,   132864 used.  9602824 avail Mem

Interestingly, I have a daily cron job also on each server, running at 00:15 every day, and they also failed on same day (last successful run on Nov 13 00:15, failed since Nov 14 00:15 every day). My full crontab for the technical user:

Server01:

15 0 * * * /opt/xxx-env1/Logrotator/logrotate.sh &>/opt/xxx-env1/Logrotator/logrotate.log
15 0 * * * /opt/xxx-env2/Logrotator/logrotate.sh &>/opt/xxx-env2/Logrotator/logrotate.log
3 * * * * /opt/xxx-env1/LogImporter/start.sh &>/opt/xxx-env1/LogImporter/import.log
3 * * * * /opt/xxx-env2/LogImporter/start.sh &>/opt/xxx-env2/LogImporter/import.log
3 * * * * /opt/yyy/LogImporter/start.sh &>/opt/yyy/LogImporter/import.log

Server02:

15 0 * * * /opt/xxx/Logrotator/logrotate.sh &>/opt/xxx/Logrotator/logrotate.log
3 * * * * /opt/xxx/LogImporter/start.sh &>/opt/xxx/LogImporter/import.log
3 * * * * /opt/yyy/LogImporter/start.sh &>/opt/yyy/LogImporter/import.log

I tried to comment out some jobs, and leaving only 1 in each server, but still the same issue. What I also noticed, and I'm not sure if it is a problem or not, is that crond.service shows the TasksCurrent is very high:

sudo systemctl show crond.service

Type=simple
Restart=on-failure
NotifyAccess=none
RestartUSec=30s
...
MemoryCurrent=18446744073709551615
TasksCurrent=18446744073709551615
Delegate=no
CPUAccounting=no
CPUShares=18446744073709551615
StartupCPUShares=18446744073709551615
CPUQuotaPerSecUSec=infinity
...
MemoryAccounting=no
MemoryLimit=18446744073709551615
DevicePolicy=auto
TasksAccounting=no
TasksMax=18446744073709551615

However, I see the same numbers on my test server, and crontab is executing fine (although I do not have the hourly jobs, only the daily jobs).

Gábor Major
  • 113
  • 4

1 Answers1

0

Huh. So similarly to the other question I mentioned, it was a completely different application. A Tomcat 8.5 which is running on same server, I restarted it on both servers for a different reason, and cron jobs are running again... I'm stumped. Although I notice there is a bit more memory available:

Server01:

top - 14:23:35 up 39 days,  2:02,  2 users,  load average: 0.26, 0.50, 0.43
Tasks: 201 total,   1 running, 200 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.5 us,  2.2 sy,  0.0 ni, 96.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16265944 total,  1149468 free,  5950648 used,  9165828 buff/cache
KiB Swap:  4063228 total,  3926524 free,   136704 used.  9160900 avail Mem

Server02:

top - 14:23:27 up 39 days,  2:01,  3 users,  load average: 0.13, 0.25, 0.28
Tasks: 199 total,   1 running, 198 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.2 us,  0.1 sy,  0.0 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16265944 total,  5178096 free,  3902808 used,  7185040 buff/cache
KiB Swap:  4063228 total,  3930364 free,   132864 used. 11246396 avail Mem

It is not impossible that we have a memory leak in the application running on Tomcat. But we had available memory even before, over 3.5 G on Server02, which I assumed to be enough.

Gábor Major
  • 113
  • 4