2

Is there any priority difference between root user process and non root process on CentOS? When I run on a nodejs server as root user, it goes smooth and after some time (say after weeks) it hangs the entire server and needs a hard reboot.

Why can't CentOS kill or terminate that process? Is it that because of running that service as root user?

slybloty
  • 443
  • 2
  • 9
  • 30
vimalpt
  • 73
  • 7
  • Can you please expand on "it hangs the entire server and needs a hard reboot"? Does the server stop responding to SSH? The console goes dark and cannot be woken back up? Pings get dropped? Power supply catches on fire? – BMDan Oct 10 '14 at 14:34
  • ssh hangs or cannot login- but ping responds for certain time. – vimalpt Oct 13 '14 at 04:18

3 Answers3

2

By definition, processes executing as UID 0 are not constrained by filesystem or system constraints (most of the limits in /etc/limits are taken as suggestions, can twiddle kernel parms). I would suggest to track the amount of memory and CPU that this process is consuming over time, as well as tracking the disk and network IO saturation over time.

I'd be willing to bet that the process has either a memory leak and is slowly starving the system (and is not being constrained by ulimits), eventually causing other processes to be shot by the OOM killer (inclusing SSHD, Apache, etc), or that it has a handle leak, which is eventually starving other processes of their access to file handles to use for things like TTY sessions or access to configuration files.

You can setup net-snmp to expose network IO, memory and CPU utilization and track it over time using something like MRTG (running inside another box, of course) to see how this is trending over time. Because it can take weeks for the problem to manifest, the default granularity of MRTG (one poll every 5 minutes) should be sufficient to illuminate trends.

DTK
  • 1,688
  • 10
  • 15
  • 1
    Just to add to this answer, collect metrics from key performance indicator via logging, and then after hard reboot, troubleshoot the problem until you find what is starving your server. I'd prefer this approach before MRTG as it is instantaneous, you don't need to configure anything, just place some logger function that uses rsyslog or syslog-ng of your server to place a lofgile at /var/log. – Marcel Oct 08 '14 at 17:04
  • 1
    sar and atop (in daemon mode) are also useful tools for collecting performance info with a lot less setup effort than mrtg (unless you are already using mrtg of course). atop is nice in that you can see a lot about what individual processes are doing, including io usage. All of these tools are limited though by their approach of periodically polling what is going on, which may be unhelpful if your problem actually occurs quickly. – mc0e Oct 09 '14 at 09:37
1

The root user has access to override nearly all system settings. Other than file system access though, a root owned process generally has to explicitly ask for some change, rather than just getting special access or priority.

e.g. processes have a 'niceness' priority (see man nice) which dictates which processes get higher priority access to CPU time. Root owned processes get scheduled according to their niceness like any other, but whereas other processes can only increase their own niceness, the root user can also decrease it (assuming a higher priority).

Similar rules apply to resource limits (ulimit), and ionice as for nice. This is contrary to what DTK seems to be saying about ulimit, but note that ulimit is a BSD technology, only partially implemented in linux. The ulimit command line interface will silently allow you to specify some limits that the kernel in fact ignores. Also resource limits include a hard limit which can only be set by root, up to which the active user can modify the soft limit which is currently in effect.

If a process cannot be killed, it is usually because it is waiting on some kernel call to complete. The most common example is waiting for a filesystem action on a filesystem which has gone away or is perhaps just heavily overloaded so the kill action can't proceed till the kernel call completes. In the case of something like a remote mounted file system where the network link has been broken, the call may never complete.

mc0e
  • 5,786
  • 17
  • 31
0

One additional note, you rarely want to run a daemon / long-running task as root, if at all possible (or if it starts as root, have it change users as soon as it can). Root user has much more power than other accounts, and as such, if it is subverted, can perform significantly destructive actions. BCP for most network services is to have the service run as its own account, in its own constrained directory sub-tree, without rights to any other resources outside its sandbox. Your code may be perfect, but if a badguy can find a defect in a library you use or in the nodejs interpreter, that badguy can subvert your code, and if running as root, might be able to do even more damage.

DTK
  • 1,688
  • 10
  • 15