0

I upgraded an old (about 5 or 7 years) Fedora Server to 32 and now have a process being stopped problem - stopped by the OS. The only "application code" that is new is the java version (open-jdk), and that doesn't matter to the java code. And, the hang is "restartable" if it was started from the command line (and maybe otherwise? If so I don't know how yet) as it goes into a "Stopped" state (reported by ps as Tl) that can be un-done by foregrounding (more on that below).

So, something changed about the OS itself.

There are multiple ways to start the code that's being stopped but I've been starting it via the command line as a matter of convenience, backgrounding it via &. However it's started, it runs a Java-based daemon that looks for work to do and when it finds something, it launches a child process - also Java - which does some processing on its own, asynchronously.

If there's no child launch, there's no stop.

If there's a child launch, the child runs for a bit and then the whole process tree, starting from the command-line-started daemon on down is put in the "Stopped" state. And I probably can't easily give it load enough to start multiple children before one of them stops the dispatching daemon because the system is so fast it gets to the stop point nearly instantly.

The child's functions are broken into "Prelude" "Main", "Epilogue", and "Cleanup." And, it reports what it's doing so we know where it's at. ... It always gets to Main when it gets stopped, so I know for sure it's running various sorts of setup before it gets to the stop - it's not merely stopping when the child Java is launched. Both processes are put in the Tl state.

From the command-line, jobs shows the job as "Stopped". You can then fg back to the daemon itself and it then prompts for password (I've NEVER seen this behavior before?!), and once forgrounded all stopped threads / processes are resumed and it runs to completion like nothing ever happened. I usually ^z and then bg to return to the previously backgrounded state...

On a lark, I tried NOT backgrounding and when it gets to the problem spot, I just get a prompt for password! ... I haven't ever seen anything like this behavior before on ANY system. This MUST be a strong clue.

Investigating the Tl state has yielded, so far, absolutely nothing, though the T state is reasonably documented, though other than the fg trick, I don't yet know how to restart such a stopped job. (A method to restart such a stopped process tree without having to enter a password would be a good temporary workaround!) Nor have I yet had time to learn what all puts something into the T state. But that's my next focus, after I complete pursuing something I share below.

This is completely unacceptable behavior and I need to fix it ASAP.

I thought maybe this was modified behavior of the OOM killer, but it isn't killing jobs?! But maybe, so I observed memory and there's not a lot of actual memory free but there's always plenty of virtual memory (swap space) available.

The only other quirk that might be playing a role is that while I upgraded the server to Fedora Server 32, if I'm not supervising the boot at the console, it boots into 31 instead - VERY frustrating and I've already tried fruitlessly to fix this. It appears to be an old bug in grub as the system was "upgraded". However, it's presently running as 31 - it's a server and restarting in person at the console is painful! - could this be the cause?! I'm guessing this is not the cause, but I don't know.

I turned selinux off, just to be sure that wasn't the cause, and it's not (presuming the NSA stuff can actually be turned off).

Desperate for a solution ASAP, I've done what research I can and a web search suggested a Fedora 30 issue that was closed regarding EnableMultipleStreamsException. But this appears to NOT be the problem. ... Still researching!

WORKAROUND Attempts

I tried using kill -CONF <pid> but it only freed the daemon, and didn't work on the child processes, even when I did the same command directly applied to them. My wild assed guess is that they are looking for the password to be applied from somewhere. NOTABLY, the state change from Tl to Sl! What does this mean? IDK.

SOMETHING IN THE OS STOPPED THAT TREE OF PROCESSES!

And I have to turn that something off or go to another OS, however painful that is.

Anybody else seen this before, now what to do, etc?

Richard T
  • 1,130
  • 11
  • 26
  • What does this password prompt look like? – Michael Hampton Aug 11 '20 at 04:13
  • @MichaelHampton It looks rather like a standard ssh login password prompt if you're not using keys. And, you know, I thought about creating a pair of keys for the afflicted account to avoid the pwd prompt! It might be a functional workaround - maybe I should try that next, but I'd FAR rather stop the stops! ... Specifically, it's ^@'s password: – Richard T Aug 11 '20 at 04:16
  • Umm, is your app actually trying to ssh somewhere? – Michael Hampton Aug 11 '20 at 04:40
  • @MichaelHampton Absolutely not! – Richard T Aug 11 '20 at 04:55
  • 1
    Well, it certainly looks like it is! You may want to focus more on what the app or its startup script is doing. – Michael Hampton Aug 11 '20 at 05:13
  • @MichaelHampton I certainly appreciate your help, but I'm 100% certain it's NOT doing anything that requires any kind of login because I wrote it. It's just a simple Java program (not JavaScript, of course), and all it does with a network is a database connection which has already succeeded before the process tree is halted. – Richard T Aug 11 '20 at 07:18
  • 1
    If you start a process in the background from an interactive shell, then if it (or any processes created by it) tries to read from the tty, the whole process group will get a SIGTTIN signal and stop. You can avoid it being stopped by running the program as `nohup program &`, but that won't let you provide any input; nohup will make the program read stdin from /dev/null. As mentioned above, you need to find the part of the program that is asking for a password. A Java debugger or even gdb may help. – Mark Plotnick Aug 11 '20 at 11:34
  • 1
    As a start, I'd run `pstree` and see if your program has started another process that might try to read from stdin. – Mark Plotnick Aug 11 '20 at 11:39
  • @MarkPlotnick You and MichaelHampton were helpful, I found it. I was both right and wrong. It wasn't the Java, but some Bash scripting was getting called and that tried to push a file via scp. Apparently when the server was replaced, keys changed! Oops! ... I'd like to recraft this question into something more useful for others, but no time for that right now - maybe later today. Also, the nohup part was right: the daemon has an option to "single thread" or not and when not, it uses nohup, so it can go away, etc. And it worked just as described here, abandoning the child - no tty, no pwd, hung – Richard T Aug 11 '20 at 14:55

0 Answers0