2

I have an existing slurm cluster up and running but as of today without a configuration change I get an error when I run certain sacctmgr commands and slurmdbd crashes:

$ sacctmgr list associations
sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to slurm.domain.com:6819: Connection refused
sacctmgr: error: slurmdbd: Getting response to message type 1410
sacctmgr: error: slurmdbd: DBD_GET_ASSOCS failure: Connection refused
 Error with request: Connection refused

The systemctl status shows:

Jul 03 10:01:46 slurm systemd[1]: slurmdbd.service: Main process exited, code=killed, status=11/SEGV
Jul 03 10:01:46 slurm systemd[1]: slurmdbd.service: Failed with result 'signal'.

and the slurmdbd.log says:

[2020-07-03T10:01:45.816] debug2: Opened connection 9 from 127.0.0.1
[2020-07-03T10:01:45.817] debug:  REQUEST_PERSIST_INIT: CLUSTER:slurmcluster VERSION:8192 UID:0 IP:127.0.0.1 CONN:9
[2020-07-03T10:01:45.817] debug2: acct_storage_p_get_connection: request new connection 1
[2020-07-03T10:01:45.861] debug2: DBD_FINI: CLOSE:0 COMMIT:0
[2020-07-03T10:01:45.862] debug4: got 0 commits
[2020-07-03T10:01:45.949] debug2: DBD_GET_ASSOCS: called
[2020-07-03T10:01:45.950] debug4: 9(as_mysql_assoc.c:2032) query
call get_parent_limits('assoc_table', 'root', 'slurmcluster', 0); select @par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos, @delta_qos;

However other commands work (restart of slurmdbd needed after crash):

$ sacctmgr show cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
slurmclus+       127.0.0.1         6817  8192         1                                                                                           normal

I can connect to the database and execute commands. Also, I can connect via telnet slurm.domain.com 6819.

I'm using slurm 17.11.2 with MySQL 5.7 from the standard Ubuntu 18.04 repositories.

Sethos II
  • 497
  • 4
  • 7
  • 18
  • Are you using MariaDB or MySQL? Which version? slurmdbd was killed with 11/SEGV which is Segmentation Fault. If you are with MariaDB perhaps it may be a hardware issue. – Vinícius Ferrão Jul 03 '20 at 09:03
  • @ViníciusFerrão: I'm using MySQL also from the standard Ubuntu repositories. I don't know about hardware issues. It's a virtual machine that works fine otherwise and it's odd that this only happens on some commands. – Sethos II Jul 03 '20 at 09:25
  • Which version? It’s higher than 5.5? If yes change to MariaDB. I’m almost sure that will fix your issue. If yes I will write a proper answer. – Vinícius Ferrão Jul 03 '20 at 09:26
  • Yes, it's MySQL 5.7. – Sethos II Jul 03 '20 at 09:27
  • If you’re up to, change to MariaDB. I’ve got a lot of issues with DBD and MySQL after the fork on SLURM. Going to bed right now, so please let me know if you tried the change when I woke up. – Vinícius Ferrão Jul 03 '20 at 09:32
  • I don't really know what I need to take care of when switching from MySQL to MariaDB. So please post an answer. – Sethos II Jul 03 '20 at 12:21
  • If you don't care about previous data you can just remove and install MariaDB instead. Many times it will even use the tables from MySQL since MariaDB is supposed to be a drop-in replacement. – Vinícius Ferrão Jul 04 '20 at 05:06
  • I want to keep the data. When installing MariaDB it says that the existing MySQL data is not compatible. I tried to restore a dump but it didn't take over slurms accounting settings. – Sethos II Jul 06 '20 at 08:55

1 Answers1

0

It turns out that the problem was an unattended upgrade. Therein MySQL was updated from 5.7.29 to 5.7.30. Everything works with MySQL 5.7.29. The changelog doesn't include something obvious, but according to the slurm-users mailinglist this is the problem:

Seems that (at least for the mysql procedure get_parent_limits) mySQL 5.7.30 returns NULL where mySQL 5.7.29 returned an empty string.

Sethos II
  • 497
  • 4
  • 7
  • 18