We have one server which exports two partitions which are massive (one is 61 TB, the other is 28 TB), and we have a lot of users reading and writing to these when mounted on another machine.
We have separate users complaining about these mounts spontaneously being unmounted and remounted, which disrupts stuff like MATLAB jobs in progress. I asked the people complaining to give me a time frame, and in the meantime have set it up to run nicstat -i eth0 3
(check NIC utilization once every three seconds) for the past several hours and intend to leave it running. I'm writing the results to a log so I can check them, and hopefully cross-reference them with when the unexpected/unwelcome unmounting occurs. The users have not gotten back to me yet.
Basically I took a quick look at the results of nicstat so far over a long period of time, and two things stick out to me, but I'm not sure if they can be considered normal or not.
Every so often (once every two hours or so), there is a sudden drop in activity:
Time Int rKB/s wKB/s rPk/s wPk/s rAvs wAvs %Util Sat
21:13:04 eth0 48057.5 30790.5 35560.6 23086.0 1383.9 1365.7 64.6 0.00
21:13:07 eth0 5438.7 12064.3 4697.4 8237.9 1185.6 1499.6 14.3 0.00
21:13:10 eth0 2.98 0.09 8.67 1.33 351.7 66.00 0.00 0.00
21:13:13 eth0 0.87 0.04 3.33 0.67 266.6 66.00 0.00 0.00
21:13:16 eth0 0.65 0.02 4.33 0.33 153.9 66.00 0.00 0.00
21:13:19 eth0 1.41 0.08 3.67 1.33 393.9 60.00 0.00 0.00
21:13:22 eth0 0.87 0.04 3.00 0.67 298.2 66.00 0.00 0.00
21:13:25 eth0 0.40 0.03 3.00 0.67 137.8 42.00 0.00 0.00
21:13:28 eth0 0.16 0.00 2.67 0.00 60.00 0.00 0.00 0.00
21:13:31 eth0 68692.2 60637.2 53586.2 46664.5 1312.7 1330.6 100 0.00
Every so often (once every hour or so), there is a huge spike:
Time Int rKB/s wKB/s rPk/s wPk/s rAvs wAvs %Util Sat
21:49:31 eth0 2423.2 64807.9 6604.5 44182.7 375.7 1502.0 55.1 0.00
21:49:34 eth0 53528.0 53388.1 40204.7 38932.1 1363.3 1404.2 87.6 0.00
21:49:37 eth0 1283.3 55824.3 5041.1 38035.3 260.7 1502.9 46.8 0.00
21:49:40 eth0 1330.6 57297.0 5488.8 39000.1 248.2 1504.4 48.0 0.00
21:49:43 eth0 3311.9 61186.1 7687.9 41753.8 441.1 1500.6 52.8 0.00
21:49:46 eth0 25882.2 54957.0 23886.9 39327.8 1109.5 1430.9 66.2 0.00
21:49:49 eth0 67499.1 54285.2 57065.7 41844.6 1211.2 1328.4 99.8 0.00
21:49:52 eth0 75769.3 45597.8 60522.3 36384.9 1282.0 1283.3 99.4 0.00
21:49:55 eth0 65356.3 49854.2 54671.6 38437.2 1224.1 1328.2 94.4 0.00
21:49:58 eth0 72088.5 50239.6 58650.0 39578.6 1258.6 1299.8 100 0.00
21:50:01 eth0 71629.2 50227.4 59794.7 39317.1 1226.7 1308.2 99.8 0.00
21:50:04 eth0 70348.0 52802.9 58990.7 40937.5 1221.1 1320.8 100 0.00
21:50:07 eth0 69576.3 54797.4 58773.3 42931.5 1212.2 1307.0 100 0.00
21:50:10 eth0 112908 40154.3 83073.2 35212.8 1391.8 1167.7 100 0.00
21:50:13 eth0 103493 47422.6 78465.8 39654.2 1350.6 1224.6 100 0.00
21:50:16 eth0 68535.5 53543.9 57525.4 41778.5 1220.0 1312.4 100 0.00
21:50:19 eth0 76412.5 49362.7 61296.4 38985.3 1276.5 1296.6 100 0.00
21:50:22 eth0 74091.6 54256.3 57072.8 42506.2 1329.4 1307.1 100 0.00
Are these observations worth noting? Could they be related to why a loss of connectivity to an exported filesystem from the machine with this NIC activity?