1

We have one server which exports two partitions which are massive (one is 61 TB, the other is 28 TB), and we have a lot of users reading and writing to these when mounted on another machine.

We have separate users complaining about these mounts spontaneously being unmounted and remounted, which disrupts stuff like MATLAB jobs in progress. I asked the people complaining to give me a time frame, and in the meantime have set it up to run nicstat -i eth0 3 (check NIC utilization once every three seconds) for the past several hours and intend to leave it running. I'm writing the results to a log so I can check them, and hopefully cross-reference them with when the unexpected/unwelcome unmounting occurs. The users have not gotten back to me yet.

Basically I took a quick look at the results of nicstat so far over a long period of time, and two things stick out to me, but I'm not sure if they can be considered normal or not.

Every so often (once every two hours or so), there is a sudden drop in activity:

    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
21:13:04     eth0 48057.5 30790.5 35560.6 23086.0  1383.9  1365.7  64.6   0.00
21:13:07     eth0  5438.7 12064.3  4697.4  8237.9  1185.6  1499.6  14.3   0.00
21:13:10     eth0    2.98    0.09    8.67    1.33   351.7   66.00  0.00   0.00
21:13:13     eth0    0.87    0.04    3.33    0.67   266.6   66.00  0.00   0.00
21:13:16     eth0    0.65    0.02    4.33    0.33   153.9   66.00  0.00   0.00
21:13:19     eth0    1.41    0.08    3.67    1.33   393.9   60.00  0.00   0.00
21:13:22     eth0    0.87    0.04    3.00    0.67   298.2   66.00  0.00   0.00
21:13:25     eth0    0.40    0.03    3.00    0.67   137.8   42.00  0.00   0.00
21:13:28     eth0    0.16    0.00    2.67    0.00   60.00    0.00  0.00   0.00
21:13:31     eth0 68692.2 60637.2 53586.2 46664.5  1312.7  1330.6   100   0.00

Every so often (once every hour or so), there is a huge spike:

    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
21:49:31     eth0  2423.2 64807.9  6604.5 44182.7   375.7  1502.0  55.1   0.00
21:49:34     eth0 53528.0 53388.1 40204.7 38932.1  1363.3  1404.2  87.6   0.00
21:49:37     eth0  1283.3 55824.3  5041.1 38035.3   260.7  1502.9  46.8   0.00
21:49:40     eth0  1330.6 57297.0  5488.8 39000.1   248.2  1504.4  48.0   0.00
21:49:43     eth0  3311.9 61186.1  7687.9 41753.8   441.1  1500.6  52.8   0.00
21:49:46     eth0 25882.2 54957.0 23886.9 39327.8  1109.5  1430.9  66.2   0.00
21:49:49     eth0 67499.1 54285.2 57065.7 41844.6  1211.2  1328.4  99.8   0.00
21:49:52     eth0 75769.3 45597.8 60522.3 36384.9  1282.0  1283.3  99.4   0.00
21:49:55     eth0 65356.3 49854.2 54671.6 38437.2  1224.1  1328.2  94.4   0.00
21:49:58     eth0 72088.5 50239.6 58650.0 39578.6  1258.6  1299.8   100   0.00
21:50:01     eth0 71629.2 50227.4 59794.7 39317.1  1226.7  1308.2  99.8   0.00
21:50:04     eth0 70348.0 52802.9 58990.7 40937.5  1221.1  1320.8   100   0.00
21:50:07     eth0 69576.3 54797.4 58773.3 42931.5  1212.2  1307.0   100   0.00
21:50:10     eth0  112908 40154.3 83073.2 35212.8  1391.8  1167.7   100   0.00
21:50:13     eth0  103493 47422.6 78465.8 39654.2  1350.6  1224.6   100   0.00
21:50:16     eth0 68535.5 53543.9 57525.4 41778.5  1220.0  1312.4   100   0.00
21:50:19     eth0 76412.5 49362.7 61296.4 38985.3  1276.5  1296.6   100   0.00
21:50:22     eth0 74091.6 54256.3 57072.8 42506.2  1329.4  1307.1   100   0.00

Are these observations worth noting? Could they be related to why a loss of connectivity to an exported filesystem from the machine with this NIC activity?

CptSupermrkt
  • 233
  • 1
  • 3
  • 9

1 Answers1

0

This is most likely the automounter biting you, or a simply overtaxed network connection.

NFS automounts can be a finicky thing; sometimes timing out when they're active, sometimes staying connected when they shouldn't. The peaks you see might be reconnection sprees.

Note that a misbehaving job, user or network (your NIC goes to 100% usage for a prolonged time!) can also cause some clients to think the NFS server has gone out the backyard.

Other than that, to investigate further, the NFS HOWTO is a good start.

Roman
  • 3,825
  • 3
  • 20
  • 33