1

I am running an Ubuntu server 10.10 machine with 24 hard disks (and two OS drives).

I am using a script to mount all the drives to their desired mountpoints. This works pretty fast, as expected, until it suddenly hangs. Usually after mounting 15 drives or something in the vicinity.

Sometimes it also mounts all drives fast, but that is rare.

Unmounting is the same thing. When I manually unmount drives it literllay takes hours to complete. For one drive I wait about 10 minutes to complete.

The mount processes is in an uninterruptable sleep mode and there is nothing in the syslog and nothing in the kernel log.

The server is a mass storage server by the hosting provider leaseweb

The drives are behind a raid controller configured in jbod mode.

My idea is maby that some sort of staggered boot/halt of the hard disks in the raid controller could cause this problem but somehow this does not really fit the behaviour.

Any ideas?

The Shurrican
  • 2,230
  • 7
  • 39
  • 58
  • You know about the automatic fsck after x mounts? – Martino Dino Dec 15 '13 at 18:08
  • There might be pending SCSI reservations on the disks which are preventing the unmounts. – Martino Dino Dec 15 '13 at 18:16
  • I'm curious as to why you have so many drives and mount points. ***More information and context!!*** – ewwhite Dec 15 '13 at 18:23
  • well the behaviour also occurs immediately after freshly creating the filesystem. also i should mention that there is nothing on the disks. they are empty and no applications are accessing them. i also tried debian wheezy without success (same behaviour) – The Shurrican Dec 15 '13 at 18:52
  • i can also read data from the drive using dd (even when it hangs). and i have 24 drives because i need to store a lot of files and dont need a raid because this is managed on the application level – The Shurrican Dec 15 '13 at 18:58
  • 1
    @TheShurrican MORE INFORMATION, PLEASE!! What type of server or enclosure is this? What type(s) of disks are these? What is the make/model of the RAID controller involved? Is there a SAS expander backplane in the mix? – ewwhite Dec 15 '13 at 20:00
  • As @ewwhite said, more infomation. Have you tried changing the order of which you mount the drives? Is this consistent between a particular row of disks against a particular target? – Matthew Ife Dec 15 '13 at 20:17
  • lshw tells me there is a 3ware 9650SE controller, so i figure there is an expander or the ports would not suffice. probably built into the backplane. i also tried reversing the order of the drives, the behaviour keeps the same. after about 15 drives it starts hanging. i am also using luks encryption but i produced the exact same behaviour across the board without it, so this is not the reason. – The Shurrican Dec 15 '13 at 21:14
  • Might be worth to upgrade the controller firmware :) – Martino Dino Dec 15 '13 at 21:44
  • yeah, now i just waited for more than 2 hours and then it actually went through! i already tried to update the firmware controller but it seems that i need to get the firmware from some legacy sites... the official site does just not come up with anything... – The Shurrican Dec 15 '13 at 23:39

2 Answers2

4

If your drives are EXT formatted then you're probably encountering the automatic fsck after x mounts. In order to check if this is the case run: tune2fs -l /dev/sdx| grep -i check and check when the next fsck will run.

Eventually you can increase the counter with: tune2fs /dev/sdxy -c number

Martino Dino
  • 1,145
  • 1
  • 10
  • 17
  • unfortunately the issue also occurs when the filesystem is freshly creating and also immediately after a manual filesystem check.. – The Shurrican Dec 15 '13 at 19:27
  • Then I'm thinking that your controller has some issues, you should investigate controller logs and eventually check for failed I/O requests or similar in kernel logs – Martino Dino Dec 15 '13 at 21:50
2

it turned out that after a firmware downgrade everything runs fine...

the lsi website is just broken so google turned up this page:

http://www.3ware.com/support/downloadpageprod.asp?pcode=16&path=Escalade9650SE-Series&prodname=3ware%209650SE-24M8

where i downloade firmware version 4.10.00.007

the installed firmware was 4.10.00.029 (i think - not 100% sure but definitely newer)

after a downgrade everything runs fine!

The Shurrican
  • 2,230
  • 7
  • 39
  • 58