0

Lately on a shared host, the filesystem containing my home folder was mounted read-only for 45 minutes to 1 hour. The technical support did not know about the outage, evaded direct questions. After a bit more than three days I obtained the answer:

There are many explanations, but in the most times this caused by server issue on filesystem level.

I am somehow not pleased by this in-depth analysis, as my normal work environment runs on a RAID1 (mdadm) and I never encountered such issues.

The shared host system is supposed to be a RAID1, and I became aware of the issue as a cronjob running uptime every 15 minutes sent me email about it.

I would really like to know, what you, the more experienced, think of this.

nickgrim
  • 4,336
  • 1
  • 17
  • 27
benjamin
  • 187
  • 1
  • 12

2 Answers2

1

I've had this happen in the following scenario:

VMWare server connected to shared storage.

  1. Shared storage either goes offline, or communication to it is lost
  2. VMWare notices the storage went offline and marks virtual disks on the shared storage as read-only
  3. Lack of Profit

It should be rare, but it happens.

They could/should have been more forthcoming with you about the causes.

Chris
  • 414
  • 2
  • 2
  • +1 for your answer, yet i do wonder if a well configured system is allowed to behave this way. shouldn't admins be alerted automatically in case of such an error (in a well configured environment). Or is this one of the errors, where the likelyhood/outage time can be time-boxed and ignored by admins as it still ranges in the SLA? – benjamin Mar 09 '11 at 21:25
  • I can't comment on the issues that caused this, but I think you have legitimate concerns and asking more questions would be appropriate. People tend to be a bit reluctant to admit faults, but you're a paying customer here. – Chris Mar 09 '11 at 21:28
1

I am going to assume that you are talking about a Linux system on ext2/ext3/ext4 filesystems (also reiser if you dare).

Anyway, when a new filesystem is generated on the disk, there is something in the filesystem meta information that informs the Linux host what to do in the event of some problem in the filesystem during operations.

From what I see, this is set to a default. The default tells the operating system to remount the problem filesystem READ ONLY.

I had this happen on a number of VMs and was most annoyed. What I did was change the setting so that if a serious filesystem event happens, then to panic the system which will cause a reboot.

Assumming ext* filesystems, you can change the setting even while the disk is mounted:

  tune2fs -e panic /dev/sdX#

where sdX# is a disk for example /dev/sdb3 where your filesystem is. This also works for LVM disks with the appropriate /dev/ name for the specific LVM where the filesystem is contained.

You must do this for each filesystem partition, changing one filesystem does not change any other filesystem.

After making this change on all my VM filesystems, I am very happy.

Enjoy

mdpc
  • 11,698
  • 28
  • 51
  • 65
  • mdpc, +1 for your input. actually, on the shared host i do not have root access. i think your guess with the `read-only` on error is plausible. i am wondering what could have foobared the RAID1 on the server side. IMHO, as one complete disc is allowed to fail, the hoster should have had serious trouble. What do you think? (No, they won't/can't tell me the reason.) – benjamin Mar 10 '11 at 00:32
  • 1
    I'd ask your provider to change your filesystems using this command. As for the cause, if the internal filesystem in the OS runs into some type of inconsistency or kernel issue (i.e. BUG), this would be outside the mdm/RAID realm to handle this type of error. – mdpc Mar 10 '11 at 00:36
  • Great point you are mentioning again concerning the RAID. Now I am frightened :-), what system have the built?? They are around in the VPS business for quite a while. However, i'll suggest it to the techs and see how it works out. – benjamin Mar 10 '11 at 01:00
  • I suggested your tip, as answer I obtained: `Your fs is mounted as read usually when your server is restarted without normal order and on boot is missed fskc check.` Still they seem not to know the cause. Do you think it is valid to conclude as `fsck` ran and it is a RAID1, that they had to rebuild the data on new disks due to a total outage? – benjamin Mar 10 '11 at 10:54
  • This answer was accepted as the best, as according to wikipedia linux favors `kernel panic` over lots of error recovery code. According to the `man tune2fs` possibilities in case of a filesystem error are to `continue` operation (in this case bad), `remount-ro` (in this case, bad too, as pointed out by mdpc), or to `panic` causing a reboot. Each of these settings tune the kernel behavior in case of a filesystem error. Thank you very much again for your help, mdpc. – benjamin Mar 11 '11 at 07:41