Ext3 Keeps getting Journal Error and Becoming Read-Only

Question

I have a RHEL5.5 x86_64 server with 2 HBA connecting to EMC and HP storage arrays. EMC PowerPath is installed because my EMC vendor insists on that.

My problem is the volumes on HP storage often get journal error (see below) and goes into read-only mode.

Is it a SAN problem or OS problem? How can I resolve this?

May 27 14:16:57 cvoddv01 kernel: journal_bmap: journal block not found at offset 6156 on dm-7
May 27 14:16:57 cvoddv01 kernel: Aborting journal on device dm-7.
May 27 14:16:57 cvoddv01 kernel: ext3_abort called.
May 27 14:16:57 cvoddv01 kernel: EXT3-fs error (device dm-7): ext3_journal_start_sb: Detected aborted journal
May 27 14:16:57 cvoddv01 kernel: Remounting filesystem read-only
May 27 14:16:57 cvoddv01 kernel: __journal_remove_journal_head: freeing b_frozen_data
May 27 14:16:57 cvoddv01 kernel: __journal_remove_journal_head: freeing b_committed_data
May 27 14:16:57 cvoddv01 kernel: __journal_remove_journal_head: freeing b_frozen_data
May 27 14:17:36 cvoddv01 kernel: ext3_abort called.
May 27 14:17:36 cvoddv01 kernel: EXT3-fs error (device dm-7): ext3_put_super: Couldn't clean up the journal

My modprobe.conf is:

alias scsi_hostadapter mptbase
alias scsi_hostadapter1 mptspi
alias scsi_hostadapter2 cciss
alias scsi_hostadapter3 ata_piix
alias scsi_hostadapter4 qla2xxx
alias eth0 e1000e
alias eth2 e1000e
alias eth1 e1000e
alias eth3 e1000e
alias eth4 bnx2
alias eth5 bnx2
#Added by HP rpm installer
alias scsi_hostadapter_mptscsih_module mptscsih
#Added by HP rpm installer
alias scsi_hostadapter_mptsas_module mptsas
options qla2xxx ql2xmaxqdepth=16 ql2xloginretrycount=30 qlport_down_retry=64
options lpfc lpfc_lun_queue_depth=16 lpfc_nodev_tmo=30 lpfc_discovery_threads=32
###BEGINPP
include /etc/modprobe.conf.pp
###ENDPP

The /etc/fstab is:

/dev/VolGroup00/LogVol00 /                       ext3    defaults        1 1
LABEL=/boot             /boot                   ext3    defaults        1 2
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
/dev/VolGroup00/LogVol01 swap                    swap    defaults        0 0
#/dev/sdae1             /mnt/sda1               ext3    defaults        0 0
#/dev/sdaf1             /mnt/sdb1               ext3    defaults        0 0
#/dev/sdag1             /mnt/sdc1               ext3    defaults        0 0
#/dev/sdah1             /mnt/sdd1               ext3    defaults        0 0
/dev/vg01/lvu02         /u02                    ext3    defaults        0 0
/dev/vg01/lvu03         /u03                    ext3    defaults        0 0
/dev/vg01/lvu04         /u04                    ext3    defaults        0 0
/dev/vg01/lvu05         /u05                    ext3    defaults        0 0
/dev/vg02/lvu06         /u06                    ext3    defaults        0 0
/dev/vg02/lvu07         /u07                    ext3    defaults        0 0
/dev/vg02/lvu08         /u08                    ext3    defaults        0 0
/dev/vg02/lvu09         /u09                    ext3    defaults        0 0
shmfs                   /dev/shm                tmpfs   rw,size=22g     0 0

uanme -a

Linux cvoddv01.globetel.com 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

Can you please run `powermt display dev=all`? – Tom Shaw May 27 '11 at 08:31 — Tom Shaw, May 27 '11 at 08:31

score 2 · Accepted Answer · answered May 27 '11 at 08:39

You should really be using either dm-multipath or PowerPath, not both at the same time.

From the PowerPath Admin Guide:

PowerPath is not compatible with the native Linux device mapper (DM-MPIO). Configuring both products on the same host can cause system instability. EMC recommends that you do not configure the native device mapper on a host on which PowerPath will be installed.

score 0 · Answer 2 · answered May 27 '11 at 07:36

0

Have you tried a remove and rebuild the journals ? There are few posts around which explain how to recreate your EXT3 journals. If a rebuild of the journals still gives you errors then I would investigate the hardware/drivers. - Sorry I cant be more detailed here.

answered May 27 '11 at 07:36

AndyM

901
2
14
26

I have rebuilt the journals and still getting the errors. As far as I know, I have not configured multipathd for HP storage. How can I make sure that only PowerPath is active and not the default multipathd? – kjloh May 29 '11 at 04:07

score 0 · Answer 3 · answered May 27 '11 at 07:39

The device affected in the attached log is dm-7, so I expect you use multipathd for the HP storage, right? If you do, please attach also your multipathing config.

el5 in the kernel name suggests RHEL 5. If you have a support contract, contact them ASAP, they will be able to help you the most.

What we are sure from the data is that an attempt to access the log has been made, failed and the OS did the only thing it could, i.e. froze the filesystem in order to avoid damaging it with any writes.

The failure can lie in any of the components:

Storage -- is the filesystem OK after a remount? Can you do a full fsck on it to see if the problem with journal is the only thing that went wrong, or maybe you have a lot of silent corruption, and only when the bug hits the journal it becomes visible.
This particular LUN. Can you (as in: is it feasible) to format it, restore data and see if it happens again?
Can you create another LUN on the same array and see if you can reproduce the error? A LUN on a different array on the same storage?
Multipathing -- can you reproduce errors if you access the storage directly, over just one path (this requires changes to SAN zoning or lun masking at the storage).
Drivers collision between powerpath and the native multipathing. Can you reproduce an error on the same LUN when you don't have powerpath installed?

I do not think it would be a bug in the ext3 code, because it's been around for a while, but do you use any exotic mount options? Do you have 4K block on the storage? Anything exotic?

Did the server ever work OK? If so, can you name the change, that caused it to start failing?

If you are going to troubleshoot it yourself, then your best bet would be to make a minimal set of options that makes the system fail. More practical approach could be to re-organise your storage so that you use only one vendor's storage at any given server. This could save you a ping-pong between vendors.

Your best bet, however, would be to contact your OS vendor and make them drive the case, I think.

- Multipath is not enabled. The multipath.conf is the default. "multipath -ll" returns nothing. - After unmount and fsck, I can remount the filesystem. - I have re-created the LUN's before, and the problem still recurs. LUN on EMC are OK. - I am now accessing the devices directly using /dev/sd*. - I use default mount option. Block size is 8K. - Before the installation of PowerPath, I already encountered intermittent problem. But the problem has become worse after installation of Powerpath. — kjloh, May 27 '11 at 07:50
In that case I would be pestering storage vendor and the OS vendor. If the storage is supported for this version of operating system it's their job to make it work. Do any other servers access the HP storage? Do they experience any problems too? If you don't use the native multipathing, then why is the error on dm-7? — Paweł Brodacki, May 27 '11 at 09:06

Ext3 Keeps getting Journal Error and Becoming Read-Only

3 Answers3