2

I have an extremely strange error message which causes a complete system crash and remount of the filesystem as read-only. It all started ages ago when I installed a dodgy $2 ebay PCI modem and there were kernel panics showing up monthly and the output was huge. A new hard disk and a dist-upgrade later I have gotten the error to be very sporadic and a lot smaller in terms of what is actually printed. (it's still rubbish to me - even after thorough googling )

This system when booted into Debian has been 'cursed' I was thinking about trashing the computer and getting a new one... but because it is only a Linux problem it must be software!!

Basically here it is (I post now because I crashed today but also yesterday):

EXT2-fs error (device hda1): ext2_check_page: bad entry in directory #5898285: rec_len is smaller than minimal - offset=0, inode=5898285, rec_len=8, name_len=1 Remounting filesystem read-only

What is going on? I then have to pull the power out, reboot, fsck -y, reboot and then that usually settles it for a while.

If this could be figured out I would be so happy.

Thanks in advance for any light you guys can shed on this matter.

--EDIT:

Now running updatedb causes this error every time (well twice) and that means it's reproducible and trackable! (now just to fix it...)

Is it time for a new computer?

--EDIT:

resize2fs /dev/hda1 says it's already the correct amount of blocks long and badblocks doesn't return anything (is it not meant to?)

--EDIT:

Is it possible something is corrupting all my new disks? A hardware problem - someone said it might be the disk controller, or a bios option - is there anyway to check this?

Thanks.

Dennis Williamson
  • 60,515
  • 14
  • 113
  • 148

2 Answers2

2

That really does sound like the filesystem's idea of the partition size is different to what the actual partition size is. You said you installed a new hard drive; if you transferred the filesystem to the new hard drive with dd (or some other method that didn't involve a mkfs on the new disk) this could happen.

Try running resize2fs /dev/hda1 from within a rescue environment (after a fsck -f, etc) and see if the filesystem size changes. I'm guessing that it probably will, and your problems will mysteriously go away.

womble
  • 95,029
  • 29
  • 173
  • 228
  • Thanks for the answer: basically when i replaced the disk i replaced the whole operating system and data. Everything was wiped, i couldnt take chances it was so screwed before... just a few q's -- rescue environment? how do i go into that... fsck '-f' -f Force fsck to check `clean' filesystems when preening. what does this mean :) and resize2fs (im sure it wont, becuase otherwise you probably wouldnt have said it, but) is it destructive? thanks for your help... this problem is so annoying :) –  Aug 16 '09 at 00:20
  • now, i am getting these nonfatal errors on stderr. swap_free: bad swap file entry 10000000 times about 12 all with different numbers... its soo annoying becuase no other debian system in my house is this buggy!!! thanks :) –  Aug 16 '09 at 00:28
  • This sounds like the disk is bad. I'd consider replacing it, or at the very least running badblocks over it for a few days from a rescue environment. That can be entered by booting off a Debian install CD and entering "rescue" at the ISOLinux prompt. – womble Aug 16 '09 at 04:14
  • maybe it's not the disk, but the controller –  Aug 18 '09 at 22:03
  • @hop: Could be. Badblocks would hopefully find that, too, of course, but you should badblocks all new disks, and if you replace the disk and it's still having problems, *then* suspect the controller (disks go bad a lot more than controllers, IME). Of course, there's also driver problems to consider... – womble Aug 18 '09 at 22:28
1

I surely think your disk contains bad sectors. You can verify it with badblocks (http://en.wikipedia.org/wiki/Badblocks)

man badblocks:

badblocks  is  used  to  search  for bad blocks on a device
(usually a disk partition).  device is the special file corresponding
to the device (e.g /dev/hdc1).  last-block is the last block to be checked; 
if it is not specified, the last block on the device is used as a default. 
start-block is an optional parameter specifying the starting block number
for the test, which allows the  testing to start in the middle of the disk.
If it is not specified the first block on the disk is used as a default.

if you really going to be through, you shall choose the -w option (read-write test) with 2-3 passes, but be sure to backup your data because read/write tests are destroying data on the physical media.

NOTE: you will be tempted to set ext* to ignore bad blocks, but I would strongly recommend replacing the drive. Drives usually contain a few bad blocks by default, but the internal logic relocates data on the fly if OS wants to write on a known bad block. Area for this relocation is fixed, so if it gets full, drive will stop relocating sectors. This is the point you are now at, so you can expect sectors becoming faulty more and more rapidly. IF you have any guarantee on your disk, you shall get the disk replaced, if not, get a new one.

You can also consider setting up a RAID1 (from new disks) and creating backup at regular intervals (for disk medias not stored on or near the actual server/workstation in topic)

NOTE2: although a memory problem does not manifests in strictly the same errors all the time, you could also run a memtest to be sure your server hasn't got "Alzheimer`s" :)

asdmin
  • 2,020
  • 16
  • 28
  • ok, i will defs run badblocks. say if i do have bad blocks, why? i mean what is the problem behind it -- this disk is nneeewwww! i just paid 100 bucks for it not a month ago!! ps thanks for the answer anyway –  Aug 16 '09 at 12:01
  • a product (a disk for example) can basically suffer from two kinds of weakness: manufacturing errors and deterioration. When manufacturer makes the disk, there's always a chance that a screw, an IC is not at the rightmost position or a piece of dust gets into the disk chamber. It will render your drive useless soon after you start it up. If you 'survive' this time, your disk will be worn off (deterioration during hundreds of yours of service), so there's a chance that the drive will build an error. You have to understand, that disk defects can happen at _beginning_ of lifetime too. – asdmin Aug 16 '09 at 23:09
  • yes i understand.. this is what prompted to get me a new drive last drive (which was actually corrupted but that is a different point) the fact of the matter is: the same error has been cropping up and i am beginning to think that it is not ahardware issue but a software issue. also- whenever i try to make a new partition with the debian install CD it never lets me create an ext3 partition, fails by hanging for ever, i have to pull power and create an ext2 one which is created fine.. is it the disk controller? if so, should i trash the machine and get a new one-would the disk be usable still? –  Aug 17 '09 at 09:30
  • haha lame i sshed into the box for 'badblocks' and then shut my laptop lid... fail! (network conenction reset) but the process ran to completeion... –  Aug 17 '09 at 09:36