Odd memtest86/boot problems on linux machine

Question

I run Debian Lenny 64-bit on an HP server with an Intel Core 2 Duo processor. It boots with LILO rather than GRUB because it has an XFS root partition. Up until today it had 3 GB (2x 512MB and 2x 1GB) of ECC RAM. I have been getting ECC errors from EDAC occasionally on a single slot but since I had no crashes I wasn't too worried.

Today I tried to do a Seagate firmware update which Seagate recommended for two of the drives (data only, not /) which are in a RAID-1 in mdadm on that machine. I didn't manage to do this, or even get to the README for that disc as the it was taking forever to boot. I got fed up and tried to reboot the system. It hung after three lines of ...s from LILO.

I thought that I probably had some bad RAM due to the ECC errors, so I tried many different configurations (with 6 DIMMs, the four mentioned plus 2 non-ECC DIMMs, obviously not at the same time) but couldn't get it to boot.

I ran memtest86, hoping to isolate the bad RAM. This resulted in the exact same error every time in Test #2 of memtest86, no matter which DIMM I used and no matter which slot. It always returned 3 errors on the first occupied RAM slot. I cannot make sense of the errors it returned but can produce them here if it's relevant.

Attempting to boot Debian off the main disk after this did not even show the word "LILO". It just hangs with a blinking cursor. This, together with the fact that there were memory errors every time, caused me to believe there was something wrong with the motherboard or with the CPU.

However, very oddly, Knoppix boots up happily and runs with no problems. I cannot run lilo because Knoppix is 32-bit and the system is 64-bit. But this makes me question some of the above stuff -- surely Knoppix can't run with RAM errors or a bad processor?

score 3 · Answer 1 · answered Aug 09 '09 at 02:27

Sounds like the slot on the motherboard is bad. If can skip using the first slot, try that and see what happens. If there are no problems then the problem is pretty much confirmed.

Check for dirty contacts, dirt in the slot etc. Maybe you'll get luck and it's something that simple.

If you have a spare box laying around, try putting the ram in that one and running memtest and seeing what happens.

score 1 · Answer 2 · answered Aug 09 '09 at 00:32

1

I suspect disk, disk controller or bus used by the controller. If you are failing before the L in LILO than the partition with LILO is being read successfully. Knoppix doesn't have to deal with this so it boots just fine. Can you mount anything from Knoppix?

answered Aug 09 '09 at 00:32

From a private conversation an hour ago: "within knoppix I can verify that the disks (a bunch of RAID-1s) all look fine" – liori Aug 09 '09 at 00:35

nik · Answer 3 · 2009-08-09T05:56:13.070

I agree a lot with David's analysis.
Have used the memtest86 (right out of a Ubuntu LiveCD too) to isolate RAM errors.
These troubleshooting notes on the memtest86+ pages are also a good read.

memtest86+ has a focus on the memory stability
if changing the modules around still gives you an error in the same address locations,
the memory slots are very likely culprit
you can concentrate testing on the problem error addresses with
simple controls at the bottom of the memtest86 screen to quicken your testing cycle
if memtest86 shows errors, other higher layer checks are not worthwhile, focus on memory path

score 0 · Answer 4 · answered Aug 10 '09 at 05:55

Memory errors are very 'iffy'. That is why software can sometimes still run even with faulty RAM.

Sometimes, the error bits are such that they do not cause errors. One example would be if those locations merely stored extra stuffed data bits that were there in order to ensure memory alignment but are not actually used by the software. Even if it was used for storing actual programmes, it may be just those bits of an instruction that are not actually essential or decoded by the processor. Most modern PCs will have a memory management unit (MMU) that translates between physical and virtual memory locations. So, although the RAM that is faulty is the initial part of memory, it may not essentially be used by software that is addressing that particular block of RAM.

However, as others have said, it is most likely a faulty slot. Avoid using that slot if it is found to be faulty. If it is a fixed area of RAM, you may even be able to avoid using it by marking the BADMEM area in the Linux kernel.

RAM errors will come back and bite you sooner or later.

Odd memtest86/boot problems on linux machine

4 Answers4