2

Context

The company sells access to a sort of cash register web application. Access to the application is given through a VPN. The VPN entrypoint for the clients is a Soekris board running Voyage Linux (a trimmed down version of Debian). These boards have run for 3 years MySQL with replication and a RoR application stack.

The storage support for these boards is a Compact Flash 4GB card.

The problem

We are getting regular errors and random application crashes on these boards. The most frequent errors are the following :

Aug 24 14:54:44 box45 puppetd[3669]: Could not run Puppet::Network::Client::Master: Stale NFS file handle - /var/lib/puppet/state/state.yaml

Aug 24 13:37:01 box76 kernel: [ 2091.575622] EXT2-fs error (device hda1): read_block_bitmap: Cannot read block bitmap - block_group = 30, block_bitmap = 983040

If these were HDD-based, I would run SMART monitoring tools to check for bad sectors and general disk health. Except, due to them being CF cards, I am in the dark and have difficulty measuring how bad (or good !) the situation is.

What can I do to monitor the health of these cards and measure their health ? I insist on "measure" as I need to give some hard facts that will eventually motivate the change of all the CF cards.

And to make things a little more complex, I do not have physical access to the Soekris boards so all this needs to be remote.

Antoine Benkemoun
  • 7,314
  • 3
  • 41
  • 60

2 Answers2

2

The error seems to point pretty solidly to a problem with a section of the CF card media. If it has been running for some time without any problems and now it's giving these issues, I'd think that the card has started going bad. Easiest way to test is to send a tech out with a replacement card and swap it out, especially if you're seeing this on a limited number of the systems. All media have lifespans and failure rates; the more read/write cycles you have going to the cards the sooner they'll die.

Another thing to look at: are the errors in reading near the same spot(s) each time? That would tell me it's probably a bad cell as well in a specific part of the card.

I don't know if fsck would work the same way on these cards or not. My first inclination seeing that error is to replace it.

Bart Silverstrim
  • 31,092
  • 9
  • 65
  • 87
  • How can I know if it's reading the same spot ? The errors are pretty similar. I'm 100% for replacing the cards but I'm having trouble gathering "hard" evidence. My boss' current reaction is to say just reinstall on top of the existing CF.. – Antoine Benkemoun Aug 25 '11 at 13:12
  • Your error message. It's giving a location with the bitmap numbers. – Bart Silverstrim Aug 25 '11 at 13:14
  • Grep the error logs for the EXT2-fs error messages and compare. If they're the same or very close to each other, you're probably looking at a bad spot or run of spots. Either way when I've seen similar error messages on traditional drives it's time to replace them. – Bart Silverstrim Aug 25 '11 at 13:15
  • Ok got it. The problem is most error messages are the "Stale NFS file" ones which are very vague... And there is no NFS of course. – Antoine Benkemoun Aug 25 '11 at 13:31
-3

Why in the world would you run things off of CF cards? Use solid state media (meant for the purpose) if you need flash storage. CF cards are not made with technical standards to include monitoring. The most you can do is a checkdisk and check it for bad sectors.

U4iK_HaZe
  • 631
  • 5
  • 13
  • 3
    Soekris boards are made to run off CF cards... Putting write intensive apps such as MySQL and RoR was not my idea and definitely a bad one but mine. I agree with you but that doesn't change the fact that I have to handle this situation. – Antoine Benkemoun Aug 25 '11 at 13:04
  • Don't beat Antoine up, it was my downvote not his. – Ben Pilbrow Aug 25 '11 at 13:14
  • Okay. Sorry. I just bought a CF card yesterday, and while playing around with it, found that it sometimes errors when reading AND writing to it with rapid succession, but does have NO errors when check disking it. – U4iK_HaZe Aug 25 '11 at 13:18
  • Could be a bad or cheap card, depending on how the data is being written to it. – Bart Silverstrim Aug 25 '11 at 13:20
  • I thought from the question that it's been running for 4 years and now is showing issues. Unless the errors have been persistent for the entire time, I'd think there's a cell going bad. – Bart Silverstrim Aug 25 '11 at 13:20
  • Mine was a cheap card. Actually, it's not rated for the speeds I was forcing it to use and was choking instead. (No buffer). As for his case, I can't think of anything but bad card, maybe it's overheating? Old could mean dead sectors. – U4iK_HaZe Aug 25 '11 at 13:30