13

My company makes an embedded Debian Linux device that boots from an ext3 partition on an internal SSD drive. Because the device is an embedded "black box", it is usually shut down the rude way, by simply cutting power to the device via an external switch.

This is normally okay, as ext3's journalling keeps things in order, so other than the occasional loss of part of a log file, things keep chugging along fine.

However, we've recently seen a number of units where after a number of hard-power-cycles the ext3 partition starts to develop structural issues -- in particular, we run e2fsck on the ext3 partition and it finds a number of issues like those shown in the output listing at the bottom of this Question. Running e2fsck until it stops reporting errors (or reformatting the partition) clears the issues.

My question is... what are the implications of seeing problems like this on an ext3/SSD system that has been subjected to lots of sudden/unexpected shutdowns?

My feeling is that this might be a sign of a software or hardware problem in our system, since my understanding is that (barring a bug or hardware problem) ext3's journalling feature is supposed to prevent these sorts of filesystem-integrity errors. (Note: I understand that user-data is not journalled and so munged/missing/truncated user-files can happen; I'm specifically talking here about filesystem-metadata errors like those shown below)

My co-worker, on the other hand, says that this is known/expected behavior because SSD controllers sometimes re-order write commands and that can cause the ext3 journal to get confused. In particular, he believes that even given normally functioning hardware and bug-free software, the ext3 journal only makes filesystem corruption less likely, not impossible, so we should not be surprised to see problems like this from time to time.

Which of us is right?

Embedded-PC-failsafe:~# ls
Embedded-PC-failsafe:~# umount /mnt/unionfs
Embedded-PC-failsafe:~# e2fsck /dev/sda3
e2fsck 1.41.3 (12-Oct-2008)
embeddedrootwrite contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Invalid inode number for '.' in directory inode 46948.
Fix<y>? yes

Directory inode 46948, block 0, offset 12: directory corrupted
Salvage<y>? yes

Entry 'status_2012-11-26_14h13m41.csv' in /var/log/status_logs (46956) has deleted/unused inode 47075.  Clear<y>? yes
Entry 'status_2012-11-26_10h42m58.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47076.  Clear<y>? yes
Entry 'status_2012-11-26_11h29m41.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47080.  Clear<y>? yes
Entry 'status_2012-11-26_11h42m13.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47081.  Clear<y>? yes
Entry 'status_2012-11-26_12h07m17.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47083.  Clear<y>? yes
Entry 'status_2012-11-26_12h14m53.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47085.  Clear<y>? yes
Entry 'status_2012-11-26_15h06m49.csv' in /var/log/status_logs (46956) has deleted/unused inode 47088.  Clear<y>? yes
Entry 'status_2012-11-20_14h50m09.csv' in /var/log/status_logs (46956) has deleted/unused inode 47073.  Clear<y>? yes
Entry 'status_2012-11-20_14h55m32.csv' in /var/log/status_logs (46956) has deleted/unused inode 47074.  Clear<y>? yes
Entry 'status_2012-11-26_11h04m36.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47078.  Clear<y>? yes
Entry 'status_2012-11-26_11h54m45.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47082.  Clear<y>? yes
Entry 'status_2012-11-26_12h12m20.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47084.  Clear<y>? yes
Entry 'status_2012-11-26_12h33m52.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47086.  Clear<y>? yes
Entry 'status_2012-11-26_10h51m59.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47077.  Clear<y>? yes
Entry 'status_2012-11-26_11h17m09.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47079.  Clear<y>? yes
Entry 'status_2012-11-26_12h54m11.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47087.  Clear<y>? yes

Pass 3: Checking directory connectivity
'..' in /etc/network/run (46948) is <The NULL inode> (0), should be /etc/network (46953).
Fix<y>? yes

Couldn't fix parent of inode 46948: Couldn't find parent directory entry

Pass 4: Checking reference counts
Unattached inode 46945
Connect to /lost+found<y>? yes

Inode 46945 ref count is 2, should be 1.  Fix<y>? yes
Inode 46953 ref count is 5, should be 4.  Fix<y>? yes

Pass 5: Checking group summary information
Block bitmap differences:  -(208264--208266) -(210062--210068) -(211343--211491) -(213241--213250) -(213344--213393) -213397 -(213457--213463) -(213516--213521) -(213628--213655) -(213683--213688) -(213709--213728) -(215265--215300) -(215346--215365) -(221541--221551) -(221696--221704) -227517
Fix<y>? yes

Free blocks count wrong for group #6 (17247, counted=17611).
Fix<y>? yes

Free blocks count wrong (161691, counted=162055).
Fix<y>? yes

Inode bitmap differences:  +(47089--47090) +47093 +47095 +(47097--47099) +(47101--47104) -(47219--47220) -47222 -47224 -47228 -47231 -(47347--47348) -47350 -47352 -47356 -47359 -(47457--47488) -47985 -47996 -(47999--48000) -48017 -(48027--48028) -(48030--48032) -48049 -(48059--48060) -(48062--48064) -48081 -(48091--48092) -(48094--48096)
Fix<y>? yes

Free inodes count wrong for group #6 (7608, counted=7624).
Fix<y>? yes

Free inodes count wrong (61919, counted=61935).
Fix<y>? yes


embeddedrootwrite: ***** FILE SYSTEM WAS MODIFIED *****

embeddedrootwrite: ********** WARNING: Filesystem still has errors **********

embeddedrootwrite: 657/62592 files (24.4% non-contiguous), 87882/249937 blocks

Embedded-PC-failsafe:~# 
Embedded-PC-failsafe:~# e2fsck /dev/sda3
e2fsck 1.41.3 (12-Oct-2008)
embeddedrootwrite contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Directory entry for '.' in ... (46948) is big.
Split<y>? yes

Missing '..' in directory inode 46948.
Fix<y>? yes

Setting filetype for entry '..' in ... (46948) to 2.
Pass 3: Checking directory connectivity
'..' in /etc/network/run (46948) is <The NULL inode> (0), should be /etc/network (46953).
Fix<y>? yes

Pass 4: Checking reference counts
Inode 2 ref count is 12, should be 13.  Fix<y>? yes

Pass 5: Checking group summary information

embeddedrootwrite: ***** FILE SYSTEM WAS MODIFIED *****
embeddedrootwrite: 657/62592 files (24.4% non-contiguous), 87882/249937 blocks
Embedded-PC-failsafe:~# 
Embedded-PC-failsafe:~# e2fsck /dev/sda3
e2fsck 1.41.3 (12-Oct-2008)
embeddedrootwrite: clean, 657/62592 files, 87882/249937 blocks
ewwhite
  • 194,921
  • 91
  • 434
  • 799
Jeremy Friesner
  • 1,311
  • 1
  • 14
  • 25
  • Have you all thought of changing to ext4 or ZFS? – mdpc Dec 04 '12 at 02:14
  • I've thought about changing to ext4, at least... would that help address this issue? Would ZFS be better still? – Jeremy Friesner Dec 04 '12 at 02:17
  • 1
    Neither option would fix this. We still use devices with supercapacitors in ZFS, and battery or flash-protected cache is recommended for ext4 in server applications. – ewwhite Dec 04 '12 at 02:54

2 Answers2

11

You're both wrong (maybe?)... ext3 is coping the best it can with having its underlying storage removed so abruptly.

Your SSD probably has some type of onboard cache. You don't mention the make/model of SSD in use, but this sounds like a consumer-level SSD versus an enterprise or industrial-grade model.

Either way, the cache is used to help coalesce writes and prolong the life of the drive. If there are writes in-transit, the sudden loss of power is definitely the source of your corruption. True enterprise and industrial SSD's have supercapacitors that maintain power long enough to move data from cache to nonvolatile storage, much in the same way battery-backed and flash-backed RAID controller caches work.

If your drive doesn't have a supercap, the in-flight transactions are being lost, hence the filesystem corruption. ext3 is probably being told that everything is on stable storage, but that's just a function of the cache.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Sorry to you and everyone who upvoted this, but you're just wrong. Handling the loss of in progress writes is exactly what the journal is for, and as long as the drive correctly reports whether it has a write cache and obeys commands to flush it, the journal guarantees that the metadata will not be inconsistent. You only need a supercap or battery backed raid cache so you can enable write cache without having to enable barriers, which sacrifices some performance to maintain data correctness. – psusi Dec 05 '12 at 19:12
  • @psusi The SSD in use probably has cache explicitly enabled or relies on an internal buffer regardless of the OS_level setting. The data in that cache is what a [supercapacitor-enabled SSD](http://thessdguy.com/how-controllers-maximize-ssd-life-external-data-buffering/) would protect. – ewwhite Dec 05 '12 at 19:30
  • The data in the cache doesn't need protecting if you enable IO barriers. Most consumer type drives ship with write caching disabled by default and you have to enable it if you want it, exactly because it causes corruption issues if the OS is not careful. – psusi Dec 05 '12 at 19:35
  • 2
    @pusi Old now, but you mention this: `as long as the drive correctly reports whether it has a write cache and obeys commands to flush it, the journal guarantees that the metadata will not be inconsistent.` That's the thing: because of storage controllers that tend to assume older disks, SSDs **will sometimes lie** about whether data is flushed. You do need that supercap. – Joel Coel Aug 09 '15 at 22:01
2

You are right and your coworker is wrong. Barring something going wrong the journal makes sure you never have inconsistent fs metadata. You might check with hdparm to see if the drive's write cache is enabled. If it is, and you have not enabled IO barriers ( off by default on ext3, on by default in ext4 ), then that would be the cause of the problem.

The barriers are needed to force the drive write cache to flush at the correct time to maintain consistency, but some drives are badly behaved and either report that their write cache is disabled when it is not, or silently ignore the flush commands. This prevents the journal from doing its job.

psusi
  • 3,247
  • 1
  • 16
  • 9
  • -1 for reading-comprehension... – ewwhite Dec 05 '12 at 19:34
  • @ewwhite, maybe *you* should try reading, and actually writing a useful response instead of this childish insult. – psusi Dec 05 '12 at 19:36
  • +1 this answer probably could be improved, as any other answer in any QA. But at least provides some light and hints. @downvoters: improve the answer yourselves, or comment on possible flows, but downvoting this answer without proper justification is just disgusting! – Alberto Dec 06 '12 at 21:44