Docker overlay2 file system corruption

Question

I have a small home server where I host a few services I like. Recently the server enters a read only state on the main LVM volume, this is the volume where Ubuntu is installed. I have very little experience investigating these types of issues.

Problem

Every few weeks I will try to access one of the services hosted on the server and notice they don't respond. From experience I know this means the /dev/mapper/ubuntu--vg-ubuntu--lv volume went read only.

My fix

I can run fsck /dev/mapper/ubuntu--vg-ubuntu--lv which ends up reporting and fixing a lot of errors in /var/lib/docker/overlay2/.

Shortened output of the command:

fsck from util-linux 2.31.1
e2fsck 1.44.1 (24-Mar-2018)
/dev/mapper/ubuntu--vg-ubuntu--lv: recovering journal
JBD2: Invalid checksum recovering block 6294682 in log
JBD2: Invalid checksum recovering block 5767200 in log
JBD2: Invalid checksum recovering block 5769164 in log
...
Journal checksum error found in /dev/mapper/ubuntu--vg-ubuntu--lv
/dev/mapper/ubuntu--vg-ubuntu--lv contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found.  Fix<y>? yes
Inode 395236 was part of the orphaned inode list.  FIXED.
...
Pass 2: Checking directory structure
Symlink /var/lib/docker/overlay2/e2983dd0abaadcf1745ffa84433e9c84f8b1a4ab79ee4b9b059f039758b885d1/diff/opt/gitlab/embedded/service/gitlab-shell/.gitlab_shell_secret (inode #1613356) is invalid.
Clear<y>? yes
...
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -(7406080--7406211) -(7407552--7407614) -(7408736--7408765) -(7411072--7411125) -(7413760--7413824) -(7414784--7415295)
Fix? yes
/dev/mapper/ubuntu--vg-ubuntu--lv: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/ubuntu--vg-ubuntu--lv: ***** REBOOT SYSTEM *****
/dev/mapper/ubuntu--vg-ubuntu--lv: 453053/2097152 files (0.1% non-contiguous), 3134566/8388608 blocks

I follow this up by running sudo docker image rm gitlab/gitlab-ee -f for all affected images.

I obviously can't keep doing this as I need to remove all broken files and redeploy all docker images.

Question

How do I fix this once and for all? How do I know where to look?

Update 1

The output of dmesg somewhat a minute after the corruption occurs. Note that this is just a small portion, dmesg returns a lot of these, but they're all the same format Read-only file system format.

[250377.826952] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250377.827922] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250378.236228] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250378.237456] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250381.765351] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250381.766273] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250381.982447] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250381.983362] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250382.639171] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250382.640142] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250382.766604] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250382.767619] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system

I think I am already too late with running dmesg at this point. I doubt I will be able to notice the issue faster.

Update 2

I have finally caught the error in time.

[94813.533642] IPv6: ADDRCONF(NETDEV_UP): veth13a279e: link is not ready
[94813.533654] br-069ee8939c9c: port 1(veth13a279e) entered blocking state
[94813.533658] br-069ee8939c9c: port 1(veth13a279e) entered forwarding state
[94813.534526] br-069ee8939c9c: port 1(veth13a279e) entered disabled state
[94813.603964] br-54483b957e18: port 3(vetha0f7425) entered blocking state
[94813.603967] br-54483b957e18: port 3(vetha0f7425) entered disabled state
[94813.604076] device vetha0f7425 entered promiscuous mode
[94813.604292] IPv6: ADDRCONF(NETDEV_UP): vetha0f7425: link is not ready
[94813.604298] br-54483b957e18: port 3(vetha0f7425) entered blocking state
[94813.604299] br-54483b957e18: port 3(vetha0f7425) entered forwarding state
[94814.483126] br-54483b957e18: port 3(vetha0f7425) entered disabled state
[94814.816681] eth0: renamed from vethfda95fe
[94814.831541] IPv6: ADDRCONF(NETDEV_CHANGE): veth13a279e: link becomes ready
[94814.831591] br-069ee8939c9c: port 1(veth13a279e) entered blocking state
[94814.831593] br-069ee8939c9c: port 1(veth13a279e) entered forwarding state
[94815.842053] eth0: renamed from vethba9b3b5
[94815.859813] IPv6: ADDRCONF(NETDEV_CHANGE): vetha0f7425: link becomes ready
[94815.859882] br-54483b957e18: port 3(vetha0f7425) entered blocking state
[94815.859886] br-54483b957e18: port 3(vetha0f7425) entered forwarding state
[94816.829692] eth0: renamed from veth8fea459
[94816.847459] IPv6: ADDRCONF(NETDEV_CHANGE): vetha6ab738: link becomes ready
[94816.847491] br-54483b957e18: port 1(vetha6ab738) entered blocking state
[94816.847495] br-54483b957e18: port 1(vetha6ab738) entered forwarding state
[95248.835077] r8169 0000:13:00.0 enp19s0: link down
[95257.640648] r8169 0000:13:00.0 enp19s0: link up
[96145.500660] sd 0:0:0:0: [sda] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.500670] sd 0:0:0:0: [sda] tag#6 CDB: Write(10) 2a 00 00 a4 7d 40 00 00 30 00
[96145.500674] print_req_error: I/O error, dev sda, sector 10779968
[96145.500826] Aborting journal on device dm-0-8.
[96145.500828] sd 0:0:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.500834] sd 0:0:0:0: [sda] tag#7 CDB: Write(10) 2a 00 07 64 41 a8 00 00 08 00
[96145.500895] print_req_error: I/O error, dev sda, sector 124010920
[96145.500932] EXT4-fs error (device dm-0) in __ext4_new_inode:1082: Journal has aborted
[96145.500947] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 1709169 (offset 0 size 4096 starting block 6849845)
[96145.500982] Buffer I/O error on device dm-1, logical block 6849845
[96145.501023] sd 0:0:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.501025] sd 0:0:0:0: [sda] tag#10 CDB: Write(10) 2a 00 05 d0 58 88 00 00 08 00
[96145.501027] print_req_error: I/O error, dev sda, sector 97540232
[96145.501058] EXT4-fs warning (device dm-0): ext4_end_bio:323: I/O error 10 writing to inode 450308 (offset 0 size 0 starting block 7735313)
[96145.501060] Buffer I/O error on device dm-0, logical block 7735313
[96145.501105] sd 0:0:0:0: [sda] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.501110] sd 0:0:0:0: [sda] tag#9 CDB: Write(10) 2a 00 0a 24 94 80 00 00 08 00
[96145.501114] print_req_error: I/O error, dev sda, sector 170169472
[96145.501147] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 3148652 (offset 0 size 4096 starting block 12619664)
96145.501148] Buffer I/O error on device dm-1, logical block 12619664
[96145.501187] sd 0:0:0:0: [sda] tag#8 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.501189] sd 0:0:0:0: [sda] tag#8 CDB: Write(10) 2a 00 03 27 16 40 00 00 80 00
[96145.501190] print_req_error: I/O error, dev sda, sector 52893248
[96145.501266] sd 0:0:0:0: [sda] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.501268] sd 0:0:0:0: [sda] tag#11 CDB: Write(10) 2a 00 07 64 58 e0 00 00 08 00
[96145.501269] print_req_error: I/O error, dev sda, sector 124016864
[96145.501301] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 1706133 (offset 0 size 4096 starting bl
ock 6850588)
[96145.501302] Buffer I/O error on device dm-1, logical block 6850588
[96145.501382] EXT4-fs (dm-0): Delayed block allocation failed for inode 450308 at logical offset 18450 with max blocks 7 with error 
30
[96145.501438] EXT4-fs (dm-0): This should not happen!! Data will be lost
[96145.501477] EXT4-fs error (device dm-0) in ext4_writepages:2895: Journal has aborted
[96145.504583] Aborting journal on device dm-1-8.
[96145.507210] EXT4-fs error (device dm-0): ext4_journal_check_start:61: Detected aborted journal
[96145.507306] EXT4-fs (dm-0): Remounting filesystem read-only
[96145.513169] EXT4-fs error (device dm-0) in ext4_reserve_inode_write:5769: Journal has aborted
[96145.514680] EXT4-fs error (device dm-0) in ext4_mkdir:2660: IO failure
[96145.516071] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.516095] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.516121] EXT4-fs (dm-1): Remounting filesystem read-only
[96145.518645] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.520899] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.531523] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.542876] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.542955] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.944229] traps: postgres[2215] general protection ip:7fba363d97a1 sp:7ffc943cc880 error:0 in ld-musl-x86_64.so.1[7fba363d2000+4
6000]
[96146.339588] br-ae1db04d3386: port 4(vethb6cf243) entered disabled state
[96146.345493] veth3fdc38e: renamed from eth0
[96146.441684] overlayfs: upper fs is r/o, try multi-lower layers mount
[96149.670401] br-ae1db04d3386: port 7(veth2bcd3f7) entered disabled state
[96149.670617] vethf7402f2: renamed from eth0
[96149.776068] overlayfs: upper fs is r/o, try multi-lower layers mount
[96180.516686] systemd-journald[392]: Failed to write entry (21 items, 603 bytes), ignoring: Read-only file system
[96180.516842] systemd-journald[392]: Failed to write entry (21 items, 694 bytes), ignoring: Read-only file system
[96180.516985] systemd-journald[392]: Failed to write entry (21 items, 603 bytes), ignoring: Read-only file system
[96180.517093] systemd-journald[392]: Failed to write entry (21 items, 694 bytes), ignoring: Read-only file system
[96180.517192] systemd-journald[392]: Failed to write entry (21 items, 603 bytes), ignoring: Read-only file system
[96180.517558] systemd-journald[392]: Failed to write entry (21 items, 694 bytes), ignoring: Read-only file system
[96180.517753] systemd-journald[392]: Failed to write entry (21 items, 603 bytes), ignoring: Read-only file system
[96180.518982] systemd-journald[392]: Failed to write entry (21 items, 694 bytes), ignoring: Read-only file system
[96180.520186] systemd-journald[392]: Failed to write entry (21 items, 603 bytes), ignoring: Read-only file system
[96180.521475] systemd-journald[392]: Failed to write entry (21 items, 694 bytes), ignoring: Read-only file system

Update 3

I have connected the drive to a Windows host and checked for firmware updates. The drive has the latest firmware.

SMART also reports, what to me seem, very reasonable numbers.

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   000   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       5638
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       75
148 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       3
149 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
167 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
168 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
169 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       26
170 Unknown_Attribute       0x0000   100   100   010    Old_age   Offline      -       18
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
173 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       2424903
181 Program_Fail_Cnt_Total  0x0032   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0000   100   100   000    Old_age   Offline      -       1
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       -       52
194 Temperature_Celsius     0x0022   077   068   000    Old_age   Always       -       23 (Min/Max 22/32)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       1
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
218 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
231 Temperature_Celsius     0x0000   004   004   000    Old_age   Offline      -       96
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       3775
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       2489
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       976
244 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       37
245 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       71
246 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       141104

I will install a fresh OS and restore the data from a backup to see if that resolves the problem.

Docker info

As only the docker overlay directory seems to get corrupted, here is the docker storage configuration:

Server Version: 19.03.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true

Check `dmesg` at the time the filesystem goes read only to find out why it happened. — Michael Hampton, Aug 15 '19 at 04:55
I will try that, will report back in probably around 2 weeks. — LizardMaster, Aug 15 '19 at 19:22
I have added the dmesg output just seconds after the corruption happens. — LizardMaster, Aug 17 '19 at 22:00
Eh? There should be something just before that. Keep looking up. — Michael Hampton, Aug 17 '19 at 22:01
I had saved the dmesg output using dmesg > dmesg.txt, the txt file contains 2258 lines like the example I added to the post, no other messages. I guess I messed up but running the same command on my desktop results in a file starting with entries at [0.000000]. On the server the first entry is for [250377.826952]. — LizardMaster, Aug 18 '19 at 02:14
Probably there are so many of them that the actual original error has scrolled out of the ring buffer. You might check `journalctl -b` to see if something is logged in the journal. — Michael Hampton, Aug 18 '19 at 04:43
Looks like your harddrive is trying really hard to die, and it will eventually succeed. Replace it, and during the process, reinstall docker cleanly. — BMitch, Aug 19 '19 at 14:00
I guess I will need to run `journalctl -b` after the error shows up. I will do it when it happens again. — LizardMaster, Aug 24 '19 at 10:31
@BMitch This was my first thought, but I tried switching the SSD before posting for help. The new SSD started showing the error while the old one still works fine in a different computer. So I am pretty sure the SSD is not to blame. — LizardMaster, Aug 24 '19 at 10:34
I have finally caught the error, I added it under update 2 to the main post. — LizardMaster, Aug 31 '19 at 14:30
I currently don't have much time but I will perform the steps next week. The drive in question is a lower capacity version of the one in your linked post so I will also try to update the firmware. — LizardMaster, Sep 01 '19 at 16:30

Docker overlay2 file system corruption

0 Answers0