I have a small home server where I host a few services I like. Recently the server enters a read only state on the main LVM volume, this is the volume where Ubuntu is installed. I have very little experience investigating these types of issues.
Problem
Every few weeks I will try to access one of the services hosted on the server and notice they don't respond.
From experience I know this means the /dev/mapper/ubuntu--vg-ubuntu--lv
volume went read only.
My fix
I can run fsck /dev/mapper/ubuntu--vg-ubuntu--lv
which ends up reporting and fixing a lot of errors in /var/lib/docker/overlay2/
.
Shortened output of the command:
fsck from util-linux 2.31.1
e2fsck 1.44.1 (24-Mar-2018)
/dev/mapper/ubuntu--vg-ubuntu--lv: recovering journal
JBD2: Invalid checksum recovering block 6294682 in log
JBD2: Invalid checksum recovering block 5767200 in log
JBD2: Invalid checksum recovering block 5769164 in log
...
Journal checksum error found in /dev/mapper/ubuntu--vg-ubuntu--lv
/dev/mapper/ubuntu--vg-ubuntu--lv contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found. Fix<y>? yes
Inode 395236 was part of the orphaned inode list. FIXED.
...
Pass 2: Checking directory structure
Symlink /var/lib/docker/overlay2/e2983dd0abaadcf1745ffa84433e9c84f8b1a4ab79ee4b9b059f039758b885d1/diff/opt/gitlab/embedded/service/gitlab-shell/.gitlab_shell_secret (inode #1613356) is invalid.
Clear<y>? yes
...
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -(7406080--7406211) -(7407552--7407614) -(7408736--7408765) -(7411072--7411125) -(7413760--7413824) -(7414784--7415295)
Fix? yes
/dev/mapper/ubuntu--vg-ubuntu--lv: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/ubuntu--vg-ubuntu--lv: ***** REBOOT SYSTEM *****
/dev/mapper/ubuntu--vg-ubuntu--lv: 453053/2097152 files (0.1% non-contiguous), 3134566/8388608 blocks
I follow this up by running sudo docker image rm gitlab/gitlab-ee -f
for all affected images.
I obviously can't keep doing this as I need to remove all broken files and redeploy all docker images.
Question
How do I fix this once and for all? How do I know where to look?
Update 1
The output of dmesg
somewhat a minute after the corruption occurs. Note that this is just a small portion, dmesg
returns a lot of these, but they're all the same format Read-only file system format.
[250377.826952] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250377.827922] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250378.236228] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250378.237456] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250381.765351] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250381.766273] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250381.982447] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250381.983362] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250382.639171] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250382.640142] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250382.766604] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250382.767619] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
I think I am already too late with running dmesg
at this point. I doubt I will be able to notice the issue faster.
Update 2
I have finally caught the error in time.
[94813.533642] IPv6: ADDRCONF(NETDEV_UP): veth13a279e: link is not ready
[94813.533654] br-069ee8939c9c: port 1(veth13a279e) entered blocking state
[94813.533658] br-069ee8939c9c: port 1(veth13a279e) entered forwarding state
[94813.534526] br-069ee8939c9c: port 1(veth13a279e) entered disabled state
[94813.603964] br-54483b957e18: port 3(vetha0f7425) entered blocking state
[94813.603967] br-54483b957e18: port 3(vetha0f7425) entered disabled state
[94813.604076] device vetha0f7425 entered promiscuous mode
[94813.604292] IPv6: ADDRCONF(NETDEV_UP): vetha0f7425: link is not ready
[94813.604298] br-54483b957e18: port 3(vetha0f7425) entered blocking state
[94813.604299] br-54483b957e18: port 3(vetha0f7425) entered forwarding state
[94814.483126] br-54483b957e18: port 3(vetha0f7425) entered disabled state
[94814.816681] eth0: renamed from vethfda95fe
[94814.831541] IPv6: ADDRCONF(NETDEV_CHANGE): veth13a279e: link becomes ready
[94814.831591] br-069ee8939c9c: port 1(veth13a279e) entered blocking state
[94814.831593] br-069ee8939c9c: port 1(veth13a279e) entered forwarding state
[94815.842053] eth0: renamed from vethba9b3b5
[94815.859813] IPv6: ADDRCONF(NETDEV_CHANGE): vetha0f7425: link becomes ready
[94815.859882] br-54483b957e18: port 3(vetha0f7425) entered blocking state
[94815.859886] br-54483b957e18: port 3(vetha0f7425) entered forwarding state
[94816.829692] eth0: renamed from veth8fea459
[94816.847459] IPv6: ADDRCONF(NETDEV_CHANGE): vetha6ab738: link becomes ready
[94816.847491] br-54483b957e18: port 1(vetha6ab738) entered blocking state
[94816.847495] br-54483b957e18: port 1(vetha6ab738) entered forwarding state
[95248.835077] r8169 0000:13:00.0 enp19s0: link down
[95257.640648] r8169 0000:13:00.0 enp19s0: link up
[96145.500660] sd 0:0:0:0: [sda] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.500670] sd 0:0:0:0: [sda] tag#6 CDB: Write(10) 2a 00 00 a4 7d 40 00 00 30 00
[96145.500674] print_req_error: I/O error, dev sda, sector 10779968
[96145.500826] Aborting journal on device dm-0-8.
[96145.500828] sd 0:0:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.500834] sd 0:0:0:0: [sda] tag#7 CDB: Write(10) 2a 00 07 64 41 a8 00 00 08 00
[96145.500895] print_req_error: I/O error, dev sda, sector 124010920
[96145.500932] EXT4-fs error (device dm-0) in __ext4_new_inode:1082: Journal has aborted
[96145.500947] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 1709169 (offset 0 size 4096 starting block 6849845)
[96145.500982] Buffer I/O error on device dm-1, logical block 6849845
[96145.501023] sd 0:0:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.501025] sd 0:0:0:0: [sda] tag#10 CDB: Write(10) 2a 00 05 d0 58 88 00 00 08 00
[96145.501027] print_req_error: I/O error, dev sda, sector 97540232
[96145.501058] EXT4-fs warning (device dm-0): ext4_end_bio:323: I/O error 10 writing to inode 450308 (offset 0 size 0 starting block 7735313)
[96145.501060] Buffer I/O error on device dm-0, logical block 7735313
[96145.501105] sd 0:0:0:0: [sda] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.501110] sd 0:0:0:0: [sda] tag#9 CDB: Write(10) 2a 00 0a 24 94 80 00 00 08 00
[96145.501114] print_req_error: I/O error, dev sda, sector 170169472
[96145.501147] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 3148652 (offset 0 size 4096 starting block 12619664)
96145.501148] Buffer I/O error on device dm-1, logical block 12619664
[96145.501187] sd 0:0:0:0: [sda] tag#8 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.501189] sd 0:0:0:0: [sda] tag#8 CDB: Write(10) 2a 00 03 27 16 40 00 00 80 00
[96145.501190] print_req_error: I/O error, dev sda, sector 52893248
[96145.501266] sd 0:0:0:0: [sda] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.501268] sd 0:0:0:0: [sda] tag#11 CDB: Write(10) 2a 00 07 64 58 e0 00 00 08 00
[96145.501269] print_req_error: I/O error, dev sda, sector 124016864
[96145.501301] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 1706133 (offset 0 size 4096 starting bl
ock 6850588)
[96145.501302] Buffer I/O error on device dm-1, logical block 6850588
[96145.501382] EXT4-fs (dm-0): Delayed block allocation failed for inode 450308 at logical offset 18450 with max blocks 7 with error
30
[96145.501438] EXT4-fs (dm-0): This should not happen!! Data will be lost
[96145.501477] EXT4-fs error (device dm-0) in ext4_writepages:2895: Journal has aborted
[96145.504583] Aborting journal on device dm-1-8.
[96145.507210] EXT4-fs error (device dm-0): ext4_journal_check_start:61: Detected aborted journal
[96145.507306] EXT4-fs (dm-0): Remounting filesystem read-only
[96145.513169] EXT4-fs error (device dm-0) in ext4_reserve_inode_write:5769: Journal has aborted
[96145.514680] EXT4-fs error (device dm-0) in ext4_mkdir:2660: IO failure
[96145.516071] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.516095] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.516121] EXT4-fs (dm-1): Remounting filesystem read-only
[96145.518645] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.520899] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.531523] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.542876] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.542955] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.944229] traps: postgres[2215] general protection ip:7fba363d97a1 sp:7ffc943cc880 error:0 in ld-musl-x86_64.so.1[7fba363d2000+4
6000]
[96146.339588] br-ae1db04d3386: port 4(vethb6cf243) entered disabled state
[96146.345493] veth3fdc38e: renamed from eth0
[96146.441684] overlayfs: upper fs is r/o, try multi-lower layers mount
[96149.670401] br-ae1db04d3386: port 7(veth2bcd3f7) entered disabled state
[96149.670617] vethf7402f2: renamed from eth0
[96149.776068] overlayfs: upper fs is r/o, try multi-lower layers mount
[96180.516686] systemd-journald[392]: Failed to write entry (21 items, 603 bytes), ignoring: Read-only file system
[96180.516842] systemd-journald[392]: Failed to write entry (21 items, 694 bytes), ignoring: Read-only file system
[96180.516985] systemd-journald[392]: Failed to write entry (21 items, 603 bytes), ignoring: Read-only file system
[96180.517093] systemd-journald[392]: Failed to write entry (21 items, 694 bytes), ignoring: Read-only file system
[96180.517192] systemd-journald[392]: Failed to write entry (21 items, 603 bytes), ignoring: Read-only file system
[96180.517558] systemd-journald[392]: Failed to write entry (21 items, 694 bytes), ignoring: Read-only file system
[96180.517753] systemd-journald[392]: Failed to write entry (21 items, 603 bytes), ignoring: Read-only file system
[96180.518982] systemd-journald[392]: Failed to write entry (21 items, 694 bytes), ignoring: Read-only file system
[96180.520186] systemd-journald[392]: Failed to write entry (21 items, 603 bytes), ignoring: Read-only file system
[96180.521475] systemd-journald[392]: Failed to write entry (21 items, 694 bytes), ignoring: Read-only file system
Update 3
I have connected the drive to a Windows host and checked for firmware updates. The drive has the latest firmware.
SMART also reports, what to me seem, very reasonable numbers.
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0032 000 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 5638
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 75
148 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 3
149 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
167 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
168 Unknown_Attribute 0x0012 100 100 000 Old_age Always - 0
169 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 26
170 Unknown_Attribute 0x0000 100 100 010 Old_age Offline - 18
172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
173 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 2424903
181 Program_Fail_Cnt_Total 0x0032 100 100 000 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0000 100 100 000 Old_age Offline - 1
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0012 100 100 000 Old_age Always - 52
194 Temperature_Celsius 0x0022 077 068 000 Old_age Always - 23 (Min/Max 22/32)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
218 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
231 Temperature_Celsius 0x0000 004 004 000 Old_age Offline - 96
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 3775
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 2489
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 976
244 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 37
245 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 71
246 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 141104
I will install a fresh OS and restore the data from a backup to see if that resolves the problem.
Docker info
As only the docker overlay directory seems to get corrupted, here is the docker storage configuration:
Server Version: 19.03.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true