Centos7 - Buffer I/O error on dev sda, logical block xxxxxxxxx, lost async page write

Question

I have a webserver that have the content in HP MSA2040 Storage (10 tb total storage).

I keep getting error like below

Jul 31 19:06:24 xxxxxxxx*** kernel: blk_update_request: I/O error, dev sda, sector 7094923416
Jul 31 19:06:24 xxxxxxxx*** kernel: buffer_io_error: 1110 callbacks suppressed
Jul 31 19:06:24 xxxxxxxx*** kernel: Buffer I/O error on dev sda, logical block 886865171, lost async page write
Jul 31 19:06:24 xxxxxxxx*** kernel: Buffer I/O error on dev sda, logical block 886865172, lost async page write
Jul 31 19:06:24 xxxxxxxx*** kernel: Buffer I/O error on dev sda, logical block 886865173, lost async page write
Jul 31 19:06:24 xxxxxxxx*** kernel: Buffer I/O error on dev sda, logical block 886865174, lost async page write
Jul 31 19:06:24 xxxxxxxx*** kernel: Buffer I/O error on dev sda, logical block 886865175, lost async page write
Jul 31 19:06:24 xxxxxxxx*** kernel: Buffer I/O error on dev sda, logical block 886865176, lost async page write
Jul 31 19:06:24 xxxxxxxx*** kernel: Buffer I/O error on dev sda, logical block 886865177, lost async page write
Jul 31 19:06:24 xxxxxxxx*** kernel: Buffer I/O error on dev sda, logical block 886865178, lost async page write
Jul 31 19:06:24 xxxxxxxx*** kernel: Buffer I/O error on dev sda, logical block 886865179, lost async page write
Jul 31 19:06:24 xxxxxxxx*** kernel: Buffer I/O error on dev sda, logical block 886865180, lost async page write

I've tried to run xfs_repair on /dev/sda which is my MSA2040 storage, this is the report I've got

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...

I even tried to run xfs_repaif -L I can reach to my data but it got stuck after some time. I have also checked MSA's interface, everything seems smooth.

Is there any method to solve this issue?

Thanks in advance.

Edit* This is also smartctl report

=== START OF INFORMATION SECTION ===
Vendor:               HP
Product:              MSA 2040 SAN
Revision:             G210
User Capacity:        10,239,998,951,424 bytes [10.2 TB]
Logical block size:   512 bytes
LU is thin provisioned, LBPRZ=1
Rotation Rate:        15000 rpm
Logical Unit id:      0x600xxxxxxxxxxxxxxxef5701000000
Serial number:        00c0ff27xxxxxxxxxxxx701000000
Device type:          disk
Transport protocol:   Fibre channel (FCP-2)
Local Time is:        Mon Jul 31 19:22:30 2017 +03
SMART support is:     Available - device has SMART capability.
SMART support is:     Disabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Elements in grown defect list: 0

Error Counter logging not supported

Device does not support Self Test logging

Edit2* Requested outputs

[root@xxxxxxxx*** thumbs]# lsblk
NAME            MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda               8:0    0  9.3T  0 disk /msa10tb
sdb               8:16   0  1.8T  0 disk
├─sdb1            8:17   0  200M  0 part /boot/efi
├─sdb2            8:18   0  500M  0 part /boot
└─sdb3            8:19   0  1.8T  0 part
  ├─centos-root 253:0    0   50G  0 lvm  /
  ├─centos-swap 253:1    0  7.8G  0 lvm  [SWAP]
  └─centos-home 253:2    0  1.8T  0 lvm  /home
sr0              11:0    1 1024M  0 rom
[root@xxxxxxxx*** thumbs]# pvs
  PV         VG     Fmt  Attr PSize PFree
  /dev/sdb3  centos lvm2 a--  1.82t 64.00m
[root@xxxxxxxx*** thumbs]# vgs
  VG     #PV #LV #SN Attr   VSize VFree
  centos   1   3   0 wz--n- 1.82t 64.00m
[root@xxxxxxxx*** thumbs]# lvs
  LV   VG     Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  home centos -wi-ao----  1.76t
  root centos -wi-ao---- 50.00g
  swap centos -wi-ao----  7.75g
[root@xxxxxxxx*** thumbs]#

Edit*3 - When I check journalctl, now I keep getting these logs;

Jul 31 19:29:46 xxxxxxxx*** kernel: Peer 0000:0000:0000:0000:0000:ffff:1885:f5b0:30313/443 unexpectedly shrunk window 1461891501:1461898801 (repaired)
Jul 31 19:29:48 xxxxxxxx*** kernel: Peer 0000:0000:0000:0000:0000:ffff:1885:f5b0:30313/443 unexpectedly shrunk window 1461891501:1461898801 (repaired)
Jul 31 19:29:50 xxxxxxxx*** kernel: Peer 0000:0000:0000:0000:0000:ffff:1885:f5b0:30313/443 unexpectedly shrunk window 1461891501:1461898801 (repaired)
Jul 31 19:29:54 xxxxxxxx*** kernel: Peer 0000:0000:0000:0000:0000:ffff:1885:f5b0:30313/443 unexpectedly shrunk window 1461891501:1461898801 (repaired)
Jul 31 19:30:03 xxxxxxxx*** kernel: Peer 0000:0000:0000:0000:0000:ffff:1885:f5b0:30313/443 unexpectedly shrunk window 1461891501:1461898801 (repaired)
Jul 31 19:30:25 xxxxxxxx*** kernel: Peer 0000:0000:0000:0000:0000:ffff:1885:f5b0:30313/443 unexpectedly shrunk window 1462932481:1462952921 (repaired)
Jul 31 19:30:27 xxxxxxxx*** kernel: Peer 0000:0000:0000:0000:0000:ffff:1885:f5b0:30313/443 unexpectedly shrunk window 1462932481:1462952921 (repaired)
Jul 31 19:30:29 xxxxxxxx*** kernel: Peer 0000:0000:0000:0000:0000:ffff:1885:f5b0:30313/443 unexpectedly shrunk window 1462932481:1462952921 (repaired)
Jul 31 19:30:30 xxxxxxxx*** kernel: Peer 0000:0000:0000:0000:0000:ffff:5e37:2374:63273/443 unexpectedly shrunk window 3861953537:3862015916 (repaired)
Jul 31 19:30:31 xxxxxxxx*** kernel: Peer 0000:0000:0000:0000:0000:ffff:5e37:2374:63273/443 unexpectedly shrunk window 3861953537:3862015916 (repaired)
Jul 31 19:30:32 xxxxxxxx*** kernel: Peer 0000:0000:0000:0000:0000:ffff:5e37:2374:63273/443 unexpectedly shrunk window 3861953537:3862015916 (repaired)
Jul 31 19:30:33 xxxxxxxx*** kernel: Peer 0000:0000:0000:0000:0000:ffff:1885:f5b0:30313/443 unexpectedly shrunk window 1462932481:1462952921 (repaired)
Jul 31 19:30:34 xxxxxxxx*** kernel: Peer 0000:0000:0000:0000:0000:ffff:5e37:2374:63273/443 unexpectedly shrunk window 3861953537:3862015916 (repaired)
Jul 31 19:30:38 xxxxxxxx*** kernel: Peer 0000:0000:0000:0000:0000:ffff:5e37:2374:63273/443 unexpectedly shrunk window 3861953537:3862015916 (repaired)

How are the head node connected to the storage server? iSCSI? Fiber channel? — shodanshok, Jul 31 '17 at 19:16
@shodanshok It is fiber channel, Type:FC, Speed 8G. Could SFP or the cables be the problem? — Lunatic Fnatic, Jul 31 '17 at 19:23

shodanshok · Accepted Answer · 2017-12-13T15:18:39.470

7

Messages as

Buffer I/O error on dev sda, logical block 886865171, lost async page write

mean that an async write (ie: dirty page writeback or buffered writes) failed. You found these errors in dmesg or /var/log/message because failed async writes can not, by their very nature, be notified to the original application which submitted the writes in the first place.

They are often caused by a media where some blocks can not be written. This can happen due to:

damaged disk platters/cells
connection problems (ie: bad cable, a "lost" iSCSI target, etc)
thinly provisioned block device whose parent pool space was exausted

You are using sda directly with a filesystem on top, without head node side LVM, so we can exclude a bad device mapper table as a problem source (on the head node, at least).

If the physical properties of your storage node are OK (ie: no faulty disks, etc), I strongly suggest to review any thinly-provisioned volumes and their parent pools.

edited Dec 13 '17 at 15:18

answered Jul 31 '17 at 20:20

shodanshok

44,038
6
98
162

Thank you, I will revise the whole cable and SFP first to see if the problem persists, then will check the volumes and update whats going on. – Lunatic Fnatic Aug 01 '17 at 05:45
I have done everything, changed cables, sfps, repaired xfs, reviewd volumes and pools but the its still persists :-/ I keep getting this loop `Aug 01 17:00:24 ****.com kernel: sd 0:0:0:0: [sda] tag#13 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK Aug 01 17:00:24 ****.com kernel: sd 0:0:0:0: [sda] tag#13 CDB: Read(16) 88 00 00 00 00 04 80 30 9b 90 00 00 01 00 00 00 Aug 01 17:00:24 ****.com kernel: blk_update_request: I/O error, dev sda, sector 19330538384` – Lunatic Fnatic Aug 01 '17 at 14:16
It really seems a full thinpool to me. If you have a support contract with HP, I suggest you to open a support ticket with them. – shodanshok Aug 01 '17 at 16:11
I changed single mode SFP and cable to multimode sfp & cable and the problem has been solved. Its funny how it wasnt a problem before.. – Lunatic Fnatic Aug 02 '17 at 14:11

Centos7 - Buffer I/O error on dev sda, logical block xxxxxxxxx, lost async page write

1 Answers1