2

Before I tried to partition a 10TB HDD again, parted saw it:

# parted /dev/sdb
(parted) print list                                                       
Model: ATA ST10000NM0016-1T (scsi)
Disk /dev/sdb: 10.0TB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags: 

Number  Start   End     Size    File system  Name     Flags
 1      1049kB  10.0TB  10.0TB  xfs          primary
....
....
....

Then, I just tried to partition again but failed:

[root@localhost ~]# parted /dev/sdb
GNU Parted 3.1
Using /dev/sdb
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mklabel gpt                                                      
Warning: The existing disk label on /dev/sdb will be destroyed and all data on this disk will be lost. Do you want to continue?
Yes/No? Yes
Error: end of file while reading /dev/sdb
Retry/Ignore/Cancel? Retry                                                
Error: end of file while reading /dev/sdb
Retry/Ignore/Cancel? Cancel                                               
(parted) q                                                                
Warning: Error fsyncing/closing /dev/sdb: Input/output error
Retry/Ignore? Retry                                                       
Warning: Error fsyncing/closing /dev/sdb: Input/output error
Retry/Ignore? Ignore                                                      

Then, the drive disappeared. I tried to reboot but still couldn't see the drive.

This post suggested to use gdisk /dev/sdb. However, I think it is so corrupted that gdisk can't recognize it:

# gdisk -l /dev/sdb
GPT fdisk (gdisk) version 0.8.10

Problem opening /dev/sdb for reading! Error is 2.
The specified file does not exist!

lsbk's output:

# lsblk 
NAME                 MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                    8:0    1 447.1G  0 disk 
├─sda1                 8:1    1     2G  0 part /boot
└─sda2                 8:2    1 445.1G  0 part 
  ├─centos-root      253:0    0    30G  0 lvm  /
  ├─centos-swap      253:1    0     4G  0 lvm  [SWAP]
  ├─centos-var       253:2    0    30G  0 lvm  /var
  ├─centos-coredumps 253:3    0    30G  0 lvm  /coredumps
  └─centos-latest    253:4    0 351.1G  0 lvm  /latest

ls -ltr /dev/sd*'s output:

brw-rw---- 1 root disk 8, 0 Feb 10 16:00 /dev/sda
brw-rw---- 1 root disk 8, 2 Feb 10 16:00 /dev/sda2
brw-rw---- 1 root disk 8, 1 Feb 10 16:00 /dev/sda1

lshw -class disk, parted -l and fdisk -l also don't see the drive.

I see something fishy in dmesg:

[Wed Feb 10 13:27:39 2021] ata13: softreset failed (1st FIS failed)
[Wed Feb 10 13:27:49 2021] ata13: softreset failed (device not ready)
[Wed Feb 10 13:28:06 2021] ata13: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Wed Feb 10 13:28:11 2021] ata13.00: qc timeout (cmd 0xec)
[Wed Feb 10 13:28:11 2021] ata13.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[Wed Feb 10 13:28:17 2021] ata13: link is slow to respond, please be patient (ready=0)
[Wed Feb 10 13:28:21 2021] ata13: softreset failed (device not ready)
[Wed Feb 10 13:28:31 2021] ata13: softreset failed (1st FIS failed)
[Wed Feb 10 13:28:41 2021] ata13: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Wed Feb 10 13:28:51 2021] ata13.00: qc timeout (cmd 0xec)
[Wed Feb 10 13:28:51 2021] ata13.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[Wed Feb 10 13:28:51 2021] ata13: limiting SATA link speed to 3.0 Gbps
[Wed Feb 10 13:28:52 2021] ata13: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[Wed Feb 10 13:29:13 2021] ata13.00: qc timeout (cmd 0x47)
[Wed Feb 10 13:29:13 2021] ata13.00: READ LOG DMA EXT failed, trying unqueued
[Wed Feb 10 13:29:13 2021] ata13.00: failed to get NCQ Send/Recv Log Emask 0x40
[Wed Feb 10 13:29:13 2021] ata13.00: ATA-10: ST10000NM0016-1TT101, SNE0, max UDMA/133
[Wed Feb 10 13:29:13 2021] ata13.00: 19532873728 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
[Wed Feb 10 13:29:13 2021] ata13.00: failed to set xfermode (err_mask=0x40)
[Wed Feb 10 13:29:13 2021] ata13.00: disabled
[Wed Feb 10 13:29:13 2021] ata13: hard resetting link
[Wed Feb 10 13:29:23 2021] ata13: softreset failed (1st FIS failed)
[Wed Feb 10 13:29:23 2021] ata13: hard resetting link
[Wed Feb 10 13:29:33 2021] ata13: softreset failed (device not ready)
[Wed Feb 10 13:29:33 2021] ata13: hard resetting link
[Wed Feb 10 13:29:39 2021] ata13: link is slow to respond, please be patient (ready=0)
[Wed Feb 10 13:29:49 2021] ata13: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[Wed Feb 10 13:29:49 2021] ata13: EH complete

=================================

Update #1

I read this article and turned off acpi, and another article suggested power issue and so I turned off tune-adm. Then, the disk came back and I ran parted /dev/sdb with mklabel gpt just like last time, but this time, no Error: end of file while reading /dev/sdb, but then when I continued to mkpart primary xfs 0% 1%, it gave me Error: /dev/sdb: unrecognised disk label. I rebooted the machine and tried again:

(parted) mkpart primary xfs 0% 1%                                         
(parted) mkpart primary xfs 1% 2%                                         
(parted) mkpart primary ext4 2% 3%                                        
(parted) mkpart primary ext4 3% 4%
(parted) mkpart primary btrfs 4% 5%                                       
(parted) mkpart primary btrfs 5% 6%                                       
(parted) mkpart primary xfs 6% 100%                                       
(parted) print                                                            
Model: ATA ST10000NM0016-1T (scsi)
Disk /dev/sdb: 10.0TB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags: 

Number  Start   End     Size    File system  Name     Flags
 1      1049kB  100GB   100GB   xfs          primary
 2      100GB   200GB   100GB                primary
 3      200GB   300GB   100GB                primary
 4      300GB   400GB   100GB                primary
 5      400GB   500GB   100GB                primary
 6      500GB   600GB   100GB                primary
 7      600GB   10.0TB  9401GB               primary

(parted) q                                                                

It works. But it seems so unstable. And I checked dmesg again, and found similar but different failures:

[Thu Feb 11 00:58:31 2021] ata15.00: qc timeout (cmd 0xec)
[Thu Feb 11 00:58:31 2021] ata15.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[Thu Feb 11 00:58:32 2021] ata15: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Thu Feb 11 00:58:42 2021] ata15.00: qc timeout (cmd 0xec)
[Thu Feb 11 00:58:42 2021] ata15.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[Thu Feb 11 00:58:42 2021] ata15: limiting SATA link speed to 3.0 Gbps
[Thu Feb 11 00:58:44 2021] ata15: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[Thu Feb 11 00:59:12 2021] ata15.00: ATA-10: ST10000NM0016-1TT101, SNE0, max UDMA/133
[Thu Feb 11 00:59:12 2021] ata15.00: 19532873728 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
[Thu Feb 11 00:59:12 2021] ata15.00: configured for UDMA/133
[Thu Feb 11 00:59:12 2021] scsi 14:0:0:0: Direct-Access     ATA      ST10000NM0016-1T SNE0 PQ: 0 ANSI: 5
[Thu Feb 11 00:59:12 2021] sd 14:0:0:0: [sdb] 19532873728 512-byte logical blocks: (10.0 TB/9.09 TiB)
[Thu Feb 11 00:59:12 2021] sd 14:0:0:0: [sdb] 4096-byte physical blocks
[Thu Feb 11 00:59:12 2021] sd 14:0:0:0: [sdb] Write Protect is off
[Thu Feb 11 00:59:12 2021] sd 14:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[Thu Feb 11 00:59:12 2021] sd 14:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[Thu Feb 11 00:59:19 2021]  sdb:
[Thu Feb 11 00:59:19 2021] sd 14:0:0:0: [sdb] Attached SCSI removable disk
[Thu Feb 11 00:59:37 2021] SGI XFS with ACLs, security attributes, no debug enabled

Any idea what's going on?

Thanks.

HCSF
  • 245
  • 2
  • 13
  • 2
    At the time you ran `gdisk` and `lsblk`, no `/dev/sdb` existed, as if the disk were not plugged in. Kernel messages show that the kernel can't communicate with the device (unfortunately I don't know what they mean precisely). If you can plug it in somewhere else, and try a different cable, you can get a better idea whether the problem is the connection or the disk itself. – berndbausch Feb 10 '21 at 09:18
  • I will get a guy at datacenter to try that. But unlikely it is cable related as the issue appeared immediately after re-partitioning by `parted`. Before that, the disk was recognized correctly. – HCSF Feb 10 '21 at 09:20
  • My money is on the drive electronics going out at just that moment, which may mislead a human being into thinking they are somehow related. – Michael Hampton Feb 10 '21 at 12:01
  • @MichaelHampton maybe. my drive came back but thing still doesn't look stable. – HCSF Feb 10 '21 at 17:25

2 Answers2

0

It turns out it is a faulty SATA controller.

I replaced SATA cable, and the entire HDD. Same issue. Reinstalling the entire OS, same issue.

Replacing the SATA controller solves the issue.

HCSF
  • 245
  • 2
  • 13
0

While having similar HW problems with quite new disk - dmesg:

ataX: softreset failed (device not ready)

and similar until total fail - I checked nearly everything and googled the whole Internet.

I noted a scrubbing sound and sometimes the short sounds of spinning down and up. In my case the insufficient power cable was the reason. When connected to separate power source no new fail was registered.

schweik
  • 253
  • 2
  • 8