Adaptec pm80xx Driver Drops Drives Randomly

Question

I'm building a ZFS NAS using an Adaptec ASA-71605H HBA on Ubuntu 12.04.4.

Modern Linux kernels ship with the open-source version of the required pm80xx kernel module. Adaptec provides a driver for Ubuntu 12.04 theirself and I tested both with the same effect.

The symptom I see is that from time to time after boot only 14 of the 16 drives are available.

The full dmesg log is available here, the interesting parts being

[    3.591035] pm80xx 0000:01:00.0: driver version 0.1.37 / 1.0.15-1

[   50.749419] sas: Enter sas_scsi_recover_host busy: 0 failed: 0
[   50.749424] sas: ata1: end_device-1:0: dev error handler
[   50.749430] sas: ata2: end_device-1:1: dev error handler
[   50.749433] sas: ata3: end_device-1:2: dev error handler
[   55.900826] ata3.00: qc timeout (cmd 0xec)
[   55.900899] pm80xx:: mpi_sata_completion 2049: SATA IO STATUS 0x1 task ffff8807ee8cc000
[   55.900900] pm80xx:: mpi_sata_completion 2085: status:0x1, tag:0x2, task::0xffff8807ee8cc000
[   55.900831] pm80xx:: pm8001_chip_abort_task 4889: cmd_tag = 0x3, abort task tag = 0x2
[   55.900902] pm80xx:: mpi_sata_completion 2118: SAS Address of IO Failure Drive:50000d1106c76219<6>
[   55.900903] pm80xx:: mpi_sata_completion 2493: task 0xffff8807ee8cc000 done with io_status 0x1 resp 0x0 stat 0x8d but aborted by upper layer!
[   55.900906] pm80xx:: pm8001_mpi_task_abort_resp 3840: ABORT status = 0x0 task ffff8807ee8cc1c0
[   55.900907] pm80xx:: pm8001_mpi_task_abort_resp 3856: ABORT IO_SUCCESS for tag 3 ,task ffff8807ee8cc1c0
[   55.900911] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[   66.049020] ata3.00: qc timeout (cmd 0xec)
[   66.049087] pm80xx:: mpi_sata_completion 2049: SATA IO STATUS 0x1 task ffff8807ee8cc000
[   66.049088] pm80xx:: mpi_sata_completion 2085: status:0x1, tag:0x2, task::0xffff8807ee8cc000
[   66.049025] pm80xx:: pm8001_chip_abort_task 4889: cmd_tag = 0x3, abort task tag = 0x2
[   66.049089] pm80xx:: mpi_sata_completion 2118: SAS Address of IO Failure Drive:50000d1106c76219<6>
[   66.049091] pm80xx:: mpi_sata_completion 2493: task 0xffff8807ee8cc000 done with io_status 0x1 resp 0x0 stat 0x8d but aborted by upper layer!
[   66.049093] pm80xx:: pm8001_mpi_task_abort_resp 3840: ABORT status = 0x0 task ffff8807ee8cc1c0
[   66.049094] pm80xx:: pm8001_mpi_task_abort_resp 3856: ABORT IO_SUCCESS for tag 3 ,task ffff8807ee8cc1c0
[   66.049098] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[   96.181921] ata3.00: qc timeout (cmd 0xec)
[   96.182001] pm80xx:: mpi_sata_completion 2049: SATA IO STATUS 0x1 task ffff8807ee8cc000
[   96.182009] pm80xx:: mpi_sata_completion 2085: status:0x1, tag:0x2, task::0xffff8807ee8cc000
[   96.181934] pm80xx:: pm8001_chip_abort_task 4889: cmd_tag = 0x3, abort task tag = 0x2
[   96.182014] pm80xx:: mpi_sata_completion 2118: SAS Address of IO Failure Drive:50000d1106c76219<6>
[   96.182020] pm80xx:: mpi_sata_completion 2493: task 0xffff8807ee8cc000 done with io_status 0x1 resp 0x0 stat 0x8d but aborted by upper layer!
[   96.182025] pm80xx:: pm8001_mpi_task_abort_resp 3840: ABORT status = 0x0 task ffff8807ee8121c0
[   96.182029] pm80xx:: pm8001_mpi_task_abort_resp 3856: ABORT IO_SUCCESS for tag 3 ,task ffff8807ee8121c0
[   96.182043] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[   96.337817] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0

[   96.354159] sas: Enter sas_scsi_recover_host busy: 0 failed: 0
[   96.354177] sas: ata1: end_device-1:0: dev error handler
[   96.354194] sas: ata2: end_device-1:1: dev error handler
[   96.354204] sas: ata3: end_device-1:2: dev error handler
[   96.354210] sas: ata4: end_device-1:3: dev error handler
[   96.510401] ata4.00: ATA-9: ST4000VN000-1H4168, SC43, max UDMA/133
[   96.510409] ata4.00: 7814037168 sectors, multi 0: LBA48 NCQ (depth 31/32)
[   96.511106] ata4.00: configured for UDMA/133
[   96.511134] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0
[   96.526013] scsi 1:0:3:0: Direct-Access     ATA      ST4000VN000-1H41 SC43 PQ: 0 ANSI: 5

The first big block shows what a failed detection of a drive looks like, the second what a successful detection looks like.

All hard drives have been tested multiple times without errors before being put into the full build. It's not always the same drives that drop out, it seems completely random.

Another question suggests that the error emerges from a shared IRQ 16 and indeed, I'm sometimes having error logs pointing to IRQ 16. Unfortunately I do not know if it is possible to use another IRQ as the BIOS does not allow such a thing for me and using another PCIe slot is not an option link speed wise.

Any help is greatly welcome. I'm close to ordering an LSI controller to see if it helps but hope to get it working with the Adaptec. I just have great concerns trusting my data to this controller.

Update: The problems go on. Even if all drives are found there are kernel panics in libsas and the pm80xx kernel module randomly. Not usable in production either. Thinking about getting an LSI 9201-16i…

Can you tell me about the enclosure? Are you using SATA disks on a backplane expander? — ewwhite, Feb 25 '14 at 20:50
I'm using four pieces of SFF-8643 to 4x SATA 6Gb cables. The drives are Seagate NAS 4TB SATA drives. — Patrick Bergner, Feb 25 '14 at 20:55

Adaptec pm80xx Driver Drops Drives Randomly

0 Answers0