How to automatically detect inserted SATA disk in Solaris if cfgadm status is disconnected?

Question

My goal is to automate a backup routine on a small OpenSolaris NAS (running OmniOS + napp-it on a HP Microserver N54L) in combination with SATA disks.

Background:

I have installed one of those 5.25" -> 3.5" carrier-less HDD trays that contain a simple SATA or SAS/SATA backplane with one port, a power button and some LEDs (power and HDD activity). To backup multiple HDDs (one each week in rotation, stored offsite), I have written a script that uses zfs send/recv to dump the complete main pool including all snapshots (updating only new blocks). This script works fine when I manually start it.

I'd like to further automate that process, because the NAS does not have direct VGA or serial console attached and it is tedious to insert the disk, go back to the desktop system, log onto the web interface or SSH and start the script manually. Timed start via cron job is not an option, because the days of backup may vary slightly (forgot the disk, holidays, etc.). So the backup should start right after insertion of the disk.

Problem:

In the script I use cfgadm to connect + configure and later unconfigure + disconnect the disks. If I only insert the disk and it spins up, I have no way of knowing that the disk is there. Possible solutions I've considered already:

Probing for a new disk and zpool every x minutes continuously by using cfgadm -f -c connect and checking for error results. Not very elegant.
Checking /var/adm/messages every x minutes and grepping for device path or AHCI. Not possible, because messages are only written if the device is connected manually.
Using iostat -En. Displays the disks, but I have to grep for the exact serial numbers, because it does not list port information. Also needs to be done every x minutes.
Using cfgadm with SELECT syntax to filter for receptacle status. Does not work, because the insertion does not trigger anything (maybe backplane is too cheap for that).
Recognizing the power on/off of the enclosure. Would be okay, but I couldn't figure out how to accomplish this.
Remapping the power button or adding another button to the machine. Could work, but I also don't know how to do this.

I think I would need two things:

a reliable way to identify disk and port status in combination (so only the correct disk in the correct slot is detected)
a way to register this detection and trigger an event (start shell script)

Is this possible? If not, what would you suggest as alternatives?

Final solution (updated 2015-01-26):

For anyone with similar problems in the future:

Enable AHCI hotswap in OmniOS as detailed in the accepted answer by gea.
Use syseventadm as detailed in my own answer to trigger the backup script when the disk comes online.
Make sure your cables, controller and disks are fault-free and play well together (I had problems with WD SE 4TB disks and the onboard AHCI SATA controller, which resulted in random WARNING: ahci0: ahci_port_reset port 5 the device hardware has been initialized and the power-up diagnostics failed messages in the system logs).

score 3 · Accepted Answer · answered Jan 22 '16 at 10:05

3

Onboard Sata/AHCI is hotplug capable but this is disabled in OmniOS per default: To enable add the following line to /etc/system

set sata:sata_auto_online=1

answered Jan 22 '16 at 10:05

gea

46
1

Thank you, I did add it, but even after a reboot it is not working. I suspect the backplane may be to blame, I will check for this by directly adding the disk, another disk and another controller this weekend when there is room for downtime. I'm also getting a lot of ` WARNING: ahci0: ahci_port_reset port 5 the device hardware has been initialized and the power-up diagnostics failed` in the log. I will report back if I have more info to share. – user121391 Jan 22 '16 at 13:12
After some testing I found out that the combination of WD SE 4TB SATA drives and the onboard port on the microserver is most likely to blame. A cheaper and smaller disk had no problem with hotplugging and the drives with another controller were also fine. With the onboard controller there would be power-up diagnostics failed in about 80% of cases, changing the cable and enclosure did not help. – user121391 Jan 25 '16 at 08:22

score 1 · Answer 2 · answered Jan 21 '16 at 13:51

1

Interesting question... a bit of a science experiment, as I'd probably just use USB or send remotely or have this on a schedule...

But in your case, I wouldn't try to "look" for the disk at all from a cfgadm or log parsing manner. That's not scalable.

I'd simply name the removable disk with a unique ZFS pool name and script logic around a periodic zpool import. In ZFS under Linux, the pool import process is a systems service/daemon. But there's no cost to running it periodically. It'll detect the drive and associated pool.

I hope you're exporting the pool when you're done with the backup as well. That would cover situations where the drive remains in the server for multiple backup cycles. Like leaving a backup tape in its drive.

answered Jan 21 '16 at 13:51

ewwhite

194,921
91
434
799

Yes, import/export is included, I just left it out of the question because once the connect status is achieved, it all works. I also already check for the unique names (tape0, tape1, etc.) when importing. Remote is not an option, because this NAS is already the remote endpoint of the hourly backup, but the data should also be offsite for disaster recovery. – user121391 Jan 21 '16 at 13:54
I'm saying that all you need is to `zpool import pool name` to discover your disk. – ewwhite Jan 21 '16 at 13:55
That does not work, because the disk is not yet connected. Or do you mean to scrap the configure/connect/unconfigure/disconnect stuff from the script completely, using only zpool import/export and hotswapping? – user121391 Jan 21 '16 at 13:57
I'm suggesting just using `zpool import` and `zpool export`. That's what I do in Linux for this situation. You don't care about the disk itself. Don't treat it like a disk. Think of it as a portable ZFS pool. – ewwhite Jan 21 '16 at 13:59
I have tried your suggestion, unfortunately it does not work. The device is always marked as "faulty" after such an unclear removal und reinsertion. Therefore, no pools are found until I manually do `fmadm faulty`, `cfgadm -f -c connect` and `cfgadm -f -c configure`. I think I remember something along those lines was the reason for adding all the cfgadm commands in the script in the first place - because it did not work at all without that. – user121391 Jan 21 '16 at 16:02
I'm sorry. I think this is a very narrow use case, but I hope you find a solution. – ewwhite Jan 21 '16 at 16:17
No problem, and thanks for your suggestion! I have thought about posting to unix.stackexchange.com, but this problem is a bit between those both sites unfortunately... – user121391 Jan 21 '16 at 16:24
Sure. Go ahead and ask. – ewwhite Jan 21 '16 at 16:27

score 1 · Answer 3 · edited Apr 13 '17 at 12:37

I'll add this answer to document what I found out about monitoring events (may also be useful in other cases):

While trying to ask the question on unix/linux.SE, I noticed a useful thread about using udev on Linux to monitor for kernel events. As an equivalent tools for Solaris, I stumbled upon the suggestion to use syseventadm which watches for sysevents and triggers defined actions/scripts.

At first I did not find much except copies of the man page and some discussions about a problem with Xen Hypervisor, but the supported events are listed in /usr/include/sys/sysevent/eventdefs.h (or online at /usr/src/uts/common/sys/sysevent/eventdefs.h in various repos) and other files in that directory.

Using the first example from the manpage and syseventadm add -c EC_zfs -s ESC_ZFS_scrub_start /path/to/script.sh \$pool_name I successfully tested a sample event that fires every time a scrub is initiated and returns the pool name as first argument.

After some trial and error, I found the correct way to monitor for newly added disks:

syseventadm add -c EC_dev_add -s disk /path/to/script.sh \$version \$dev_name \$phys_path \$driver_name \$instance
syseventadm restart

Everything after disk is optional and directly passed to the script as arguments $1 to $5.

Now as soon as the newly added disk comes online, the script will be triggered and the script can check if the device ID is correct (optional) and then import the pool by name.

How to automatically detect inserted SATA disk in Solaris if cfgadm status is disconnected?

Background:

Problem:

Final solution (updated 2015-01-26):

3 Answers3