Fibre channel: LTO tape overwritten on bus reset

Question

There is a situation that we had at out customer that I'd like to understand better.

Here's what happened:

A library with LTO tape drives is connected to a fibre channel environment
Archiving software running on Windows server 2008 is writing data to the tapes
At some point the tape was rewinded without the software being aware of that and writing erased the tape
The situation was detected by comparing the expected position on the tape with actual one

I don't have the details about the vendors of equipment.

It seems that a reset happened on the tape drive that caused the tape to rewind but that situation was not reported as an error back to the to the driver and software so the software assumed that the write was successful.

I was reading a lot of documentation to understand why this happened but I can't make any final conclusions to assist the customer.

Can a FC HBA or switch on its own retransmit the SCSI write on bus reset?
- Can something like this be configurable?
Did FC HBA or switch ignore the reported Unit Attention?
Can the OS driver be to blame?
Is this vendor specific?

I'd be very thankful if someone can provide me some directions where to continue.

This seems apropos to your situation: http://serverfault.com/questions/237281/do-i-need-to-run-a-verfication-on-lto-tape-backups-even-though-the-drives-themse — Andrew Henle, Apr 08 '16 at 14:16
@AndrewHenle: I don't think so. My question is about unwanted tape rewing happening unnoticed while writing data to tape in FC environment. — matejk, Apr 08 '16 at 14:28
In the past I have spend a few years administrating a large deployment of tape libraries with LTO drives, and I never saw what you describe. What appears to be the major differences between your setup and the one I administrated is the use of a FC switch in your setup, and that you are using Windows. If the API Windows uses for writing to tape drives is similar to Linux, then there is a quite easy way to introduce a bug that occasionally miss errors and cause data corruption, but I have never seen it lose data - only corrupt it during write. — kasperd, Apr 12 '16 at 07:57
My first guess as to what happened would be that somehow two of the hosts connected to this FC switch were trying to simultaneously use the same drive - maybe in connection with a software bug causing error conditions to be missed. — kasperd, Apr 12 '16 at 08:30
Is the tape drive accessible to more than one host via FC? Have you checked the tape drive's error logs with the vendor's tape drive diagnostic tools? Also, drives can be configured to not rewind on device reset. What make is the tape drive and what backup software is it? Have you contacted the backup software vendor? It seems strange that the overwrite happened at all as the enterprise backup software I've used generally periodically queries the tape drive for the logical tape position and verify that it matches the amount of data written so far to the tape. — etherfish, Apr 18 '16 at 04:44
Here's a link to a description of the tape position verification I mentioned in my above comment. See the header, "External event has caused rewind." https://www.veritas.com/support/en_US/article.TECH169477 — etherfish, Apr 18 '16 at 05:05

MadHatter · Answer 1 · 2016-04-12T22:25:48.810

3

This is a known problem with tape drives, and the way that they are trivially-easy to rewind merely by looking sideways at the device (ie, opening it in the wrong way - via the rewinding device - just eg to check status).

At least one major piece of UNIX backup software is so worried by this that it simply refuses to write to a tape a second time until that tape is ready to be erased; this from the amanda FAQ (which specifically mentions bus resets as a problem area):

Why does Amanda not append to a tape?

One run of Amanda = one (set of) tapes. Amanda opens the tape device once, writes all the images and filemarks, and closes the device once. Using that sequence, there is no possibility that other programs interrupt the sequence and rewind the tape, without Amanda noticing.

Doing "mt -f /dev/st0 status" could be enough, or even "amcheck daily". Also, an error like a scsi bus reset implies a rewind.

If Amanda would close and reopen the tape drive for each backup image, there is a window of vulnerability that the tape gets rewound accidentally, and the next image will overwrite all the good backups on the tape. And you wouldn't know unless you tried to restore from the tape.

When appending to a tape, there is the possibility that, between the time that Amanda positions to the last image (that already is not really trivial!), and opening the device for writing, a tape rewind happens, and in that case Amanda would happily erase ALL of the tape, containing possibly many days worth of backup.

Bacula similarly addresses the issue by never closing the tape device, so noone else can open it wrongly while a tape is loaded. But that doesn't get around the bus reset problem.

Essentially, this is a problem, and it's a hard one. I might argue that your backup hardware should be sufficiently rock-solid that these don't happen often; if FC seems particularly prone to these, it's time to get a SAS tape drive instead, or at least directly-attach the tape device to the backup server in order to remove fibre switches etc. from the path. Other than that, I can't see how you can do much more than you have, since you caught the problem before the usual point, ie "our restores don't work, we're screwed".

edited Apr 12 '16 at 22:25

answered Apr 12 '16 at 08:43

MadHatter

78,442
20
178
229

Thanks for your answer. Do you have concrete information what can happen in TC environment to happen what we experienced? – matejk Apr 12 '16 at 14:20
1

@matejk TC environment? – MadHatter Apr 12 '16 at 14:26
I personally spend years administrating a large backup system using LTO-4 drives, and I never saw this problem. That system would happily append to previously written tapes, and it would do verification reads. So it appears the developers of that system somehow found a way to avoid the problem. – kasperd Apr 12 '16 at 17:18
@MadHatter: Sorry, of course I meant FC (fibre channel). – matejk Apr 12 '16 at 20:21
@MadHatter I can understand that what Bacula does makes a lot of sense. It is very similar to what our sw does: seek to EOD before starting to write without closing the device in between (and a lot other precautions). However the rewind happened *unnoticed* having the device opened and while writing to the tape. We detected that by comparing the position, reported by the drive: it suddenly dropped to 1 after a write meaning that the tape was overwritten. Based on your feedback can assume that the customer's old FC equipment is to blame. I don't know what to advise them... – matejk Apr 12 '16 at 20:30
@kasperd Unless you restored from every tape you wrote, you don't know that you didn't have the problem, only that you never found out that you had it. Verification reads tell you only that the data you just laid down went to tape, which isn't at issue here; what's at issue is whether in doing so, it overwrote earlier data on the tape. It would be a very sophisticated piece of software that verified *last* week's backups, having written *this* week's ones. – MadHatter Apr 12 '16 at 22:15
@matejk I don't know what to tell you. If you are sure that bus resets are happening, then I have provided you with a reference that these can indeed be a major problem when they happen. You will need to find a hardware combination that is less prone to them, or an OS that reports them as a screamingly fatal event to the backup software. – MadHatter Apr 12 '16 at 22:23
@MadHatter I am not talking about the verification reads automatically performed by the drive. The software would do full tape reads after the tape had been filled. Also accidentally overwriting a tape from the start when appending data would have caused outliers in tape capacity, which would have been noticed when analyzing the health of the system. – kasperd Apr 13 '16 at 06:43
@kasperd I didn't think you were. Full tape reads after the tape has been filled are only possible if you only write one set of backups to a tape: otherwise, what are you verifying earlier backups against? Only writing once per tape is exactly how amanda guards against overwriting. As for outliers, I agree with you that those are very strong signifiers of accidental overwriting, and that seems to be analogous to how such was noticed in this case, also. – MadHatter Apr 13 '16 at 06:49
@MadHatter If you know how many times you have been writing to the tape and how many bytes you wrote each time, it would be quite obvious at read time if something was missing. – kasperd Apr 13 '16 at 07:09
@kasperd I agree. I think that needs a less ambiguous name than *verification writes*, but if that's what your software did after each tape went full, that would definitely catch accidental-rewind-based overwriting. Care to share the name of your software, that does this? – MadHatter Apr 13 '16 at 07:17
@MadHatter It was developed in house and never made publicly available. – kasperd Apr 13 '16 at 07:20
1

@kasperd that's fair enough, and three cheers for you and your team for making something elegant. I only wish more commercial - and free - backup software did this as an end-stop check. – MadHatter Apr 13 '16 at 07:21
Although I'm sure you already know, I thought it's worth adding for the benefit of others that there are rewind and non-rewind dev files for tape drives. Traditionally, there are two sets of /dev/ entries for each SCSI tape device, stN and nstN. When the device is closed, the SCSI tape device driver will automatically rewind the tape when using the stN devices and will not rewind when using the nstN devices. Anyone curious should see st(4) man page. – etherfish Apr 18 '16 at 04:37
@etherfish I know. But the rewinding device is usually the default (try `mt eod` and see where you end up on the tape!) which makes it very easy to rewind a tape unintentionally. For myself, I think that tape devices should *only* have a non-rewinding device, and *require* eg an `mt rewind` in order to rewind. But what do I know?! – MadHatter Apr 20 '16 at 10:42
@MadHatter I couldn't agree more. I /always/ use /dev/nst0. I imagine there's a historical reason for the rewind on close by default behavior.. And, likely, it's because the first driver to enjoy popular use did that, so noone bothered explicitly rewinding in their scripts and it all went down hill from there. With modern single-spool tapes like DLT and LTO, you can only eject the tape if it's full rewound, but as far as I can remember, tape cartridges from 20 years ago all had two spools like VHS and could be ejected at any time - so maybe auto rewind is more appropriate now than before? – etherfish Apr 20 '16 at 19:54

Fibre channel: LTO tape overwritten on bus reset

1 Answers1