9

When creating a linux software raid device as a raid10 device, I am confused why it must be initialized. The same question applies for raid1 or raid0, really.

Ultimately most people would put a file system of some sort on top of it, and that filesystem should not assume any state of the disk's data. Each write will affect both disks in a raid10 or raid1 setup, where the N mirrors are written to. There should be no reason whatsoever for a raid10 to be initialized initially, as it will happen over time.

I can understand why for a raid5/6 setup where there is a parity requirement, but even then it seems like this could be done lazily.

Is it just so people feel better about it?

Michael Graff
  • 6,588
  • 1
  • 23
  • 36
  • 1
    Good question. It is possible to skip the synchronization when the RAID is being created, and I have come across recommendations for doing so in case one or more underlying devices are SSD. I don't know if scenarios exist in which the synchronization is needed for correct operation. – kasperd Jan 27 '16 at 08:50

5 Answers5

7

Raid 1, being a mirror, depends on all disks in a mirror being exact copies of each other. Take your random hard drive, and another random hard drive, and you possibly have different data there, thus violating this presumption. This is why initialization is needed. It simply copies contents of the first drive to others. Note that in some conditions you can get away with not initializing the drives - usually factory-new devices already have zeros all over the place, so you can simply ignore this. The mdadm option --assume-clean does this, but warns you:

   --assume-clean

Tell mdadm that the array pre-existed and is known to be clean. It can be useful when trying to recover from a major failure as you can be sure that no data will be affected unless you actually write to the array. It can also be used when creating a RAID1 or RAID10 if you want to avoid the initial resync, however this practice -- while normally safe -- is not recommended. Use this only if you really know what you are doing.

If you don't do it, there is a discrepancy between the drives and it's read, there's no knowing what the drive will read. You should be pretty safe with a filesystem (but note below), because most probably you'll write before you read anything from that device, and then you're clear.

Note that at least Linux's mdadm will initialize the array in background. You can happily create FS on top of it the first second. The performance is going to suffer until the initialization is finished, but that's everything.

But:

a) When doing mkfs some utilities check if there's something on that drive already. While this only touches a few well-known regions of drive, it reads before you write anything, thus putting you in danger.

b) If you do a periodic resync of your array, the RAID device knows nothing of your FS. It simply reads every block from every device and compares those. And if you are not using a copy-on-write FS (e.g. ZFS or BTRFS) and never fill your FS, it's perfectly plausible for a block to stay uninitialized from FS perspective for years.

Why resyncing with RAID1 devices?

For the same reason you resync with RAID5 devices or any other level (except RAID0). It reads all data and compares/verifies RAID checksums (in RAID 5 or 6). If a bit was flipped in any way (because the HD memory got spontaneous flip, because the cellphones of you and your 5 neighbours just accidentally interferenced over this particular region of platter, whatever) it will detect inconsistency, but won't be able to help you. If, OTOH, one of the hard drives will simply report "I cannot read that block", which is more probable with a failing drive, you just have detected a failure early, and reduced time you're running in degraded mode (counting from the drive failure, not from when you notice it). Raid won't help you if one drive fails and a month later the other one fails if you don't notice the first failure in that month.

RAID10

Now, for RAID10 all of the above holds. After all RAID10 is just a clever way of telling 'I'm putting my two RAID1 devices in a RAID0 pair'.

Caveat:

This is all undefined behavour. Why I've checked on Linux, using mdadm, other software RAID implementations may behave differently. Other versions of Linux kernel and/or mdadm tools than I'm using also may behave differently.

Torinthiel
  • 181
  • 2
  • 1
    Please provide a citation for `If you don't do it, there is a discrepancy between the drives and it's read, the RAID device will report failure of a drive`. I believe that statement is incorrect. At least provide an example of the error message such that it is possible to consult the source to verify under what circumstances it is produced. – kasperd Jan 27 '16 at 12:36
  • @kasperd You are right. I've just checked that it does not report errors, and even survives adding a third drive, at least with `mdadm`. I've rephrased slightly. – Torinthiel Jan 27 '16 at 13:07
  • 1
    That's better. Did you verify the statement about writing zeros? I think it doesn't write zeros but rather copy one of the disks to the other(s). – kasperd Jan 27 '16 at 13:09
  • 1
    `While this only touches a few well-known regions of drive, it reads before you write anything, thus putting you in danger.` In danger of what? I realize that the read may result in anything, but why would that result in some kind of danger for the user if (a) the information being read is not used anywhere and (b) a write is about to happen? – Vegard Jan 27 '16 at 14:25
  • 1
    @kasperd you are right, it copies the first device to the second one. Test on on `urandom`-initialized device, with linux mdadm shows that first 80k remain different, as well as the last 48k. The latter probably due to rounding of RAID size to block size. I've not tested with different device sizes, but the 80+48 is exactly the difference in size between RAID device and the underlying block device. – Torinthiel Jan 27 '16 at 16:55
  • @Vegard in danger of swimming on uncharted waters. This is undefined behaviour, and 'software raid' the OP referred to could be Linux `mdadm` or BSD or anything else, each one could behave differently and behaviour can change between versions. Also your (a) is not correct, this information can be used by `mkfs` to check if it's safe to create an FS on the device. Not highly vulnerable to random data, but still used after being read. – Torinthiel Jan 27 '16 at 16:59
  • @Torinthiel Checking for the presence of a file system on the device sounds like it is only done to avoid accidentally wiping an existing file system. If you create a RAID-1 without synchronizing existing data between the devices, you have already written off any previously existing data on the disk. So in that case I wouldn't worry about the risk of mkfs incorrectly concluding the md device has no indications of a file system previously existing on the device. – kasperd Jan 27 '16 at 17:07
  • With the latest edit I consider this answer to be the best of the answers so far. So you get my vote for that. – kasperd Jan 27 '16 at 17:09
  • @Vegard I agree the risk is low (and the impact even lower). It's just taking random data from somewhere on the old disk, and compare it to some pattern. Could be an issue if for whatever reason you'd store image of an FS at that specific place, which is not impossible, but not probable either. And the worst that can happen is that you'd have to provide one extra parameter or answer to create your FS. Anyway this is getting long, if we are to continue I'd like to ask someone to move it to chat. – Torinthiel Jan 27 '16 at 17:36
  • 1
    One thing to consider is that usually during initialization, the raid system will ALWAYS read disk A and copy it to disk B. Why? Since you can use the disk while it is initializing, you may have written data at block 100,000. Once the raid init gets to that block, both A and B are already identical, so nothing happens. If It were instead zeroing blocks, it would wipe good data. Thus, once again, I see two reasons to ensure the blocks are identical: "it's always been done" and "so you can run a check later" -- I also question that check's usefulness. Reading is good, comparing? not sure. – Michael Graff Jan 27 '16 at 17:50
5

Remember that RAID 1 is a mirror, and that RAID 10 is a stripe of mirrors.

The question is, on which disk in each mirror is the data valid? In a freshly created array, this cannot be known, as the disks may have different data.

Remember also that RAID operates at a very low level; it knows nothing of filesystems or whatever data might be stored on the disk. There might not even be a filesystem in use.

Thus, initialization in these arrays consists of the data from one disk in each mirror being copied as-is to the other disk.

This also means that the array is safe to use from the moment of creation, and can be initialized in the background; most RAID controllers (and Linux mdraid) have an option for this, or do it automatically.

Michael Hampton
  • 237,123
  • 42
  • 477
  • 940
  • Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackexchange.com/rooms/34929/discussion-on-answer-by-michael-hampton-why-does-a-raid-10-device-need-to-be-ini). – Michael Hampton Jan 27 '16 at 17:53
2

Initial synchronization is needed because any differences between the mirrors would show up as errors during the periodic check.

And you should be doing periodic checks.

Simon Richter
  • 3,209
  • 17
  • 17
  • 1
    I can see why periodic checks for readability of data can be useful. But what good does a periodic check for the replicas being identical do? Such checks can be useful if performed by a file system which checksums data. But at the RAID layer without file system knowledge you cannot know which of the two different replicas is good, you cannot know how the discrepancy happened in the first place, and you cannot know which file (if any) is affected. So it appears alerts about inconsistencies at this layer are mostly useless as there is nothing the administrator can do with the alerts anyway. – kasperd Jan 27 '16 at 10:11
  • As you need to read the data anyway, the cost of comparing it is minimal, but it can show you that one of the disks has developed an otherwise undetected problem (e.g. bad RAM in the drive's own cache). The administrator would then break up the array, manually look at the differences and choose which drive to replace. – Simon Richter Jan 27 '16 at 10:18
  • You should expand on that in your answer then. – kasperd Jan 27 '16 at 10:20
  • I know it's been many years, but this is the only valid reason I can see. I do not think it matters otherwise if the data is out of sync, as that data is by definition not written to yet, so the filesystem applied to the raid drive will never read from those blocks. Making sure the periodic checks pass from the start, though, makes this necessary. Thanks! – Michael Graff Oct 10 '19 at 08:40
1

Simply put because two new disks are not expected to be mirror perfect copies of each other from the onset.

They need to be turned into perfect copies of each other.

In addition initialization includes setting up the metadata superblock with information about the array configuration as well.

The /proc/mdstat file should tell you that the device has been started, that the mirror is being reconstructed, and an ETA of the completion of the reconstruction. Reconstruction is done using idle I/O bandwidth. So, your system should still be responsive, although your disk LEDs will also be showing lots of activity.

The reconstruction process is transparent, so you can actually use the device even though the mirror is currently under reconstruction.

HBruijn
  • 72,524
  • 21
  • 127
  • 192
  • 2
    But **why** do they need to be perfect copies of each other? What could possibly break from the two being inconsistent in sectors that were never used by the file system? – kasperd Jan 27 '16 at 08:46
  • @kasperd RAID is implemented at a lower level than any file system. So the question becomes, what is the "file system" to which you refer. – Taemyr Jan 27 '16 at 10:26
  • @Taemyr I am not referring to any specific file system. Pick whichever you prefer and explain what would break by using it on a RAID-1 where the replicas were not in sync before initializing the file system. – kasperd Jan 27 '16 at 10:51
  • @kasperd There is *no* file system to break at the level RAID operates. – Taemyr Jan 27 '16 at 11:32
  • @Taemyr Then how would skipping the synchronization cause it to break? – kasperd Jan 27 '16 at 12:37
  • 1
    In my case, as the original poster, I don't care what file system. I know of no file systems that will read sectors that have never been written to, thus any indeterminate state of those unwritten sectors does not matter. – Michael Graff Jan 27 '16 at 13:40
  • I've been a bit too distracted with other stuff, but the why can be found in the "Scrubbing" section of `man 4 md` *"Requesting a scrub will cause md to read every block on every device in the array, and check that the data is consistent. **For RAID1 and RAID10, this means checking that the copies are identical**." - since raid operates below the file-system it doesn't know which blocks are in use and contain real data, and which don't. It can't be smart about the consistency check and acts pretty dumb -> identical blocks equals consistent - not identical is an indication of issues. – HBruijn Jan 27 '16 at 14:21
  • @HBruijn So what you are saying is that without the initial synchronization, the scrubbing can flag a healthy RAID-1 as having problems. In that case I think it doesn't do any harm, but it means that you cannot rely on scrubbing to tell you about inconsistencies. – kasperd Jan 27 '16 at 17:21
  • I think that was was my thought, yes. You need the initialisation for a simple and cheap, robust consistency check that is unaware of the actual data – HBruijn Jan 27 '16 at 18:09
  • @HBruijn Then I recommend you update your answer to reflect that. Because without it, I think you fail to answer the question as it was asked. – kasperd Jan 27 '16 at 18:37
0

If you are using Linux LVM to create a RAID 1 (or 10) filesystem that you will immediately load with data, here's how you can avoid much of the unnecessary initialization I/O.

First create an ordinary linear (non-RAID) filesystem and load it with your data. Then convert it to a RAID filesystem with lvconvert. The mirror device will be initialized with your already-loaded filesystem data, so the only "unnecessary" I/O will be when the unallocated blocks in your already-loaded filesystem are copied. This is better than first copying every block from one uninitialized device to another and then writing your data to both devices. By serializing the two operations (loading the filesystem and then creating the mirror) you will also allow the disks to perform sequential I/O, which is much faster than the random seeking that occurs when writing to a RAID mirror pair that is still initializing.

Phil Karn
  • 31
  • 2