3

First of all, thanks for reading, and sorry for asking something related to my job. I understand that this is something that I should solve by myself but as you will see its something a bit difficult.

A small description:

Now

Storage => 1PB using DDN S2A9900 storage for the OSTs, 4 OSS , 10 GigE network. (lustre 1.6)

100 compute nodes with 2x Infiniband

1 infiniband switch with 36 ports

After

Storage => Previous storage + another 1PB using DDN S2A 990 or LSI E5400 (still to decide) (lustre 2.0)

8 OSS , 10GigE network

100 compute nodes with 2x Infiniband

Previous experience: transfered 120 TB in less than 3 days using following command:

 tar -C /old --record-size 2048 -b 2048 -cf - dir | tar -C /new
--record-size 2048 -b 2048 -xvf - 2>&1 | tee /tmp/dir.log

So , big problem here, using big mathematical equations I conclude that we are going to need 1 month to transfer the data from one side to the new one. During this time the researchers will need to step back, and I'm personally not happy with this.

I'm telling you that we have infiniband connections because I think that may be there is a chance to use it to transfer the data using 18 compute nodes (18 * 2 IB = 36 ports) to transfer the data from one storage to the other. I'm trying to figure out if the IB switch will handle all the traffic but in case it just burn up will go faster than using 10GigE.

Also, having lustre 1.6 and 2.0 agents on same server works quite well, with this there is no need to go by 1.8 to upgrade the metadata servers with two steps.

Any ideas?

Many thanks

Note 1: Zoredache, we can divide it in two blocks (A)600Tb and (B)400Tb. The idea is to move (A) to new storage which is lustre2.0 formated, then format where (A) was with lustre2.0 and move (B) to this lustre2.0 block and extend with the space where (B) was.

This way we will end with (A) and (B) on separate filesystems, with 1PB each.

Marc Riera
  • 1,587
  • 4
  • 21
  • 38
  • 1
    Is there no way to break up your data into smaller chunks so that you can move it piecemeal without interrupting all activities for the entire transfer time? – Zoredache Apr 03 '12 at 21:37
  • don't understand why you are moving the data, if the ddn devices are so scaleable why are you even moving the data. if this was a netapp you'd add the extra PB and move on. – tony roth Apr 03 '12 at 22:17
  • Zoredache, we can divide it in two blocks (A)600Tb and (B)400Tb. The idea is to move A to new storage wich is lustre2.0 formated, then format where A was with lustre2.0 and move (B) to this lustre2.0 block and extend with the space where (B) was. – Marc Riera Apr 03 '12 at 23:08
  • tony, we need to give new format to the filesystem, lustre1.6 is no longer supported. We are lucky to have 1 extra PB to play with. – Marc Riera Apr 03 '12 at 23:09
  • btw tony, I see you only ask and respond windows related issues, why did you read this question? polite question, don't want to be rude. :) – Marc Riera Apr 03 '12 at 23:15
  • My naive guess is similar to what it sounds like you're thinking: dedicate some fraction of your compute nodes to transfer the data. Figuring out how to partition the data among the computes is challenging unless you already have a good idea of how it's distributed, though. But are you mounting Lustre over ethernet? (As it looks like the OSS's have only 10GigE.) If you could get away with it, I'd be tempted to take a short downtime, stuff IB cards in your OSS's (steal from computes if needed) and transfer everything over IB for the better bandwidth. – ajdecon Apr 04 '12 at 02:50
  • 1
    Windows pays my bills, but I do enough with storage to deal with these type of issue quite frequently, just not with lustre based FS's. As far as the IB based communications is concerned why not purchase some mellanox connectx3 qdr adapters they are quite cheap now days. – tony roth Apr 04 '12 at 04:59
  • Tony, Netapp isn't used in high performance computing. They have a completely different set of requirements from their storage, and Netapp is aimed for corporate computing. – Basil Apr 04 '12 at 13:50
  • @basil wrong check out the e5400 – tony roth Apr 04 '12 at 17:13
  • I'd never seen that before. Anyways, the question wasn't "what should I have bought", it was "how can I migrate". – Basil Apr 04 '12 at 17:23
  • @Basil don't think I told him to buy anything other then qdr IB cards. You just questioned netapps being used in a hpc situation and I pointed out the e5400, which he is\was considering already. – tony roth Apr 04 '12 at 20:00

1 Answers1

2

The goal is to get it so that every layer between the old storage and new storage goes faster than the maximum read speed you can get from your old machine. Their specs claim 6GB/s sequential (which this should be). That means that the minimum time possible to move the data would be in the realm of 46 hours, if you are able to get the advertised speed.

When you were using tar to move 120 TB in 3 days, you must have averaged just shy of half a GB per second, and that's considerably less than the 6 GB/s the specs claim. The true number will likely be somewhere in the middle.

First, tar might be your problem. I'm a storage guy, not a unix guy, but as far as I know, it can limit your throughput based on the processor speed. If you stick with this methodology, you can get the migration window down by increasing the number of nodes running the migration and having them work on different parts of the dataset. Keep adding nodes until the old machine is incapable of serving files faster.

Second, make sure that you're able to write from your migration node to your new storage as fast as you can read off the old storage. This might mean tweaking some settings on the new storage (especially if it has an old-fashioned mirrored write cache) as well as ensuring there are no network bottlenecks.

Lastly, and this might be a bit far fetched, if you can take the downtime and this box is serving LUNs over FC, you can insert a storage virtualization device into the data path that would allow you to continue using the storage, albeit slower, while you do the migration. IBM's SAN Volume Controller, Falconstore's virtualization appliance, or an HDS storage array are all capable of automating data-migration in the background without interrupting host access. None of them will be as fast as what you're used to, but it will allow you to do work while you migrate after the brief interruption needed to get the nodes working from the new storage heads.

It's probably not worth buying one since you won't be using it after you finish the migration, but you might be able to borrow or rent one.

Basil
  • 8,811
  • 3
  • 37
  • 73
  • I'll take a look at the virtual storage, I'm not sure that can be applied here but sounds quite promising. Another think, if I'm not wrong the slow speed with the 120TB migration was because of the big amount of small files. Opening and closing the files for reading takes a bunch of time, and this people works with genomic files, lots of small files. :-/ – Marc Riera Apr 07 '12 at 15:45