16

This is a really tricky one, and to some extent it's not a technical problem, and might not belong here, but

Server Fault is for system administrators ... who manage or maintain computers in a professional capacity

And I do.. and this is one of my tasks.. Anyway.

Imagine you had 5000 + double-sided pages of A4. Company data, all business critical.
You need to back it up, somehow. Proposed solutions so far are:

  1. PDF -> Online storage
  2. PDF -> DVD / BluRay / Tape
  3. PDF -> Portable HDD / SSD / Flash drive.
  4. Buy/Lease/Hire/'Steal' a big photocopier, and make copies.
  5. ???

Immediate problems with the above:

  1. What if the storage partner goes bust?
  2. DVDs do rot over time. Tapes similarly.
  3. These too, break over time.
  4. Expensive. Slow. Heavy. Not Tree Friendly.

The Question(s):

What is the gold-standard for long-to-medium term data preservation and archiving? Have you solved a similar problem in the workplace?

After the initial loading, there is some requirement to add to the collection roughly 100 pages a month. Retrieval should be possible, easily, but probably is infrequent.
Ideally I'd like to guarantee that the solution will be workable long after I have left the company, and that it won't require a massive amount to keep it maintainable, so storing many many DVDs is not only not ideal, but also not a good long-term solution.

While just making paper copies is certainly the easiest, it's not the most environmentally friendly, not by a long way. It's also not very manageable, difficult to search, index, and so on. Combined with heavy, and difficult to physically store.

I quite like the idea in principle of having everything stored electronically, but the actual mechanism of doing this needs to be transparent and easy. I really don't want to be responsible for this forever and a day, supporting office users as they cock it up, and lose documents. I also don't want to be reliant on a single storage vendor, what if Dropbox (We have an online backup solution ATM, but it isn't Dropbox.) were to go bust, or otherwise experience a catastrophic event, how many businesses who are using their services would be up the creek, sans paddle?

There's some budget flexibility here, but I suspect anything that costs more than our current online backup (which is like 2500USD/year), would be viewed less than favourably, compared to just putting it in a shoebox under a bed. Which is no-doubt what would happen if I did nothing, and resigned tomorrow.

Any ideas?

-Edit-

The reason for doing this is twofold.

1) provide a sensible secure backup of business critical paperwork in the event the office burns down.

2) to satisfy data archiving laws WRT uk tax law for businesses and so on.

Edit 2:

Having some mechanism for indexing the documents would be bloody useful too..

Tom O'Connor
  • 27,440
  • 10
  • 72
  • 148
  • I knocked up a quick script for doing PDF->Txt with ghostscript and gocr. Idly considering squirting everything into a SQLlite db to make some interesting index for the data. – Tom O'Connor Mar 02 '11 at 13:51
  • Redundancy (of digital media) and maintaining redundancy throughout the years may be your best friend. – Vortico Sep 25 '12 at 13:23

7 Answers7

6

There are specific systems that internally use DVDs and migrate the data to new media every so often. Look up digital preservation.

Since the storage requirements rise pretty quickly, it is advisable to switch to a newer, bigger type of media every few years anyways.

Assuming you get the data in paper form, you need to:

  1. List the data at mail entry. This may mean giving each sheet a unique barcode.
  2. Scan it. Use the barcode identifier as filename. Archive the paper.
  3. Archive the data. Put the data on a revision secure archiving system. A fileserver will not be good enough because something might happen to the files if they are write accessible.
  4. Make it read accessible for other systems.

In the customers case, it is all the invoices for a large organizations that have to be transferred to online system (SAP). The archive storage went through several iterations by now. Currently they are moving to blue ray.

On the other hand, nowadays everything goes onto disks, so maybe something along these lines would be your way to go: http://www.eurostor.com/german/iTernity.D.php

Posipiet
  • 1,725
  • 14
  • 13
  • Keep in mind that most DVD-R (et al) only lasts a few years before degrading. The expensive "archival" stuff last longer if kept according to the instructions. – Chris S Mar 01 '11 at 16:26
  • So do most disks, servers, file systems, or document formats. Archiving means moving the data. Try reading a pdf in 20 years. Do you remember what the standard was 10 years ago? We have nothing that comes close to paper, really. Except for copy and paste... – Posipiet Mar 01 '11 at 16:32
  • Pure text documents are pretty readable still. – Bart Silverstrim Mar 01 '11 at 16:36
  • There are Free tools to read PDFs. As long as they're "plain vanilla" I wouldn't be too worried. TIFF is a good option, too. – Evan Anderson Mar 01 '11 at 16:39
5

Keeping the data in a format like PDF is probably safe, because there are Free tools to read it. The volume of data you're talking about is fairly small (1,200 pages / year) so even at a 300 dpi scan resolution you're only talking about tens of gigabytes per year.

The physical storage device problem is never going to go away, though. Whatever media you use to store electronic data (tape, optical, etc) will eventually need to be updated to a newer media. Plan and budget for "kicking the data down the road" to new formats as new formats replace older formats.

I'd probably look at optical media as a first choice simply because you have so little data. I'd also plan on burning 3x duplicates of everything and refreshing the media every 2 - 3 years.

If optical media is too small I'd go with LTO tape and refresh the media every 4 - 5 years. That's going to be pretty expensive, though, for such a small amount of data.

Evan Anderson
  • 141,071
  • 19
  • 191
  • 328
  • DVD? Or Blu-ray? Is BR a sensible format for data yet? – Tom O'Connor Mar 01 '11 at 16:41
  • 1
    @Tom O'Connor: They seem the same to me. I'd look at cost to make that decision. There are "archival grade" Blu-Ray blanks out there now, so from a technical perspective it seems like a viable format. (Manufacturers saying that Blu-Ray media has a "rated" life of 200 years doesn't give me any more confidence than those who said that archival DVD media had a 100 year "rated" lifetime...) – Evan Anderson Mar 01 '11 at 16:44
  • I have some Kodak GOLD CD-Rs from about 1998, or so, and they're still readable. I also have some free ones, where the dye layer has separated, and they're screwed. – Tom O'Connor Mar 01 '11 at 16:49
  • I'm not suggesting that there isn't a difference in archival versus non-archival media. I'm simply saying that Blu-Ray archival grade media versus DVD archival grade media don't seem a lot different to me. Comparing archival grade versus "spindle of 100 for $20.00" media is a whole different story. – Evan Anderson Mar 01 '11 at 16:52
  • One item not mentioned was Solid State storage (USB thumb drives): Bigger than optical, smaller than LTO. USB is likely to be around for at least another decade or two, and a 16GB or 32GB thumb drive (or two, or three) is pretty cheap in terms of archiving cost. Since it's going to Write-Once-Read-Many you don't have to worry about the SSD cells wearing out, so you could theoretically keep sticks for 5 or more years in a fireproof vault. – voretaq7 Sep 25 '12 at 19:11
3

Our solution: Scan to PDF -> Backup to Tape

We have a document scanner, does ~30 pages/min and produces OCRed PDF files. We back those up to Tape (LTO4 specifically) which has a shelf life of 50 to 100 years (finding a tape drive might be difficult in the time frame, but there are data recovery places that will still recover 8" floppy disks around).

Chris S
  • 77,337
  • 11
  • 120
  • 212
  • 2
    I had to google 8" floppy.... – Holocryptic Mar 01 '11 at 16:18
  • I do retain my backup tapes from Mac OS 7.5. But the disk broke, the backup programs media is lost. I did manage to reinstall the Mac, but I cant read the tape because I dont have the backup program. And frankly, I dont even remember its name. Yes, the tape may last 100 years. But the reader doesnt. – Posipiet Mar 01 '11 at 16:35
  • @Holocryptic: Only a few months ago I threw out an unopened box of Verbatim hard sectored 8" floppy disks. – user9517 Mar 01 '11 at 16:36
  • 1
    @Holocryptic: NSFW! NSFW!! – Bart Silverstrim Mar 01 '11 at 16:37
  • @Posipiet, I think I covered the fact that drives don't last forever but there are companies that specialize in recovering data from just about any commonly used media. – Chris S Mar 01 '11 at 17:59
3

I think Amazon's new Glacier service is an interesting offering in this space.

Amazon Glacier is optimized for data that is infrequently accessed and for which retrieval times of several hours are suitable. With Amazon Glacier, customers can reliably store large or small amounts of data for as little as $0.01 per gigabyte per month, a significant savings compared to on-premises solutions.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
2

Step one, Backup: OCR the documents, and then re-arrange all the words into a series of novels about the Catholic Church, Opus Dei and Templars. You should have enough input data for about 10 novels, and about one more every year or so forever. Maintain a lookup table which holds the words original location in the source documents (in source order), and it's final location in the novels; store duplicate words in one entry in the table. Secure a publishing deal and get millions of the novels published. Use the revenue from the book sales to fund the OCR and word rearrangement operation. Shred the original documents and sell them as hamster bedding. It may occasionally be necessary to place purchase orders for crucifixes, anti-matter, or plane tickets to exotic locations, if you find you are missing vocabulary from your input documents.

Step two, Recovery/ access: There is no need to store copies of the data - all you need is your lookup table and a second hand bookshop.

As the lookup table is your single point of failure, you will still need to back this up. Thanks to the huffman encoding scheme employed, this will be quite small, compared to your input documents, so could probably be copied to DVD. For offsite backup, sit in front of a log fire and read out the lookup table, while videoing yourself. Place your video performance piece on the fusion of art and technology into the Tate Modern Gallery, on permanent display.

Duncan Lock
  • 1,762
  • 1
  • 16
  • 18
1

Bit too early to buy but it seems like HDS have come up with a permanent data storage mechanism based on quartz - take a READ.

Chopper3
  • 100,240
  • 9
  • 106
  • 238
  • 1
    I've heard the permanency claim so many times that I'm never going to believe it and none of us are going to live long enough to ever see it proven. *Predictions* of permanency are absolutely worthless and invariably become proven wrong. – John Gardeniers Sep 25 '12 at 12:01
-2

I have to put forward Humyo.com (bought by Trend Micro - who's middle name is security)

They encrypt all user data and their servers are housed in the Bank Of England in a vault.

Pretty secure :)

benhowdle89
  • 111
  • 7