0

The conventional wisdom is that when you store data for long-term storage, you convert the data into a format that will last for decades. It should be in NTFS or FAT and the file format should be one in which will be around for decades.

For example, they do not recommend that files should be left in docx (Word format) because you don't now if Microsoft will be around 10 years from now. They suggest that formats such as txt.

HOWEVER, I have most of my documents in Microsoft Word and Excel. And, the formatting of the document is extremely important to me. It allows readability of the document. It's critical. If it's all in the same regular text, it makes it impossible to read.

What do you people suggest that I should do? I want to burn my long-term files but what format do you recommend that I should save it in on my blu-ray M-discs?

  • I can basically guarantee you that word and excel files will be readable in 10 years. Docx is XML-based for that very reason. I would also stick with usb rather than optical discs, which seem to be rarer each year. – dandavis Oct 13 '20 at 18:33
  • @dandavis: *"I can basically guarantee you that word and excel files will be readable in 10 years."* - readable probably. But will they look exactly the same? Maybe not, that's not the focus of Docx and interpretation of the file is up to the application. Given that specifically the look is relevant Docx might not be a good idea. Also, just because something is XML doesn't make it future-proof. It depends on how granular the specification actually is and not if the format is XML or JSON or some binary. – Steffen Ullrich Oct 13 '20 at 18:47
  • there is Rich Text Format, too. If you're going to go through the trouble of converting all those files, it's not a bad choice as it's human readable... or even HTML? (personally I wouldn't worry about lack of support for 10 year old .doc or .xls files... XML is actually pretty good for forwards/backwards compatibility.) – pcalkins Oct 13 '20 at 21:08
  • PCalkins, I considered RTF but maybe I am being unreasonable but I would love to keep all the styles that I created in Word. But, you'll probably say that the styles are proprietary for Microsoft, right? You are right, of course. I guess I'm hoping that there is a way that I can keep all the styles that I spent years creating and refining. They look beautiful and without a doubt, make readability exponentially easier. Headers separate topics so you can quickly move to the spot in the document that refers to what you're looking for, right? – QuietInMontana Oct 15 '20 at 15:58
  • PDF/A and PDF do not work because they are basically screenshot of your document. It will preserve all the nice styles that I have but I cannot manipulate them say 30 years from now. Maybe 30 years from now, they'll have word processors that can determine the styles by via OCR. That may or may not be true. – QuietInMontana Oct 15 '20 at 16:01
  • Honestly I don't think you'll have a problem. The commodore-64 came out in 1982. There's emulators for it still today... so if you have documents stored in Print Shop format, you can still use them. The trick is actually keeping your archive format viable. (physically) – pcalkins Oct 16 '20 at 17:28
  • Wow, commodore 64's...old times. Emulators are kinda' clunky though, no? If I had say 30,000 documents, would I be able to use the emulator to convert the documents to something modern? The emulator would allow me to open the document but would I be able to easily convert the documents? (I'm obviously not a computer guy.) – QuietInMontana Oct 16 '20 at 20:30
  • I suppose you could with some work... I don't know of anyone who's made a converter from C-64 print shop format to .doc. But amazingly enough you can print them from the C-64 emulator. For .doc and .xml files, though, it would be easier to convert because of the XML structure. There are currently many libraries for reading/writing .xls files, for instance. I think there's an Apache POI for .doc files too, though I've never used it. – pcalkins Oct 19 '20 at 17:56

1 Answers1

1

The first hit for me when searching for long term document storage format is Recommended Preservation Formats for Electronic Records from Smithsonian Institution Archives. There is a table with recommended formats which clearly shows PDF/A and PDF as the preferred format for Text/word processing applications. Similar recommendations can be found on the later search results from other archives and libraries, and these likely know best.

It is actually no surprise that PDF/A is chosen - because it was actually designed for this purpose. The "/A" in PDF/A stands for Archiving, i.e. this is a version of PDF specifically designed for long-term archiving while still preserving the original formatting.

Steffen Ullrich
  • 184,332
  • 29
  • 363
  • 424
  • I saw the same exact site. The problem like I said in another comment is that PDF/A and PDF is just a screenshot of your document but you won't be able to manipulate that document if you want to. You'd first have to convert it to a word processor. For example, 30 years later, what if you want to add the text to the bottom of another document? You think OCR will improve enough that it will be able to recognize formats and styles? – QuietInMontana Oct 15 '20 at 16:04
  • @QuietInMontana: Your question clearly focuses on preserving the formatting for readability, which is exactly the point of PDF/A. That's why my answer focused on PDF/A. But if you read the recommendations from Smithsonian Institution Archive they mention RTF, TXT and XML with schema (i.e. DOCX, ODT) as other acceptable formats. While these don't focus on keeping the exact formatting (i.e. preservation kerning, implicit line and page breaks etc) you can use these in addition to PDF/A in order to maybe have something which looks similar in the future and can be better edited. – Steffen Ullrich Oct 15 '20 at 17:09
  • The problem is that I personally have 50,000 documents. How do I convert DOCX to RTF and maintain the format sufficiently? – QuietInMontana Oct 15 '20 at 19:00
  • @QuietInMontana: How to automatically convert DOCX to RTF is a completely different question and not an information security question. First, I've explicitly mentioned DOCX as acceptable format too. Second, there are ways to programmatically control Office to do the conversation but details about this are off-topic here. – Steffen Ullrich Oct 15 '20 at 19:03
  • Steffen, you seem offended and I think you misinterpreted the tenor of my response. I've read actually a few articles that said NOT to use DOCX even though I desperately want to. I use DOCX all the time so I would love to keep that format. But, the Smithsonian is a good source but they may be wrong. I'm sitting in front of my computer, trying to think objectively. If I want to access a file 30 years from now, I have a feeling that DOCX might not be around. That's because I have bad feeling about Microsoft. I think they are at about the stage IBM was about 20 years ago. Personal opinion. – QuietInMontana Oct 16 '20 at 20:25
  • @QuietInMontana: I'm not offended at all. I only want to focus on your original question, which was about choosing a format for long-term storage with a focus on readability. It was not about future editing and it was not about how to convert files from one format into another. This is not a discussion forum but a strict Q+A site. If you have additional questions please ask these as new questions and don't broaden or refocus the original one in comments. – Steffen Ullrich Oct 16 '20 at 20:45
  • Steffen, has StackExchange given a directive to make the questions and answers more strict, without any superfluous dialogue in the last few months? It makes this website so cold and not fun. – QuietInMontana Oct 17 '20 at 18:58
  • @QuietInMontana: If you feel that you are unfairly treated or have question how the site is supposed to work please take your problems to security.meta.stackexchange.com. This question here is not the place to discuss in depth how this site currently works or should work. – Steffen Ullrich Oct 17 '20 at 19:26
  • This is what I mean. I was just politely asking your opinion and if StackExchange management has issued a mandate. You didn't answer the question even though it's in the comment section. If I go to the security.meta site, it would take me literally 30-60 minutes to ask this question as I will have to provide background information but right now, you know exactly what I'm asking. – QuietInMontana Oct 17 '20 at 19:42
  • @QuietInMontana: No SE management is instructing me anything. I myself don't like it if OP diverts too much from their own question. And this is no different from what I'm doing for years. And so far it works good for me and helps to keep the focus of the discussions on-topic. – Steffen Ullrich Oct 17 '20 at 19:55
  • Steffen, it isn't just you but in the past few months, I notice that moderators are making people stick to strictly asking questions with no diversions. Answers also have to be concise and to the point. No sociable interactions allowed. Maybe I'm just hallucinating. But, the lack of social interactions make it less fun and I think it deters people from helping. MHO. – QuietInMontana Oct 17 '20 at 20:03