How to know whether a textfile has been edited or tampered with?

Question

Is it possible to know whether a textfile, e.g. in XML format, has been edited or tampered with over time?

The context to my question follows:

I am a scientist in industry using a technology called 'mass spectrometry (MS)'. MS is an analytical technique used, e.g. in forensic analysis to determine whether a particular compound is present in a sample (e.g. drug of abuse in blood or urine).

Mass spec. datafiles are usually stored in flat-file format to the instrument vendor's private binary specification - their software can process it, but nothing else can. However, open standards for MS data exist, and most vendors support export to at least one open specification. These open standards are mainly XML based these days (eg mzML) and allow processing with open source applications, and also allow long-term storage (> 10 years) of the data in a format that doesn't require that we maintain an archived computer and the OS (or VM) and the processing software for long periods.

The vendor binary format provides at least some security against data tampering, however the XML formats do not. Hence the issue - the open formats are very useful for providing access to data over archival timescales, but security is a problem.

You could calculate hashes of the files and keep them in a secured database (with backups of the originals). Then if you ever suspect tampering you can simply recalculate the hashes and compare, then replace with the backups if required. — Jonathan Gray, Jan 09 '16 at 13:20
Who are you worried about tampering with them? What is your threat model? — iAdjunct, Jan 09 '16 at 13:42
*The vendor binary format provides at least some security against data tampering* - I am pretty certain that it does not. Just because *you* can't read and edit it when you open it with a text editor doesn't mean nobody else can reverse-engineer the format and build an editor for it. — Philipp, Jan 09 '16 at 14:18
@philipp is correct - at best, this is "security by obscurity" and it's no protection at all against anyone with rudimentary knowledge, a hex editor and a modicum of patience. — Shadur, Jan 09 '16 at 20:20
@JonathanGray - assuming that the original files aren't that large, how is your hash solution any better than just storing a backup of the data? — Neil Smithline, Jan 09 '16 at 21:02
@iAdjunct I presume the OP is worried about falsified test results. When you're dealing with drug testing, that's a legitimate concern - imagine what'd happen if someone skewed the data of a competitor for high-paying job, making it look like they're a junkie! — etherealflux, Jan 09 '16 at 21:06
Uhm, read it before and after. If it's different, then it's been editted. If not, it's the same. — PyRulez, Jan 10 '16 at 01:44
You made a typo: vendor binary format provides **zero** security against data tampering — Lie Ryan, Jan 10 '16 at 02:32
@NeilSmithline Because the hashes could be sent for verification instead of entire files. — Jonathan Gray, Jan 10 '16 at 04:53
As our [help/on-topic] says, "Security is a very contextual topic: threats that are deemed important in your environment may be inconsequential in somebody else's, and vice versa. [...] To get the most helpful answers you should tell us: what assets you are trying to protect; who uses the asset you're trying to protect, and who you think might want to abuse it (and why); what steps you've already taken to protect that asset; what risks you think you still need to mitigate". I encourage you to edit the question to add this information, so that we can provide you the best quality answers. — D.W., Jan 11 '16 at 04:41
@philipp makes an excellent point. The first thing that sprang to my mind was "given the plain text XML and the binary, it won't take me long at all to reverse engineer the proprietary file format". Unless they are actually encrypting, it should be straightforward. At most, they will tack on some identifying header to each value (https://en.wikipedia.org/wiki/Type-length-value) I fear that you would have to contact each vendor individually and, even then, don't expect them to disclose details of their "secret sauce"; at most, i would expect vague reassurances of security, with no detail). — Mawg says reinstate Monica, Jan 11 '16 at 09:31
You might want to look at a software product specifically designed for storing and managing laboratory data, such as a LIMS, ELN (electronic lab notebook) or SDMS (scientific document management system) - these are often used within quality systems that have to meet regulatory standards such as GMP, so the vendors should be well versed in what those standards expect and how to meet them. — nekomatic, Jan 11 '16 at 09:59
Thankyou for all of the useful comments. The issue is compliance with regulatory agencies data security requirements. Those agencies may want to review any aspect of the development of a pharmaceutical compound and data integrity is high on their agenda, and rightly so. — Drew Gibson, Jan 11 '16 at 21:57
If this is for pharma I strongly suspect you should hire in some professional expertise on regulatory compliance - I assume your employer is not actually a pharma company otherwise you'd already have that in house? — nekomatic, Jan 12 '16 at 11:50
This is a commercial solution, but they probably tick all your boxes : proof of integrity and time, auditability, long term solution... [www.guardtime.com](http://www.guardtime.com) — user47516, Jan 14 '16 at 11:54

score 81 · Accepted Answer · edited Jan 11 '16 at 19:04

81

The default solution would be to use cryptographic signatures. Have every technician generate a PGP keypair, publishing the public key and keeping the private key secure.

When a technician made an analysis, they sign the result file with their private key. Now anyone who wants to verify the file can check the signature using the public key of the technician. When anyone changes the file, the signature won't be correct anymore.

Security consideration: Should any private key of a technician get known to someone else, that person can change the files and also change the signature to one which will be valid. This problem can be mitigated by having multiple persons sign each result file. An attacker would require all keys to replace all signatures with valid ones.

Alternative low-tech solution: Print out each result file, have the technician sign it the old-school way (with a pen) and deposit the file in a physically secure archive.

By the way: Do not assume that the vendor-specific binary format provides any more security against tampering than XML does. Just because you can't read and edit it when you open it with a text editor doesn't mean nobody else can reverse-engineer the format and build an editor for it.

edited Jan 11 '16 at 19:04

Monty Harder

476
3
6

answered Jan 09 '16 at 14:27

Philipp

48,867
8
127
157

6

Vendor-specific binaries will can be anywhere between really easy to change (there is plaintext, just surrounded by word stuff), to really hard (if they use cryptography, like this answer suggests for you to do). You can't really know without trying probably (unless it's open source). – PyRulez Jan 10 '16 at 01:46
16

It is VERY unlikely for the vendor binary to include cryptography. If they did, it would have been heavily advertised and be a selling point, since it costs money to implement. – Nelson Jan 11 '16 at 05:19
1

To mitigate against leaked private keys of single users, separate signatures by two different users might be appropriate. For very long term storage (i.e., when the keys have to be considered as leaked simply by their age), it may be appropriate to resign at regular intervals ... – Hagen von Eitzen Jan 11 '16 at 10:13
A small technicality but isn't there a problem with `Give each technician a keypair`, in that the private key should only be known to the owner? Shouldn't each technician create their own key pair? – Qwerky Jan 11 '16 at 14:09
5

@Qwerky In a perfect world this would be true, but in the real-world they might require assistance. – Philipp Jan 11 '16 at 14:26

score 27 · Answer 2 · edited Jan 11 '16 at 19:29

27

Any form of digital signature will do. Here are a few pointers:

For XML data, there is a digital signature standard (XMLSign). Unfortunately, this standard is rather poor and has an important security loophole (documents needs to be normalized through an XML transform before they can be signed. This is extremely hard to do securely since the transform itself becomes an important part of the signature).
You can also use PGP or S/MIME to digitally sign documents, These will produce new, text-based and mostly readable but still tampered-proof documents.
Finally, you can use detached signatures. Basically, it's another file that contains the digital signature linked to another document and can be used to validated the original data (no matter what the original format).

Let me add a few extra info here:

Picking the right properties for the signature (algorythm, key type and size, etc.) is very dependent on the condition you set: how long do you intend to have the data secure, against what type of adversary do you intend to protect them (what's the value of a forgery? what would be the value of an attack that would break all documents signed with the same key ?), is there any regulatory requirement? This means that you should consult a specialist who can translate these business requirement and tranlate them into technical ones.
I strongly advise you to add a secure timestamp to your signature. This will not only allow you to prove that a document hasn't been tampered with but also allow you to prove when the signature occurred.

edited Jan 11 '16 at 19:29

StackzOfZtuff

17,783
1
50
86

answered Jan 09 '16 at 14:25

Stephane

18,557
3
61
70

1

Secure timestamp? How do you prove that a signature occurred at a specific time? – Blacklight Shining Jan 09 '16 at 21:27
6

The protocol is described in rfc 3161. Basically, you take a hash of your signature data, send it to a secure timestamp server that sends you back a signed version of the hash. You then add that to your signature. – Stephane Jan 09 '16 at 22:18
4

Ahh, so it requires trust to be placed in a third party. – Blacklight Shining Jan 09 '16 at 22:52
8

@BlacklightShining it does, but it prevents very real attack vectors - for example, a malicious insider (e.g. your own technicians) or an attacker with access to all *your* keys will still be unable to fake the timestamps, and if that third party is malicious or compromised then *by itself* it is not sufficient to disclose or modify your data. A drawback is that the network connection to that timestamp server can expose how much signatures you're doing and when exactly you're doing so, depending on your situation it may be irrelevant or dangerous. – Peteris Jan 10 '16 at 10:22
You could embed the hash of the signature into the bitcoin blockchain, then you don't have to trust a third party. It's not quite free though. – Buge Jan 10 '16 at 22:29
2

All digital signature scheme will rely, at one level or another, on trust placed on a third party: it is necessary to assert the identity of the key used for signature. That doesn't mean that you need to place much trust, in 3rd parties, tough: for instance, the timestamping authority only is responsible to guaratee that, at a given time, a specific datum already existed (through its hash). – Stephane Jan 11 '16 at 09:24
1

+1 for the timestamp especially since many court cases have key evidences made inadmissible due to the computers producing them having incorrect time set. Many major x509 certificate authorities provide timestamp services, but you'll have to be using a compatible file format. – billc.cn Jan 11 '16 at 10:19
1

@billc.cn Actually, no, you don't have to use a compatible file format. That's what I explained in my post: you can either envelope the data in PGP/SMIME or simply use a detached signature – Stephane Jan 11 '16 at 11:30
1

Why not just sign the .xml file with PGP? – Joshua Jan 11 '16 at 21:49
@Joshua PGP has no secure timestamps. – Josef Jan 12 '16 at 12:41
@Josef: http://www.itconsult.co.uk/stamper.htm – Joshua Jan 12 '16 at 16:12
@Joshua well, they are using PGP/GPG, but you can't use your GPG/PGP and just get a timestamp. You have to send them the file (so encrypt it before!) and then they will sign it with PGP and you have to trust them to use the correct date and don't lose the keys. Seems not really suited for this use case. I didn't know this service, so thanks for mentioning it exists! – Josef Jan 12 '16 at 16:17
1

You could just send them the .asc file of the detached signature to have it signed. – Joshua Jan 12 '16 at 16:21

score 6 · Answer 3 · answered Jan 10 '16 at 20:02

I will outline the three main options and pros/cons of each.

Store backups of the files in a secure location

Pretty self-explanatory. The "secure location" can be a read-only medium (like CDs), or a network drive that everyone can read but only the supervisor can write to, or an online storage service (e.g. Dropbox) that makes it reasonably hard to forge file modification dates.

Pros

You should have a backup system anyway

Cons

If files are large, downloading them for verification can be time-consuming
If the forger breaks into the secure location, he can cover his tracks

Store hashes in a secure location

A hash is a fingerprint of a file that looks something like ^{8f2e3f53aa90b27bda31dea3c6fc72f6}; if two files are just slightly different they will have a different hash. Take a hash of the original file and store it securely, then to verify a file has not been modified, take a hash of it and compare it to the stored hash.

Pros

You need to securely store/check a ~32 digit code instead of an entire file

Cons

You still need to access an external resource to check the file
If the forger breaks into the secure location, he can cover his tracks

Cryptographic signatures

In this case, one or more people can "sign" the file and if any changes are made these signatures will be invalidated. Of course, if everyone who needs to sign the file is willing to (or tricked into) sign a tampered file then you can get away with the tampered file.

Pros

The security information can be kept within the file itself, or otherwise on the same drive, meaning easier verification.

Cons

Everyone who signs files needs to be very careful to prevent someone stealing their private key.
Everyone who signs files needs to be very careful they know what they are signing.

score 2 · Answer 4 · answered Jan 10 '16 at 21:08

2

Take your xml file, and your favorite holiday photo. Concatenate the files and compute several hash values of the resulting file.

The holiday picture ensures that it is extremely hard to produce a collision, even if the holiday photo file is public. Also, if you use several hash algorithms, it is unlikely that all of these will be broken under 10 years span.

answered Jan 10 '16 at 21:08

Per Alexandersson

121
2

2

Concatenating all data files with the same photo won't help much. You're better off using more computationally expensive hash algorithms on pure data. – Dmitry Grigoryev Jan 11 '16 at 14:35
Isn't this "trivially" defeated by a length extension attack? – NikoNyrh Jan 11 '16 at 20:56
If the holiday photo is not known to the public, it is very hard, and with multiple hashes, even harder. – Per Alexandersson Jan 11 '16 at 22:36

score 2 · Answer 5 · answered Jan 11 '16 at 09:19

Addressing vendor file-format security, expanding on what @Philipp says in the comments.

I've had a poke around a vendor file format (not mass spec but near enough for these purposes). It was made a lot easier by having the software installed, but I'm no expert in these things. I could easily change metadata (extracting the metadata was my goal in the first place) real data would have been harder but by no means impossible to modify. As metadata includes things like sample ID and date of test, that's a big enough vulnerability for things like "whose sample was clean and when?" as seems relevant to you, or "who first discovered this drug?" in other fields.

Some software provides some anti-tamper features (e.g. internal use of -- not necessarily crypto-grade -- hashes; user permissions when editing using their software). Reverse engineering these would be little more than trivial for someone with a decent bit of skill in most cases. With the software installed even circumventing the built-in features could be as simple as writing a front-end to call the vendor's DLLs, as these anti-tamper features are normally optional add-ons (in many fields they're not required or deprecated).

(This could have been a sequence of comments, but as my goal was to make the vendor-file issue clearer, it seemed better to write it properly).

score 1 · Answer 6 · answered Jan 11 '16 at 10:30

How about making the technicians post pairs of unique file ids and their hashs to Twitter using their own accounts?

This will prove that:

Data file with said id and hash existed at the time of posting
The person who has access to the account trusts the content of the file at that point
The file is not modified after the fact as Twitter does not allow tweets to be edited

This method provides at least comparable security to many of the digital signature-based answers and benefits like:

Much simpler to learn and use (no complicated private key generation, opening or back up procedures)
High redundancy (through twitter's backups and third-party twitter scraping sites)
Built-in timestamp (that will probably stand in a legal proceeding without much explanation)

I recommend using at least SHA256 as the hash algo.

score 1 · Answer 7 · answered Jan 11 '16 at 15:14

One of the easiest ways is to create a hash of the file and store it elsewhere so you know if it gets changed. Intrusion detection programs use this technique all the time to verify the integrity (or at least indicate if some attacker has been fiddling with system files).

Look at a program such as AIDE, you could run this against the directory containing the files (and possibly run it on-demand when a file gets added) to update its database of hashes. Nightly, run it to check and email you a report showing all file changes.

If you need to know the original, then a versioned filesystem might be a good idea. Every change that is made to a file is recorded and old versions can be extracted. Alternatively a backup system that detects new files and backs them up to a secure location could be used (and keeps all the old versions - or an attacker could just modify the file repeatedly until the original is deleted).

score -2 · Answer 8 · answered Jan 10 '16 at 07:28

the open formats are very useful for providing access to data over archival timescales, but security is a problem

Big question: how are the archives being accessed?

The issue with hashing a plain-text file is the hash is character-accurate. Change one character and the hash will be completely different. Works very well for binary files like executable programs (where one byte out of place is usually disastrous) but fails on things like markup files - normalizing (or compacting) the whitespace will change the hash but have no effect on the data.

If you are handing the files around by email or read-write network share you will have to have secure storage for the hash, or anyone with half a brain can edit the file and then update the hash. If you have secure storage for the hash, why not store the data file in the same place and forget about the hash?

This is going to sound strange at first, but look at uploading the file and description to a local installation of something like wordpress or mediawiki. Access can be as open or secure as you want, and the platforms have user-specific file upload controls. Once the IT department has set it up properly, the write access to the files can be locked up as tight as necessary.

"*or anyone with half a brain can edit the file and then update the hash*": this is not possible when using digital signature unless the private key has been compromised. — WhiteWinterWolf, Jan 10 '16 at 10:54
-1, misses the standard solutions, which is use cryptographic signatures. — Quora Feans, Jan 10 '16 at 23:00

How to know whether a textfile has been edited or tampered with?

8 Answers8

Linked