Opened a JPG picture with notepad, pasted all the "text" to a new notepad file, changed to .JPG and it no longer opens. Why?

83

22

This phenomenon has been leaving me questions to ask.

Here is the detailed experiment, my OS is Windows 7 x64 SP1:

  • I changed a picture (JPG) file to TXT by simply changing its extension (or one could just choose to open the JPG with notepad, same thing)

It should look like this, oddly looking sequences of texts, and some of them (very rare) are actually meaningful, like in the screenshot below "creator: dg-jpeg v1.0..."

Sample JPG text

  • I disabled wrapping and selected all the text using Ctrl+A (to make sure nothing's missed)
  • I pasted the copied text to another blank TXT file and saved it as JPG, I compared the new file size with the original JPG. All of them (the original JPG, the converted TXT file and the newly created TXT file) are of the exact same size, to bytes.

When I tried to open, Windows would say "Windows Photo Viewer can't open this picture because the file appears to be damaged, corrupted, or is too large".

I even tried to test it using another method: Opened the JPG with notepad, I cut ONE known character from a location easy to remember (like the first character of the 2nd line) then save the file. The viewer would of course display the same message. Then I opened it again and pasted the character to the EXACT location (Notepad remembers its exit state like windows position, wrapping, fonts size...so I have no problem getting this right)

And still the same error. You can try this to get the idea, remember to choose a small picture else Notepad will act like a old rusty man.

What could have been the cause of this phenomenon?

Nguyễn Tuấn Danh

Posted 2014-07-13T20:50:32.143

Reputation: 975

4Try the fc command. open a cmd prompt and do- C:\blah>fc file1 file2 It is possible for files to be the same size but different. (though usually some random change doesn't tend to leave a file the same size but it easily could). The fc command will be very useful to you in investigating what is happening. You can also use the xxd command, this is in cygwin, and also comes with vim7. xxd -p file1 That will dump the hex of a file. You can compare the hex of the two files with that and fc. Or even open the hex in notepad and flick between the two notepad windows with alt-tab. – barlop – 2014-07-13T21:18:01.377

23You are trying to read a binary file with a simple text editor like notepad. It won't be able to read the ANSI encoding correctly and thus it will convert it. When you save it then the file won't be binary anymore and thus the parser can't read the data inside the file. (Lookup the difference between XML based file saving and Binary file saving it's an interesting topic.) If you would try the same experiment with Notepad++ you'll succeed in what you were trying. – woutervs – 2014-07-14T10:32:55.433

1

possible duplicate of Why does an exe file not appear as ones and zeros in a text editor such as Notepad?

– allquixotic – 2014-07-14T15:07:53.937

3

For the interested: You can edit images in Vim: However, the trick is, that Vim converts the file in the XPM format, which is plain ASCII.

– Boldewyn – 2014-07-14T15:35:37.230

1

@ÃŁŁǫǛȉЖΦΤїҪ (@allquixotic) I disagree; this question is not a dup of Why does an exe file not appear as ones and zeros …?; that question does not address the fact that Notepad makes changes to binary files that are not requested by the user (which is what this question is all about). OTOH, Save Raw Image Data As Image is somewhat close – but still not similar enough for me to VTC.

– Scott – 2014-07-14T17:30:48.173

@Scott Well it does say possible duplicate... and that question was mostly for reference anyway; I should've posted a comment saying "See Also: <link>" but I was too lazy :S – allquixotic – 2014-07-14T17:39:36.927

Try using a hex editor. – Panzercrisis – 2014-07-14T20:34:36.320

4Long story short, Notepad modifies your file before displaying it to you. – Derek 朕會功夫 – 2014-07-15T23:21:43.723

Any idea that it'll work in Notepad++? – Nguyễn Tuấn Danh – 2014-07-18T10:44:29.580

@NguyễnTuấnDanh Why even ask about Notepad++? Don't use a text editor to edit arbitrary data, it's entirely the wrong tool for the job. Use a proper hex editor or, even better, something designed to edit JPEG images (like e.g. an image editor). I'm not sure what you were expecting to happen by opening random strings of bytes with a program intended for text editing. It's like trying to edit an MP3 with Photoshop. – Jason C – 2014-07-19T15:57:29.737

I.e. You know that game that's sometimes fun to play where you type some text into a translator, translate it to a language then back to yours, and laugh at the poor grammar in the results? That's essentially what you've done by loading and saving JPEG data in a text editor. – Jason C – 2014-07-19T16:01:49.190

@woutervs "If you would try the same experiment with Notepad++ you'll succeed". tried with notepad++ and sublime Text failed. – Saif – 2015-12-07T05:12:18.033

@Saif For the sake of the experiment I've also tried it, worked perfectly on my end with default N++ settings. – woutervs – 2015-12-07T08:29:03.160

Answers

82

Depending on the encoding used to open the file you might see different behaviour. My Windows 7 notepad allows to open a file in ANSI, UTF-8, Unicode or Unicode big endian.

I've tested this issue with a small 2x2 pixel jpeg image created with gimp and opening and saving the image file with ANSI encoding. Opening both the original and the saved image with an hex editor I see that all 00 sequences (two hex digits, NUL control character) have been converted to 20 (space character).

Replacing back in the hex editor all 20 by 00 restores the image format.

I've googled it a bit and I didn't found any references that explain why it does that. Only a reference to a post that warns about it (google cache link, the page is not available).

If you save/open the file as UTF-8 it seems that it still converts NUL characters to spaces but it also increases the resulting file size due to conversions from single-byte characters to UTF-8 multi-byte sequences.

If you save/open the file as Unicode it seems that it still converts NUL characters to spaces but also adds a byte to the beginning of the file, the BOM.

mangper

Posted 2014-07-13T20:50:32.143

Reputation: 846

220x00 is a string terminator in C strings. They may have replaced them since a text file should not contain them. Notepad is a very old program. – Zonder – 2014-07-14T09:07:44.103

@Zonder And what do C strings have to do with that? C strings exist only in RAM. A file cannot contain C strings. Also Windows programs are usually written in .NET languages, so C doesn't have anything to do with all that in the first place. – Bakuriu – 2014-07-14T13:09:46.553

3If you loaded a file into ram then referenced it with a string pointer it would only look upto the first 0x00. Notepad predates .NET and will not use the CLR. (I know for a fact it only get bare minimal maintenance if it ain't broke don't fix it) – Zonder – 2014-07-14T13:15:24.353

25I doubt that notepad.exe is a .NET executable. – knittl – 2014-07-14T13:15:26.803

11@Bakuriu A C string most certainly can exist in a file; I can think of numerous file formats that contain them. And the vast majority of apps that ship with Windows apps are native, not .NET. That said, notepad does not write null-terminated strings to files. – Carey Gregory – 2014-07-14T14:35:34.743

2"C strings can exist in a file" --> Well, if you serialize them to disk they can; but the process of serializing the data from an in-memory data structure to a file on a block device, kind of makes it no longer a C string, because C itself can no longer natively read/write the serialized data without calling OS routines for reading and writing files. Also, when the C string is written to disk, it's usually encoded in some text format, like UTF-8 or ASCII. This could involve transforming (in various ways) the original bits that comprised the C string buffer that was in virtual memory. – allquixotic – 2014-07-14T15:06:22.500

4@Bakuriu : Windows programs are usually not written in .Net. It's C/C++ and native at the core. One of the .Net applications developed by microsoft was live writer which is now discontinued. – bhathiya-perera – 2014-07-14T15:58:24.060

2Any character array semantically terminated by a \0 could be called a C string (depending on its purpose), so they may absolutely be found in files! Don't confuse C strings for string literals. – Lightness Races with Monica – 2014-07-14T21:08:52.850

1@Zonder I think any half-decent C/C++-based text editor out there doesn't uses C strings directly, instead they have to implement some kind of line-based character buffers with a counter for the length, so they can store anything, even control characters. – mangper – 2014-07-14T22:15:28.863

2The why is because all text files are binary files, but not all binary files are text files. Text files follow rules, called character encoding. If a file does not abide by the rules, it will be altered to match the rules. JPEG files follow rules specific to JPEG, and don't conform to any character encoding, which would be wasteful if it did (character encoding does not use all of the possible bits in every byte). The translation into a character encoding causes data loss/damage. Some text editors do support binary, though. – phyrfox – 2014-07-15T05:17:03.620

@CareyGregory when you deal with binary data in C++ you do not use strings, but byte[], with the length being a different data item (metadata). "Strings" do not need that metadata because an special character (\0) signals the end of the data. Of course, this means that at least \0 is illegal as the content of a string (they could have allowed to register an "escaped" \0, but that would require modifying the raw data too, so it is not a solution to the OP). – SJuan76 – 2014-07-15T07:56:25.540

5@SJuan76 Huh? C++ does not define a data type named byte. Perhaps you're thinking of some other language. And the application developers can deal with binary data however they see fit, including the use of C strings if they so choose. As I said before, I can think of numerous binary file formats that contain C strings. – Carey Gregory – 2014-07-15T14:27:34.200

@SJuan76, C++ does not have a byte type, though some developers add to header files "typedef unsigned char byte" to act as if it does. – AresAvatar – 2014-07-15T17:51:50.277

But Windows defines BYTE, which should be used when you mean, well, to use bytes ;-) – Sebastian Godelet – 2014-07-16T20:05:42.360

2

It does not matter that Notepad is old and it does not matter what language it was written in. Notepad uses a standard edit control, standard edit control text is set via WM_SETTEXT, and this message assumes strings are null terminated and does not provide a parameter for string length. The only way Notepad could include nulls in its editor is if it used a custom component and a custom message for setting the text, which it (rightfully) does not. Discussions of C/C++/VB/.NET are completely irrelevant.

– Jason C – 2014-07-19T14:59:22.817

What do you do if the original file did contain space characters, in which case converting all spaces to null characters would corrupt the file further instead of uncorrupting it? – Sean – 2019-12-07T23:53:19.263

39

Why it fails :

Notepad create spaces (ASCII code 32) character for characters like NUL (ASCII code 0) because Windows API's text box only allows null terminated char * ASCIIZ (character array, pointer). It gets cut off at the first NUL.

That happens because Windows API is mostly written in C language and null terminated strings are one of the common features. Even when modern Windows and Unicode is considered same null terminated strings occur. So notepad simply replace them with space so you can view the complete file.

So when you save the file it is corrupted.

wikipedia-null terminated strings


How to do further research :

You may use a comparator like beyond compare (commercial,trial) to see the character replacement effect. also see other binary compare tools.

hex comparison

Note : (20)16 = (32)10


Reason for notepad acts slowly on large files

It checks each character and replace special characters with spaces. Other software do not do in-memory conversions (at least not primitive as notepad). They just render special characters differently. And they use advanced buffering techniques.

Looking into Notepad.exe (XP 32 bit)

( I'm assuming its still written in C++ or at least use a comparably similar linker )

notepad

I'm using the PEiD tool (which stopped development with introduction of PE+/64 exes)

PEiD can be found bundled in the bin folder of Universal Extractor

I extracted the notepad. ex_ file from the Windows xp iso obviously. Try it out. It's a cab file extract using 7z.

Warning ! Your virus scanner might detect Universal Extractor/PEiD as hack tools or viruses. Don't Trust it don't download it !!


Further info about windows API

credits:Jason C

It's not just the text box; WM_SETTEXT in general provides no parameter for specifying the string length, and strings are always assumed to terminate at null. You could always create a custom text box with a custom message that specified the string length, but Notepad and most other programs reasonably do not. Also the function SetWindowText does not provide a length parameter as well.

bhathiya-perera

Posted 2014-07-13T20:50:32.143

Reputation: 599

1It is a little strange that you show the property sheet for a Notepad executable bundled with a version of Windows XP, yet judging by the window theme, you're clearly running some version of Windows 8. That would explain why the executable was linked with version 7.1 of the toolset—that's what they used to compile Windows XP and associated utilities. The Windows 8 version of Notepad will undoubtedly be compiled with a newer version of the SDK tools. – Cody Gray – 2014-07-17T09:27:15.040

2

It's not just the text box; WM_SETTEXT in general provides no parameter for specifying the string length, and strings are always assumed to terminate at null. You could always create a custom text box with a custom message that specified the string length, but Notepad and most other programs reasonably do not.

– Jason C – 2014-07-19T14:55:31.430

@BhathiyaPerera Because I'm satisfied with the level of work that I've done by adding info in a comment. You are welcome to improve your answer with that information if you'd like. – Jason C – 2014-07-19T16:19:49.587

28

Notepad does not preserve all special / extended characters exactly as they are. I don't have a reference for this behaviour immediately at hand but have found this to be the case for example with UNIX-style end of line LF which Notepad will convert into CRLF and null (0x00) which it will ignore. In a binary file such as a JPG there are liable to be random occurrences of the character(s) that Notepad does not preserve. Try your experiment with a HEX-aware editor and it should work then. I'll update my answer if I find a good reference and once I've tested a HEX editor.

Update: I tried a few well known programmers editors but only one of them worked right off the bat, HxD by Maël Hörz. I never used HxD before but found it thanks to an answer to this Stack article, A hex viewer / editor plugin for Notepad++.

The other editors that didn't work after a few minutes effort were Notepad++, Notepad2 and UltraEdit (v17.3, older version). A couple of these had problems with the copy / paste of the first few bytes, the JPEG file signature magic number FF D8 FF. Maybe they would work with a little more fiddling than I have time for at present.

JohnC

Posted 2014-07-13T20:50:32.143

Reputation: 651

Sublime Text (2/3) automatically opens a binary file by showing it in hex format. As an example, the start of JPEG file by just clicking "open": http://puu.sh/aaAVx/bd08dab46e.png

– tomsmeding – 2014-07-14T07:11:06.210

3Actually, more often than notepad will convert LF to CRLF, it will leave the LF the way it is and display the text as if there was no line break at all! – Moshe Katz – 2014-07-15T03:18:08.420

6

You used to be able to do this with Write back in the day. It was a standard program in Windows 3.1 but I can't remember if Windows 95 included it. Write would allow binary safe editing of any file it could open (probably very limited file size). Notepad is definitely not binary safe (the text remains the same but the actual bytes of non-text characters [e.g. control codes] may change) which is why your JPG example is not working. Try getting a copy of Write (and very old Windows) and try your experiment again!

According to Wikipedia's "Windows Write" article Write was included up to Windows NT 3.5. It was replaced by Wordpad in Windows 95 onwards. write.exe was still present in the Windows directory but was simply a wrapper for opening Wordpad.

CJ Dennis

Posted 2014-07-13T20:50:32.143

Reputation: 805

5

I think it's not that much a problem of encoding but also of character set. JPG format is basically a byte stream. Thus allowing non-printable characters like NUL, ETX, STX, SOH, DLE, etc.

Microsoft Notepad can't display those non-printable characters. It may display placeholders of some kind like a space for a null-character. So opening the file with Notepad doesn't show the actual content but the content decoded by the selected encoding (utf-8, utf-16, etc) and displayed by a certain character set (unicode, ascii, etc) excluding the non-printable characters.

When selecting all the displayed text and copying the text to the clipboard, you only copy the printable characters including the placeholders. Thus automatically converting null-characters to spaces and ignoring other non-printable characters entirely.

So basically you just lose content doing it this way. If you use a hex-editor instead, it will copy all the content entirely.


Update: Bhathiya Pereras answer is right: https://superuser.com/a/782885/322784 Non-printable characters aren't ignored when copying text to clipboard.

sbecker

Posted 2014-07-13T20:50:32.143

Reputation: 181

Every file is "basically a byte stream". – Jason C – 2014-07-19T15:01:05.357

1@JasonC I would disagree. While every file can be read as a byte stream. Structured files like XML files are not readable as a stream of data. The content would not be valid until the end of the file has been read. A cut in half jpg is still valid and can be displayed. It's just missing half the picture. – sbecker – 2014-07-22T07:49:57.117

There isn't really room for disagreement on that. :) XML is a stream of bytes like anything else, and XML (along with character encoding) defines a format for those bytes. It is certainly readable as a stream of data. Open it in a hex editor, for example. That stream of data just happens to be parseable as XML. – Jason C – 2014-07-29T02:59:31.160

@JasonC Can't argue with that actually. :) Touché! – sbecker – 2014-07-29T12:27:10.253

2

The JPEG file contains non text data except for some fields, basically any byte values between 0 and 255 will be found, especially in the area representing the encoded compressed image that contains nearly pseudorandom data.

But Notepad will treat the data as ANSI text by default, so it will do various things that will alter the original data, as:

  • replace bytes mapping special / undefined / forbidden characters as they does not makes sense for a valid ANSI text

  • re encode null characters, end of line and end of file sequences to Windows/DOS conventions

Which means if you edit and save the data as text it will change the jpeg in the best case, and make it unusable in the worst.

Dice9

Posted 2014-07-13T20:50:32.143

Reputation: 191

"ANSI" is not technically correct, although it is commonly understood. – Jason C – 2014-07-19T15:04:42.303