Every other line is blank when copying an HTML document and pasting it as plain text

2

2

When I copy a text from an HTML document (e.g. from Firefox, Thunderbird, Internet Explorer...) and paste it as text then after every HTML paragraph (<p>) an empty line is inserted.
This also happens when I export the HTML document as text.

It often happens that an HTML document has every line marked as a separate paragraph.
I frequently receive e-mails formatted this way and I think that they come into existence by pasting a plain text into an HTML e-mail under certain conditions.

Example

How does the HTML document look like:

line 1
line 2

How does its HTML code look like:

<p>line 1</p>
<p>line 2</p>

How does the paste into text or export as text look like:

line 1

line 2

Is there a way how to avoid the inserted empty lines without need to post-process the text document?

The behaviour was observed both on Linux (X Org) and Windows.

pabouk

Posted 2013-12-17T16:54:21.793

Reputation: 5 358

The example output you gave is wrong. For the given example HTML code it should put a chunk of whitespace in between the lines. (to separate the paragraphs). What are you pasting this in to? Is it actually putting a real blank line in (where you could type text) or is it just a whitespace gap? – Ƭᴇcʜιᴇ007 – 2013-12-17T17:01:38.653

@techie007: I must admit that the example is a simplified HTML document and I did not test this simplified form. I am going to test it. --- The blank lines are inserted probably during any paste into a plain text format. I have tested a while ago: in Ubuntu 12.04.3 gnome-terminal, gedit and in Windows XP Notepad. --- It puts a real empty line in. I.e. after line 1 there are two newlines instead of one. – pabouk – 2013-12-17T17:11:40.950

1

It does it for myself as well. It's the way it's interpreted by the browser, clipboard and/or the paste target. Again, it's normal, there is supposed to be a blank line between paragraphs, so it's doing what it should. From W3Schools "Note: Browsers automatically add an empty line before and after a paragraph."

– Ƭᴇcʜιᴇ007 – 2013-12-17T17:20:20.553

@techie007: Right now I have tested the HTML code from my question. I saved it as test.html, opened in Firefox in Ubuntu 12.04.3, there I exported it as text into test-export.txt and here is the hex dump: od -tx1 test-export.txt --- 0000000 6c 69 6e 65 20 31 0a 0a 6c 69 6e 65 20 32 0a 0a. The exactly same result is when I copy from Firefox and paste into gedit or gnome-terminal or anything what accepts plain text. You can see that every paragraph is followed by two newlines (0a) instead of one. – pabouk – 2013-12-17T17:23:09.887

@techie007: Ok, I was thinking that this behaviour is probably defined but I would like to know if there is a simple way how to change or override it because I often receive e-mails formatted this way. – pabouk – 2013-12-17T17:26:08.903

The browsers are following standards, to have it not follow those standards you need a non-standards-compliant browser. Which finding a browser that doesn't follow the HTML1.0 standards is going to be tough if not impossible. Unfortunately you may be best off just getting a text editor with a blank-line removal add-on and post-process it (as you suggest). – Ƭᴇcʜιᴇ007 – 2013-12-17T17:29:22.573

1HTML emails usually are sent with both a plain-text and html version, and a reader like Thunderbird can be set to display the plain text rather than the html. You may find that copying from the plain text version has a format you prefer. – mgkrebbs – 2013-12-17T19:28:10.253

@mgkrebbs: Thanks for the tip! This could be one of the answers here! In the last e-mail with the problem the plain-text variant does not have the doubled end-lines. This makes me wonder even more - Why an e-mail client makes paragraphs from the lines? Why does not it use <br>? In this case it was probably MS Exchange. – pabouk – 2013-12-17T19:54:09.427

No answers