Delete all lines of text in the HTML file in addition to the first?

1

I have to rewrite a lot of HTML files, example:

*--file1.html--*

<p>text1</p><br>
**<p>text2</p><br>
...<br>
<p>text(n)</p>**

*--file2.html--*

<img1...<br>
<img2...<br>
<p>text1</p><br>
**<p>text2</p><br>
...<br>
<p>text(n)</p>**

*--file3.html--*

<blockquote><br>
<p>text1</p><br>
**<img...<br>
<p>text2</p><br>
...<br>
<p>text(n)</p>**


*--file(n).html--*

... - various combinations of tags.

Tag [p]...[/p] in different lines. I need to delete all tag 'p' but the first (I marked from ** to **), example:

*--file1.html--*

<p>text1</p><br>


*--file2.html--*

<img1...<br>
<img2...<br>
<p>text1</p><br>

*--file3.html--*

<blockquote><br>
<p>text1</p><br>

I tried this but it does not work:

sed '/<p>/,</p>/d;1/<p>/!d' file*.html - I delete all the lines starting with tag p, i can not to leave a single line P tag.

sed '1!d' file*.html - work if the first line is tag p, but the first line can be tag img - so bad.

How to do to not remove the first p tag, but the rest (of the second tag p)? Let's wrong?

user2435244

Posted 2013-08-08T10:57:24.470

Reputation: 15

1this might get you bettter exposure on SO – Somesh Mukherjee – 2013-08-08T11:24:05.013

Answers

0

You may tray this perl oneliner:

perl -0777 -ne 'm#(^.*?<p>.*?</p>.*?\n).*</p>.*?\n(.*)$#s; print $1, $2' <file>

For example if you have the file test with the following content

<blockquote><br>
<p>text1</p><br>
**<img...<br>
<p>text2</p><br>
...<br>
<p>text(n)</p>**
appendix

and you process it with the mentioned oneliner it puts

<blockquote><br>
<p>text1</p><br>
appendix

as a result on the screen.

user1146332

Posted 2013-08-08T10:57:24.470

Reputation: 156