Remove span class through regex in Notepad++

1

0

I have big file catering to 1000+ span class for pagenumbers. I would like to remove the complete syntax with the text in notepad++ through a ReGex. Example,<p>Cillacepro di to tem endelias eaquunto maximint eostrum eos dolorit et laboria estiati<span class=”pagenum”><a name=”Page_4” id=”Page_4”>[Pg 4]</a></span>Cillacepro di to tem endelias</p>

I would to replace <span class=”pagenum”><a name=”Page_4” id=”Page_4”>[Pg 4]</a></span> with " " such that it finds the pattern for two and three digits, as well. I am new to this regex string search therefore, I would be grateful if anyone could help me with a replace string for the same. Regards, Aman Mittal

Aman Mittal

Posted 2018-05-22T13:27:02.833

Reputation: 31

It is not clear what you are trying to find and what you want to replace it with, so I suggest you look at this tutorial.

– AFH – 2018-05-22T13:42:06.493

Hi, Thank you for the reply. I am simply trying to replace <span class=”pagenum”><a name=”Page_4” id=”Page_4”>[Pg 4]</a></span> with a space. I am looking for a regex search string that that remove all the page numbers at once so that I don't have to manually remove each syntax one by one. – Aman Mittal – 2018-05-22T14:10:00.367

1I imagine that matching <span class="pagenum">.*?</span> will be sufficient, as only this span class is likely to contain the page number data you want to eliminate. Note that .*? will match the minimum number of arbitrary characters, ensuring the that the </span> in the match is paired with the same leading <span ...>. – AFH – 2018-05-22T17:14:13.767

Thank you so much for taking out the time and providing a solution. It worked wonders for me. I am too grateful to you. Thanks a lot! Stay blessed! – Aman Mittal – 2018-05-23T05:14:11.710

@AFH I wish I could like your profile and let everyone know how great of a person you are. Thanks a lot! – Aman Mittal – 2018-05-23T05:21:34.627

@AFH: I am sorry to trouble you again. But in my project, I cam across another type of page number span: <span class="tei tei-pb" id="page001">[pg 001]</span><a name="Pg001" id="Pg001" class="tei tei-anchor"></a> I tried this code to find them all at once: <span class="tei tei-pb" id="page\d+">[pg \d+]</span><a name="Pg\d+" id="Pg\d+" class="tei tei-anchor"></a> However, it does not seem to work. Could you please let me know the errors in the regex search? I would be highly obliged to you. – Aman Mittal – 2018-05-25T04:41:25.757

Also, If I would like to search for roman number - i, ii, iii... in the same regex then how would I go about it? – Aman Mittal – 2018-05-25T04:43:52.027

Answers

1

Would like to thank @AFH for providing generic answer which could cater to Page, Pg and other types as well. I imagine that matching <span class="pagenum">.*?</span> will be sufficient, as only this span class is likely to contain the page number data you want to eliminate. Note that .*? will match the minimum number of arbitrary characters, ensuring the that the in the match is paired with the same leading . – AFH 12 hours ago

I would also like to thank @alzaj for providing the right direction, as well. Thanks a lot! Saved my day and effort! Regards, Aman Mittal

Aman Mittal

Posted 2018-05-22T13:27:02.833

Reputation: 31

1

escape the square brackets and use the digit-shorthand ("/d") followed by repetition sign "+" to match the page numbers:

<span class="pagenum"><a name="Page_\d+" id="Page_\d+">\[Pg \d+\]</a></span>

you can validate the above regex on following sample code:

placeholdertext<span class="pagenum"><a name="Page_4" id="Page_4">[Pg 4]</a></span>placeholdertext
placeholdertext
<span class="pagenum"><a name="Page_111" id="Page_111">[Pg 111]</a></span>
placeholdertext<span class="pagenum"><a name="Page_222" id="Page_222">[Pg 222]</a></span>

alzaj

Posted 2018-05-22T13:27:02.833

Reputation: 36

Thank you so much for taking out the time and providing a solution. It is specific to Page_1 and does not work for Pg_1 but we can always tweak the regex. Thanks a lot for all the help. You solution has been of great help. I am too grateful to you. Thanks a lot! Stay blessed! – Aman Mittal – 2018-05-23T05:15:53.860

You're welcome! One more advantage of the @AFH solution: his regex match if there is a line break inside the span tag (checkbox "dot matches newline" in Notepad++). But the solution of AFH could also have a drawback if your span tag would contain a nested span tag. – alzaj – 2018-05-23T06:59:27.113

1@alzaj - I have never seen nested <span> tags and, though allowed, they are very unlikely in page numbering unless the page number is the innermost, which will not affect my match string. I should have mentioned checking . matches newline: thanks pointing that out. – AFH – 2018-05-23T10:26:53.707

@alzaj - I am sorry to trouble you again. But in my project, I cam across another type of page number span: <span class="tei tei-pb" id="page001">[pg 001]</span><a name="Pg001" id="Pg001" class="tei tei-anchor"></a> I tried this code to find them all at once: <span class="tei tei-pb" id="page\d+">[pg \d+]</span><a name="Pg\d+" id="Pg\d+" class="tei tei-anchor"></a> However, it does not seem to work. Could you please let me know the errors in the regex search? I would be highly obliged to you – Aman Mittal – 2018-05-25T04:44:19.313

Also, If I would like to search for roman number - i, ii, iii... in the same regex then how would I go about it? – Aman Mittal – 2018-05-25T04:46:19.247

@AFH Request you to help me out once more, if possible. Would greatly appreciate your valuable time and expertise. – Aman Mittal – 2018-05-26T06:14:15.920

2

Using this site I was able to see that the problem is the square brackets, which need to be escaped to be matched literally (\[ and \]); otherwise, they are treated as delimiting a character set in a single character match. Note that there is no exact match for NotePad++, but I was able to use the "golang" flavour. For Roman numerals you simply replace \d+ by .+: it means that there is a possibility of matching non-numeric page number, but I cannot imagine that this would happen. You could tighten the criteria with [0-9ivxdm]+.

– AFH – 2018-05-26T11:59:30.970

@AFH Thank you taking out the time and replying to the query. I am extremely grateful to you for helping me out so much. Regarding the span regex, the regex worked after including the brackets and not after removing the. Moreover, your trick .+ worked for the roman numbers a well. To search for <span class="tei tei-pb" id="pageix">[pg ix]</span><a name="Pgix" id="Pgix" class="tei tei-anchor"></a> I used this regex and it worked wonders <span class="tei tei-pb" id="page.+">\[pg .+\]</span><a name="Pg.+" id="Pg.+" class="tei tei-anchor"></a> – Aman Mittal – 2018-05-29T05:31:45.777

@AFH Thank you so much for helping me out. Overwhelmed and grateful! May you always stay blessed and happy! – Aman Mittal – 2018-05-29T05:34:17.257