Notepad++ and xml - replacing </ in closing element tag

0

I have an XML file (1000s of records, simplified here), structure (e.g. say):

<LIST>
<ITEM_0>
<NAME>Item Name</NAME>
</ITEM_0>
...
<ITEM_9999>
<NAME>Item Name</NAME>
</ITEM_9999>
</LIST>

I need result:

<LIST>
<ITEM>
<ID>0</ID>
<NAME>Item Name</NAME>
</ITEM>
...
<ITEM>
<ID>9999</ID>
<NAME>Item Name</NAME>
</ITEM>
</LIST>

Using Regex:

Find: \<ITEM_(.*)(>)
Replace: ITEM>\n<ID>\1\</ID>

I get:

<LIST>
<ITEM>
<ID>0</ID>
<NAME>Item Name</NAME>
</ITEM>
<ID>0</ID> <-- This line not wanted
...
<ITEM>
<ID>9999</ID>
<NAME>Item Name</NAME>
</ITEM>
<ID>9999</ID> <-- This line not wanted
</LIST>

It's replacing </ITEM> as well even though (I think) I'm asking it to only replace <ITEM>- what am I doing wrong/how to fix? I may be missing something regarding grouping (or 'greedy'?) but not sure what and have looked all over for similar. There's a million ways to cut and dice it with something else, but it just bugs me getting so close but not there with NPP.

Help appreciated- thanks.

Late Edit: Even if I get the 1st replace to work right, just the <ITEM_#> tag, I'm still left with the </ITEM_#> closing tag as another search/replace operation. The problem here is the current operation replaces both the <ITEM and </ITEM tags...

Catch21

Posted 2016-07-31T11:26:56.903

Reputation: 1

Why not do a regular replace and replace the </ITEM_ with something else and then run your regex replacement? – Blerg – 2016-08-01T20:02:47.377

Yes, thanks, would work but take 2 replaces, whereas x2 search/replace in 1 regex solution below works OK (but with the Q there still outstanding). – Catch21 – 2016-08-02T21:17:38.570

Answers

0

Yes, it's likely that the .* is too "greedy" and captures as many characters as it can; you need the opposite – the shortest possible match instead.

One method would be to use [^>]* instead – this would still match as many as possible, but only until the first >, so <ITEM_([^>]*)> would only match the opening tag and nothing more.

Depending on regex syntax, .*? might also work – this explicitly switches the * to "non-greedy".

user1686

Posted 2016-07-31T11:26:56.903

Reputation: 283 655

0

Thanks grawity, it helped me broaden my search to here to cover multiple search and replace in one regex.

Trying the following works:

Find: </ITEM_.*(>)|<ITEM_(.*)(>)
Replace: (?1</ITEM>)(?2<ITEM>\n<ID>\2</ID>)
RegEx

The | separates 2 strings looked for and the ?1 and ?2 are their respective replacements.

But I have to look for the closing </ITEM tag first, not the <ITEM tag as you would logically figure. So I have a solution, but can anyone answer the question as to why the above works but the following, looking for <ITEM tag first, fails when we're just reversing the order in which we look?

Find: <ITEM_(.*)(>)|</ITEM_.*(>)
Replace: (?1<ITEM>\n<ID>\1</ID>)(?2</ITEM>
RegEx

Not essential, but enquiring minds might like to know. Thanks.

Catch21

Posted 2016-07-31T11:26:56.903

Reputation: 1