How can I delete sections of HTML files in a batch of 700+ files?

1

1

First off:
I'm using the latest OSX and can edit the HTML files with CotEditor and KompoZer. I am however unexperienced with HTML editors in general :/
If I have to - because one of your answers is super simple and convenient that way - I could switch to a Win7 machine for this task.

The Problem:
I have a little over 700 HTML files at hand here that share the same basic structure, they are organized in many tables and I need to delete certain content from all of them, which consists of always the same rows with Titles and below columns with altering content. -- If I could specify something like "delete the whole column that contains e.g. "Name" in the top cell", that would do. Also, I need to delete recurring parts (which in theory can be found & replaced for all files... but, well, in batch somehow)

Can you help me out? Will KompoZer do the batch-trick or do you have another recommendation? Thanks :)

-----EDIT-----
I tried TextWrangler for it's batch find & replace capabilities, and it works very well with finding recurring code across many files, so I know how I'll get the exact same bits out of every file. That leaves me with the altering content.

Is there a way to find content between two recurring points? For example if I knew something before and after the content I wish to delete is always the same:

<tag> txt_a Content1_to_delete txt_b </tag>
<tag> txt_a Content2_to_delete txt_b </tag>
<tag> txt_a Content3_to_delete txt_b </tag>

so I'd need like find & replace between <tag> txt_a and txt_b </tag> or even find & replace starting at <tag> txt_a up to and including txt_b </tag>

this is the troublesome bit for me where I really need assistance
-----EDIT2-----
After Gombai Sándor's answer in combination with Dooley_labs' comment I got some ideas, and while the sed variant will work from the terminal, I chose TextWrangler to do the work.
TextWrangler can do Find & Replace across multiple files, but will also accept regular expressions via a "grep" option. I learned about regular expressions and was able to resolve my issue. The "magical" bit for me was getting the wildcards right. Especially the simple .*. To anyone who'd like to mess around with regular expressions I recommend this site, which I found very useful: regexr.com

QuentinS

Posted 2016-03-02T13:53:56.053

Reputation: 13

Maybe a regex would help in this case? I've never seen a text editor that can do that, but I've not looked into it. If you can find one, I'm interested. – Dooley_labs – 2016-03-02T14:16:44.423

@Dooley_labs I found TextWrangler (or it's feature enhanced, paid version BBEdit) does Find & Replace not only across multiple files, but has a grep option to enter regular expressions to find :) – QuentinS – 2016-03-05T13:08:06.120

I just found that out yesterday, but thanks! xD – Dooley_labs – 2016-03-07T03:14:29.587

Answers

0

The most common general IDE's have the function of (regexp) search&replace in files (within a directory structure). Even small editors tend to offer this feature; in Windows, NotePad++ is a good example.

For OSX, where you have the usual shell tools, it can be a typical task for sed which is an editor itself... a very special editor.

Provided that all the files are in the same directory, standing in that directory, you can use this to delete the unneeded parts and put the output in files ending with .htm which (after some check) you can rename to .html.

$ cat just-an-html.html
<tag> txt_a Content1_to_delete txt_b </tag>
<tag> txt_a Content2_to_delete txt_b </tag>
<tag> txt_a Content3_to_delete txt_b </tag>
$ for HTML in *.html; do sed -e 's@\(tag> txt_a\) .*\(txt_b </tag\)@\1 \2@g' $HTML > $(basename $HTML html)htm ; done
$ ls *.htm
just-an-html.htm
$ cat just-an-html.htm
<tag> txt_a txt_b </tag>
<tag> txt_a txt_b </tag>
<tag> txt_a txt_b </tag>

It's also possible to delete the substring directly inside the files (-i: inplace), but I would not recommend that unless you have up-to-date backups.

$ cat just-an-html.html
<tag> txt_a Content1_to_delete txt_b </tag>
<tag> txt_a Content2_to_delete txt_b </tag>
<tag> txt_a Content3_to_delete txt_b </tag>
$ for HTML in *.html; do sed -i -e 's@\(tag> txt_a\) .*\(txt_b </tag\)@\1 \2@g' $HTML  ; done
$ cat just-an-html.html
<tag> txt_a txt_b </tag>
<tag> txt_a txt_b </tag>
<tag> txt_a txt_b </tag>

Gombai Sándor

Posted 2016-03-02T13:53:56.053

Reputation: 3 325

Do I understand you correctly, you basically suggest taking everything before and after the content to delete and then merge it into one new .htm file? – QuentinS – 2016-03-02T22:02:35.427

Yes, the safe one creates one .htm for for each .html with almost the same content but the pattern deleted. – Gombai Sándor – 2016-03-02T23:34:33.553

remember, You can't parse [X]HTML with regex: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

– aaaaa says reinstate Monica – 2019-01-23T20:26:36.033