How to exclude HTML elements by ID or class using HTTrack or bash script

This isn't a full answer, but it was too big for the comment box and I hope it puts you on the right track.

Stripping those things out while HTTrack is processing the files might be hard. They have a C API for plug-ins that seems to provide hooks you could use to strip out chunks of the document before HTTrack scans it for further URLs to download, but I don't see a scripting interface for that (someone else might have written one, though).

Stripping those things out after HTTrack has downloaded the files would be easier, but bash probably isn't the best choice here, because you'd have to spend a lot of time teaching bash how to parse HTML. You'd probably do better to use a more advanced scripting language that has good HTML/DOM libraries built-in or freely available (Perl, Python, Ruby, etc.).

Spiff

Posted 2010-05-28T17:49:31.560

Reputation: 84 656

How to exclude HTML elements by ID or class using HTTrack or bash script

Answers