How to exclude HTML elements by ID or class using HTTrack or bash script

0

I need to copy a website, but I want to exclude specific HTML elements by ID or class in all pages (using a bash script).

I need this because I don't want ads or some other annoying features of some websites.

pak memek

Posted 2010-05-28T17:49:31.560

Reputation: 1

Answers

0

This isn't a full answer, but it was too big for the comment box and I hope it puts you on the right track.

Stripping those things out while HTTrack is processing the files might be hard. They have a C API for plug-ins that seems to provide hooks you could use to strip out chunks of the document before HTTrack scans it for further URLs to download, but I don't see a scripting interface for that (someone else might have written one, though).

Stripping those things out after HTTrack has downloaded the files would be easier, but bash probably isn't the best choice here, because you'd have to spend a lot of time teaching bash how to parse HTML. You'd probably do better to use a more advanced scripting language that has good HTML/DOM libraries built-in or freely available (Perl, Python, Ruby, etc.).

Spiff

Posted 2010-05-28T17:49:31.560

Reputation: 84 656