14
6
Essentially, I want to crawl an entire site with Wget, but I need it to NEVER download other assets (e.g. imagery, CSS, JS, etc.). I only want the HTML files.
Google searches are completely useless.
Here's a command I've tried:
wget --limit-rate=200k --no-clobber --convert-links --random-wait -r -E -e robots=off -U "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36" -A html --domain=www.example.com http://www.example.com
Our site is hybrid flat-PHP and CMS. So, HTML "files" could be /path/to/page
, /path/to/page/
, /path/to/page.php
, or /path/to/page.html
.
I've even included -R js,css
but it still downloads the files, THEN rejects them (pointless waste of bandwidth, CPU, and server load!).
2what's the command you've tried so far? If the naming of files is consistent, you should be able to use the -R flag. Alternatively, you could use the --ignore-tags flag and ignore script and img tags. – ernie – 2014-01-31T17:12:45.297
Opposite: Exclude list of specific files in wget
– Ƭᴇcʜιᴇ007 – 2014-01-31T17:26:01.647I've tried using --accept=html, but it downloads CSS files THEN deletes them. I want to prevent them from ever downloading. A headers request is fine, though -- E.g. I notice
Length: 558 [text/css]
on the files I don't want. If I could stop the request if the header doesn't returntext/html
, I'd be elated. – Nathan J.B. – 2014-01-31T17:36:58.467