First off, this seems to be an OS X only problem. I can use the above command on Ubuntu 14.04 LTS and it works out of the box! A few suggestions:
.css
files and images, etc do not seem to be downloaded - at least, up to the level I've left running (ok, maybe they would be downloaded if the process was completed, so we may skip this one)
When you say --domains wikispaces.com
, you will not be downloading linked CSS files located on other domains. Some of the stylesheets on that website are located on http://c1.wikicdn.com
as suggests the source of index.html
Some websites do not allow you to access their linked files (referenced images) directly using their link (see this page). You can only view them through the website. That doesn't seem to be the case here though.
Wget does not seem to recognize comments while parsing the HTML. I see the following when Wget is running:
--2016-07-01 04:01:12-- http://chessprogramming.wikispaces.com/%3C%25-%20ws.context.user.imageUrlPrefix%20%25%3Elg.jpg
Reusing existing connection to chessprogramming.wikispaces.com:80.
HTTP request sent, awaiting response... 404 Not Found
2016-07-01 04:01:14 ERROR 404: Not Found.
Opening the link in browser takes you to a log in page. The name of the file suggests that it occurred somewhere in the comments.
- Many sites do not allow being downloaded using download managers, hence they check which client originated the HTTP request (which includes the browser, or whatever client you used to request a file from their server).
Use -U somebrowser
to fake the client and pretend to be a browser. For example, -U mozilla
can be added to tell the server that a Mozilla/Firefox is requesting the page. This however is not the issue here since I can download the site without this argument.
- The download and request rate is important. Servers do not want their performance to be bloated by robots requesting data from their site. Use
--limit-rate=
and --wait=
arguments in Wget to limit the download rate and wait a few seconds between generating get requests for individual files.
e.g.
wget -r --wait=5 --limit-rate=100K <other arguments>
to wait 5 seconds between get requests and limit the download rate to 100Kbps. Once again, this is not the issue here because the server did not require me to limit the download rate to fetch the website.
The most possible case here is (1). Replace the --domains wikispaces.com
with --domains *
and try again. Let's see where we get. You should be able to fetch the CSS files at least.
NO html extension is being added
HTML extension is being added when I run the command.
Links are not converted
I don't think if I am totally correct here, but do not expect links to work out of the box when you mirror a site.
When you pass argument to the HTTP get request (for example http://chessprogramming.wikispaces.com/wiki/xmla?v=rss_2_0
has the arguments v=rss_2_0
), the request is dealt with some script running on the server, for example PHP. The arguments will help you fetch the correct version of the script depending on the argument(s). Remember, when you are mirroring a site, specially a Wiki, which runs on PHP, you can't exactly mirror a site unless you fetch the original PHP scripts. HTML pages returned by PHP scripts are just one face of the page you can expect to see with that script. The correct algorithm that generates the page is stored on the server and will only mirror correctly if you fetch the original PHP file, which you can't do with HTTP. For that you need FTP access to the server.
Hope this helps.
I have a similar problem. Using OS X 10.10 with wget 1.18. I run
wget -mkpr https://consoreddomain.com
and all I get is a directory with a single index.html page in it. Would be nice if this could receive an answer. – Julian – 2016-06-30T13:49:02.333I did something like this in the past and ended up abandoning some wget-based solutions and installing Heretrix (open source). It was a little challenging to get it set up, but did an excellent job of archiving the site.
– GuitarPicker – 2016-06-30T18:39:30.767@Dr.Kameleon Um... wget seems to have a lot of bugs on OSX... do you want an alternative answer using cURL? – rahuldottech – 2016-07-03T10:38:49.777
@Julian If you are not able to fix the problem under OSX you can always "break a (butter)fly on the wheel". Use an Ubuntu live system (pen drive), or a virtual machine just to download it. :-) The second can be cosy for many other purposes.
– Hastur – 2016-07-07T08:31:55.157