wget --convert-links appending 'index.html'

2

Im trying to mirror a website using wget.

Most of the links on the website point to the subfolder like http://foo.com/x/.

However, when I use --convert-links, it rewrites the url to http://foo.com/x/index.html.

Can anyone offer a solution to stop this filename being appended to the url link?

maxp

Posted 2011-04-26T09:05:02.580

Reputation: 248

Answers

3

When you browse to a website as http://example.com/foo/bar what is actually happening is this:

  1. You request http://example.com/foo/bar
  2. Website redirects you to http://example.com/foo/bar/
  3. You request http://example.com/foo/bar/
  4. Website looks for default entry in the directory (what that is depends on the web server) and returns that. If there is no default entry then either return a directory listing or return "Forbidden".

The default entry, as I said, depends on the web server and its settings.

Default entries include:

  • index.html
  • index.htm
  • index.php
  • index.cgi
  • default.htm¹

When operating locally and not through a web server there is no way for the filesystem to reply with a default entry as it has no concept of websites or index.html or anything like that. The sequence of events for a local filesystem would be like this:

  1. Open /path/to/example.com/foo/bar
  2. This file is a directory. Here's the list of files.
  3. Display the list of files.

When mirroring a website with wget it is impossible to store the URL as a file without having a default entry file to store the data into within the directory, so it creates one (default: index.html). The --convert-links option re-writes the URLs in the files to ensure that they point to this newly created index.html file and not just the directory name.

If the website doesn't have a default entry it will send the directory listing nicely formatted (if permissions allow). This will get saved in the index.html file.

This is desirable operation as it ensures that when you click a link locally it points to the file you want to see and not the directory that contains the file. This is the whole point of using the --convert-links option. You cannot have a local copy of the website without local index.html files. Anything else would break the local copy of the site.

So no, you cannot stop --convert-links from appending index.html as it is required for it to work locally.

¹ This one is Microsoft specific - trust them to do it completely different to everyone else.

Majenko

Posted 2011-04-26T09:05:02.580

Reputation: 29 007

1I have no issues with the file its creating, or its name, but when wget rewrites the link it insists on including 'index.html' in the anchor tag, which is great if I can only use my filesystem and web browser, but it wont allow me to host it on my web server without and using 'default entries' to specify what default filename to look for. – maxp – 2011-04-26T10:52:09.427