Downloading a large site with wget

1

I'm trying to mirror a very large site but wget never seems to finish properly. I am using the command:

wget -r -l inf -nc -w 0.5 {the-site}

I have downloaded a good portion of the site, but not the whole thing. The content does not change fast enough to bother using time-stamping.

After running overnight, this message appears:

File `{filename}.html' already there; not retrieving.
File `{filename}.html' already there; not retrieving.
File `{filename}.html' already there; not retrieving.
File `{filename}.html' already there; not retrieving.
Killed

Does anyone know what is happening and how I can fix it?

Evan Gill

Posted 2010-06-15T16:09:34.147

Reputation: 11

Answers

1

Have you tried using the '-m' option?
it is a short cut for,

-N -r -l inf --no-remove-listing

You can also experiment specifically with the site using a deeper URL for a limited set of files and avoid fetching parent paths with,

-np

nik

Posted 2010-06-15T16:09:34.147

Reputation: 50 788

I have not tried -m because it is not compatible with -nc and i dont want to overload the servers by downloading every single page when I already have most of them. I have also tried -np and it made little difference for this site. I will try to use a series of deeper urls.

thanks. – Evan Gill – 2010-06-15T17:16:07.340

@Evan: -m works in a similar way to -nc -- it will not re-download files unless the server has a newer version. Most web servers support those checks. – user1686 – 2010-06-15T20:19:47.603

Thanks. I have written up a script to use the -m flag and wget each level 2 directory separately, sleeping between each directory. I'll run it tonight and update the question/answers. – Evan Gill – 2010-06-15T20:36:36.170

You might want to look at the -w option too. – nik – 2010-06-16T02:27:30.717

Downloading each subdirectory separately solved the problem. I tried to get the entire site with one large recursive wget but it always used my entire 4GB of memory. In the end, I got everything 2 levels deep with -l 2, and ran a for loop to perform a recursive wget on all the directories in the site with this command:

wget -m -w 1 --random-wait -np www.fakesite.com/directory – Evan Gill – 2010-06-22T16:34:34.800