Trouble using wget or httrack to mirror archived website

12

9

I am trying to use wget to create a local mirror of a website. But I am finding that I am not getting all the linking pages.

Here is the website

http://web.archive.org/web/20110722080716/http://cst-www.nrl.navy.mil/lattice/

I don't want all pages that begin with web.archive.org, but I do want all pages that begin with http://web.archive.org/web/20110722080716/http://cst-www.nrl.navy.mil/lattice/.

When I use wget -r, in my file structure I find

web.archive.org/web/20110722080716/http://cst-www.nrl.navy.mil/lattice/index.html,

but I don't have all files that are part of this database, e.g.

web.archive.org/web/20110808041151/http://cst-www.nrl.navy.mil/lattice/struk/d0c.html.

Perhaps httrack would do better, but right now that's grabbing too much.

So, by which means is it possible to grab a local copy of an archived website from the Internet Archive Wayback Machine?

user695322

Posted 2013-01-10T15:10:38.027

Reputation: 121

Man! I tried to mirror exactly the same page (and really get angry that I didn't it when the original site was still online, which would have been much more easy). I think a problem is, that not all files are accessible under the 20110722080716 snapshot, hence wget's -np option won't help. – mpy – 2014-02-01T11:16:05.250

Have you checked manually that the missing pages are actually archived? Archive.org doesn't always archive every single page. – nitro2k01 – 2014-02-03T09:44:32.120

Answers

20

While helpful, prior responses fail to concisely, reliably, and repeatably solve the underlying question. In this post, we briefly detail the difficulties with each and then offer a modest httrack-based solution.

Background

Before we get to that, however, consider perusing mpy's well-written response. In h[is|er] sadly neglected post, mpy rigorously documents the Wayback Machine's obscure (and honestly obfuscatory) archival scheme.

Unsurprisingly, it ain't pretty. Rather than sanely archiving sites into a single directory, The Wayback Machine ephemerally spreads a single site across two or more numerically identified sibling directories. To say that this complicates mirroring would be a substantial understatement.

Understanding the horrible pitfalls presented by this scheme is core to understanding the inadequacy of prior solutions. Let's get on with it, shall we?

Prior Solution 1: wget

The related StackOverflow question "Recover old website off waybackmachine" is probably the worst offender in this regard, recommending wget for Wayback mirroring. Naturally, that recommendation is fundamentally unsound.

In the absence of complex external URL rewriting (e.g., Privoxy), wget cannot be used to reliably mirror Wayback-archived sites. As mpy details under "Problem 2 + Solution," whatever mirroring tool you choose must allow you to non-transitively download only URLs belonging to the target site. By default, most mirroring tools transitively download all URLs belonging to both the target site and sites linked to from that site – which, in the worst case, means "the entire Internet."

A concrete example is in order. When mirroring the example domain kearescue.com, your mirroring tool must:

  • Include all URLs matching https://web.archive.org/web/*/http://kearescue.com. These are assets provided by the target site (e.g., https://web.archive.org/web/20140521010450js_/http_/kearescue.com/media/system/js/core.js).
  • Exclude all other URLs. These are assets provided by other sites merely linked to from the target site (e.g., https://web.archive.org/web/20140517180436js_/https_/connect.facebook.net/en_US/all.js).

Failing to exclude such URLs typically pulls in all or most of the Internet archived at the time the site was archived, especially for sites embedding externally-hosted assets (e.g., YouTube videos).

That would be bad. While wget does provide a command-line --exclude-directories option accepting one or more patterns matching URLs to be excluded, these are not general-purpose regular expressions; they're simplistic globs whose * syntax matches zero or more characters excluding /. Since the URLs to be excluded contain arbitrarily many / characters, wget cannot be used to exclude these URLs and hence cannot be used to mirror Wayback-archived sites. Period. End of unfortunate story.

This issue has been on public record since at least 2009. It has yet to be be resolved. Next!

Prior Solution 2: Scrapbook

Prinz recommends ScrapBook, a Firefox plugin. A Firefox plugin.

That was probably all you needed to know. While ScrapBook's Filter by String... functionality does address the aforementioned "Problem 2 + Solution," it does not address the subsequent "Problem 3 + Solution" – namely, the problem of extraneous duplicates.

It's questionable whether ScrapBook even adequately addresses the former problem. As mpy admits:

Although Scrapbook failed so far to grab the site completely...

Unreliable and overly simplistic solutions are non-solutions. Next!

Prior Solution 3: wget + Privoxy

mpy then provides a robust solution leveraging both wget and Privoxy. While wget is reasonably simple to configure, Privoxy is anything but reasonable. Or simple.

Due to the imponderable technical hurdle of properly installing, configuring, and using Privoxy, we have yet to confirm mpy's solution. It should work in a scalable, robust manner. Given the barriers to entry, this solution is probably more appropriate to large-scale automation than the average webmaster attempting to recover small- to medium-scale sites.

Is wget + Privoxy worth a look? Absolutely. But most superusers might be better serviced by simpler, more readily applicable solutions.

New Solution: httrack

Enter httrack, a command-line utility implementing a superset of wget's mirroring functionality. httrack supports both pattern-based URL exclusion and simplistic site restructuring. The former solves mpy's "Problem 2 + Solution"; the latter, "Problem 3 + Solution."

In the abstract example below, replace:

  • ${wayback_url} by the URL of the top-level directory archiving the entirety of your target site (e.g., 'https://web.archive.org/web/20140517175612/http://kearescue.com').
  • ${domain_name} by the same domain name present in ${wayback_url} excluding the prefixing http:// (e.g., 'kearescue.com').

Here we go. Install httrack, open a terminal window, cd to the local directory you'd like your site to be downloaded to, and run the following command:

httrack\
    ${wayback_url}\
    '-*'\
    '+*/${domain_name}/*'\
    -N1005\
    --advanced-progressinfo\
    --can-go-up-and-down\
    --display\
    --keep-alive\
    --mirror\
    --robots=0\
    --user-agent='Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5'\
    --verbose

On completion, the current directory should contain one subdirectory for each filetype mirrored from that URL. This usually includes at least:

  • css, containing all mirrored CSS stylesheets.
  • html, containing all mirrored HTML pages.
  • js, containing all mirrored JavaScript.
  • ico, containing one mirrored favicon.

Since httrack internally rewrites all downloaded content to reflect this structure, your site should now be browsable as is without modification. If you prematurely halted the above command and would like to continue downloading, append the --continue option to the exact same command and retry.

That's it. No external contortions, error-prone URL rewriting, or rule-based proxy servers required.

Enjoy, fellow superusers.

Cecil Curry

Posted 2013-01-10T15:10:38.027

Reputation: 310

I'm glad to hear that at least one person read my answer thoroughly. And thanks for your further analysis and the httrack solution. +1 – mpy – 2014-06-21T21:28:09.037

1The httrack solution was perfect, thank you so much! – ChrisChinchilla – 2015-03-25T14:03:51.210

Glad to be of minor assistance, guys. Given how gut-wrenchingly awful this tapestry of woe and deceit was to unravel, I just had to share my findings. – Cecil Curry – 2015-04-20T05:45:31.780

To remove rate transfer limit add these parameters: --disable-security-limits --max-rate=0 – Oswaldo – 2017-06-19T18:45:08.127

7

Unfortunately none of the answers were able to solve the problem of making a complete mirror from an archived website (without duplicating every file a dozens of times). So I hacked together another approach. Hacked is the important word as my solution is neither a general solution nor a very simple (read: copy&paste) one. I used the Privoxy Proxy Server to rewrite the files on-the-fly while mirroring with wget.

But first, what is so difficult about mirroring from the Wayback Machine?

Problem 1 + Solution

The Wayback toolbar is handy for interactive use, but might interfere with wget. So get rid of it with a privoxy filter rule

FILTER: removewaybacktoolbar remove Wayback toolbar
s|BEGIN WAYBACK TOOLBAR INSERT.*END WAYBACK TOOLBAR INSERT|Wayback Toolbar removed|s

Problem 2 + Solution

I wanted to capture the whole site, so needed a not-too-small recursion depth. But I don't want wget to crawl the whole server. Usually you use the no-parent option -np of wget for that purpose. But that will not work here, because you want to get

http://web.archive.org/web/20110722080716/http://cst-www.nrl.navy.mil/lattice/struk/hcp.html

but also

http://web.archive.org/web/20110801041529/http://cst-www.nrl.navy.mil/lattice/struk/a_f.html

(notice the changed timestamp in the paths). Omitting -np will end up wget crawling up to (...)http://cst-www.nrl.navy.mil, and finally retrieve the whole navi.mil site. I definitely don't want that! So this filter tries to emulate the -np behavior with the Wayback machine:

FILTER: blocknonparentpages emulate wget -np option
s|/web/([0-9].*)/http://cst-www.nrl.navy.mil/lattice/|THIS_IS_A_GOOD_$1_ADDRESS|gU
s|/web/(.*)/http(.*)([" ])|http://some.local.server/404$3|gU
s|THIS_IS_A_GOOD_(.*)_ADDRESS|/web/$1/http://cst-www.nrl.navy.mil/lattice/|gU

I'll leave it as an exercise to dig into the syntax. What this filter does is the following: It replaces all Wayback URLs like http://web.archive.org/web/20110801041529/http://www.nrl.navy.mil/ with http://some.local.server/404 as long as they do not contain http://cst-www.nrl.navy.mil/lattice/.

You have to adjust http://some.local.server/404. This is to send an 404 error to wget. Probably privoxy can do that more elegant. However, the easiest way for me was just to rewrite the link to a non-existent page on a local http server, so I stuck with this.

And, you also need to adjust both occurences of http://cst-www.nrl.navy.mil/lattice/ to reflect the site you want to mirror.

Problem 3 + Solution

And finally some archived version of a page might link to page in another snapshot. And that to yet another one. And so on... and you'll end up with a lot of snapshots of the same page -- and wget will never manage to finish until it has fetched all snapshots. I really don't want that, neither! Here it helps a lot, that the Wayback machine is very smart. You can request a file

http://web.archive.org/web/20110801041529/http://cst-www.nrl.navy.mil/lattice/struk/a_f.html

even if it's not included in the 20110801041529 snapshot. It automatically redirect you to the correct one:

http://web.archive.org/web/20110731225728/http://cst-www.nrl.navy.mil/lattice/struk/a_f.html

So, another privoxy filter to rewrite all snapshots to the most recent one

FILTER: rewritewaybackstamp rewrite Wayback snapshot date
s|/([0-9]{14})(.{0,3})/|/20120713212803$2/|g

Effectively every 14-digit-number enclosed in /.../ gets replaced with 20120713212803 (adjust that to the most recent snapshot of your desired site). This might be an issue if there are such numbers in the site structure not originating from the Wayback machine. Not perfect, but fine for the Strukturtypen site.

The nice thing about that is, that wget ignores the new location it is redirected to and saves the file -- in the above exampe -- as web.archive.org/web/20110801041529/http://cst-www.nrl.navy.mil/lattice/struk/a_f.html.

Using wget to mirror archived site

So, finally with these privoxy filters (defined in user.filter) enabled in user.action via

{ +filter{removewaybacktoolbar} +filter{blocknonparentpages} +filter{rewritewaybackstamp} }
web.archive.org

you can use wget as usual. Don't forget to tell wget to use the proxy:

export http_proxy="localhost:8118"
wget -r -p -k -e robots=off http://web.archive.org/web/20120713212803/http://cst-www.nrl.navy.mil/lattice/index.html

I used these options, but -m should work, too. You'll end up with the folders

20120713212803
20120713212803cs_
20120713212803im_
20120713212803js_

as the Wayback machine separates images (im_), style sheets (cs_) etc. I merged everything together and used some sed magic to replace the ugly relative links (../../../../20120713212803js_/http:/cst-www.nrl.navy.mil/lattice) accordingly. But this isn't really necessary.

mpy

Posted 2013-01-10T15:10:38.027

Reputation: 20 866

1

This was an invaluable answer. Your precise dissection of The Wayback Machine's internal site structure was key to the httrack-based solution I eventually stumbled upon. You rock, mpy.

– Cecil Curry – 2015-04-20T05:39:21.193

5

wget

--page-requisites
This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.

Ordinarily, when downloading a single HTML page, any requisite documents that may be needed to display it properly are not downloaded. Using -r together with -l can help, but since Wget does not ordinarily distinguish between external and inlined documents, one is generally left with "leaf documents" that are missing their requisites.

For instance, say document 1.html contains an "" tag referencing 1.gif and an "" tag pointing to external document 2.html. Say that 2.html is similar but that its image is 2.gif and it links to 3.html. Say this continues up to some arbitrarily high number.

-m
--mirror

Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings. It is currently equivalent to -r -N -l inf --no-remove-listing.

Note that Wget will behave as if -r had been specified, but only that single page and its requisites will be downloaded. Links from that page to external documents will not be followed. Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to -p:

wget -E -H -k -K -p http://<site>/<document>

SO wget -E -H -k -K -p http://web.archive.org/web/20110722080716/http://cst-www.nrl.navy.mil/lattice will be your best suit for you. But I recommend another tool, a firefox extension scrapbook

scrapbook

ScrapBook is a Firefox extension, which helps you to save Web pages and easily manage collections. Key features are lightness, speed, accuracy and multi-language support. Major features are:
* Save Web page
* Save snippet of Web page
* Save Web site
* Organize the collection in the same way as Bookmarks
* Full text search and quick filtering search of the collection
* Editing of the collected Web page
* Text/HTML edit feature resembling Opera's Notes

How to mirror a site
Install scrapbook and restart firefox

  1. Load page in browser [web page to be mirrored]
  2. Right click on the page -> Save page as ...
  3. select level from In depth Save and press save enter image description here
  4. select Restrict to Drirectory/Domain from Filter
    enter image description here

Wait for it to mirroring to complete. After mirroring you can access the web site offline from ScrapBook menu.

Prinz

Posted 2013-01-10T15:10:38.027

Reputation: 376

Although Scrapbook failed so far to grab the site completely, it was closer to a possible solution than the other suggestions. Especially its Filter by String... option was more helpful than to filter by host/domain. Hence, I award the bounty to you :) – mpy – 2014-02-08T11:16:55.317

0

There is already a tool that does that better:

wayback_machine_downloader domain.org 

To get it you need to have ruby installed. And then:

gem install wayback_machine_downloader

Eduard Florinescu

Posted 2013-01-10T15:10:38.027

Reputation: 2 116

0

Be careful with the below command because it grabs a lot. The 1 after the 'l' tells it to grab all pages for links on the site that are 1 level deep. If you want it to spider deeper change this to a 2 but it might never end because it could get caught in a loop.

wget -rHpkl 1 -e robots=off http://www.example.com/

I'm not sure which parts of the site you want to keep and which parts you don't care about but you should probably white list and/or blacklist the different parts of the site to get only what you want and to prevent yourself from downloading all of archive.org or the internet.

Use -D www.examle.com,www.another.example.com to whitelist only the domains you want or use --exclude-domains www.examle.com,www.another.example.com to blacklist what you don't want.

Michael Yasumoto

Posted 2013-01-10T15:10:38.027

Reputation: 583

Thanks, but the problem with white/blacklisting is that all archived websites come from the web.archive.org host. I want to mirror everything what wget -np would have mirrored once the original site was still online. -l doesn't help much either, since it has to be increased to 3 or 4, hence resulting in ascending the website hierarchy too much. – mpy – 2014-02-03T09:47:34.807

0

The format of the URLs for the Internet Archive include the date and time the site was archived. To save space assets that haven't changed are linked back to a previous version of a site.

For example in this url http://web.archive.org/web/20000229123340/http://www.yahoo.com/ the date the site was crawled was Feb 29, 2000 at 12:33 and 40 seconds.

So to get all of http://web.archive.org/web/20110722080716/http://cst-www.nrl.navy.mil/lattice/ you need to start at that but also grab all linked assets from http://web.archive.org/web/*/http://cst-www.nrl.navy.mil/lattice/.

Brian

Posted 2013-01-10T15:10:38.027

Reputation: 8 439

Exactly, and that is the problem. Let's say page A links to B. So, the current version A links to old version B. But B includes also a link to A. So the old version of A gets retrieved, too and links again to older version. This (at a (needed) crawl depths of 4) leads to the result, that you end up with dozens of versions of the index page, but not all needed files. – mpy – 2014-02-08T10:15:14.363