How to download a website from the archive.org Wayback Machine?

88

42

I want to get all the files for a given website at archive.org. Reasons might include:

  • the original author did not archived his own website and it is now offline, I want to make a public cache from it
  • I am the original author of some website and lost some content. I want to recover it
  • ...

How do I do that ?

Taking into consideration that the archive.org wayback machine is very special: webpage links are not pointing to the archive itself, but to a web page that might no longer be there. JavaScript is used client-side to update the links, but a trick like a recursive wget won't work.

user36520

Posted 2014-10-20T10:16:39.213

Reputation: 1 795

14

I've came accross the same issue and I've coded a gem. To install: gem install wayback_machine_downloader. Run wayback_machine_downloader with the base url of the website you want to retrieve as a parameter: wayback_machine_downloader http://example.comMore information: https://github.com/hartator/wayback_machine_downloader

– Hartator – 2015-08-10T06:32:40.320

3

A step by step help for windows users (win8.1 64bit for me) new to Ruby, here is what I did to make it works : 1) I installed http://rubyinstaller.org/downloads/ then run the "rubyinstaller-2.2.3-x64.exe" 2) downloaded the zip file https://github.com/hartator/wayback-machine-downloader/archive/master.zip 3) unzip the zip in my computer 4) search in windows start menu for "Start command prompt with Ruby" (to be continued)

– Erb – 2015-10-02T07:40:28.233

3

start="5">

  • follow the instructions of https://github.com/hartator/wayback_machine_downloader (e;.g: copy paste this "gem install wayback_machine_downloader" into the prompt. Hit enter and it will install the program...then follow "Usage" guidelines). 6) once your website captured you will find the files into C:\Users\YOURusername\websites
  • – Erb – 2015-10-02T07:40:33.497

    Answers

    65

    I tried different ways to download a site and finally I found the wayback machine downloader - which was mentioned by Hartator before (so all credits go to him, please), but I simply did not notice his comment to the question. To save you time, I decided to add the wayback_machine_downloader gem as a separate answer here.

    The site at http://www.archiveteam.org/index.php?title=Restoring lists these ways to download from archive.org:

    • Wayback Machine Downloader, small tool in Ruby to download any website from the Wayback Machine. Free and open-source. My choice!
    • Warrick - Main site seems down.
    • Wayback downloader , a service that will download your site from the Wayback Machine and even add a plugin for Wordpress. Not free.

    Comic Sans

    Posted 2014-10-20T10:16:39.213

    Reputation: 788

    1

    i also wrote a "wayback downloader", in php, downloading the resources, adjusting links, etc: https://gist.github.com/divinity76/85c01de416c541578342580997fa6acf

    – hanshenrik – 2017-10-18T18:08:00.333

    @ComicSans, On the page you've linked, what is an Archive Team grab?? – Pacerier – 2018-03-15T14:17:10.377

    1October 2018, the Wayback Machine Downloader still works. – That Brazilian Guy – 2018-10-02T17:43:02.023

    @Pacerier it means (sets of) WARC files produced by Archive Team (and usually fed into Internet Archive's wayback machine), see http://archive.org/details/archiveteam

    – Nemo – 2019-01-20T14:47:12.837

    15

    This can be done using a bash shell script combined with wget.

    The idea is to use some of the URL features of the wayback machine:

    • http://web.archive.org/web/*/http://domain/* will list all saved pages from http://domain/ recursively. It can be used to construct an index of pages to download and avoid heuristics to detect links in webpages. For each link, there is also the date of the first version and the last version.
    • http://web.archive.org/web/YYYYMMDDhhmmss*/http://domain/page will list all version of http://domain/page for year YYYY. Within that page, specific links to versions can be found (with exact timestamp)
    • http://web.archive.org/web/YYYYMMDDhhmmssid_/http://domain/page will return the unmodified page http://domain/page at the given timestamp. Notice the id_ token.

    These are the basics to build a script to download everything from a given domain.

    user36520

    Posted 2014-10-20T10:16:39.213

    Reputation: 1 795

    7

    You should really use the API instead https://archive.org/help/wayback_api.php Wikipedia help pages are for editors, not for the general public. So that page is focused on the graphical interface, which is both superseded and inadequate for this task.

    – Nemo – 2015-01-21T22:41:41.383

    It'd probably be easier to just say take the URL (like http://web.archive.org/web/19981202230410/http://www.google.com/) and add id_ to the end of the "date numbers". Then, you would get something like http://web.archive.org/web/19981202230410id_/http://www.google.com/.

    – haykam – 2016-07-09T21:57:41.520

    1

    A python script can also be found here: https://gist.github.com/ingamedeo/50df5def5bce7edfbd4c6b71aa385328#file-webarchive-py

    – Amedeo Baragiola – 2018-06-22T20:24:04.553

    4

    There is a tool specifically designed for this purpose, Warrick: https://code.google.com/p/warrick/

    It's based on the Memento protocol.

    Nemo

    Posted 2014-10-20T10:16:39.213

    Reputation: 1 050

    3As far as I managed to use this (in May 2017), it just recovers what archive.is holds, and pretty much ignores what is at archive.org; it also tries to get documents and images from the Google/Yahoo caches but utterly fails. Warrick has been cloned several times on GitHub since Google Code shut down, maybe there are some better versions there. – Gwyneth Llewelyn – 2017-05-31T16:41:47.160

    0

    You can do this easily with wget.

    wget -rc --accept-regex '.*ROOT.*' START
    

    Where ROOT is the root URL of the website and START is the starting URL. For example:

    wget -rc --accept-regex '.*http://www.math.niu.edu/~rusin/known-math/.*' http://web.archive.org/web/20150415082949fw_/http://www.math.niu.edu/~rusin/known-math/
    

    Note that you should bypass the Web archive's wrapping frame for START URL. In most browsers, you can right-click on the page and select "Show Only This Frame".

    jcoffland

    Posted 2014-10-20T10:16:39.213

    Reputation: 197