How to: Download a page from the Wayback Machine over a specified interval

11

2

What I mean is to download each page available from the Wayback Machine over a specified time period and interval. For example, I want to download each page available from each day from nature.com from January of 2012 to December of 2012. (Not precisely what I want to do, but it's close enough -- and provides a good example.)

wget won't work due to the unique nature of how the Wayback machine works, unfortunately.

Tools like Wayback Machine downloader only download the most recent version of the page, it seems.

Interacting with the IA API seems like a viable route, but I'm not sure how that would work.

Thanks!

orlando marinella

Posted 2017-03-13T20:49:43.923

Reputation: 113

You would definitely need to write a script for this. Maybe cURL? – PulseJet – 2017-03-15T14:50:22.863

I think it'd be possible to write a script and lean on cURL, but I'm unfamiliar with the Memento API that the Internet Archive uses, and don't' think I've seen it used in this way. – orlando marinella – 2017-03-16T14:38:57.900

I need to a) Do multiple sites at once, b) grab a snapshot of each site over a long interval (say, 1998 to 2001), and c) be able to specify how many snapshots I want to take over that interval. – orlando marinella – 2017-03-16T14:52:27.760

Same problem. They just want one page, it seems -- the documentation for the WB Machine downloader is vague whether it works over an interval like that, or not. – orlando marinella – 2017-03-16T22:02:33.593

Just try it out? – duenni – 2017-03-17T08:37:26.350

@duenni Yeah, no, that's not how it's working. – orlando marinella – 2017-03-20T12:27:21.963

Answers

5

The way wayback URLs are formatted are as follows:

http://$BASEURL/$TIMESTAMP/$TARGET

Here BASEURL is usually http://web.archive.org/web (I say usually as I am unsure if it is the only BASEURL)

TARGET is self explanatory (in your case http://nature.com, or some similar URL)

TIMESTAMP is YYYYmmddHHMMss when the capture was made (in UTC):

  • YYYY: Year
  • mm: Month (2 digit - 01 to 12)
  • dd: Day of month (2 digit - 01 to 31)
  • HH: Hour (2 digit - 00 to 23)
  • MM: Minute (2 digit - 00 to 59)
  • ss: Second (2 digit - 00 to 59)

In case you request a capture time that doesn't exist, the wayback machine redirects to the closest capture for that URL, whether in the future or the past.

You can use that feature to get each daily URL using curl -I (HTTP HEAD) to get the set of URLs:

BASEURL='http://web.archive.org/web'
TARGET="SET_THIS"
START=1325419200 # Jan 1 2012 12:00:00 UTC (Noon) 
END=1356998400 # Tue Jan  1 00:00:00 UTC 2013
if uname -s |grep -q 'Darwin' ; then
    DATECMD="date -u '+%Y%m%d%H%M%S' -r "
elif uname -s |grep -q 'Linux'; then
    DATECMD="date -u +%Y%m%d%H%M%S -d @"
fi


while [[ $START -lt $END ]]; do
    TIMESTAMP=$(${DATECMD}$START)
    REDIRECT="$(curl -sI "$BASEURL/$TIMESTAMP/$TARGET" |awk '/^Location/ {print $2}')"
    if [[ -z "$REDIRECT" ]]; then
        echo "$BASEURL/$TIMESTAMP/$TARGET"
    else
        echo $REDIRECT
    fi
    START=$((START + 86400)) # add 24 hours
done

This gets you the URLs that are closest to noon on each day of 2012. Just remove the duplicates, and, and download the pages.

Note: The Script above can probably be greatly improved to jump forward in case the REDIRECT is for a URL more than 1 day in the future, but then it requires deconstructing the returned URL, and adjusting START to the correct date value.

Samveen

Posted 2017-03-13T20:49:43.923

Reputation: 290

This is great, why? because we have facts and proof of when somebody archived content and web.archive.org has removed archived content in the past.

This script above would save archived content. Awesome. – DeerSpotter – 2017-03-23T13:10:51.017

4

duenni

Posted 2017-03-13T20:49:43.923

Reputation: 2 109

This is awesome. – DeerSpotter – 2017-03-21T13:08:10.380