The way wayback
URLs are formatted are as follows:
http://$BASEURL/$TIMESTAMP/$TARGET
Here BASEURL
is usually http://web.archive.org/web
(I say usually as I am unsure if it is the only BASEURL)
TARGET
is self explanatory (in your case http://nature.com
, or some similar URL)
TIMESTAMP
is YYYYmmddHHMMss
when the capture was made (in UTC):
YYYY
: Year
mm
: Month (2 digit - 01 to 12)
dd
: Day of month (2 digit - 01 to 31)
HH
: Hour (2 digit - 00 to 23)
MM
: Minute (2 digit - 00 to 59)
ss
: Second (2 digit - 00 to 59)
In case you request a capture time that doesn't exist, the wayback machine redirects to the closest capture for that URL, whether in the future or the past.
You can use that feature to get each daily URL using curl -I
(HTTP HEAD
) to get the set of URLs:
BASEURL='http://web.archive.org/web'
TARGET="SET_THIS"
START=1325419200 # Jan 1 2012 12:00:00 UTC (Noon)
END=1356998400 # Tue Jan 1 00:00:00 UTC 2013
if uname -s |grep -q 'Darwin' ; then
DATECMD="date -u '+%Y%m%d%H%M%S' -r "
elif uname -s |grep -q 'Linux'; then
DATECMD="date -u +%Y%m%d%H%M%S -d @"
fi
while [[ $START -lt $END ]]; do
TIMESTAMP=$(${DATECMD}$START)
REDIRECT="$(curl -sI "$BASEURL/$TIMESTAMP/$TARGET" |awk '/^Location/ {print $2}')"
if [[ -z "$REDIRECT" ]]; then
echo "$BASEURL/$TIMESTAMP/$TARGET"
else
echo $REDIRECT
fi
START=$((START + 86400)) # add 24 hours
done
This gets you the URLs that are closest to noon on each day of 2012.
Just remove the duplicates, and, and download the pages.
Note: The Script above can probably be greatly improved to jump forward in case the REDIRECT
is for a URL more than 1 day in the future, but then it requires deconstructing the returned URL, and adjusting START
to the correct date value.
You would definitely need to write a script for this. Maybe cURL? – PulseJet – 2017-03-15T14:50:22.863
I think it'd be possible to write a script and lean on cURL, but I'm unfamiliar with the Memento API that the Internet Archive uses, and don't' think I've seen it used in this way. – orlando marinella – 2017-03-16T14:38:57.900
I need to a) Do multiple sites at once, b) grab a snapshot of each site over a long interval (say, 1998 to 2001), and c) be able to specify how many snapshots I want to take over that interval. – orlando marinella – 2017-03-16T14:52:27.760
Possible duplicate: https://superuser.com/questions/828907/how-to-download-a-website-from-the-archive-org-wayback-machine
– PulseJet – 2017-03-16T18:17:35.877Same problem. They just want one page, it seems -- the documentation for the WB Machine downloader is vague whether it works over an interval like that, or not. – orlando marinella – 2017-03-16T22:02:33.593
Just try it out? – duenni – 2017-03-17T08:37:26.350
@duenni Yeah, no, that's not how it's working. – orlando marinella – 2017-03-20T12:27:21.963