How to automate regular Google Takeout backups to cloud storage

44

16

I would like to create regular Google Takeout backups (let's say every 3 months) and store them encrypted in some other cloud storage like DropBox or S3.

It does not have to be a cloud-to-cloud solution, though preferred. It does not have to be 100% automated, however the more the better.

Thank you in advance for any ideas.

Michał Šrajer

Posted 2014-02-14T07:22:48.653

Reputation: 2 495

Answers

3

This is a partial answer with partial automation. It may stop working in the future if Google chooses to crack down on automated access to Google Takeout. Features currently supported in this answer:

+---------------------------------------------+------------+---------------------+
|             Automation Feature              | Automated? | Supported Platforms |
+---------------------------------------------+------------+---------------------+
| Google Account log-in                       | No         |                     |
| Get cookies from Mozilla Firefox            | Yes        | Linux               |
| Get cookies from Google Chrome              | Yes        | Linux, macOS        |
| Request archive creation                    | No         |                     |
| Schedule archive creation                   | Kinda      | Takeout website     |
| Check if archive is created                 | No         |                     |
| Get archive list                            | Yes        | Cross-platform      |
| Download all archive files                  | Yes        | Linux, macOS        |
| Encrypt downloaded archive files            | No         |                     |
| Upload downloaded archive files to Dropbox  | No         |                     |
| Upload downloaded archive files to AWS S3   | No         |                     |
+---------------------------------------------+------------+---------------------+

Firstly, a cloud-to-cloud solution can't really work because there is no interface between Google Takeout and any known object storage provider. You've got to process the backup files on your own machine (which could be hosted in the public cloud, if you wanted) before sending them off to your object storage provider.

Secondly, as there is no Google Takeout API, an automation script needs to pretend to be a user with a browser to walk through the Google Takeout archive creation and download flow.


Automation Features

Google Account log-in

This is not yet automated. The script would need to pretend to be a browser and navigate possible hurdles such as two-factor authentication, CAPTCHAs, and other increased security screening.

Get cookies from Mozilla Firefox

I have a script for Linux users to grab the Google Takeout cookies from Mozilla Firefox and export them as environment variables. For this to work, the default/active profile must have visited https://takeout.google.com while logged in.

As a one-liner:

cookie_jar_path=$(mktemp) ; source_path=$(mktemp) ; firefox_profile=$(cat "$HOME/.mozilla/firefox/profiles.ini" | awk -v RS="" '{ if($1 ~ /^\[Install[0-9A-F]+\]/) { print } }' | sed -nr 's/^Default=(.*)$/\1/p' | head -1) ; cp "$HOME/.mozilla/firefox/$firefox_profile/cookies.sqlite" "$cookie_jar_path" ; sqlite3 "$cookie_jar_path" "SELECT name,value FROM moz_cookies WHERE baseDomain LIKE 'google.com' AND (name LIKE 'SID' OR name LIKE 'HSID' OR name LIKE 'SSID' OR (name LIKE 'OSID' AND host LIKE 'takeout.google.com')) AND originAttributes LIKE '^userContextId=1' ORDER BY creationTime ASC;" | sed -e 's/|/=/' -e 's/^/export /' | tee "$source_path" ; source "$source_path" ; rm -f "$source_path" ; rm -f "$cookie_jar_path"

As a prettier Bash script:

#!/bin/bash
# Extract Google Takeout cookies from Mozilla Firefox and export them as envvars
#
# The browser must have visited https://takeout.google.com as an authenticated user.

# Warn the user if they didn't run the script with `source`
[[ "${BASH_SOURCE[0]}" == "${0}" ]] &&
       echo 'WARNING: You should source this script to ensure the resulting environment variables get set.'

cookie_jar_path=$(mktemp)
source_path=$(mktemp)

# In case the cookie database is locked, copy the database to a temporary file.
# Edit the $firefox_profile variable below to select a specific Firefox profile.
firefox_profile=$(
    cat "$HOME/.mozilla/firefox/profiles.ini" |
    awk -v RS="" '{
        if($1 ~ /^\[Install[0-9A-F]+\]/) {
            print
        }
    }' |
    sed -nr 's/^Default=(.*)$/\1/p' |
    head -1
)
cp "$HOME/.mozilla/firefox/$firefox_profile/cookies.sqlite" "$cookie_jar_path"

# Get the cookies from the database
sqlite3 "$cookie_jar_path" \
       "SELECT name,value
        FROM moz_cookies
        WHERE baseDomain LIKE 'google.com'
        AND (
                name LIKE 'SID' OR
                name LIKE 'HSID' OR
                name LIKE 'SSID' OR
                (name LIKE 'OSID' AND host LIKE 'takeout.google.com')
        ) AND
        originAttributes LIKE '^userContextId=1'
        ORDER BY creationTime ASC;" |
                # Reformat the output into Bash exports
                sed -e 's/|/=/' -e 's/^/export /' |
                # Save the output into a temporary file
                tee "$source_path"

# Load the cookie values into environment variables
source "$source_path"

# Clean up
rm -f "$source_path"
rm -f "$cookie_jar_path"

Get cookies from Google Chrome

I have a script for Linux and possibly macOS users to grab the Google Takeout cookies from Google Chrome and export them as environment variables. The script works on the assumption that Python 3 venv is available and the Default Chrome profile visited https://takeout.google.com while logged in.

As a one-liner:

if [ ! -d "$venv_path" ] ; then venv_path=$(mktemp -d) ; fi ; if [ ! -f "${venv_path}/bin/activate" ] ; then python3 -m venv "$venv_path" ; fi ; source "${venv_path}/bin/activate" ; python3 -c 'import pycookiecheat, dbus' ; if [ $? -ne 0 ] ; then pip3 install git+https://github.com/n8henrie/pycookiecheat@dev dbus-python ; fi ; source_path=$(mktemp) ; python3 -c 'import pycookiecheat, json; cookies = pycookiecheat.chrome_cookies("https://takeout.google.com") ; [print("export %s=%s;" % (key, cookies[key])) for key in ["SID", "HSID", "SSID", "OSID"]]' | tee "$source_path" ; source "$source_path" ; rm -f "$source_path" ; deactivate

As a prettier Bash script:

#!/bin/bash
# Extract Google Takeout cookies from Google Chrome and export them as envvars
#
# The browser must have visited https://takeout.google.com as an authenticated user.

# Warn the user if they didn't run the script with `source`
[[ "${BASH_SOURCE[0]}" == "${0}" ]] &&
       echo 'WARNING: You should source this script to ensure the resulting environment variables get set.'

# Create a path for the Chrome cookie extraction library
if [ ! -d "$venv_path" ]
then
       venv_path=$(mktemp -d)
fi

# Create a Python 3 venv, if it doesn't already exist
if [ ! -f "${venv_path}/bin/activate" ]
then
        python3 -m venv "$venv_path"

fi

# Enter the Python virtual environment
source "${venv_path}/bin/activate"

# Install dependencies, if they are not already installed
python3 -c 'import pycookiecheat, dbus'
if [ $? -ne 0 ]
then
        pip3 install git+https://github.com/n8henrie/pycookiecheat@dev dbus-python
fi

# Get the cookies from the database
source_path=$(mktemp)
read -r -d '' code << EOL
import pycookiecheat, json
cookies = pycookiecheat.chrome_cookies("https://takeout.google.com")
for key in ["SID", "HSID", "SSID", "OSID"]:
        print("export %s=%s" % (key, cookies[key]))
EOL
python3 -c "$code" | tee "$source_path"

# Clean up
source "$source_path"
rm -f "$source_path"
deactivate
[[ "${BASH_SOURCE[0]}" == "${0}" ]] && rm -rf "$venv_path"

Clean up downloaded files:

rm -rf "$venv_path"

Request archive creation

This is not yet automated. The script would need to fill out the Google Takeout form and then submit it.

Schedule archive creation

There is no fully automated way to do this yet, but in May 2019, Google Takeout introduced a feature that automates the creation of 1 backup every 2 months for 1 year (6 backups total). This has to be done in the browser at https://takeout.google.com while filling out the archive request form:

Google Takeout: Customize archive format

Check if archive is created

This is not yet automated. If an archive has been created, Google sometimes sends an email to the user's Gmail inbox, but in my testing, this doesn't always happen for reasons unknown.

The only other way to check if an archive has been created is by polling Google Takeout periodically.

Get archive list

I have a command to do this, assuming that the cookies have been set as environment variables in the "Get cookies" section above:

curl -sL -H "Cookie: SID=${SID}; HSID=${HSID}; SSID=${SSID}; OSID=${OSID};" \
'https://takeout.google.com/settings/takeout/downloads' |
grep -Po '(?<=")https://storage\.cloud\.google\.com/[^"]+(?=")' |
awk '!x[$0]++'

The output is a line-delimited list of URLs that lead to downloads of all available archives.
It's parsed from HTML with regex.

Download all archive files

Here is the code in Bash to get the URLs of the archive files and download them all, assuming that the cookies have been set as environment variables in the "Get cookies" section above:

curl -sL -H "Cookie: SID=${SID}; HSID=${HSID}; SSID=${SSID}; OSID=${OSID};" \
'https://takeout.google.com/settings/takeout/downloads' |
grep -Po '(?<=")https://storage\.cloud\.google\.com/[^"]+(?=")' |
awk '!x[$0]++' |
xargs -n1 -P1 -I{} curl -LOJ -C - -H "Cookie: SID=${SID}; HSID=${HSID}; SSID=${SSID}; OSID=${OSID};" {}

I've tested it on Linux, but the syntax should be compatible with macOS, too.

Explanation of each part:

  1. curl command with authentication cookies:

    curl -sL -H "Cookie: SID=${SID}; HSID=${HSID}; SSID=${SSID}; OSID=${OSID};" \
    
  2. URL of the page that has the download links

    'https://takeout.google.com/settings/takeout/downloads' |
    
  3. Filter matches only download links

    grep -Po '(?<=")https://storage\.cloud\.google\.com/[^"]+(?=")' |
    
  4. Filter out duplicate links

    awk '!x[$0]++' |
    
  5. Download each file in the list, one by one:

    xargs -n1 -P1 -I{} curl -LOJ -C - -H "Cookie: SID=${SID}; HSID=${HSID}; SSID=${SSID}; OSID=${OSID};" {}
    

    Note: Parallelizing the downloads (changing -P1 to a higher number) is possible, but Google seems to throttle all but one of the connections.

    Note: -C - skips files that already exist, but it might not successfully resume downloads for existing files.

Encrypt downloaded archive files

This is not automated. The implementation depends on how you like to encrypt your files, and the local disk space consumption must be doubled for each file you are encrypting.

Upload downloaded archive files to Dropbox

This is not yet automated.

Upload downloaded archive files to AWS S3

This is not yet automated, but it should simply be a matter of iterating over the list of downloaded files and running a command like:

aws s3 cp TAKEOUT_FILE "s3://MYBUCKET/Google Takeout/"

Deltik

Posted 2014-02-14T07:22:48.653

Reputation: 16 807

Thank you so much for this! I had to hack around a little bit to get it working on macOS (brew install pkg-config dbus was some of it, and I'll try to come back and edit in the rest) but it saved me a lot of hassle. – jlucktay – 2020-02-05T08:23:22.637

1@jlucktay: Please improve this answer if you can! The more automated and compatible the process is, the better! – Deltik – 2020-02-05T10:44:46.203

2

Instead of Direct APIs for backing up Google Takeout(which seems to be almost impossible to do as of now), you can back up your data to 3rd party storage solutions via Google Drive. Many Google service allow backup to Google Drive, and you can backup Google Drive using the following tools:

GoogleCL - GoogleCL brings Google services to the command line.

gdatacopier - Command line document management utilities for Google docs.

FUSE Google Drive - A FUSE user-space filesystem for Google Drive, written in C.

Grive - An independent open-source implementation of a Google Drive client. It uses the Google Document List API to talk to the servers in Google. The code is written in C++.

gdrive-cli - A command-line interface for GDrive. This uses the GDrive API, not the GDocs API, which is interesting. To use it, you need to register a chrome application. It must be at least installable by you, but need not be published. There is a boilerplate app in the repo you can use as a starting point.

python-fuse example - Contains some slides and examples of Python FUSE filesystems.

Most of these seem to be in the Ubuntu repositories. I've used Fuse, gdrive and GoogleCL myself and they all work fine. Depending on the level of control you want this will be really easy or really complex. That's up to you. It should be straight forward to do from an EC2/S3 server. Just figure the commands out one by one for everything you need and put it in a script on a cron job.

If you don't want to work so hard, you can also just use a service like Spinbackup. I'm sure there are others just as good but I haven't tried any.

krowe

Posted 2014-02-14T07:22:48.653

Reputation: 5 031

21Google takeout is the best tool for this because it support more services that these other tools. The question is valid. – jl6 – 2015-10-20T22:07:29.527

6@krowe: Your answer is really usefull, howver it relates only to google drive. Google takeout lets you download all your data from 25 different Google services, not just Google drive. – Bjarke Freund-Hansen – 2015-12-08T08:55:36.447

@BjarkeFreund-Hansen 1) Many of those 25 services can be saved to GDrive and backed up automatically as part of an automated GDrive backup. 2) Most of the remaining services are either pointless to backup (+1s, Circles, etc) or defunct (Google Code). 3) I'm tired of explaining this to people who don't have a better answer. I'm fairly certain that there is no way to automate takeout (aside from using client side macros; which aren't very reliable anyway). 4) If you can prove me wrong then post your better solution and we can talk. If not, then refer to my previous comment on this same issue. – krowe – 2015-12-09T11:31:21.843

6@krowe: Gmail, Calendar, Contacts, Photos, Hangout history and Location history are services I use extensively and would like to ensure against data loss at Google. None of those service's data are included in Google drive. Just because I don't know a better solution, or one exists at all, does not make you answer any more correct. Again, I am not saying that you answer is bad, it just does not answer to the actual question. – Bjarke Freund-Hansen – 2015-12-21T04:52:48.217

@BjarkeFreund-Hansen I understand your frustration and some of those services CAN be synced with your GDrive (so they will backup along with it). For example, Google Photos can do that: Backup Photos. I believe that Calendar and Contacts can be synced in the same way. GMail can be backed as well: Backup GMail. The other things you mention I don't know about but that is mostly because I personally wouldn't bother backing them up anyway.

– krowe – 2015-12-21T07:21:19.140

GoogleCL is no longer supported due to an issue in authentication. – Bartosz Klimek – 2017-06-27T07:53:14.903

0

I found this question while searching for how to fix my google photos not showing up properly in google drive (which I'm already automatically backing up!).

So, to get your photos to show up in google drive, go to https://photos.google.com, settings and set it to show photos in a folder in drive.

Then use https://github.com/ncw/rclone to clone your entire google drive (which now includes photos as a 'normal' directory) down to your local storage.

djsmiley2k TMW

Posted 2014-02-14T07:22:48.653

Reputation: 5 937

rclone looks great, seems like a mature project. Just the solution I was looking for. – steampowered – 2018-05-03T14:33:20.623

It's really REALLY nice. tho with my many thousands of photos it now takes awhile to crunch through them. I do wonder if I can just make it blindly download everything, rather than checking for dupes. – djsmiley2k TMW – 2018-05-03T14:55:28.257