1

I am defining a one-shot service in my CoreOS cloud-config, but it is failing due to not being able to download files from Google Cloud Storage (via wget):

Apr 13 11:09:56 staging-node-ys9y.c.experimentalberlin.internal sh[1132]: Connecting to storage.googleapis.com|74.125.133.128|:443... failed: Connection timed out.

How should I ensure that the service be able to download files from the Internet?

My cloud-config

#cloud-config
coreos:
  units:
    - name: bootstrap.service
      command: start
      content: |
        [Unit]
        Description=Bootstrap instance
        After=network-online.target
        Requires=network-online.target

        [Service]
        Type=oneshot
        RemainAfterExit=true
        ExecStart=/usr/bin/mkdir -p /tmp/kubernetes-staging
        ExecStart=cd /tmp/kubernetes-staging
        ExecStart=/bin/sh -c "cd /tmp/kubernetes-staging && wget https://storage.googleapis.com/experimentalberlin/staging.tar.gz && tar xf staging.tar.gz"
        ExecStart=/tmp/kubernetes-staging/worker/bootstrap.sh

        [Install]
        WantedBy=local.target
aknuds1
  • 2,085
  • 3
  • 16
  • 23

1 Answers1

2

I would take a multi-step tactic to troubleshooting this. Pardon the extra info and over explanation, everyone here at CoreOS has to deal with this from me. ;)

First and foremost you want to make sure that the URL you are trying to download from can be retrieved from inside the cluster. Presently, I don't see any reason why this should not be the case as I was able to wget it (as an aside, it's generally better to not put private key material in a publicly accessible tarball. In this case while still not optimal it may be better to include those assets either in the user-data or at the very least protect the tarball with symmetric encryption.)

As cloud-init runs after the network is online, this should be sufficient (the meta-data service resides at http://169.254.169.254 and thus the cloud-config cannot be retrieved until after the network is online.) This means that the likely culprits are down to transient network issues, or other details.

When I attempt to run through this I get the following error:

core@rbtest ~ $ journalctl -u bootstrap.service
-- Logs begin at Wed 2016-04-13 17:31:35 UTC, end at Wed 2016-04-13 17:33:09 UTC. --
Apr 13 17:31:47 rbtest.c.coreos-support.internal systemd[1]: [/etc/systemd/system/bootstrap.service:10] Executable path is not absolute, ignoring: cd /tmp/kubernetes-staging
Apr 13 17:31:47 rbtest.c.coreos-support.internal systemd[1]: Starting Bootstrap instance...
Apr 13 17:31:47 rbtest.c.coreos-support.internal sh[1074]: --2016-04-13 17:31:47--  https://storage.googleapis.com/experimentalberlin/staging.tar.gz
Apr 13 17:31:47 rbtest.c.coreos-support.internal sh[1074]: Resolving storage.googleapis.com... 209.85.200.128, 2607:f8b0:4001:c08::80
Apr 13 17:31:47 rbtest.c.coreos-support.internal sh[1074]: Connecting to storage.googleapis.com|209.85.200.128|:443... connected.
Apr 13 17:31:48 rbtest.c.coreos-support.internal sh[1074]: HTTP request sent, awaiting response... 200 OK
Apr 13 17:31:48 rbtest.c.coreos-support.internal sh[1074]: Length: 4722 (4.6K) [application/x-tar]
Apr 13 17:31:48 rbtest.c.coreos-support.internal sh[1074]: Saving to: 'staging.tar.gz'
Apr 13 17:31:48 rbtest.c.coreos-support.internal sh[1074]: 0K ....                                                  100% 47.4M=0s
Apr 13 17:31:48 rbtest.c.coreos-support.internal sh[1074]: 2016-04-13 17:31:48 (47.4 MB/s) - 'staging.tar.gz' saved [4722/4722]
Apr 13 17:31:48 rbtest.c.coreos-support.internal systemd[1]: bootstrap.service: Main process exited, code=exited, status=203/EXEC
Apr 13 17:31:48 rbtest.c.coreos-support.internal systemd[1]: Failed to start Bootstrap instance.
Apr 13 17:31:48 rbtest.c.coreos-support.internal systemd[1]: bootstrap.service: Unit entered failed state.
Apr 13 17:31:48 rbtest.c.coreos-support.internal systemd[1]: bootstrap.service: Failed with result 'exit-code'.

The clue here is the line:

    bootstrap.service: Main process exited, code=exited, status=203/EXEC

This message is telling you that there was a problem running the script itself. Digging in this makes complete sense as when I look at the top of that shell script there is no shebang telling systemd how to run the executable (in this case it's all Bourne Shell/Bourne-Again Shell compatible commands, so the shebang should likely be either #!/bin/sh or #!/bin/bash.) Adding a shebang should fix this issue.

Some other minor nits:

  • when using wget specify the download location :

    wget -O /tmp/kubernetes-staging/staging.tar.gz https://storage.googleapis.com/experimentalberlin/staging.tar.gz
    
  • when expanding your tarball, you can output it to a specific location with -C:

    tar  xf /tmp/kubernetes-staging/staging.tar.gz  -C /tmp/kubernetes-staging/
    

This allows you to separate those into their relevant ExecStart= options, which provides additional logging.

  • As most of these commands are pre-amble to the execution of the actual bootstrap.sh script, I would change all of the ExecStart= options (with the exception of the last) to ExecStartPre=.
Brian Redbeard
  • 349
  • 3
  • 12
  • Thanks for reminding me about the private data in there! :) It's just for test purposes now anyway, so it doesn't matter much, but it shouldn't be like that! – aknuds1 Apr 13 '16 at 17:41
  • I had to add a cheeky `ExecStartPre` to wait for the network to be available before I could get past wget failing: `ExecStartPre=/bin/sh -c 'until ping -c1 google.com; do sleep 1; done;'`. I guess you might have a point about other stuff being `ExecStartPre` stuff as well. – aknuds1 Apr 13 '16 at 17:43
  • Interesting, my boot on GCE worked on the first go-round. I'm curious as to why this isn't being successfully handled by `network-online.target`. – Brian Redbeard Apr 13 '16 at 17:44
  • You and me both! /curious – aknuds1 Apr 13 '16 at 17:45
  • I confirmed that it keeps failing even over my home network (was at a coworking space earlier), so it's reproducible. – aknuds1 Apr 13 '16 at 19:34
  • I found out what was happening... I assigned an external IP only upon SSH-ing into nodes, and before having an external IP nodes couldn't connect to the Internet... – aknuds1 Apr 15 '16 at 14:56