0

We are deploying services written in Golang on the App engine standard environment.

We did not have issues for years with this, with deployment taking 2 to 5 minutes to complete.

Since January we have seen the deployment time of one of the service growing until it started to sometimes timeout last month, even tho this service did not change significantly.

Right now it is mostly always timing out rather than deploying before the 10 min timeout.

Our first step was to try to increase this timeout limit.

We have looked into the documentation and found that gcloud config set app/cloud_build_timeout 900 could be used to increase this timeout to 15 min. So we tried that but it is still timing out after 10 min. We now found out in the documentation that this will change the timeout for the Flex environment, but not for the Standard one.

As said above we do not really understand why the deployment time would be increasing in the first place, and as a matter of fact, this service does deploy properly in production in less than 5 min.

We now are looking into the Cloud build summary to see what differ, and here is our finding.

The builds that timeout are using a go cache layer that takes a long time to complete.

The build summary will look like this when timing out:

  • Step 0 fetcher finish in 2 seconds
  • Step 1 prep finish in 0 seconds
  • Step 2 pre-tag finish in 7 seconds
  • Step 3 detector finish in 17 seconds
  • Step 4 analyzer finish in 2 seconds
  • Step 5 restorer finish in 1 minutes 4 seconds
  • Step 6 builder finish in 34 seconds
  • Step 7 exporter timeout after 7 minutes 54 minutes

In a build that deployed properly (back in January, or in production where it still deploy properly), we see that Step 5 takes seconds and Step 7 takes ~2 minutes.

We also see that this gradually increased over time, taking more and more time until it now mostly timeout.

Here is log from the build summary regarding this cache layer in Step 5

2021-04-27 17:16:44.658 ICTStarting Step #5 - "restorer"
Info
2021-04-27 17:16:44.660 ICTStep #5 - "restorer": Already have image (with digest): eu.gcr.io/gae-runtimes/buildpacks/go113/builder:go113_20210406_1_13_15_RC00
Info
2021-04-27 17:16:46.238 ICTStep #5 - "restorer": Restoring data for "google.go.build:gocache" from cache
Info
2021-04-27 17:16:46.238 ICTStep #5 - "restorer": Retrieving data for "sha256:2b4be425fac1c997f7350663ff0406cc65bcff4acc894f8b07c0af09284a950e"
Info
2021-04-27 17:17:39.732 ICTFinished Step #5 - "restorer"

Here retriving the data for the cache layer took 53 seconds, while it used to take a few seconds.

And here is log from the Step 7

Info
2021-04-27 18:33:29.861 ICTStep #7 - "exporter": Adding cache layer 'google.go.build:gocache'
Info
2021-04-27 18:33:29.861 ICTStep #7 - "exporter": Layer 'google.go.build:gocache' SHA: sha256:6851017222d45b350217544670d113df46d630ffb51fc0c0a4bbb669c65d2178
Info
2021-04-27 18:39:57.199 ICTTIMEOUT
Info
2021-04-27 18:39:57.199 ICTERROR: context deadline exceeded

Here adding the cache layer could not complete as the whole build was now taking more than 10 min, while it used to take ~15 to ~25 seconds to complete Step 7.

We also noticed some errors related to this cache layer that occur sometimes on successful builds, we are not sure it is related to the time increase tho.

Unable to delete previous cache image: DELETE https://eu.gcr.io/v2/xxxx/app-engine-tmp/build-cache/ttl-7d/xxxx/buildpack-cache/manifests/sha256:872c0a8965bc92689e2184fb277146bb05716e38352fb539115485c7d982dbcc: GOOGLE_MANIFEST_DANGLING_TAG: Manifest is still referenced by tag: dcd74554-d009-4147-8ddf-8fb36a67e92f

Or

Warning: Failed to export cache: committing cache: saving image 'eu.gcr.io/xxx/app-engine-tmp/build-cache/ttl-7d/xxx/buildpack-cache:latest': failed to write image to the following tags: [eu.gcr.io/xxx/app-engine-tmp/build-cache/ttl-7d/xxx/buildpack-cache:latest: PUT https://eu.gcr.io/v2/xxx/app-engine-tmp/build-cache/ttl-7d/xxx/buildpack-cache/manifests/latest: MANIFEST_INVALID: Failed to parse manifest for request "/v2/xxx/app-engine-tmp/build-cache/ttl-7d/xxx/buildpack-cache/manifests/latest": Manifest is missing required field "layers".]

We are trying to investigate this further but so far no luck.

What could be causing this ?

Cordially.

1 Answers1

0

We solved this with a workaround, by deleting the cloud storage eu.artifacts.xxx/container folder as it contains all the data restored and exported by cloud build.

The build now complete in less than 2 minutes again.

We also added a life-cycle rule to have this folder deleted automatically one day after it's being re-created. Hopefully this will prevent the build time from increasing again.

Be aware that this workaround is fine for us as we do not need the cache layer, all our builds complete around 2 min, and do not complete faster with the cache layer, it might not be the case for your builds.

Note as well that the folder might start with us.container or another prefix based on the build location.

If someone has a proper answer to this problem tho, we are still interested to hear it.