We are deploying services written in Golang on the App engine standard environment.
We did not have issues for years with this, with deployment taking 2 to 5 minutes to complete.
Since January we have seen the deployment time of one of the service growing until it started to sometimes timeout last month, even tho this service did not change significantly.
Right now it is mostly always timing out rather than deploying before the 10 min timeout.
Our first step was to try to increase this timeout limit.
We have looked into the documentation and found that gcloud config set app/cloud_build_timeout 900
could be used to increase this timeout to 15 min. So we tried that but it is still timing out after 10 min. We now found out in the documentation that this will change the timeout for the Flex environment, but not for the Standard one.
As said above we do not really understand why the deployment time would be increasing in the first place, and as a matter of fact, this service does deploy properly in production in less than 5 min.
We now are looking into the Cloud build summary to see what differ, and here is our finding.
The builds that timeout are using a go cache layer that takes a long time to complete.
The build summary will look like this when timing out:
- Step 0 fetcher finish in 2 seconds
- Step 1 prep finish in 0 seconds
- Step 2 pre-tag finish in 7 seconds
- Step 3 detector finish in 17 seconds
- Step 4 analyzer finish in 2 seconds
- Step 5 restorer finish in 1 minutes 4 seconds
- Step 6 builder finish in 34 seconds
- Step 7 exporter timeout after 7 minutes 54 minutes
In a build that deployed properly (back in January, or in production where it still deploy properly), we see that Step 5 takes seconds and Step 7 takes ~2 minutes.
We also see that this gradually increased over time, taking more and more time until it now mostly timeout.
Here is log from the build summary regarding this cache layer in Step 5
2021-04-27 17:16:44.658 ICTStarting Step #5 - "restorer"
Info
2021-04-27 17:16:44.660 ICTStep #5 - "restorer": Already have image (with digest): eu.gcr.io/gae-runtimes/buildpacks/go113/builder:go113_20210406_1_13_15_RC00
Info
2021-04-27 17:16:46.238 ICTStep #5 - "restorer": Restoring data for "google.go.build:gocache" from cache
Info
2021-04-27 17:16:46.238 ICTStep #5 - "restorer": Retrieving data for "sha256:2b4be425fac1c997f7350663ff0406cc65bcff4acc894f8b07c0af09284a950e"
Info
2021-04-27 17:17:39.732 ICTFinished Step #5 - "restorer"
Here retriving the data for the cache layer took 53 seconds, while it used to take a few seconds.
And here is log from the Step 7
Info
2021-04-27 18:33:29.861 ICTStep #7 - "exporter": Adding cache layer 'google.go.build:gocache'
Info
2021-04-27 18:33:29.861 ICTStep #7 - "exporter": Layer 'google.go.build:gocache' SHA: sha256:6851017222d45b350217544670d113df46d630ffb51fc0c0a4bbb669c65d2178
Info
2021-04-27 18:39:57.199 ICTTIMEOUT
Info
2021-04-27 18:39:57.199 ICTERROR: context deadline exceeded
Here adding the cache layer could not complete as the whole build was now taking more than 10 min, while it used to take ~15 to ~25 seconds to complete Step 7.
We also noticed some errors related to this cache layer that occur sometimes on successful builds, we are not sure it is related to the time increase tho.
Unable to delete previous cache image: DELETE https://eu.gcr.io/v2/xxxx/app-engine-tmp/build-cache/ttl-7d/xxxx/buildpack-cache/manifests/sha256:872c0a8965bc92689e2184fb277146bb05716e38352fb539115485c7d982dbcc: GOOGLE_MANIFEST_DANGLING_TAG: Manifest is still referenced by tag: dcd74554-d009-4147-8ddf-8fb36a67e92f
Or
Warning: Failed to export cache: committing cache: saving image 'eu.gcr.io/xxx/app-engine-tmp/build-cache/ttl-7d/xxx/buildpack-cache:latest': failed to write image to the following tags: [eu.gcr.io/xxx/app-engine-tmp/build-cache/ttl-7d/xxx/buildpack-cache:latest: PUT https://eu.gcr.io/v2/xxx/app-engine-tmp/build-cache/ttl-7d/xxx/buildpack-cache/manifests/latest: MANIFEST_INVALID: Failed to parse manifest for request "/v2/xxx/app-engine-tmp/build-cache/ttl-7d/xxx/buildpack-cache/manifests/latest": Manifest is missing required field "layers".]
We are trying to investigate this further but so far no luck.
What could be causing this ?
Cordially.