1

I have following configuration of External Google Cloud Load Balancer:

Diagram of the load balancer

  • GlobalNetworkEndpointGroupToClusterByIp is Internet NEG with type INTERNET_IP_PORT pointing to Kubernetes cluster's IP.
  • GlobalNetworkEndpointGroupToManagedS3 is Internet NEG with type INTERNET_FQDN_PORT pointing to managed by Yandex S3 service.

For some reason some backend services fail to work and when I'm trying to connect to them they response with HTML page showing 502 Server Error:

Error: Server Error

The server encountered a temporary error and could not complete your request.

Please try again in 30 seconds.

In failed backend service logs there are always following errors:

jsonPayload: {
  cacheId: "GRU-c0ee45d8"
  @type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry"
  statusDetails: "failed_to_pick_backend"
}

Requests to backend services fail in 1ms (as noted in logs), so it seems like they don't even try to connect to my Kubernetes cluster's IP or Managed S3 and fail instantly.

At the moment of posting this question S3 and Imgproxy backend services are in good condition, but others are not working:

Uptime status

If I re-deploy everything, some other services may fail, for example:

  • API and Docs will work, others will fail
  • API, Docs, FPS and Imgproxy will work, S3 will fail
  • S3 will work, others will fail

So it's absolutely random and I can't understand why it happens. If I will be very lucky enough, after re-deployment all backend services will work well. Also it's possible neither of them will work.

Kubernetes cluster works, it accept connections, Managed S3 works well too. It looks like a bug, but I couldn't find anything about this in Google.

Here's how my Terraform configuration looks:

resource "google_compute_global_network_endpoint_group" "kubernetes-cluster" {
  name                  = "kubernetes-cluster-${var.ENVIRONMENT_NAME}"
  network_endpoint_type = "INTERNET_IP_PORT"

  depends_on = [
    module.kubernetes-resources
  ]
}

resource "google_compute_global_network_endpoint" "kubernetes-cluster" {
  global_network_endpoint_group = google_compute_global_network_endpoint_group.kubernetes-cluster.name
  port                          = 80
  ip_address                    = yandex_vpc_address.kubernetes.external_ipv4_address.0.address
}

resource "google_compute_global_network_endpoint_group" "s3" {
  name                  = "s3-${var.ENVIRONMENT_NAME}"
  network_endpoint_type = "INTERNET_FQDN_PORT"
}

resource "google_compute_global_network_endpoint" "s3" {
  global_network_endpoint_group = google_compute_global_network_endpoint_group.s3.name
  port                          = 443
  fqdn                          = trimprefix(local.s3.endpoint, "https://")
}

resource "google_compute_backend_service" "s3" {
  name = "s3-${var.ENVIRONMENT_NAME}"

  backend {
    group = google_compute_global_network_endpoint_group.s3.self_link
  }

  custom_request_headers = [
    "Host:${google_compute_global_network_endpoint.s3.fqdn}"
  ]

  cdn_policy {
    cache_key_policy {
      include_host         = true
      include_protocol     = false
      include_query_string = false
    }
  }

  enable_cdn            = true
  load_balancing_scheme = "EXTERNAL"

  log_config {
    enable      = true
    sample_rate = 1.0
  }

  port_name   = "https"
  protocol    = "HTTPS"
  timeout_sec = 60
}

resource "google_compute_backend_service" "imgproxy" {
  name = "imgproxy-${var.ENVIRONMENT_NAME}"

  backend {
    group = google_compute_global_network_endpoint_group.kubernetes-cluster.self_link
  }

  cdn_policy {
    cache_key_policy {
      include_host         = true
      include_protocol     = false
      include_query_string = false
    }
  }

  enable_cdn            = true
  load_balancing_scheme = "EXTERNAL"

  log_config {
    enable      = true
    sample_rate = 1.0
  }

  port_name   = "http"
  protocol    = "HTTP"
  timeout_sec = 60
}

resource "google_compute_backend_service" "api" {
  name = "api-${var.ENVIRONMENT_NAME}"

  custom_request_headers = [
    "Access-Control-Allow-Origin:${var.ALLOWED_CORS_ORIGIN}"
  ]

  backend {
    group = google_compute_global_network_endpoint_group.kubernetes-cluster.self_link
  }

  load_balancing_scheme = "EXTERNAL"

  log_config {
    enable      = true
    sample_rate = 1.0
  }

  port_name   = "http"
  protocol    = "HTTP"
  timeout_sec = 60
}

resource "google_compute_backend_service" "front" {
  name = "front-${var.ENVIRONMENT_NAME}"

  backend {
    group = google_compute_global_network_endpoint_group.kubernetes-cluster.self_link
  }

  cdn_policy {
    cache_key_policy {
      include_host         = true
      include_protocol     = false
      include_query_string = true
    }
  }

  enable_cdn            = true
  load_balancing_scheme = "EXTERNAL"

  log_config {
    enable      = true
    sample_rate = 1.0
  }

  port_name   = "http"
  protocol    = "HTTP"
  timeout_sec = 60
}

resource "google_compute_url_map" "default" {
  name            = "default-${var.ENVIRONMENT_NAME}"
  default_service = google_compute_backend_service.front.self_link

  host_rule {
    hosts = [
      local.hosts.api,
      local.hosts.fps
    ]
    path_matcher = "api"

  }

  host_rule {
    hosts = [
      local.hosts.s3
    ]
    path_matcher = "s3"
  }

  host_rule {
    hosts = [
      local.hosts.imgproxy
    ]
    path_matcher = "imgproxy"
  }

  path_matcher {
    default_service = google_compute_backend_service.api.self_link
    name            = "api"
  }

  path_matcher {
    default_service = google_compute_backend_service.s3.self_link
    name            = "s3"
  }

  path_matcher {
    default_service = google_compute_backend_service.imgproxy.self_link
    name            = "imgproxy"
  }

  test {
    host    = local.hosts.docs
    path    = "/"
    service = google_compute_backend_service.front.self_link
  }

  test {
    host    = local.hosts.api
    path    = "/"
    service = google_compute_backend_service.api.self_link
  }

  test {
    host    = local.hosts.fps
    path    = "/"
    service = google_compute_backend_service.api.self_link
  }

  test {
    host    = local.hosts.s3
    path    = "/"
    service = google_compute_backend_service.s3.self_link
  }

  test {
    host    = local.hosts.imgproxy
    path    = "/"
    service = google_compute_backend_service.imgproxy.self_link
  }
}

# See: https://github.com/hashicorp/terraform-provider-google/issues/5356
resource "random_id" "managed-certificate-name" {
  byte_length = 4
  prefix      = "default-${var.ENVIRONMENT_NAME}-"

  keepers = {
    domains = join(",", values(local.hosts))
  }
}

resource "google_compute_managed_ssl_certificate" "default" {
  name = random_id.managed-certificate-name.hex

  lifecycle {
    create_before_destroy = true
  }

  managed {
    domains = values(local.hosts)
  }
}

resource "google_compute_ssl_policy" "default" {
  name    = "default-${var.ENVIRONMENT_NAME}"
  profile = "MODERN"
}

resource "google_compute_target_https_proxy" "default" {
  name       = "default-${var.ENVIRONMENT_NAME}"
  url_map    = google_compute_url_map.default.self_link
  ssl_policy = google_compute_ssl_policy.default.self_link
  ssl_certificates = [
    google_compute_managed_ssl_certificate.default.self_link
  ]
}

resource "google_compute_global_forwarding_rule" "default" {
  name                  = "default-${var.ENVIRONMENT_NAME}"
  load_balancing_scheme = "EXTERNAL"
  port_range            = "443-443"
  target                = google_compute_target_https_proxy.default.self_link
}

UPD. I figured out that recreating NEG will resolve the issue:

  1. Wait until Terraform will finish deployment.
  2. Create via Google Cloud Platform Console NEGs with same configurations.
  3. Edit backend services to use newly created NEGs.
  4. It works!

But it's definitely hack and seems like there is no way to automate it with Terraform. I will continue investigating the issue.

Petr Flaks
  • 113
  • 5

1 Answers1

2

Glad to hear that your issue has been fixed and I understand that you have achieved it by manually creating NEG thru GCP console and subsequently editing backend services rather than using Terraform. The most likely cause of this issue seems to be racing condition i.e. in Terraform we usually define the resources in a chain and hence each resource being defined is dependent on another resource. Usually while defining resources through Terraform, the backend services creation and NE attachments are dependent on NEG creation. Both the backend services creation and Network endpoint(NE) attachment operations tend to run in parallel and in such case the NE attach process doesn’t reference to the backend service correctly because the state of the Internet NEG will be read exactly during backend service creation/update (so NE attachment has to happen prior to backend creation) .
So, in the Terraform while creating the backend service, we have to define it to be depends-on (meta argument) [1] NE attachment (i.e, backend service should run only after NE attachment).

[1] https://www.terraform.io/docs/language/meta-arguments/depends_on.html

Hope this clarifies your doubt.

Dave M
  • 4,494
  • 21
  • 30
  • 30