I have following configuration of External Google Cloud Load Balancer:
- GlobalNetworkEndpointGroupToClusterByIp is Internet NEG with type
INTERNET_IP_PORT
pointing to Kubernetes cluster's IP. - GlobalNetworkEndpointGroupToManagedS3 is Internet NEG with type
INTERNET_FQDN_PORT
pointing to managed by Yandex S3 service.
For some reason some backend services fail to work and when I'm trying to connect to them they response with HTML page showing 502 Server Error:
Error: Server Error
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
In failed backend service logs there are always following errors:
jsonPayload: {
cacheId: "GRU-c0ee45d8"
@type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry"
statusDetails: "failed_to_pick_backend"
}
Requests to backend services fail in 1ms (as noted in logs), so it seems like they don't even try to connect to my Kubernetes cluster's IP or Managed S3 and fail instantly.
At the moment of posting this question S3 and Imgproxy backend services are in good condition, but others are not working:
If I re-deploy everything, some other services may fail, for example:
- API and Docs will work, others will fail
- API, Docs, FPS and Imgproxy will work, S3 will fail
- S3 will work, others will fail
So it's absolutely random and I can't understand why it happens. If I will be very lucky enough, after re-deployment all backend services will work well. Also it's possible neither of them will work.
Kubernetes cluster works, it accept connections, Managed S3 works well too. It looks like a bug, but I couldn't find anything about this in Google.
Here's how my Terraform configuration looks:
resource "google_compute_global_network_endpoint_group" "kubernetes-cluster" {
name = "kubernetes-cluster-${var.ENVIRONMENT_NAME}"
network_endpoint_type = "INTERNET_IP_PORT"
depends_on = [
module.kubernetes-resources
]
}
resource "google_compute_global_network_endpoint" "kubernetes-cluster" {
global_network_endpoint_group = google_compute_global_network_endpoint_group.kubernetes-cluster.name
port = 80
ip_address = yandex_vpc_address.kubernetes.external_ipv4_address.0.address
}
resource "google_compute_global_network_endpoint_group" "s3" {
name = "s3-${var.ENVIRONMENT_NAME}"
network_endpoint_type = "INTERNET_FQDN_PORT"
}
resource "google_compute_global_network_endpoint" "s3" {
global_network_endpoint_group = google_compute_global_network_endpoint_group.s3.name
port = 443
fqdn = trimprefix(local.s3.endpoint, "https://")
}
resource "google_compute_backend_service" "s3" {
name = "s3-${var.ENVIRONMENT_NAME}"
backend {
group = google_compute_global_network_endpoint_group.s3.self_link
}
custom_request_headers = [
"Host:${google_compute_global_network_endpoint.s3.fqdn}"
]
cdn_policy {
cache_key_policy {
include_host = true
include_protocol = false
include_query_string = false
}
}
enable_cdn = true
load_balancing_scheme = "EXTERNAL"
log_config {
enable = true
sample_rate = 1.0
}
port_name = "https"
protocol = "HTTPS"
timeout_sec = 60
}
resource "google_compute_backend_service" "imgproxy" {
name = "imgproxy-${var.ENVIRONMENT_NAME}"
backend {
group = google_compute_global_network_endpoint_group.kubernetes-cluster.self_link
}
cdn_policy {
cache_key_policy {
include_host = true
include_protocol = false
include_query_string = false
}
}
enable_cdn = true
load_balancing_scheme = "EXTERNAL"
log_config {
enable = true
sample_rate = 1.0
}
port_name = "http"
protocol = "HTTP"
timeout_sec = 60
}
resource "google_compute_backend_service" "api" {
name = "api-${var.ENVIRONMENT_NAME}"
custom_request_headers = [
"Access-Control-Allow-Origin:${var.ALLOWED_CORS_ORIGIN}"
]
backend {
group = google_compute_global_network_endpoint_group.kubernetes-cluster.self_link
}
load_balancing_scheme = "EXTERNAL"
log_config {
enable = true
sample_rate = 1.0
}
port_name = "http"
protocol = "HTTP"
timeout_sec = 60
}
resource "google_compute_backend_service" "front" {
name = "front-${var.ENVIRONMENT_NAME}"
backend {
group = google_compute_global_network_endpoint_group.kubernetes-cluster.self_link
}
cdn_policy {
cache_key_policy {
include_host = true
include_protocol = false
include_query_string = true
}
}
enable_cdn = true
load_balancing_scheme = "EXTERNAL"
log_config {
enable = true
sample_rate = 1.0
}
port_name = "http"
protocol = "HTTP"
timeout_sec = 60
}
resource "google_compute_url_map" "default" {
name = "default-${var.ENVIRONMENT_NAME}"
default_service = google_compute_backend_service.front.self_link
host_rule {
hosts = [
local.hosts.api,
local.hosts.fps
]
path_matcher = "api"
}
host_rule {
hosts = [
local.hosts.s3
]
path_matcher = "s3"
}
host_rule {
hosts = [
local.hosts.imgproxy
]
path_matcher = "imgproxy"
}
path_matcher {
default_service = google_compute_backend_service.api.self_link
name = "api"
}
path_matcher {
default_service = google_compute_backend_service.s3.self_link
name = "s3"
}
path_matcher {
default_service = google_compute_backend_service.imgproxy.self_link
name = "imgproxy"
}
test {
host = local.hosts.docs
path = "/"
service = google_compute_backend_service.front.self_link
}
test {
host = local.hosts.api
path = "/"
service = google_compute_backend_service.api.self_link
}
test {
host = local.hosts.fps
path = "/"
service = google_compute_backend_service.api.self_link
}
test {
host = local.hosts.s3
path = "/"
service = google_compute_backend_service.s3.self_link
}
test {
host = local.hosts.imgproxy
path = "/"
service = google_compute_backend_service.imgproxy.self_link
}
}
# See: https://github.com/hashicorp/terraform-provider-google/issues/5356
resource "random_id" "managed-certificate-name" {
byte_length = 4
prefix = "default-${var.ENVIRONMENT_NAME}-"
keepers = {
domains = join(",", values(local.hosts))
}
}
resource "google_compute_managed_ssl_certificate" "default" {
name = random_id.managed-certificate-name.hex
lifecycle {
create_before_destroy = true
}
managed {
domains = values(local.hosts)
}
}
resource "google_compute_ssl_policy" "default" {
name = "default-${var.ENVIRONMENT_NAME}"
profile = "MODERN"
}
resource "google_compute_target_https_proxy" "default" {
name = "default-${var.ENVIRONMENT_NAME}"
url_map = google_compute_url_map.default.self_link
ssl_policy = google_compute_ssl_policy.default.self_link
ssl_certificates = [
google_compute_managed_ssl_certificate.default.self_link
]
}
resource "google_compute_global_forwarding_rule" "default" {
name = "default-${var.ENVIRONMENT_NAME}"
load_balancing_scheme = "EXTERNAL"
port_range = "443-443"
target = google_compute_target_https_proxy.default.self_link
}
UPD. I figured out that recreating NEG will resolve the issue:
- Wait until Terraform will finish deployment.
- Create via Google Cloud Platform Console NEGs with same configurations.
- Edit backend services to use newly created NEGs.
- It works!
But it's definitely hack and seems like there is no way to automate it with Terraform. I will continue investigating the issue.