2

2021-10-05 UPDATED QUESTION AND TEXT AFTER MORE ANALYSIS, STRIPPED DOWN TO MINIMAL CASE

Short description

A Nomad / Consul cluster is running, with Traefik (with minimal configuration) as a system task on each Nomad client. There are 3 nomad servers, 3 consul servers, 3 nomad clients and 3 Gluster servers at this point. Set-up very similar to this article set on setting up a Nomad / Consul cluster

Basic images and sites work well.

The issue

I've started porting the first bigger PHP based site (with larger number of page dependency loads on the site) to this cluster and am running into a weird issue that I have pinpointed, but cannot resolve properly.

The tasks load well and registers as up in Consul, Traefik and Nomad. Small pages (with few dependencies) work well.

Whenever a page has too much dependency loads, Apache stalls those specific connections.

When I open a fresh Incognito browser window, and go the the url, the main page and around 10-15 of the dependencies load. The others stay in a pending state in the browser. The browser status keeps 'spinning' (as in loading). Closing the window and opening a new one allows me to repeat the process.

I've nailed down the issue to the fact that the PHP sessions directory is mapped (via Docker) to a directory on a GlusterFS mount.

Moving the volume mapping to a different directory that is host based on the same server removes the issue and the site loads as it should.

Conclusion: The interaction between Docker volumes and the Gluster mount is causing issues under 'heavy load'. With just a few requests everything works well. With lots of requests to access the PHP session file things stall and do not recover.

Question: This is probably caused by either a Gluster configuration issue or the way the mount is configured in /etc/fstab. Please help to fix this issue!

ISOLATION

The PHP sessions directory is set to /var/php_session in the images PHP config and mapped in Nomad / Docker to /data/storage/test/php_sessions.

The /data/storage/test/php_sessions directory is owned by user 20000 to make sure all nodes have access to the same PHP sessions:

client:/data/storage/test$ ls -ln .
drwxr-xr-x  2 20000 20000     6 Oct  5 14:53 php_sessions
drwxr-xr-x  2 20000 20000     6 Oct  5 14:53 upload

When changing the nomad volume mapping (in /etc/nomad/nomad.hcl) from:

client {

  host_volume "test-sessions" {
    path      = "/data/storage/test/php_sessions"
    read_only = false
  }

}

to

client {

  host_volume "test-sessions" {
    path      = "/tmp/php_sessions"
    read_only = false
  }

}

(And making sure /tmp/php_sessions is also owned by user 20000)

Everything works again.

Detailed data (More on request)

Contents of /etc/fstab:

LABEL=cloudimg-rootfs   /    ext4   defaults    0 1
LABEL=UEFI  /boot/efi   vfat    defaults    0 1
gluster-01,gluster-02,gluster-03:/storage       /data/storage   glusterfs   _netdev,defaults,direct-io-mode=disable,rw

Dockerfile for site image:

FROM php:7.4.1-apache
ENV APACHE_DOCUMENT_ROOT /var/www/htdocs
WORKDIR /var/www

RUN docker-php-ext-install mysqli pdo_mysql

# Make Apache root configurable
RUN sed -ri -e 's!/var/www/html!${APACHE_DOCUMENT_ROOT}!g' /etc/apache2/sites-available/*.conf
RUN sed -ri -e 's!/var/www/!${APACHE_DOCUMENT_ROOT}!g' /etc/apache2/apache2.conf /etc/apache2/conf-available/*.conf

# Listen on  port 1080 by default for non-root user
RUN sed -ri 's/Listen 80/Listen 1080/g' /etc/apache2/ports.conf
RUN sed -ri 's/:80/:1080/g' /etc/apache2/sites-enabled/*

# Use own config
COPY data/000-default.conf /etc/apache2/sites-enabled/

# Enable Production ini
RUN cp /usr/local/etc/php/php.ini-production /usr/local/etc/php/php.ini

RUN a2enmod rewrite && a2enmod remoteip

COPY --from=composer:latest /usr/bin/composer /usr/local/bin/composer
COPY --chown=www-data:www-data . /var/www

RUN /usr/local/bin/composer --no-cache --no-ansi --no-interaction install

# Finally add security changes
COPY data/changes.ini /usr/local/etc/php/conf.d/

The nomad file stripped down to what triggers the issue: with the following Nomad job plan:

job "test" {
  datacenters = ["dc1"]

  group "test-staging" {
    count = 1

    network {
      port "php_http" {
        to = 1080
      }
    }

    volume "test-sessions" {
      type      = "host"
      read_only = false
      source    = "test-sessions"
    }

    volume "test-upload" {
      type      = "host"
      read_only = false
      source    = "test-upload"
    }

    service {
      name = "test-staging"
      port = "php_http"

      tags = [
        "traefik.enable=true",
        "traefik.http.routers.test.php_staging.rule=Host(`staging.xxxxxx.com`)",
      ]

      check {
        type     = "tcp"
        port     = "php_http"
        interval = "5s"
        timeout  = "2s"
      }
    }

    task "test" {
      driver = "docker"
      user = "20000"

      config {
        image = "docker-repo:5000/test/test:latest"
        ports = ["php_http"]
      }

      volume_mount {
        volume      = "test-sessions"
        destination = "/var/php_sessions"
        read_only   = false
      }

      volume_mount {
        volume      = "test-upload"
        destination = "/var/upload"
        read_only   = false
      }

      template {
        data = <<EOF
1.2.3.4
EOF

        destination = "local/trusted-proxies.lst"
      }
    }
  }
}
Paul
  • 51
  • 5
  • 1
    You are describing a complex environment and an issue that is hard to reproduce, because we don't have access to your specific PHP application. Questions like this are hard to answer. To increase your chances of getting answers, try limiting the scope of the question or describe what the success criteria are for an answer. At any rate, I have some idea about the cause of the issue. Please provide data/000-default.conf and a concise / minimal overview of how you have setup traefik, nomad and consul (even more than just the config files already mentioned). – Ярослав Рахматуллин Oct 01 '21 at 18:26
  • Thanks @ЯрославРахматуллин, I've further isolated the issue. It's caused by the PHP session directory in the image being mapped to a directory on a GlusterFS volume on the host. Hopefully somebody can now tell me how to fix this! – Paul Oct 05 '21 at 15:39

0 Answers0