14

I am running a spring boot application with docker swarm and I use postgres for database. When I run both of them as docker service, database connection fails consistently and randomly (as you can see on the timestamp) as the log says:

2017-10-26T17:14:15.200415747Z app-db.1.1ayo6h8ro1og@scw-c2964a | LOG: could not receive data from client: Connection reset by peer

2017-10-26T17:43:36.481718562Z app-db.1.1ayo6h8ro1og@scw-c2964a | LOG: could not receive data from client: Connection reset by peer

2017-10-26T17:43:56.954152654Z app-db.1.1ayo6h8ro1og@scw-c2964a | LOG: could not receive data from client: Connection reset by peer

2017-10-26T17:44:17.434171472Z app-db.1.1ayo6h8ro1og@scw-c2964a | LOG: could not receive data from client: Connection reset by peer

2017-10-26T17:49:04.154174253Z app-db.1.1ayo6h8ro1og@scw-c2964a | LOG: could not receive data from client: Connection reset by peer

I couldn't understand or discover the reason for this. I'd appreciate any ideas.

edit:

we realized that, when testing the application, it also throws error like this:

SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 937517ms

Thanks.

Elifcan Çakmak
  • 173
  • 1
  • 6

2 Answers2

11

I've got the same error deploying Docker Swarm stack of Spring Boot app and PostgreSQL. After battling with this for about a week, I've figured out that issue was in firewall dropping connections between containers because of inactivity. Quick answer, run following cmd on linux machine:

sudo sysctl -w \
net.ipv4.tcp_keepalive_time=600 \
net.ipv4.tcp_keepalive_intvl=60 \
net.ipv4.tcp_keepalive_probes=3

As well, I've included following tomcat connection pool properties:

tomcat:
  max-active: 10
  initial-size: 5
  max-idle: 8
  min-idle: 5
  test-on-borrow: true
  test-while-idle: true
  test-on-return: false
  test-on-connect: true
  validation-query: SELECT 1
  validation-interval: 30000
  max-wait: 30000
  min-evictable-idle-time-millis: 60000
  time-between-eviction-runs-millis: 5000
  remove-abandoned: true
  remove-abandoned-timeout: 60

Solution came from this blogpost: DEALING WITH NODENOTAVAILABLE EXCEPTIONS IN ELASTICSEARCH

  • I will try this as soon as possible.Thanks for your help! – Elifcan Çakmak Nov 27 '17 at 16:34
  • hi, i tried the solution and i only applied the first part. it's been up since yesterday and not failed. i guess it works :) thanks a lot! – Elifcan Çakmak Nov 29 '17 at 08:51
  • Containers running kernel 4.13 or later will no longer inherit `tcp_keepalive_time` from the host (source: https://success.docker.com/article/ipvs-connection-timeout-issue), so this approach will no longer work with newer containers. However, as of Docker 19.03 there is a `sysctl` option that can be supplied to services (e.g. in a compose file). This can be used to set the above flags directly in the containers without messing with the host. https://docs.docker.com/compose/compose-file/#sysctls – benbotto Dec 12 '19 at 17:36
6

There is another way to prevent closing idle connection. The problem is related to default swarm service discovery which closes the idle connection after 15 minutes.
Explicit specified the dnsrr endpoint mode resolves the problem, e.g.:

version: '3.3'

services:
  foo-service:
    image: example/foo-service:latest
    hostname: foo-service
    networks:
      - foo_network
    deploy:
      endpoint_mode: dnsrr
      # ...

networks:
  foo_network:
    external: true
    driver: overlay
Mikolasan
  • 107
  • 5
xxxception
  • 171
  • 1
  • 5
  • Amazing. I haven't found anything about this behavior in Docker docs. How did you figure it out? Maybe, we should submit a PR into Docker docs? – SilverFire Jan 30 '21 at 17:19