Arangodb Cluster - Unable to interact with Coordinator node in cluster after restarting the cluster when there is data in the data directory

ArangoDB Version: 3.4.4
Storage Engine: RocksDB
Deployment Mode: Cluster
Deployment Strategy: Manual Start in Docker
Configuration: see server settings below
Infrastructure: own
Operating System: Centos 7
Total RAM in your machine: 12GB each (Non-prod Cluster)
Disks in use: SSD
Used Package: Docker -official

Server Settings

Agency settings:

arangod --server.storage-engine rocksdb --server.endpoint tcp://0.0.0.0:8529 --agency.my-address=tcp://agency1:8529 --server.authentication false --agency.activate true --agency.size 3 --agency.supervision true
arangod --server.storage-engine rocksdb --server.endpoint tcp://0.0.0.0:8529 --agency.my-address=tcp://agency2:8529 --server.authentication false --agency.activate true --agency.size 3 --agency.supervision true
arangod --server.storage-engine rocksdb --server.endpoint tcp://0.0.0.0:8529 --agency.my-address=tcp://agency3:8529 --server.authentication false --agency.activate true --agency.size 3 --agency.endpoint tcp://agency1:8529 --agency.endpoint tcp://agency2:8529 --agency.endpoint tcp://agency3:8529 --agency.supervision true

Primary Settings:

arangod --server.storage-engine rocksdb --server.authentication=false --server.endpoint tcp://0.0.0.0:8529 --cluster.my-address tcp://dbserver1:8529 --cluster.my-local-info db1 --cluster.my-role PRIMARY --cluster.agency-endpoint tcp://agency1:8529 --cluster.agency-endpoint tcp://agency2:8529 --cluster.agency-endpoint tcp://agency3:8529
arangod --server.storage-engine rocksdb --server.authentication=false --server.endpoint tcp://0.0.0.0:8529 --cluster.my-address tcp://dbserver2:8529 --cluster.my-local-info db2 --cluster.my-role PRIMARY --cluster.agency-endpoint tcp://agency1:8529 --cluster.agency-endpoint tcp://agency2:8529 --cluster.agency-endpoint tcp://agency3:8529
arangod --server.storage-engine rocksdb --server.authentication=false --server.endpoint tcp://0.0.0.0:8529 --cluster.my-address tcp://dbserver3:8529 --cluster.my-local-info db3 --cluster.my-role PRIMARY --cluster.agency-endpoint tcp://agency1:8529 --cluster.agency-endpoint tcp://agency2:8529 --cluster.agency-endpoint tcp://agency3:8529

Coordinator settings:

arangod --server.storage-engine rocksdb --server.authentication=false --server.endpoint tcp://0.0.0.0:8529 --cluster.my-address tcp://coordinator1:8529 --cluster.my-local-info coord1 --cluster.my-role COORDINATOR --cluster.agency-endpoint tcp://agency1:8529 --cluster.agency-endpoint tcp://agency2:8529 --cluster.agency-endpoint tcp://agency3:8529
arangod --server.storage-engine rocksdb --server.authentication=false --server.endpoint tcp://0.0.0.0:8529 --cluster.my-address tcp://coordinator2:8529 --cluster.my-local-info coord2 --cluster.my-role COORDINATOR --cluster.agency-endpoint tcp://agency1:8529 --cluster.agency-endpoint tcp://agency2:8529 --cluster.agency-endpoint tcp://agency3:8529
arangod --server.storage-engine rocksdb --server.authentication=false --server.endpoint tcp://0.0.0.0:8529 --cluster.my-address tcp://coordinator3:8529 --cluster.my-local-info coord3 --cluster.my-role COORDINATOR --cluster.agency-endpoint tcp://agency1:8529 --cluster.agency-endpoint tcp://agency2:8529 --cluster.agency-endpoint tcp://agency3:8529

Current Server Config:

{
"check-configuration": false,
  "config": "/tmp/arangod.conf",
  "configuration": "/tmp/arangod.conf",
  "console": false,
  "daemon": false,
  "default-language": "en_US",
  "define": [],
  "dump-dependencies": false,
  "dump-options": false,
  "fortune": false,
  "gid": "",
  "hund": false,
  "log": [
    "info",
    "info"
  ],
  "pid-file": "",
  "supervisor": false,
  "uid": "",
  "version": false,
  "working-directory": "/var/tmp",
  "agency.activate": false,
  "agency.compaction-keep-size": 50000,
  "agency.compaction-step-size": 1000,
  "agency.disaster-recovery-id": "",
  "agency.election-timeout-max": 5,
  "agency.election-timeout-min": 1,
  "agency.endpoint": [],
  "agency.max-append-size": 250,
  "agency.my-address": "",
  "agency.pool-size": 1,
  "agency.size": 1,
  "agency.supervision": false,
  "agency.supervision-frequency": 1,
  "agency.supervision-grace-period": 10,
  "agency.wait-for-sync": true,
  "arangosearch.threads": 0,
  "arangosearch.threads-limit": 0,
  "cache.rebalancing-interval": 2000000,
  "cache.size": 2612399104,
  "cluster.agency-endpoint": [
    "tcp://agency1:8529",
    "tcp://agency2:8529",
    "tcp://agency3:8529"
  ],
  "cluster.agency-prefix": "arango",
  "cluster.create-waits-for-sync-replication": true,
  "cluster.index-create-timeout": 3600,
  "cluster.my-address": "tcp://coordinator1:8529",
  "cluster.my-advertised-endpoint": "",
  "cluster.my-role": "COORDINATOR",
  "cluster.require-persisted-id": false,
  "cluster.synchronous-replication-timeout-factor": 1,
  "cluster.synchronous-replication-timeout-per-4k": 0.1,
  "cluster.system-replication-factor": 2,
  "compaction.db-sleep-time": 1,
  "compaction.dead-documents-threshold": 16384,
  "compaction.dead-size-percent-threshold": 0.1,
  "compaction.dead-size-threshold": 131072,
  "compaction.max-file-size-factor": 3,
  "compaction.max-files": 3,
  "compaction.max-result-file-size": 134217728,
  "compaction.min-interval": 10,
  "compaction.min-small-data-file-size": 131072,
  "database.auto-upgrade": false,
  "database.check-version": false,
  "database.directory": "coord1",
  "database.force-sync-properties": true,
  "database.ignore-datafile-errors": false,
  "database.init-database": false,
  "database.maximal-journal-size": 33554432,
  "database.password": "",
  "database.required-directory-state": "any",
  "database.restore-admin": false,
  "database.throw-collection-not-loaded-error": false,
  "database.upgrade-check": true,
  "database.wait-for-sync": false,
  "foxx.queues": true,
  "foxx.queues-poll-interval": 1,
  "frontend.proxy-request-check": true,
  "frontend.trusted-proxy": [],
  "frontend.version-check": true,
  "http.allow-method-override": false,
  "http.hide-product-header": false,
  "http.keep-alive-timeout": 300,
  "http.trusted-origin": [],
  "javascript.allow-admin-execute": false,
  "javascript.app-path": "/var/lib/arangodb3-apps",
  "javascript.copy-installation": false,
  "javascript.enabled": true,
  "javascript.gc-frequency": 60,
  "javascript.gc-interval": 2000,
  "javascript.module-directory": [],
  "javascript.script": [],
  "javascript.script-parameter": [],
  "javascript.startup-directory": "/usr/share/arangodb3/js",
  "javascript.v8-contexts": 64,
  "javascript.v8-contexts-max-age": 60,
  "javascript.v8-contexts-max-invocations": 0,
  "javascript.v8-contexts-minimum": 1,
  "javascript.v8-max-heap": 3072,
  "javascript.v8-options": [],
  "log.color": true,
  "log.escape": true,
  "log.file": "-",
  "log.force-direct": false,
  "log.foreground-tty": false,
  "log.keep-logrotate": false,
  "log.level": [
    "info",
    "info"
  ],
  "log.line-number": false,
  "log.output": [
    "-"
  ],
  "log.performance": false,
  "log.prefix": "",
  "log.request-parameters": true,
  "log.role": false,
  "log.shorten-filenames": true,
  "log.thread": false,
  "log.thread-name": false,
  "log.use-local-time": false,
  "log.use-microtime": false,
  "nonce.size": 4194304,
  "query.cache-entries": 128,
  "query.cache-entries-max-size": 268435456,
  "query.cache-entry-max-size": 16777216,
  "query.cache-include-system-collections": false,
  "query.cache-mode": "off",
  "query.fail-on-warning": false,
  "query.memory-limit": 0,
  "query.optimizer-max-plans": 128,
  "query.registry-ttl": 600,
  "query.slow-streaming-threshold": 10,
  "query.slow-threshold": 10,
  "query.tracking": true,
  "query.tracking-with-bindvars": true,
  "random.generator": 1,
  "replication.active-failover": false,
  "replication.auto-start": true,
  "replication.automatic-failover": false,
  "rocksdb.block-align-data-blocks": false,
  "rocksdb.block-cache-shard-bits": -1,
  "rocksdb.block-cache-size": 3134878924,
  "rocksdb.compaction-read-ahead-size": 2097152,
  "rocksdb.debug-logging": false,
  "rocksdb.delayed_write_rate": 0,
  "rocksdb.dynamic-level-bytes": true,
  "rocksdb.enable-pipelined-write": false,
  "rocksdb.enable-statistics": false,
  "rocksdb.enforce-block-cache-size-limit": false,
  "rocksdb.intermediate-commit-count": 1000000,
  "rocksdb.intermediate-commit-size": 536870912,
  "rocksdb.level0-compaction-trigger": 2,
  "rocksdb.level0-slowdown-trigger": 20,
  "rocksdb.level0-stop-trigger": 36,
  "rocksdb.max-background-jobs": 4,
  "rocksdb.max-bytes-for-level-base": 268435456,
  "rocksdb.max-bytes-for-level-multiplier": 10,
  "rocksdb.max-subcompactions": 0,
  "rocksdb.max-total-wal-size": 83886080,
  "rocksdb.max-transaction-size": 18446744073709552000,
  "rocksdb.max-write-buffer-number": 2,
  "rocksdb.min-write-buffer-number-to-merge": 1,
  "rocksdb.num-levels": 7,
  "rocksdb.num-threads-priority-high": 2,
  "rocksdb.num-threads-priority-low": 2,
  "rocksdb.num-uncompressed-levels": 2,
  "rocksdb.optimize-filters-for-hits": false,
  "rocksdb.recycle-log-file-num": 0,
  "rocksdb.sync-interval": 100,
  "rocksdb.table-block-size": 16384,
  "rocksdb.throttle": true,
  "rocksdb.total-write-buffer-size": 4179838566,
  "rocksdb.transaction-lock-timeout": 1000,
  "rocksdb.use-direct-io-for-flush-and-compaction": false,
  "rocksdb.use-direct-reads": false,
  "rocksdb.use-file-logging": false,
  "rocksdb.use-fsync": false,
  "rocksdb.wal-directory": "",
  "rocksdb.wal-file-timeout": 10,
  "rocksdb.wal-file-timeout-initial": 180,
  "rocksdb.wal-recovery-skip-corrupted": false,
  "rocksdb.write-buffer-size": 67108864,
  "server.allow-use-database": false,
  "server.authentication": false,
  "server.authentication-system-only": true,
  "server.authentication-timeout": 0,
  "server.authentication-unix-sockets": true,
  "server.check-max-memory-mappings": true,
  "server.descriptors-minimum": 0,
  "server.endpoint": [
    "tcp://0.0.0.0:8529"
  ],
  "server.flush-interval": 1000000,
  "server.gid": "",
  "server.jwt-secret": "",
  "server.jwt-secret-keyfile": "",
  "server.local-authentication": true,
  "server.maintenance-actions-block": 2,
  "server.maintenance-actions-linger": 3600,
  "server.maintenance-threads": 2,
  "server.maximal-queue-size": 4096,
  "server.maximal-threads": 64,
  "server.minimal-threads": 2,
  "server.prio1-size": 1048576,
  "server.rest-server": true,
  "server.scheduler-queue-size": 128,
  "server.statistics": true,
  "server.storage-engine": "rocksdb",
  "server.uid": "",
  "ssl.cafile": "",
  "ssl.cipher-list": "HIGH:!EXPORT:!aNULL@STRENGTH",
  "ssl.ecdh-curve": "prime256v1",
  "ssl.keyfile": "",
  "ssl.options": 2147485780,
  "ssl.protocol": 5,
  "ssl.session-cache": false,
  "tcp.backlog-size": 64,
  "tcp.reuse-address": true,
  "temp.path": "",
  "vst.maxsize": 30720,
  "wal.allow-oversize-entries": true,
  "wal.directory": "",
  "wal.flush-timeout": 15000,
  "wal.historic-logfiles": 10,
  "wal.ignore-logfile-errors": false,
  "wal.ignore-recovery-errors": false,
  "wal.logfile-size": 33554432,
  "wal.open-logfiles": 0,
  "wal.reserve-logfiles": 3,
  "wal.slots": 1048576,
  "wal.sync-interval": 100000,
  "wal.throttle-wait": 15000,
  "wal.throttle-when-pending": 0,
  "wal.use-mlock": false
}

The cluster runs and works as expected when starting fresh. If I shut it down, then restart (now there are some files in the data directory where it was empty before), I receive a 503 error when trying to access the Web UI. In addition, the coordinator node logs an error:

2019-03-26T21:58:47Z [1] ERROR {cluster} ClusterComm::performRequests: got BACKEND_UNAVAILABLE or TIMEOUT from shard:s3010020:/_db/_system/_api/document?collection=s3010020&waitForSync=false&returnNew=false&returnOld=false&isRestore=false&overwrite=false

The only way to remedy this error is to shut the cluster down and remove all files from the data directory in the primary nodes. Then restart. I would think that a database cluster should be able to have existing files in the data directory on startup, so I am guessing this is a settings issue on my end. Any help/suggestions would be much appreciated, at this point I am having to run the cluster with each node NOT having a persistent volume, so I don't loose communication with the cluster. They start up and get sharded the data which is fine and works, but if the whole thing goes down I will lose all data (again, this is a TEST/NON-PROD cluster) One thing to note, I have watched the logs during the shutdown process for all the nodes in the cluster, and they all shut down properly (ie. not forced down)

YeeP

Posted 2019-03-28T16:32:13.780

Reputation: 1

Arangodb Cluster - Unable to interact with Coordinator node in cluster after restarting the cluster when there is data in the data directory

No answers