Elasticsearch Debugging

Question

Our elasticsearch is a mess. The cluster health is always in red and ive decided to look into it and salvage it if possible. But I have no idea where to begin with. Here is some info regarding our cluster:

{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 6,
  "number_of_data_nodes" : 6,
  "active_primary_shards" : 91,
  "active_shards" : 91,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 201,
  "number_of_pending_tasks" : 0
}

The 6 nodes:

host               ip         heap.percent ram.percent load node.role master name
es04e.p.comp.net 10.0.22.63            30          22 0.00 d         m      es04e-es
es06e.p.comp.net 10.0.21.98            20          15 0.37 d         m      es06e-es
es08e.p.comp.net 10.0.23.198            9          44 0.07 d         *      es08e-es
es09e.p.comp.net 10.0.32.233           62          45 0.00 d         m      es09e-es
es05e.p.comp.net 10.0.65.140           18          14 0.00 d         m      es05e-es
es07e.p.comp.net 10.0.11.69            52          45 0.13 d         m      es07e-es

Straight away you can see I have a very large number of unassigned shards (201). I came across this answer and tried it and got 'acknowledged:true', but there was no change in the either of the above posted sets of info.

Next I logged into one of the nodes es04 and went through the log files. the first log file has a few lines that caught my attention

[2015-05-21 19:44:51,561][WARN ][transport.netty          ] [es04e-es] exception caught on transport layer [[id: 0xbceea4eb]], closing connection

and

[2015-05-26 15:14:43,157][INFO ][cluster.service          ] [es04e-es] removed {[es03e-es][R8sz5RWNSoiJ2zm7oZV_xg][es03e.p.sojern.net][inet[/10.0.2.16:9300]],}, reason: zen-disco-receive(from master [[es01e-es][JzkWq9qwQSGdrWpkOYvbqQ][es01e.p.sojern.net][inet[/10.0.2.237:9300]]])
[2015-05-26 15:22:28,721][INFO ][cluster.service          ] [es04e-es] removed {[es02e-es][XZ5TErowQfqP40PbR-qTDg][es02e.p.sojern.net][inet[/10.0.2.229:9300]],}, reason: zen-disco-receive(from master [[es01e-es][JzkWq9qwQSGdrWpkOYvbqQ][es01e.p.sojern.net][inet[/10.0.2.237:9300]]])
[2015-05-26 15:32:00,448][INFO ][discovery.ec2            ] [es04e-es] master_left [[es01e-es][JzkWq9qwQSGdrWpkOYvbqQ][es01e.p.sojern.net][inet[/10.0.2.237:9300]]], reason [shut_down]
[2015-05-26 15:32:00,449][WARN ][discovery.ec2            ] [es04e-es] master left (reason = shut_down), current nodes: {[es07e-es][etJN3eOySAydsIi15sqkSQ][es07e.p.sojern.net][inet[/10.0.2.69:9300]],[es04e-es][3KFMUFvzR_CzWRddIMdpBg][es04e.p.sojern.net][inet[/10.0.1.63:9300]],[es05e-es][ZoLnYvAdTcGIhbcFRI3H_A][es05e.p.sojern.net][inet[/10.0.1.140:9300]],[es08e-es][FPa4q07qRg-YA7hAztUj2w][es08e.p.sojern.net][inet[/10.0.2.198:9300]],[es09e-es][4q6eACbOQv-TgEG0-Bye6w][es09e.p.sojern.net][inet[/10.0.2.233:9300]],[es06e-es][zJ17K040Rmiyjf2F8kjIiQ][es06e.p.sojern.net][inet[/10.0.1.98:9300]],}
[2015-05-26 15:32:00,450][INFO ][cluster.service          ] [es04e-es] removed {[es01e-es][JzkWq9qwQSGdrWpkOYvbqQ][es01e.p.sojern.net][inet[/10.0.2.237:9300]],}, reason: zen-disco-master_failed ([es01e-es][JzkWq9qwQSGdrWpkOYvbqQ][es01e.p.sojern.net][inet[/10.0.2.237:9300]])
[2015-05-26 15:32:36,741][INFO ][cluster.service          ] [es04e-es] new_master [es04e-es][3KFMUFvzR_CzWRddIMdpBg][es04e.p.sojern.net][inet[/10.0.1.63:9300]], reason: zen-disco-join (elected_as_master)

In this section i realized there were a few nodes es01, es02, es03 which were deleted.

After this, all log files(around 30 of them) have only 1 line:

[2015-05-26 15:43:49,971][DEBUG][action.bulk              ] [es04e-es] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]

I have checked all the nodes and they have same version of ES and logstash. I realize this is a big complicated issues but if anyone can find out the issue and nudge me in the right direction it will be HUGE help

EDIT:

Indices:
health status index               pri rep docs.count docs.deleted store.size pri.store.size
yellow open   logstash-2015.07.20   5   1   95217146            0     30.8gb         30.8gb
yellow open   logstash-2015.07.12   5   1      66254            0     10.5mb         10.5mb
yellow open   logstash-2015.07.21   5   1   51979343            0     17.8gb         17.8gb
yellow open   logstash-2015.07.17   5   1     184206            0     27.9mb         27.9mb
red    open   logstash-2015.08.18   5   1
red    open   logstash-2015.08.19   5   1
yellow open   logstash-2015.07.25   5   1  116490654            0       55gb           55gb
red    open   logstash-2015.08.11   5   1
red    open   logstash-2015.08.20   5   1
yellow open   logstash-2015.07.28   5   1  171527842            0     79.5gb         79.5gb
red    open   logstash-2015.08.03   5   1
yellow open   logstash-2015.07.26   5   1  130029870            0     61.1gb         61.1gb
red    open   logstash-2015.08.01   5   1
yellow open   logstash-2015.07.18   5   1     143834            0     21.5mb         21.5mb
red    open   logstash-2015.08.05   5   1
yellow open   logstash-2015.07.19   5   1      94908            0       15mb           15mb
yellow open   logstash-2015.07.22   5   1   52295727            0     18.2gb         18.2gb
red    open   logstash-2015.07.29   5   1
red    open   logstash-2015.08.02   5   1
yellow open   logstash-2015.07.16   5   1     185120            0     25.8mb         25.8mb
red    open   logstash-2015.08.04   5   1
yellow open   logstash-2015.07.24   5   1  144885713            0     68.3gb         68.3gb
red    open   logstash-2015.07.30   5   1
yellow open   logstash-2015.07.14   5   1   65650867            0     22.1gb         22.1gb
yellow open   logstash-2015.07.27   5   1  170717799            0     79.3gb         79.3gb
red    open   logstash-2015.07.31   5   1
yellow open   .kibana               1   1          7            0     30.3kb         30.3kb
yellow open   logstash-2015.07.13   5   1      87420            0     13.5mb         13.5mb
yellow open   logstash-2015.07.23   5   1  161453183            0     75.7gb         75.7gb
yellow open   logstash-2015.07.15   5   1     189168            0     34.9mb         34.9mb
yellow open   logstash-2015.07.11   5   1      58411            0      8.9mb          8.9mb

elasticsearch.yml

cloud:
  aws:
    protocol: http
    region: us-east
discovery:
  type: ec2
node:
  name: es04e-es
path:
  data: /var/lib/es-data-elasticsearch

Shards - http://pastie.org/10364963

_cluster/settings/pretty?true

{
  "persistent" : { },
  "transient" : { }
}

Hm... Let me poke around this stuff tomorrow and see if I can help. — GregL, Aug 20 '15 at 22:08
It seems to me that that the indices which are red were previously hosted entirely on es01, es02 and es03, with none of their shards being on the remaining hosts. When the nodes dropped out of the cluster, effectively so did the indices they hosted. Are those 3 nodes now gone for good? The rest of the indices appear to have their primary shards STARTED, but none of their replicas are.. can you post the output of `/_cluster/settings?pretty`? — GregL, Aug 21 '15 at 18:17
@GregL - added. Another issue I found was that there was no discovery minimum master nodes was set which mightve led to the split brain issue. — Beginner, Aug 21 '15 at 18:40

adel-s · Answer 1 · 2017-11-17T11:49:55.520

The main reason of red status is clear - unassigned chards. Ok, let's try to resolve it:

Let's check the reason why chards are in unassigned state:

_cat/shards?h=index,shard,prirep,state,unassigned.reason

Check you have enough free space on nodes:
```
_cat/allocation?v
```

You can try manually allocate chard to datanode and check ElasticSearch response with full error explanation:

{
    "commands": [
    {
      "allocate": {
      "index": "index_name",
      "shard": shard_num,
      "node": "datanode_name",
      "allow_primary": true
    }
  }
  ]
}

Elasticsearch Debugging

1 Answers1