After initial deployment Ceph cluster stays in active+degraded state

Question

I've created a small Ceph cluster following quick start guide with one exception, I've used a separate disk for OSDs rather than a folder. Instead of

ceph-deploy osd prepare node2:/var/local/osd0 node3:/var/local/osd1
ceph-deploy osd activate node2:/var/local/osd0 node3:/var/local/osd1

I've issued

ceph-deploy osd prepare node2:/dev/sdb node3:/dev/sdb
ceph-deploy osd activate node2:/dev/sdb1 node3:/dev/sdb1

In the same environment the folder approach works fine and cluster reaches active+clean state.

I've checked that both OSDs are showing as up, and tried to follow the troubleshooting guide but none of the approaches described seem to work.

Below is the output from ceph osd tree, ceph -s and ceph osd dump

# id    weight  type name   up/down reweight
-1  0   root default
-2  0       host node2
0   0           osd.0   up  1
-3  0       host node3
1   0           osd.1   up  1


cluster 5d7d7a6f-63c9-43c5-aebb-5458fd3ae43e
 health HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean
 monmap e1: 1 mons at {node1=10.10.10.12:6789/0}, election epoch 1, quorum 0 node1
 osdmap e8: 2 osds: 2 up, 2 in
  pgmap v15: 192 pgs, 3 pools, 0 bytes data, 0 objects
        68476 kB used, 6055 MB / 6121 MB avail
             192 active+degraded

 epoch 8
 fsid 5d7d7a6f-63c9-43c5-aebb-5458fd3ae43e
 created 2015-04-04 21:45:58.089596
 modified 2015-04-04 23:26:06.840590
 flags
 pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool crash_replay_interval 45 stripe_width 0
 pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
 pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
 max_osd 2
 osd.0 up   in  weight 1 up_from 4 up_thru 4 down_at 0 last_clean_interval [0,0) 10.10.10.13:6800/1749 10.10.10.13:6801/1749 10.10.10.13:6802/1749 10.10.10.13:6803/1749 exists,up 42d5622d-8907-4991-a6b6-869190c21678
 osd.1 up   in  weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) 10.10.10.14:6800/1750 10.10.10.14:6801/1750 10.10.10.14:6802/1750 10.10.10.14:6803/1750 exists,up b0a515d3-5f24-4e69-a5b3-1e094617b5b4

score 1 · Answer 1 · answered Apr 05 '15 at 13:26

After some more research it turns out that the clue was in the osd tree output - weights were all set to 0. This seems to be a problem with either Ceph or ceph-deploy script as it's 100% reproducible. Resetting osd weights in crush map fixes the issue. All I had to do is issue below commands:

ceph osd crush reweight osd.0 6
ceph osd crush reweight osd.1 6

After initial deployment Ceph cluster stays in active+degraded state

1 Answers1