ceph osd down and rgw Initialization timeout, failed to initialize after reboot

Question

Centos7.2, Ceph with 3 OSD, 1 MON running on a same node. radosgw and all the daemons are running on the same node, and everything was working fine. After reboot the server, all osd could not communicate (looks like) and the radosgw does not work properly, it's log says:

2016-03-09 17:03:30.916678 7fc71bbce880  0 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403), process radosgw, pid 24181
2016-03-09 17:08:30.919245 7fc712da8700 -1 Initialization timeout, failed to initialize

ceph health shows:

HEALTH_WARN 1760 pgs stale; 1760 pgs stuck stale; too many PGs per OSD (1760 > max 300); 2/2 in osds are down

and ceph osd tree give:

ID WEIGHT  TYPE NAME               UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.01999 root default
-2 1.01999     host app112
 0 1.00000         osd.0              down  1.00000          1.00000
 1 0.01999         osd.1              down        0          1.00000
-3 1.00000     host node146
 2 1.00000         osd.2              down  1.00000          1.00000

and service ceph status results:

=== mon.app112 ===
mon.app112: running {"version":"0.94.6"}
=== osd.0 ===
osd.0: running {"version":"0.94.6"}
=== osd.1 ===
osd.1: running {"version":"0.94.6"}
=== osd.2 ===
osd.2: running {"version":"0.94.6"}
=== osd.0 ===
osd.0: running {"version":"0.94.6"}
=== osd.1 ===
osd.1: running {"version":"0.94.6"}
=== osd.2 ===
osd.2: running {"version":"0.94.6"}

and this is service radosgw status:

Redirecting to /bin/systemctl status  radosgw.service
● ceph-radosgw.service - LSB: radosgw RESTful rados gateway
   Loaded: loaded (/etc/rc.d/init.d/ceph-radosgw)
   Active: active (exited) since Wed 2016-03-09 17:03:30 CST; 1 day 23h ago
     Docs: man:systemd-sysv-generator(8)
  Process: 24134 ExecStop=/etc/rc.d/init.d/ceph-radosgw stop (code=exited, status=0/SUCCESS)
  Process: 2890 ExecReload=/etc/rc.d/init.d/ceph-radosgw reload (code=exited, status=0/SUCCESS)
  Process: 24153 ExecStart=/etc/rc.d/init.d/ceph-radosgw start (code=exited, status=0/SUCCESS)

Seeing this, I have tried sudo /etc/init.d/ceph -a start osd.1 and stop for a couple of times, but the result is the same as above.

sudo /etc/init.d/ceph -a stop osd.1
=== osd.1 ===
Stopping Ceph osd.1 on open-kvm-app92...kill 12688...kill 12688...done

sudo /etc/init.d/ceph -a start osd.1
=== osd.1 ===
create-or-move updated item name 'osd.1' weight 0.02 at location {host=open-kvm-app92,root=default} to crush map
Starting Ceph osd.1 on open-kvm-app92...
Running as unit ceph-osd.1.1457684205.040980737.service.

Please help. thanks

EDIT: it seems like mon cannot talk to osd. but both daemons are running ok. the osd log shows:

2016-03-11 17:35:21.649712 7f003c633700  5 osd.0 234 tick
2016-03-11 17:35:22.649982 7f003c633700  5 osd.0 234 tick
2016-03-11 17:35:23.650262 7f003c633700  5 osd.0 234 tick
2016-03-11 17:35:24.650538 7f003c633700  5 osd.0 234 tick
2016-03-11 17:35:25.650807 7f003c633700  5 osd.0 234 tick
2016-03-11 17:35:25.779693 7f0024c96700  5 osd.0 234 heartbeat: osd_stat(6741 MB used, 9119 MB avail, 15861 MB total, peers []/[] op hist [])
2016-03-11 17:35:26.651059 7f003c633700  5 osd.0 234 tick
2016-03-11 17:35:27.651314 7f003c633700  5 osd.0 234 tick
2016-03-11 17:35:28.080165 7f0024c96700  5 osd.0 234 heartbeat: osd_stat(6741 MB used, 9119 MB avail, 15861 MB total, peers []/[] op hist [])

did you ever work out your issue? i'm having the same problem — Spongman, Feb 06 '18 at 01:21
@Spongman yes we finally solved the problem. It was a dark "kept trying until problem solved" epoch together with quite many other pitfalls. — Tiina, Feb 06 '18 at 01:56
@Spongman I was absent for Chinese New Year. This was just one of tens puzzles we met just cannot remember how it was solved or just bypassed. But I do remember there was a maillist you could find on ceph website and ceph masters there were really helpful. — Tiina, Feb 22 '18 at 01:32
i did eventually work out what was wrong. the issue is with running everything on one node. for anyone else running into this: we had to manually change 'type node' to 'type osd' in our crushmap. — Spongman, Feb 23 '18 at 23:09

score 0 · Answer 1 · edited Sep 23 '20 at 14:14

I did eventually work out what was wrong. I had to manually change 'type host' to 'type osd' in our crushmap, which is different from Spongman's suggestion.

after booting rgw, I find that the owner of radosgw process is "root", not "ceph". command "ceph -s" also show that "100.000% pgs not active".

I search the clue "100.000% pgs not active", the post "https://www.cnblogs.com/boshen-hzb/p/13305560.html" tell how to solve it - change 'type host' to 'type osd' , as result, "ceph -s" show "HEALTH_OK" and the owner of radosgw process become "ceph", and rgw web service(7480) is listening.

the owner of radosgw process is root

ceph osd down and rgw Initialization timeout, failed to initialize after reboot

1 Answers1