1

Mesos → Completed Tasks i Sandbox In stdout file, i can see killTask signal:

Received killTask for task sources.b4e2c8e6-5b42-11e7-aec0-024227901b13

Full snap of stdout file is as follows. You can see even after receiving killTask signal my process is still running. i.e. my process doesn't terminates itself.

 
2017-06-27 14:16:08,332 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 1, bytes sent 188 so far
2017-06-27 14:16:18,333 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 1, bytes sent 188 so far
2017-06-27 14:16:28,333 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 1, bytes sent 188 so far
2017-06-27 14:16:38,333 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 1, bytes sent 188 so far
2017-06-27 14:16:48,337 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 1, bytes sent 188 so far
2017-06-27 14:16:58,332 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 1, bytes sent 188 so far
2017-06-27 14:17:08,333 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 1, bytes sent 188 so far
2017-06-27 14:17:18,333 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 1, bytes sent 188 so far
2017-06-27 14:17:28,333 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 1, bytes sent 188 so far
2017-06-27 14:17:38,334 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 1, bytes sent 188 so far
2017-06-27 14:17:48,333 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 1, bytes sent 188 so far
2017-06-27 14:17:58,333 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 2, bytes sent 376 so far
2017-06-27 14:18:08,334 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 2, bytes sent 376 so far
2017-06-27 14:18:18,333 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 2, bytes sent 376 so far
2017-06-27 14:18:28,333 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 2, bytes sent 376 so far
2017-06-27 14:18:38,332 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 2, bytes sent 376 so far
2017-06-27 14:18:48,333 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 2, bytes sent 376 so far
2017-06-27 14:18:58,333 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 2, bytes sent 376 so far
2017-06-27 14:19:08,332 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 2, bytes sent 376 so far
2017-06-27 14:19:18,332 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 2, bytes sent 376 so far
2017-06-27 14:19:28,333 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 2, bytes sent 376 so far
2017-06-27 14:19:38,333 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 2, bytes sent 376 so far
2017-06-27 14:19:48,333 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 2, bytes sent 376 so far
2017-06-27 14:19:58,333 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 2, bytes sent 376 so far
2017-06-27 14:20:08,332 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 2, bytes sent 376 so far
2017-06-27 14:20:18,334 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 2, bytes sent 376 so far
2017-06-27 14:20:28,333 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 2, bytes sent 376 so far
2017-06-27 14:20:38,333 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 2, bytes sent 376 so far
2017-06-27 14:20:48,332 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 2, bytes sent 376 so far
Received killTask for task sources.b4e2c8e6-5b42-11e7-aec0-024227901b13
2017-06-27 14:20:58,333 INFO  [Timer-0] com.informatica.vds.transport.ws.WSClient - appmonitor messages sent 2, bytes sent 376 so far

Full Snap of stderr file is as follows:

I0627 19:42:51.959991  7613 fetcher.cpp:533] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/632f9d21-ae71-4cca-95e4-63e2b3dbd78e-S0","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"executable":false,"extract":true,"value":"file:\/\/\/etc\/docker.tar.gz"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/632f9d21-ae71-4cca-95e4-63e2b3dbd78e-S0\/frameworks\/0e528b66-37aa-4d7a-933e-4638aabf494a-0000\/executors\/sources.b4e2c8e6-5b42-11e7-aec0-024227901b13\/runs\/219c102b-28ae-41d5-b98f-11829315119e"}
I0627 19:42:51.963241  7613 fetcher.cpp:444] Fetching URI 'file:///etc/docker.tar.gz'
I0627 19:42:51.963279  7613 fetcher.cpp:285] Fetching directly into the sandbox directory
I0627 19:42:51.963295  7613 fetcher.cpp:222] Fetching URI 'file:///etc/docker.tar.gz'
I0627 19:42:51.964923  7613 fetcher.cpp:207] Copied resource '/etc/docker.tar.gz' to '/var/lib/mesos/slaves/632f9d21-ae71-4cca-95e4-63e2b3dbd78e-S0/frameworks/0e528b66-37aa-4d7a-933e-4638aabf494a-0000/executors/sources.b4e2c8e6-5b42-11e7-aec0-024227901b13/runs/219c102b-28ae-41d5-b98f-11829315119e/docker.tar.gz'
I0627 19:42:52.070482  7613 fetcher.cpp:123] Extracted '/var/lib/mesos/slaves/632f9d21-ae71-4cca-95e4-63e2b3dbd78e-S0/frameworks/0e528b66-37aa-4d7a-933e-4638aabf494a-0000/executors/sources.b4e2c8e6-5b42-11e7-aec0-024227901b13/runs/219c102b-28ae-41d5-b98f-11829315119e/docker.tar.gz' into '/var/lib/mesos/slaves/632f9d21-ae71-4cca-95e4-63e2b3dbd78e-S0/frameworks/0e528b66-37aa-4d7a-933e-4638aabf494a-0000/executors/sources.b4e2c8e6-5b42-11e7-aec0-024227901b13/runs/219c102b-28ae-41d5-b98f-11829315119e'
I0627 19:42:52.070533  7613 fetcher.cpp:582] Fetched 'file:///etc/docker.tar.gz' to '/var/lib/mesos/slaves/632f9d21-ae71-4cca-95e4-63e2b3dbd78e-S0/frameworks/0e528b66-37aa-4d7a-933e-4638aabf494a-0000/executors/sources.b4e2c8e6-5b42-11e7-aec0-024227901b13/runs/219c102b-28ae-41d5-b98f-11829315119e/docker.tar.gz'
I0627 19:42:56.096325  7643 exec.cpp:162] Version: 1.3.0
I0627 19:42:56.101958  7647 exec.cpp:237] Executor registered on agent 632f9d21-ae71-4cca-95e4-63e2b3dbd78e-S0
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   221  100   138  100    83   8657   5207 --:--:-- --:--:-- --:--:--  9200
E0627 19:51:03.219312  7652 process.cpp:951] Failed to accept socket: future discarded

Messages Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap. and Failed to accept socket: future discarded seems to be culprit which are killing my container.

My question is who is killing my container after 5 to 10 minutes again and again?

I also updated the /etc/default/grub file with

GRUB_CMDLINE_LINUX_DEFAULT="cgroup_enable=memory swapaccount=1"

and rebooted my system but no progress.

Any insights on this issue.

My Ubuntu VMWare Configuration is as:

  1. Cores assigned 3
  2. memory : 6GB
  3. HDD: 32 GB
  4. I am running only one container still it exits after few minutes.

[EDIT: Adding stderr file contents from mesos UI at: /var/lib/mesos/slaves/29df799b-4797-41df-a005-465f211d286b-S0/frameworks/0e528b66-37aa-4d7a-933e-4638aabf494a-0000 executors/sources.a634642c-5bbc-11e7-ba8b-024239f32c24/runs/1bda209c-c2b8-4bb5-a41b-26361e00a284]

Adding stderr file content of another job.

I0628 10:15:45.951104  4357 fetcher.cpp:533] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/29df799b-4797-41df-a005-465f211d286b-S0","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"executable":false,"extract":true,"value":"file:\/\/\/etc\/docker.tar.gz"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/29df799b-4797-41df-a005-465f211d286b-S0\/frameworks\/0e528b66-37aa-4d7a-933e-4638aabf494a-0000\/executors\/sources.a634642c-5bbc-11e7-ba8b-024239f32c24\/runs\/1bda209c-c2b8-4bb5-a41b-26361e00a284"}
I0628 10:15:45.953835  4357 fetcher.cpp:444] Fetching URI 'file:///etc/docker.tar.gz'
I0628 10:15:45.953881  4357 fetcher.cpp:285] Fetching directly into the sandbox directory
I0628 10:15:45.953974  4357 fetcher.cpp:222] Fetching URI 'file:///etc/docker.tar.gz'
I0628 10:15:45.956663  4357 fetcher.cpp:207] Copied resource '/etc/docker.tar.gz' to '/var/lib/mesos/slaves/29df799b-4797-41df-a005-465f211d286b-S0/frameworks/0e528b66-37aa-4d7a-933e-4638aabf494a-0000/executors/sources.a634642c-5bbc-11e7-ba8b-024239f32c24/runs/1bda209c-c2b8-4bb5-a41b-26361e00a284/docker.tar.gz'
I0628 10:15:46.061069  4357 fetcher.cpp:123] Extracted '/var/lib/mesos/slaves/29df799b-4797-41df-a005-465f211d286b-S0/frameworks/0e528b66-37aa-4d7a-933e-4638aabf494a-0000/executors/sources.a634642c-5bbc-11e7-ba8b-024239f32c24/runs/1bda209c-c2b8-4bb5-a41b-26361e00a284/docker.tar.gz' into '/var/lib/mesos/slaves/29df799b-4797-41df-a005-465f211d286b-S0/frameworks/0e528b66-37aa-4d7a-933e-4638aabf494a-0000/executors/sources.a634642c-5bbc-11e7-ba8b-024239f32c24/runs/1bda209c-c2b8-4bb5-a41b-26361e00a284'
I0628 10:15:46.061148  4357 fetcher.cpp:582] Fetched 'file:///etc/docker.tar.gz' to '/var/lib/mesos/slaves/29df799b-4797-41df-a005-465f211d286b-S0/frameworks/0e528b66-37aa-4d7a-933e-4638aabf494a-0000/executors/sources.a634642c-5bbc-11e7-ba8b-024239f32c24/runs/1bda209c-c2b8-4bb5-a41b-26361e00a284/docker.tar.gz'
I0628 10:15:49.898803  4389 exec.cpp:162] Version: 1.3.0
I0628 10:15:49.903390  4390 exec.cpp:237] Executor registered on agent 29df799b-4797-41df-a005-465f211d286b-S0
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   221  100   138  100    83   5385   3239 --:--:-- --:--:-- --:--:-- 11500
W0628 10:15:49.903390  4389 logging.cpp:91] RAW: Received signal SIGTERM from process 3287 of user 0; exiting

No new logs created in /var/lib/mesos-master.ERROR file today Content of /var/log/mesos-master.WARNING file:

Log file created at: 2017/06/28 10:04:56
Running on machine: ubuntu
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
W0628 10:04:56.387049  3193 authenticator.cpp:512] No credentials provided, authentication requests will be refused
W0628 10:14:56.617103  3221 master.cpp:2011] Agent 632f9d21-ae71-4cca-95e4-63e2b3dbd78e-S0 (ubuntu) did not re-register within 10mins after master failover; marking it unreachable

Content of /var/log/mesos-slave.WARNING file is same as that if mesos-slave.ERROR file. Content of /var/log/mesos-slave.ERROR file:

Log file created at: 2017/06/28 10:05:00
Running on machine: ubuntu
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0628 10:05:00.712286  3287 shell.hpp:107] Command 'hadoop version 2>&1' failed; this is the output:
sh: 1: hadoop: not found
E0628 10:24:45.502921  3326 slave.cpp:4496] Failed to update resources for container 1bda209c-c2b8-4bb5-a41b-26361e00a284 of executor 'sources.a634642c-5bbc-11e7-ba8b-024239f32c24' running task sources.a634642c-5bbc-11e7-ba8b-024239f32c24 on status update for terminal task, destroying container: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/4469/cgroup: Failed to open file: No such file or directory
E0628 10:33:45.789072  3327 slave.cpp:4496] Failed to update resources for container 858170ce-0775-48be-8c85-3a1dbf320569 of executor 'sources.e7e069ed-5bbd-11e7-ba8b-024239f32c24' running task sources.e7e069ed-5bbd-11e7-ba8b-024239f32c24 on status update for terminal task, destroying container: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/5215/cgroup: Failed to open file: No such file or directory

I observed that the message:

Failed to read /proc/5215/cgroup: Failed to open file: No such file or directory

comes only when container/task get killed. Whereas these files exists for current running containers. Thanks.

ramanKC
  • 11
  • 1
  • 4
  • Can you provide logs from Scheduler/Mesos Master and Mesos Agent. Usually it's a Scheduler who asks Mesos to kill tasks. There are some exceptions from this rule e.g. when scheduler disconnects and mesos waits till failover timeout expires and then kill tasks. Can you show your application definition? –  Jun 27 '17 at 21:30
  • Hi @janisz, this time I am explicitly receiving SIGTERM signal. Any idea, why this signal is received after 5 to 10min of run. `logging.cpp:91] RAW: Received signal SIGTERM from process 3287 of user 0; exiting` –  Jun 28 '17 at 05:00
  • can you check the name of `3287` process –  Jun 28 '17 at 16:24
  • Hi Janisz, Sorry for long delay. got busy with other stuff. Today faced same problem. I checked the processId. It's mesos-slave process. – ramanKC Jul 13 '17 at 09:36
  • Hi @janisz, I figured out the problem that why mesos-slave was killing my task. In app config I provided http health check that we are supposed to implement (I assume so). **I removed the health check from app config and after that it's working like charm**. No more killing of tasks. Thank you so much. Closing the issue. – ramanKC Jul 17 '17 at 05:40

1 Answers1

0

It seems that marathon rely on user for implementation of health check. i.e If we are providing health check in app config, we have to implement it. I removed all the health check that I provided in application configuration. After that marathon is showing app health as unknown but now marathon(specifically mesos-slave) does not kills the task.

ramanKC
  • 11
  • 1
  • 4