5

At job we have an old C program which works with industrial hand terminals from Honeywell.

That terminal has its own ssh client to connect to a linux redhat 6.6 server. Once it is connected to the linux box (using a certain user), a C program is launched by the bash shell with the following parameters

export TERM=vt200
stty raw icrnl -echo
$APLI_EXEC/program param1 param2

so the flow is like => client ssh --> ssh server-> bash --> c program

The application (or it seems) is working fine but sometimes (1-3-5 times per week) a randomly terminal stops receiving data from the server but the application receives the inputs from it. It is like if you writes Ctrl+S in a shell.

Debuging the application and the ssh process using strace I realized about something strange:

The app strace is fine

write(1, "1", 7) = 1

But the strace of the ssh process is not fine (I think.. and yes I saw the ioctl no echo param, but...)

rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
read(3, "\227\316\242\350\261\330)\300e\210\352\367\2VX\24\305\2474\272\371\34\273n{\323p.\211\17H\327"..., 16384) = 48
select(14, [3 9], [11], NULL, {900, 0}) = 1 (out [11], left {899, 999996})
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
write(11, "1", 1)         = 1
ioctl(11, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 -opost -isig -icanon -echo ...}) = 0
select(14, [3 9], [], NULL, {900, 0}                   <<<<

file descriptor used by the ssh process:

lr-x------ 1 root     root 64 Feb 15 17:12 9 -> pipe:[383586491]
lr-x------ 1 root     root 64 Feb 15 17:12 8 -> /var/lib/sss/mc/group
lrwx------ 1 root     root 64 Feb 15 17:12 7 -> socket:[383586484]
lrwx------ 1 root     root 64 Feb 15 17:12 6 -> socket:[383586478]
lrwx------ 1 root     root 64 Feb 15 17:12 5 -> socket:[383586458]
lrwx------ 1 root     root 64 Feb 15 17:12 4 -> socket:[383586457]
lrwx------ 1 root     root 64 Feb 15 17:12 3 -> socket:[383585929]
lrwx------ 1 root     root 64 Feb 15 17:12 2 -> /dev/null
lrwx------ 1 root     root 64 Feb 15 17:12 14 -> /dev/ptmx
lrwx------ 1 root     root 64 Feb 15 17:12 13 -> /dev/ptmx
lrwx------ 1 root     root 64 Feb 15 17:12 11 -> /dev/ptmx
l-wx------ 1 root     root 64 Feb 15 17:12 10 -> pipe:[383586491]
lrwx------ 1 root     root 64 Feb 15 17:12 1 -> /dev/null
lrwx------ 1 root     root 64 Feb 15 17:12 0 -> /dev/null

In the select call, I miss fd #11 or fd #13 in there

Comparing this with another call

rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
read(3, "\365\354\354C\10|\336-\4\342\327B0P\275&\213)\367\32\24\333)#\364\355V\3\237\337\33\204"..., 16384) = 52
select(14, [3 9 13], [11], NULL, {900, 0}) = 1 (out [11], left {899, 999997})
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
write(11, "a", 1)         = 1
ioctl(11, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig -icanon -echo ...}) = 0
select(14, [3 9 13], [], NULL, {900, 0}          <<<

What it is going on with fd #13 in the other call?

Is it possible the C program is doing something to lock the file descriptor of ssh process?

It is possible the hand terminal sends a combination of ctrl keys which they 'hangs' the standard output?

I ran out of ideas. Can anybody drive me to the right direction?

February, the 25th. 2016

I have some more information about this:

select(14, [3 9 13], [], NULL, {900, 0})       = 1 (in [13], left {899, 998835}) <<- sshd realizes about data in fd #13 from C application
read(13, "\33[1;23H1\33[1;24H", 16384) = 15          <<- sshd check data from th fd#13
select(14, [3 9 13], [3], NULL, {900, 0}) = 1 (out [3], left {899, 999998}) <<- sshd sends data to fd#3 (socket)
write(3, "\301\236W\250\333\260\r\204\316o]:*1K\203\242\204\257Vb,V\347l\242\352K\341,,\307d\273\277\202.l\32F\2471\257DJt3\36\303\5\256\21K6\27\212\253\326|l\33\270\262S", 64) = 64 (1)  <<- sshd encrypts data to be sent
select(14, [3 9 13], [3], NULL, {900, 0}) = 1 (out [3], left {899, 999998})   <<-- sshd sends data thru the socket
select(14, [3 9 13], [], NULL, {900, 0}) = 1 (in [13], left {899, 998569})    <<- sshd realizes about data in fd #13 from C application 
read(13, "\7\33[1;16H        \33[6;6H_______\33[7;1H -INFORME CANT. RECOGIDA-\33[7;26H", 16384) = 67   <<- sshd check data from th fd#13
select(14, [3 9], [], NULL, {900, 0}) = 1 (in [3], left {892, 12016})  <<- sshd sends data to fd#3 (socket) but... where is fd#13 where sshd has to read it from?  

read it from?            

the terminal receives "\7\33[1;16H " but the rest of the string " \33[6;6H_______\33[7;1H -INFORME CANT. RECOGIDA-\33[7;26H"

is not received

why?

when the application writes into its output file descriptor (fd#1), sshd receives it (fd#13)..but is fd#1 cleared up at this moment? Or is sshd reading from the file descriptor char by char while it is sendind/encripting data?

1 Answers1

1

I answer to myself for saying I have just found the problem I hope the following helps to anyone

Finally I recompiled the ssh source code (openssh-5.3p1) inserting several 'traps' within the code to see what was going on there

channel_pre_open(Channel *c, fd_set *readset, fd_set *writeset)
{
u_int limit = compat20 ? c->remote_window : packet_get_maxsize();

int aux = buffer_len(&c->input);
debugtrap("en pre_open c-istate: %d limit %d buffer_len %d c_ostate %d
ctl_fd %d\n",c->istate,limit,aux,c->ostate,c->ctl_fd);

/* the rest of the function code */

limit variable at the beginning (by default) is 1024 * 1024

In normal conditions the limit variable is adjusting its window size each time channe_pre_open function is called (example using putty)

En pre_open c-istate: 0 limit 1048576 buffer_len 0 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 1048576 buffer_len 0 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 1048495 buffer_len 0 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 1048495 buffer_len 0 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 1048415 buffer_len 0 c_ostate 0   ctl_fd -1
...... time later
En pre_open c-istate: 0 limit 1002267 buffer_len 0 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 1002267 buffer_len 0 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 1002267 buffer_len 0 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 998560 buffer_len 0 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 998560 buffer_len 0 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 1048576 buffer_len 0 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 1048576 buffer_len 0 c_ostate 0   ctl_fd -1

But, If I compare the same trace when the hand-terminal connects to the application I can see the buffer is consuming (it isn't renegotiate each time) the whole size

En pre_open c-istate: 0 limit 1048576 buffer_len 0 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 1048576 buffer_len 0 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 1048476 buffer_len 0 c_ostate 0   ctl_fd -1
......
En pre_open c-istate: 0 limit 985 buffer_len 0 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 647 buffer_len 0 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 647 buffer_len 0 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 647 buffer_len 0 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 647 buffer_len 0 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 632 buffer_len 0 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 322 buffer_len 0 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 322 buffer_len 0 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 0 buffer_len 16 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 0 buffer_len 16 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 0 buffer_len 16 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 0 buffer_len 16 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 0 buffer_len 16 c_ostate 0   ctl_fd -1
En pre_open c-istate: 0 limit 0 buffer_len 16 c_ostate 0   ctl_fd -1

and finally limit variable comes to 0

When it happens fd#10 is lost from readfds in select call

0.000025 select(11, [3 6 10], [3], NULL, {900, 0}) = 1 (in [10], left {899, 999997})
0.000025 select(11, [3 6 10], [3], NULL, {900, 0}) = 1 (in [10], left {899, 999994})
0.000025 select(11, [3 6 10], [3], NULL, {900, 0}) = 1 (in [10], left {899, 999995})
0.000025 select(11, [3 6], [3], NULL, {900, 0}) = 1 (out [3], left {899, 908736})  
0.000026 select(11, [3 6], [3], NULL, {900, 0}) = 1 (out [3], left {899, 986906})  
0.000025 select(11, [3 6], [3], NULL, {900, 0}) = 1 (out [3], left {899, 992061})

The thing is that select call doesn't include this file descriptor in the set because its blocked by the other side until the buffer (client<->sshd) will be empty (it supposes the sshd cannot send more bytes to the ssh client because the window size is 0, so the fd has to be blocked to prevent send more information from the shell side)

This behaviour doesn't happen using putty client, and it seems its related ssh client of Honeywell hand-terminal which is based on Openssh (don't know the version)

Anyway, I have just confirmed the following version: OpenSSH_3.8.1p1, OpenSSL 0.9.7d 17 Mar 2004 is not affected (tested on Windows 10 against RHEL6 ssh-server 5.3.p1)

Nacho.