Sunday, August 3, 2014

linux, process go to "S" status for a while. socket programming problem.


You could try and trace the system calls and signals of one of the concerned processes.
Maybe you'll find a hint on what's goung on.

strace -p pid

where pid is the process id as found in the second column of "ps -ef".

You could add the "-f" flag to trace forked child processes as well:

strace -fp pid


Checking a strace -fp pid as suggested, I'm getting a huge amount of the following messages until I interrupt the command:

==========================

======================
strace -fp 30247
Process 30247 attached - interrupt to quit
SYS_7(0x3ffffaf9078, 0, 0xc350, 0, 0, 0x3ffffafc070, 0x800291fc, 0, 0x2000336dc40, 0x3ffffaf9088, 0x2000076e000, 0x20000741518, 0x200006e7840, 0x3ffffaf8fd8, 0x200007a7f30, 0, 0, 0, 0, 0, 0, 0, 0x3ffffaf9078, 0x8000000000000, 0x4050000000000000, 0, 0, 0, 0x4050000000000000, 0, 0, 0) = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT)         = 0
nanosleep({0, 50000000}, NULL)          = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT)         = 0
nanosleep({0, 50000000}, NULL)          = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT)         = 0
nanosleep({0, 50000000}, NULL)          = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT)         = 0
nanosleep({0, 50000000}, NULL)          = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT)         = 0
nanosleep({0, 50000000}, NULL)          = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT)         = 0
nanosleep({0, 50000000}, NULL)          = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT)         = 0
nanosleep({0, 50000000}, NULL)          = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT)         = 0
nanosleep({0, 50000000}, NULL)          = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT)         = 0
nanosleep({0, 50000000}, NULL)          = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT)         = 0
nanosleep({0, 50000000}, NULL)          = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT)         = 0
nanosleep({0, 50000000}, NULL)          = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT)         = 0
nanosleep({0, 50000000}, NULL)          = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT)         = 0
nanosleep({0, 50000000}, NULL)          = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT)         = 0
nanosleep({0, 50000000}, NULL)          = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT)         = 0
nanosleep({0, 50000000}, NULL)          = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT)         = 0
nanosleep({0, 50000000}, NULL)          = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT)         = 0
nanosleep({0, 50000000}, NULL)          = 0
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 60000) = 1
recv(3, "", 8192, MSG_DONTWAIT)         = 0
nanosleep({0, 50000000},  <unfinished ...>
Process 30247 detached
================================================

Any ideas? :S


all one can see is that the program polls for an event on filedescriptor 3 (which is obviously a socket), with a timeout of 60 seconds.
A POLLIN event is received, which means "there is data to read", and the return value of "1" means that one single structure  has been returned (the one indicating "POLLIN").

The subsequent nonblocking recv() receives a message of length zero from that socket, so the process decides to go to sleep for a while to then reissue the poll() call.

So the reason why this process passes most of its time sleeping is that the socket becomes well ready to present data, but these data are of zero length - obviously either a communications problem or desired behaviour - maybe the other end of the communication path would post an event regularly to keep your process from timing out.

You could additionally issue

lsof -p pid

to check where file descriptor (socket) 3 is connected to. The descriptor number is in the FD column, connection info is in the NAME column.

If the process whose data you posted has ended in the meantime and you're going to examine a different process please make sure to check for the correct socket descriptor - that's the "fd=..." number in the poll() call.

In any case (desired behaviour or communications problem) you should see your development folks to present this analysis to them and ask them what's the deal.

No comments: