Issue
Process A opened && mmaped thousand of files when running. Then killl -9 <pid of process A>
is issued. Then I have a question about the sequence of below two events.
a) /proc/<pid of process A>
cannot be accessed.
b) all files opened by process A are closed.
More background about the question:
Process A is a multi-thread background service. It is started by cmd ./process_A args1 arg2 arg3
.
There is also a watchdog process which checked whether process A is still alive periodically(every 1 second). If process A is dead, then restart it. The way watchdog checks process A is as below.
1) collect all numerical subdir under /proc/
2) compares /proc/<all-pids>/cmdline
with cmdline of process A. If these is a /proc/<some-pid>/cmdline
matches, then process A is alive and do nothing, otherwise restart process A.
process A will do below stuff when doing initialization.
1) open fileA
2) flock fileA
3) mmap fileA into memory
4) close fileA
process A will mmap thousand of files after initialization.
after several minutes, kill -9 <pid of process A>
is issued.
watchdog detect the death of process A, restart it. But sometimes process A stuck at step 2 flock fileA
. After some debugging, we found that unlock of fileA is executed when process A is killed. But sometimes this event will happen after step 2 flock fileA
of new process.
So we guess the way to check process alive by monitor /proc/<pid of process A>
is not correct.
Solution
Don't scan /proc/PID
to find out if a specific process has terminated. There are lots of better ways to do that, such as having your watchdog program actually launch the server program and wait for it to terminate.
Or, have the watchdog listen on a TCP socket, and have the server process connect to that and send its PID. If either end dies, the other can notice the connect was closed (hint: send a heartbeat packet every so often, to a frozen peer). If the watchdog receives a connection from another server while the first is still running, it can decide to allow it or tell one of the instances to shut down (via TCP or kill()
).
Answered By - John Zwinck