Friday, October 15, 2010

A Very Nice and Very UNIXy Bug

I encountered a very puzzling bug in a user-land app I helped write. I wrote the socket comms library and the database backend (using SQLite3).

Once in a blue moon one of the sqlite3 databases would become corrupted and would be written off by the integrity_check pragma. The first thing to blame was sqlite3 with multithreading (tho I had it compiled for that). It ended up not being this.

What happened was soo much better: the socket library was being used by a thread pool so when a new request came it was handed to a worker thread which had to talk to some innards of the app and formulate a result. In the meanwhile the accepted socket was being kept open.

The comms library was used on the client side by ephemeral CGIs that were called by a YUI2 webapp. The web server was thttpd -- it has a CGI timeout cut-off after which it kills the child CGIs.

The innards of the app could sometimes take longer to respond that the CGI cut-off time so the CGI vanished and the client socket was closed. But on the app side (=server) the TCP socket was being used. When finally the innards finished doing whatever they were doing the data would be written to the dangling socket using write(2).

But in the meanwhile in another galax^H^H^H^Hthread a sqlite3 db would be opened, used and closed. UNIX has a policy of reusing the lowest numerical socket that becomes free.

See the conflict? The worker thread would write some late-arriving data to what it thinks it's a socket but now it's a file descriptor!

I fixed this by changing all calls for read/write in the comms library to recv/send -- the latter pair only works on sockets. Also for added paranoia I sprinkled the comms code with getpeername(2) and would log a critical error if a socket descriptor did not look as a socket.

Only took me two days to get to the bottom of this.

-ulianov