In an effort to write more, I'm considering cataloging more of the things I learn on a day-to-day basis. This post certainly falls into that category of posts. We recently solved the mystery behind a pattern of failures that have been happening very infrequently over a long period of time. The culprit was surprising and a little unsettling, so I thought it might be worthy of a public blog post. Before I dive into the details, though, let me be very clear that while I contributed to this investigation this work is not mine alone. As we accumulated findings each time these issues would surface, many folks in our engineering group ended up adding to the pile of insights that would eventually lead us to a bug PHP's mysqlnd database driver. At the time of writing, the bug has been reported a number of times but still exists in the latest version of PHP (7.2.8).
Like many shops that use MySQL we run pairs of servers in a master/master configuration. At any given time, we are only writing to one server (the primary) and if we happen to lose that primary, we failover to the secondary and continue running. We try to minimize the impact of these failovers but they do not happen transparently. For most long running jobs, a failover means that the job receives an error, it then terminates and can be later retried, operating on the secondary database that has taken over for the dead primary. A few months ago, we had a database primary go down. During the cleanup that followed, we found the data associated with a particular job that had to retry was in a completely unexpected state. Here's a rough sketch, in pseudocode, of what that job looked like.
While the real code is obviously much more complex, the basic flow here is simple. In the database is a number of items belonging to the job. The job does a query that selects all the items that have not been processed. It then processes each of those items, marking each one completed. Once it completes all the items, the job is marked as completed and it will not run again. If we had a failover in the middle of running this code, we expected that the job would fail either through an exception or a fatal error. The job would then be restarted and it would pick up processing where it got previously interrupted and only after it successfully reached the end of the query results would it mark the job as completed.
Instead we found jobs that were marked completed that had some number of items that were not processed at all. Since the job was marked completed, it was clear that the code above had successfully iterated through the entire loop. This caused us to initially suspect this was an issue with asynchronous replication. But in some cases, we even saw that none of the items were processed. That would have indicated significant lag in our replication that just wasn't realistic. The cause would remain a mystery for a while longer since the impacts were still small and we had no reason to believe the issue wasn't isolated to this particular job.
A few months later, though, an attempt to lower some timeouts on database connections would clarify what was causing the grief. Shortly after pushing out configuration changes to lower the timeouts, we observed a very similar pattern happening in another place in the codebase. It appeared that if the connection timed out in the middle of a result set, we were not getting a proper error so a variety of problems surfaced all from having erroneously assumed that the entire result set had been returned. Much of this code was also well over a decade old and had not changed much at all since it was written to run on PHP 5.3. It was certainly still possible that a recent change had caused all this grief but all the signs pointed to PHP behaving in a way we had not anticipated. While I fully expected to simply rule out PHP's error handling as a culprit, I set about answering the question: How does PHP handle MySQL completely going away in the middle of returning a query?
I created a controlled environment for the experiments. This included a PHP node for running queries, a MySQL server node and a proxy server written in Go that made it easy to programmatically disconnect MySQL to simulate failure. In each of the tests, I would execute a straight-forward SELECT statement that was expected to return 10,000 rows. The proxy would allow a fixed number of bytes to be returned to the PHP client before closing the socket in both directions. The PHP code in question looks something like this.
Given that we use PDO::ERRMODE_EXCEPTION, we would certainly expect that an exception will be thrown the moment the proxy closes the connection. We would not expect to see this example print the output “Number of rows returned". And we would certainly not expect to see it print output with a number less than 10,000. But when you run this example, you see the following output:
Number of rows returned 0
That didn't go well but why zero and not some arbitrary number?
PHP's MySQL drivers support two query modes, buffered and unbuffered. In buffered mode, calling execute causes all the results to be transferred back to PHP and stored in memory before the call to execute will return. This is the default mode and is generally favored in that it allows the result set to be treated like a plain ole iterator in PHP (including rewinding and counting rows). As you might expect, buffered mode is problematic if the result set is large. In unbuffered mode, execute returns before the results are transferred back to PHP and the results are read in from the socket as the code iterates. The biggest downside of an unbuffered query is that the connection cannot run any queries while iterating over a result set.
Returning to the example above, the reason why we got zero rows is that we were running in buffered mode. The disturbing part here is that no exception was thrown and no error occurred even though we got none of the rows that we should have gotten. So what happens if we change into unbuffered mode?
Number of rows returned 98
This time we received 98 of our expected 10,000 rows and, again, no error or exception. This sucks! This means that we can successfully iterate through a result set in PHP and still not know if we have seen all the rows that were returned. It should also be clear at this point that the original job that tipped us off to this problem was using unbuffered queries.
Note that the code for the controlled environment that reproduces this error is in Github.
One thing that is hard to miss about running the examples above is that they produce a warning in PHP with the text “Empty row packet body”. Clearly PHP is aware of the error condition, so I hoped that we were just missing some configuration that prevented it from being promoted to an exception. But alas, no such luck. You can tell by these lines of C that the MySQL driver only ever emits a warning when there is an erroneous socket read. That's not good.
After some research, I found this was definitely not an unknown PHP bug. I'm a little surprised this behavior is not more widely known. I guess it really reinforces that failures are relatively rare and receive very little testing. I found three different reports in the PHP Bug Tracking System about this issue, #64763, #66370 and #65825. In one case there had been an attempt at a fix, but the patch was later abandoned. You'll also notice the bug is present in PHP 7.2.8 which is the latest release as of the time of this writing. However, I was able to reproduce the bug on every version of PHP that used the mysqlnd driver. I suspect this case worked properly when PHP used libmysql, but I have not been able to confirm that.
We ultimately decided that this behavior was just too dangerous for us. We have a PHP codebase that has been around for over a decade and reviewing all of our result set handling is no small task. Instead, we are using a custom error handler to promote the warning to a PDOException. This is a solution we have all agreed is “gross but necessary”. Hopefully, we'll see a proper fix in a future PHP release and we'll be able to remove the duct tape, but for now it looks like our best course of action.