hq already gave a non-technical summary of these events. We thought that you might enjoy a little more detail, though.
I warn you ahead of time that this entry assumes some basic familiarity with the techniques associated with data storage on large websites. If you've never heard the term "shard" applied to databases before, this might not make sense in places. In a later post we're going to do an overview of how deviantART's servers are structured which will cover this.
Last Monday at about 1:30am, people started noting some really weird things happening on deviantART. Their comments would appear somewhere other than where they were left. Their message centers filled up with notification of journals from people they didn't watch. They found inexplicable notes in their Message Center. Polls they posted had the options from someone else's poll. All sorts of things were failing: posting deviations, creating galleries and collections, almost anything that involved making something new on the site. Obviously, something was wrong.
Being that it was the middle of the night for most of us, it took a little while for reports to filter through helpdesk tickets and reach someone who could react. The only people were a few of our more insomniac and/or European developers: 20after4, kouiskas, and Pachunka. All that had been reported so far was that people in devBUG were getting journals in their message center from people they didn't watch. The natural assumption was that it was something to do with the new features we were having devBUG beta test at the time. Before long the mixed up comments were noticed, and more reports from non-devBUG people came in, so that was ruled out.
At that point we knew that something was really messed up, so the site was thrown into read-only mode. This is something we try to avoid doing, as you might imagine. We get somewhere close to 100,000 new deviations every day, well over a million comments, and thousands of new deviants... and the site being read-only shuts all of that down. But data corruption is serious business.
With some time to look into the issue in relative peace we found out that data with the same id was appearing on different database servers in our cluster. Since we shard the data by user, this was an indication that things were seriously mixed up; apparently we'd been giving out the same id to multiple pieces of content.
With this clue we investigated the source of our ids. Since we're using sharded servers, we can't rely on auto_increment for these, so we store sequence values on one database server and increment them whenever we assign an id. Looking at these, we saw that some of the sequence values had decreased at about 1:30am, when the trouble started. We then worked out that at about this time some fairly routine database maintenance had occurred, which involved swapping that server with its backup. These servers are supposed to be identical, and their replication was up to date when the swap occurred.
As far as we can tell, what happened was a failure in statement-based replication. When we update the sequence values we don't set them directly to a new number, we just send "value = value + 1", and rely on MySQL's LAST_INSERT_ID to get the new value. So if occasionally one of these queries just didn't get replicated, the backup's sequence values would slowly fall behind.
We immediately manually added a fairly large number to all of the sequence values, making sure that they were all above the largest-observed id value in use, and took the site out of read-only once we were sure that stopped new data corruption.
Now we were stuck with the problem of fixing as much of the broken data as we could.
(This is about where I woke up, and can legitimately claim to be part of "we" in this story. Because I'm in charge of our "Reactor" team, I got to coordinate the cleanup effort.)
Fortunately, because of our sharded servers, the problem was often just that different data with the same id existed on multiple servers, and the wrong server was being read from. In those cases the solution was to find the duplicate ids, and to assign new ids to some of them. That was pretty easy; I wrote a quick tool to find the duplicate ids, and the new ids were seamlessly handed out.
However, some of the older parts of the site maintained data on a non-sharded server. This old code also often didn't check whether or not the initial insert succeeded... resulting in data loss when further queries went ahead. Journal entries got mixed up terribly, for instance. In these cases we unmixed them as well as we could, generally putting one of the entries back together, and deleting the shattered remnants of the other.
Notes turned out to be especially bad, because they had mixed together all the recipient lists for the original notes, and we had no way of telling what the original list was. Notes are some of our very few totally private pieces of information, and we absolutely couldn't risk anyone reading a note that wasn't meant for them. So we deleted the 2,552 notes that were sent during the incident.
Comments were intimidating just because of their scale. 33,212 comments were jumbled up, the text for one comment appearing in the thread where another was supposed to be. Luckily this turned out to be mostly fixable; 32,861 comments were put back in their proper place, and we only had to delete 351.
This could have been a lot worse. We were initially worried that deviant credit cards might have been getting mixed up, but this turned out to not be possible. Similarly, user/group widgets and privileges could also potentially have been mixed up... and we were happy to discover that they were immune.
As you might imagine, this has made us start looking into ways to avoid problems with replication in the future. We'd never seen a replication bug this subtle before, and the preventative measures we had in place were for more blatant issues. To patch up this particular hole we've put some automatic monitoring on the sequence values, so that if the master and backup drift out of alignment we'll know immediately. More generally we're evaluating switching to row-based replication. We've also been considering moving to a system like Twitter's Snowflake to get ids, without having to rely on database integrity.
Because it's not paranoia when MySQL really is out to get you, while we were cleaning up we added error checking to a lot of old code which was assuming that its inserts couldn't possibly fail. This means that if our sequences ever do fall back again in spite of the precautions we mentioned there won't be any corruption occurring before we can respond.
This incident has added some twists to how we're going to investigate similar problems in the future. Normally we start out by looking at the last code to be launched before an issue began, since that's probably related. We also didn't immediately think it would be the database maintenance, because the sequence slip meant that timestamps on affected content were from before the maintenance. Now we know that this sort of replication bug is possible we can check for it early, instead of looking for it as a last resort.
A lot of credit must go to our insomniac heroes, 20after4, kouiskas, and Pachunka who quickly realized how screwed we were and made the judgement call to pull the plug on the site until we could fix it.