Shop Mobile More Submit  Join Login
×

:icondt: More from dt


Featured in Collections

JOURNALS AND NEWS by Elandria

Journals, News, and Features by Topicality

README.TXT by Thunderstatement


More from DeviantArt



Details

Submitted on
February 17, 2012
Submitted with
Sta.sh
Link
Thumb

Stats

Views
50,233 (3 today)
Favourites
22 (who?)
Comments
50
×
We've just finished moving every message in every user's message center — all 6 billion of them — to a brand new set of database servers, and we managed to do it without anyone noticing. We didn't find much published material about the best way to do this, so we wanted to share a few details with you. It does get a little bit technical though so I won't be offended if you skip over some parts!

The data that powers message center was stored on one of our main sharded MySQL database clusters. That same cluster powers a lot of different features of the site and we recognised a lot of traffic was caused by this relatively simple but large set of data.

To free up capacity on that cluster, and help keep the site quick as we continue to grow, we decided to move the message center data to its own servers.

Some fun figures:
  • ~6 billion rows of data in 13 tables across 16 shards
  • Average 5k, daily peak of 10k queries per second for this data across the cluster
  • Average 100k, daily peak of over 150k rows read per second
  • Average 500, daily peak of over 3k rows written per second

So we had a pretty large set of data and the high traffic to it meant that we'd have to be very careful how we copied it over without losing any data, taking the message center down for a few days or otherwise affecting users.

To add a bit of fun to the project, the old tables were still using MyISAM storage engine as they've been around from before InnoDB's more recent rise to dominance. We decided it would be a good time to switch to InnoDB. If this doesn't mean much to you, then it means we decided to switch to a different type of database which has become a much better choice than it was a few years ago when we first designed message center.

How Much Hardware?


We started out not totally sure of how much hardware we would need because it was quite hard to get meaningful measurements of our current database server load when it was only a part of the traffic on the old cluster.

So to enable us to experiment and prove the new setup, we developed an extension for our PHP database API to allow us to duplicate query traffic from the current tables to the new cluster. We made it possible to control the percentage of read, write and delete queries mirrored individually though our distributed config system, and ensured that failures in the new cluster would not impact the real traffic.

After some experimentation we found a setup which worked well and gave us much more capacity to grow.

The new cluster has 8 servers as masters, replicating to 8 identical machines as a hot-backup.

Each machine has these specs:
  • 12 Cores
  • 96GB RAM
  • 200GB Intel SSDs

Since our existing cluster is sharded over 16 masters, we kept the data split into 16 logical shards and put 2 on each master machine in separate databases.

Syncing the data


Satisfied with the hardware setup, we needed to find a way to synchronise all the data. The initial import step was made more difficult by the move to InnoDB - it took several days to convert the most recent backup to InnoDB before the new cluster could start receiving new write traffic.

This left us with a gap in the data between the old and new clusters and we needed to find a way to get them back in sync without impacting the performance of the live cluster.

The nature of the data meant that even if we could efficiently copy only rows inserted within the missing period, that would still leave the data inconsistent where deletes happened during that time. People would start to see notifications they had already dismissed popping back up again.

So to be thorough, we needed to crawl all the data and ensure its consistency. Rather than copying everything again, we looked into ways to compare consistency between the live data and the new cluster without impacting users.

Checksums


We couldn't find any documented technologies or techniques used by others that would work for us without taking the production cluster off-line so we came up with a solution that although not perfect, seems to have proven useful in this case.

It's all based on checksums.

A checksum is a number which is obtained by doing some calculations on the numbers that make up any piece of data on a computer. In general, if the checksum for two pieces of data is the same, they are very very likely to be identical. If the checksums are different, the two pieces of data are guaranteed to be different. The important thing is that using a checksum we can compare many thousands of rows with just one query to each cluster and very little data transferred.

We came up with a query that could calculate the CRC32 checksum of all the primary keys in a batch of in-boxes surprisingly efficiently. In this dataset, rows are never updated - just inserted or deleted - so primary keys are the only data points needed to ensure consistency.

The query looks something like this:

SELECT CRC32(SUM(row_cs)) as cs
  FROM (
      SELECT CRC32(CONCAT(folderid, type, itemid)) as row_cs
        FROM table
       WHERE folderid IN (1,2,3,4,...)
    ORDER BY folderid, type, itemid
  ) as tmp

[edit] rehael pointed out correctly in the comments below that the ORDER BY is unnecessary here. SUM is commutative so it doesn't actually matter if natural order is inconsistent in the inner query. This must have slipped in from an earlier attempt to accumulate rows without the sub query. In this case sort is on primary key so I guess it made little difference, but a good spot. I've left it in for posterity.[/edit]

[edit2] Another interesting potential flaw here is that CONCAT could produce false positives on some data sets. A more correct solution would be to use CONCAT_WS() with a delimiter char such that a row like (123, 456, 78) would not falsely checksum equal to (12, 345, 678). I'll credit pt-table-sync for this realisation (see bottom of article).[/edit2]

Using MySQL's EXPLAIN showed that this query didn't create a temporary table, and could select and sort efficiently using the primary key. This meant it is more likely to run quickly and not cause problems for real users of the site. In testing, we found it ran in under 200ms even for the biggest batches of folders we could find.

Despite the checksum being apparently quick, we still couldn't actually copy the out-of-sync data from big in-boxes from our master servers without locking extremely high-traffic tables, so we ran the checksums and copied the data from our hot backups.

To balance the amount of data copied with number of checksums needed, we first checksummed an entire batch of 50 in-boxes at a time. If there was a mis-match, we checksummed each in-box in the batch in turn and any that didn't match we deleted all rows from the new cluster and copied all rows from the old. There are more sophisticated options we could have used at this level to optimise for big in-boxes where not a lot changed, but none that were without downsides.

In reality the checksum query turned out not to be super-fast all the time, but we found more than 99.5% of the few million we ran in the end executed in under 200ms and over 99.8% in less than 1 second. In general checksum query load was minute compared to the load of the data-copying queries for big in-boxes.

You may not want to try this at home


There are a few flaws in this technique.

Firstly, the fact that we had to sync from replication slaves meant that any replication lag between master and slave caused inconsistency in the copied data. Once there were a few slow copy queries running, the replication lag would get worse. We found generally it kept up very well but we couldn't guarantee 100% consistency in one pass.

Secondly, even ignoring the replication lag, checksums could not be not atomic between the clusters - a user may have deleted or received a message in between the time you get the checksum from one cluster and the other.

So we knew a single pass through the data would not be sufficient. However, thanks to the checksum trick, we could continue running the sync job through the dataset and each time a few orders of magnitude less data would be found out of sync and need to be copied. Each pass also got substantially quicker as less data needed to be transferred.

After just 3 passes (several days for the first, a few hours for subsequent) we measured 99.9995% consistency. We called the job good and switched users over to read and write primarily from the new cluster. We're still writing to the old one for a week or so until we are totally sure of the new cluster.

Obviously this sort of slightly fuzzy consistency would not suitable for other data that may require absolute integrity (like customer's orders). For us, it's acceptable because:
  • a significant proportion of the < 0.0005% of in-boxes still measured as out of sync will actually be fine - just reported due to the measurement error noted above
  • statistically it is likely that any in-boxes genuinely out of sync are very large ones (100k+ notifications) and that their owners never look through and curate or remove them
  • in the highly unlikely case that a user did notice a discrepancy, it is more likely to be a notice they deleted re-appearing (they can fix that in one click)
  • these are not personal messages or unique content, just notifications of activity elsewhere on site


The useful take-aways


Despite not being a perfect solution to all live migration problems, I hope this has been somewhat interesting and may even help others with similar tasks. The exact plan we used is pretty specific to our situation but I think there are a couple of more general points to take away from this:
  • Migrating large amounts of data from one MySQL server cluster to another without affecting operation or losing writes is hard and apparently no one has come up with a really good solution yet
  • Under different circumstances, it may be possible to be more robust than this although probably at the expense of taking longer, requiring more resources, or making the service unavailable for some time
  • Checksum queries, although not a panacea, are a potentially useful tool that are not widely discussed for on-line MySQL consistency checking

Big thanks to chris and randomduck for their hard work on this upgrade.

[edit] A number of people pointed us at pt-table-sync from Percona's Toolkit. That script actually ends up doing something almost exactly like what we did here (with different checksumming techniques). I didn't know of it and somehow didn't find it when looking for similar solutions. It looks like a much more robust solution in general although would not (easily) have been able to take advantage of our distributed job cluster which enabled our migration to run sync queries on many folder chunks in parallel, so it probably would have taken much longer.[/edit]

Add a Comment:
 
:iconrehael:
rehael Featured By Owner Feb 23, 2012  Hobbyist Photographer
First, thanks for the answers. I guess we like our optimisers to be clever. And it's good you didn't (I hope) hit any collisions — CRC32 for 6 billion rows is risky business (p(collision)=39.6% for 6b. in short scale) — but I guess you compared these 16 shards and 13 tables separately, not as one 6b. dataset. :D

I feel I need to add some more thoughts about the migration. Without going into too much details ($DEITY bless NDAs) — recently we finished rather large migration project: several years of worldwide market data (raw, not aggregated dimensions), changing DB engines, instance placement and even datacenter switch along the way. All of it with live system, without degrading preformance for end users. The full load took about a week («Monitoring: saving lives, assuring sanity» :giggle:). The scenario I gave up there is bulletproof, every single time I see it used. Split, Snap, Load, Housekeep, Switch, Done. It's true it puts the NEW under much stress, but only for initial load of the snapshot, and it's somewheat offline process without an impact on the LIVE one.

I don't get why „the incredibly expensive process of converting to InnoDB which would have made it very likely the new cluster would have errors on queries that would then be missing from the final set on new one”emphasis mine — could you elaborate on this one a little more?

Anyway — I really love reading the #dt technical journals — keep 'em coming please. :XD: And in future projects — maybe source knowledge from deviants here, I think you can find some smart souls in the community.
Reply
:iconbanks:
banks Featured By Owner Feb 23, 2012
CRC32 collisions could come into play across whole dataset but remember we were only comparing "same" relatively small (less than 150k) batch of rows on each cluster at a time. It's very unlikely for collision to occur in this case, especially given that a large folder almost certainly changed between the multiple passes we ran (and so should be picked up on subsequent pass EVEN if it collided once). Also the birthday paradox doesn't come in since we are not comparing any batch with any other. But you are right -- it is important to consider in each situation.

To clarify my point about InnoDB conversion greatly increasing likelihood of degraded performance and dropped queries: as far as I understand (and I wasn't the person in charge of this aspect) the conversion from MyISAM to InnoDB took several days on each server at pretty much full disk IO. The new cluster would have been totally unable to keep up with our actual write traffic which would have meant queries hung causing too many connection errors (which would be silently ignored so as not to impact real users). Basically the new cluster would not have enough disk IO to cope with both the data conversion and the real write traffic. Throttling the import would perhaps be a possibility but it probably would then have taken a lot longer overall to complete the process.

Your solution seems very sound and robust in general though and would be much more widely applicable to other situations.

Thanks for the sharing. If you have any thoughts on ways we can get technical deviants more involved I'm sure we'd be interested to hear them!
Reply
:iconrehael:
rehael Featured By Owner Feb 23, 2012  Hobbyist Photographer
Ok, thanks for clarification, got your point now. I haven't played with MySQL for a looong time — currently only Oracle and PGSQL and some esoteric ones) are around me.

As for getting tech–savvy deviants attention — advertising the #dt globally from time to time could help. And then — just post a journal (one a month?) with a problem to solve, and wait for comments. It should work, don't expect miracles tho'. ;)

:salute: :ahoy:
Reply
:icontleach0608:
tleach0608 Featured By Owner Feb 23, 2012  Hobbyist Photographer
Though I never handled the back end of things at my old job in IT, I nonetheless could at least follow what you said and am very very impressed. Great job, everyone, the hard work was very much worth it and paid off! You guys rock!
Reply
:iconraquelvonkaminaru:
raquelvonkaminaru Featured By Owner Feb 23, 2012  Hobbyist Photographer
96GB RAM

Now THAT'S impressive!!! ;)
Reply
:iconnamenotrequired:
namenotrequired Featured By Owner Feb 22, 2012  Student Interface Designer
I read it all :D Awesome job :clap:
Reply
:iconbranchewski:
Branchewski Featured By Owner Feb 22, 2012  Hobbyist Photographer
Amazing, just amazing :worship:
Reply
:icontimberclipse:
TimberClipse Featured By Owner Feb 22, 2012  Professional Filmographer
My 17,000 notes could not possibly of helped with that :lol:

Anyway great job guys! But the truth is...



I noticed

I just didn't say anything :eyes:


:B
:heart:
Reply
:iconi-am-lady-timeless:
I-Am-Lady-Timeless Featured By Owner Feb 22, 2012  Hobbyist Digital Artist
You all really are SHEER GENUS!!! And thanks for explaining this to all of us! :heart:
Reply
:iconendosage:
endosage Featured By Owner Feb 22, 2012  Professional Interface Designer
Every time one of these dT blogs are posted, I know they're going to be entertaining. I love hearing all the little details behind what goes on in dT and what you guys have to deal with –– always interesting and incredibly informative. Thanks for the great read, Paul! (and thanks for making nice cozy homes for our data to live in!)
Reply
Add a Comment: