Shop More Submit  Join Login

#DT and LOGR

Mon Nov 5, 2012, 11:42 AM by fartprincess:iconfartprincess:
It's Friday evening and after a long day, you check the code you were working on into git, have the commit reviewed, accepted, merged, and sync it live. All seems right with the world. You let out a sigh of relief, back your chair away from your desk, and walk away in a satisfied mist of ease. In fact, you're excited because you're going to a concert with your friends tonight.

But then, twenty minutes after you leave, it begins. Errors. Fatal errors. And you're not around to know. So what happens?

In dt, we look out for one another. One of the ways we manage to do this is through an error logging service we've built called Logr. If you read our article on DRE, you'll remember we mentioned a variety of functions we use to generate messages within DRE. One of these is called dre_alert().

For example, you set up an alert in your code that gets triggered when a draft in Sta.sh Writer can't be loaded:

dre_alert("Failed to load draft");


And it seems to be happening every 5 seconds. Oops. It turns out the error is caused by a simple typo. But fortunately everyone in our team gets e-mailed when these severe types of errors are logged and since we're situated all over the world, someone is almost always around to be accountable for problems and emergencies. So yury-n kindly steps in, fixes the problem, and everyone is happy again, yay.

How does LOGR work?


We store both a message and a "payload" (any additional debugging information you pass along) along with host and backtrace information to help figure out the chain of events that led to the alert. All of this information gets sent to a Redis server. We rotate the data using a round-robin algorithm similar to RRDTool to keep the memory usage constant despite the fact that we store millions of different messages within LOGR.

Is it just for emergencies?


Nope. We use it for several things, often things that are recognized problems but not emergencies. Every time the alert happens, the incident gets grouped together with other alerts of the same type, so we can get a better picture of how frequently it's occurring, any patterns, or if there is a particular reason, using the payloads available, of why it's occurring.

We can tag our alerts however we want. So say we have 5 alerts set up for deviantART muro. We can tag each as so and then search for it later to find alerts related to that part of the site. Cool, right?

We also can set thresholds on alerts. Once established, if the incident occurs more than the number of times allowed, an e-mail will be sent to a particular developer.

For example, in this particular alert, $allixsenos is set up to receive an e-mail any time this particular alert happens more than 40 times in one hour:



Mousing over the chart, we can see a bit more information, like the exact number of occurrences that happened at different points in time.

And, as mentioned, we also get a nice payload of information:



With LOGR, we're able to get an idea of which of these alerts are exceeding their thresholds, which are happening frequently, which alerts are most recent, and when they were first seen. This helps us easily rule out which events were transitory or already resolved and which are continuing to happen and need fast attention.

We also generate histograms so that we can see a breakdown of how the error is occurring. This is available for any piece of data stored so that you can see different variations that occur. For example, we could see how the error is occurring, spread out across different servers:



But again, this isn't limited to just servers. It's available for any data stored in the payload.

We're always looking for ways to improve LOGR to help us spot problems with the site in a timely fashion to minimize the amount of impact it has on deviants so you can continue to enjoy the site as usual :)

How We Debug in DT

Thu Sep 13, 2012, 5:31 PM by fartprincess:iconfartprincess:
As developers, we often have to debug complex data across several different layers: server-side PHP, Javascript, load time, server information, memcache events, how pages are routed, along with several other pieces of data.

In dt, we use an internal system called DRE (Developer Runtime Environment), which allows us to grab information about how the code we write is executed while making a minimal impact on the code itself.

What makes DRE especially awesome?


We can use it to help debug and solve problems that other users are having. To accomplish this, once we are aware that a particular user is having an issue, we can add DRE logging to this user, usually limiting the logging to the part of the site in which they are having the problem. We can then use the log to check for any errors that might show up.

For example, say a user named testera is having trouble with a journal. We can log debug data for him when he visits journal pages. Examining that log, several pieces of information are stored, many of which will give us a better idea of where the point of failure is. We might want to check privileges the user has or what roles they have in a group if that's where the journal is being posted.






We take this type of logging very seriously, so don't be :paranoid:, because we only use it as a last resort, typically only using it within our own personal development spaces with test data.

We also store the logs so we're able to link them to other developers (and even then, we only store those logs for a few hours).

Why is it helpful to developers?


Because we load all debug information in a single window, it is accessible either by clicking a box at the top of any page or through F2. DRE can log simple data types but it can also provide a highly readable breakdown of complex objects using nested tables.



We also use it to set up alerts. If a high-risk condition occurs on a particular page, the box used to display DRE will turn red and pulsate to make it more apparent to us that there is a problem. This could be something as simple as a failure to retrieve data or something as severe as a failure to include a file (or worse, a fatal error).

More importantly, it allows us to trace data and debug across multiple layers, including processes that are non-web-facing or for asynchronous tasks, where the data may not be immediately available at runtime.

It's not just for fixing bugs; it's also for helping us spotting slow-loading modules and queries and seeing where we can better optimize the number of queries we do need to run on any given page.

For example, on this page, we have 12 separate queries that run:



This is pretty good, given all the data that has to be loaded to make DeviantArt run.

But on a different page, we might have 52 queries, some which are very similar and loading data from the same tables. If we can find a way to cull them together, we can reduce the number of queries, which will help the site perform better. We have a way of handling this in dt, a technique called wenching (which is combining several queries into a single query--we'll get to this in a future article), but we might not be aware that something needs to be wenched until we get a composite look of everything taking place on the page. This is where DRE comes into play.

We can also use it for profiling to see when a file in particular is taking a long time to load. This might be because the code in that file has inefficiencies that need to be refactored. For example, here's a page just to show you how this might look:



Eek. Bad, right? Fortunately, this isn't something we see too much of. And when we do, we try to remedy it.

So how does it work?


This is a silly impractical example, but it'll do.

$orders = $this->get_orders();
dre_trace('orders', $orders);


If get_orders() returns an object of orders, DRE will trace that object into a table, labeled "orders". Simple enough.

"This sounds like it could be kind of messy when you have a bunch of developers all using it! If everyone is using DRE, the panel would be completely overwhelmed with junk I don't care about!"

Yep. That's why we use dre_trace() sparingly. We have several functions for DRE, one which takes a parameter to filter the message to only those who want to see it. For example, if I wanted to make sure only I saw $orders from the aforementioned example in our DRE panel:

$orders = $this->get_orders();
dre_cond_trace('fartprincess', 'orders', $orders);


At the end of the day, this utility saves developers a massive amount of time.

Last year, we released the deviantART and Sta.sh APIs, our first official support for third-party app integrations. Eager developers have asked for many new APIs since then, anxious to step away from page-scraping and toward officially supported integration. We're eager for that too! :eager: 

Today we are happy to announce three new Sta.sh APIs which give developers greater access to their users' Sta.shes!

---

The New Sta.sh APIs

  • Fetch Submission Media - Now developers can request the filesize, dimensions and URL for the original media associated with any Sta.sh submission.

  • Fetch Submission or Folder Metadata - Apps can now get all of a submission's dirty details (keywords, artist comments, thumbnail URLs, etc). It can also be used with Sta.sh folders for greater organization capabilities!

  • List Folders and Submissions - My personal favorite and the workhorse of this release: this API call gives developers full read access to a user's Sta.sh in an intelligent way. We'll look more at the tech behind this call below.
--- 

The Challenges of a Delta

List Folders and Submissions (aka /delta), the most powerful call in this release, uses an incremental approach to allow apps to retrieve and store the current state of a user's Sta.sh in the most efficient way possible. By using the current state, developers and apps can recreate the Sta.sh experience on any device with any interface they desire. Let's look at what it does, how it works, and some of the challenges that dT faced during development.

Users can have up to 10GB of deviation submissions in their Sta.sh - literally thousands of 1Mb submissions. Each submission belongs to a folder, and folders can have many submissions. For an application to represent this to the user, it needs to know everything about the submissions as well as the containing folders. Because submissions can be updated, published, deleted, and moved between folders, an application needs to constantly check for the data.

Challenge 1: How can a third party application check the status of thousands of deviation submissions and Sta.sh folders repeatedly without drastically increasing server load and lagging the app?

Challenge.. Accepted: The solution was inspired by Dropbox's /delta API (though it's fundamentals can be seen throughout the world of computer science). The first time an application loads a user's Sta.sh state, we give it everything. All of the folders, all of the submissions. That kind of transmission could be megabytes, so we chunk it up into result sets of roughly 120 submissions. The application makes as many calls as necessary to retrieve the full list, and stores all of this information locally; when it completes, it will contain the exact state of that user's Sta.sh.

We also give the app a unique cursor which represents that exact state of that user's Sta.sh. After the user interacts with Sta.sh (submitting new deviations, modifying existing ones, publishing and deleting, moving deviations between folders), the application can send us the cursor and we'll give it everything that has changed. Checking for updates becomes a very fast, repeatable process for the application.

Challenge 2: How do we build the list of modified submissions and folders?

Challenge.. Accepted: Let's reconsider the data challenge above: 10GB of deviations = thousands of submissions = tons of data. Per user. Storing the exact state of a deviation at each step in the process would.. well, it'd be like striking the immovable object with the unstoppable force. Chuck Norris would disappear from existence. :confused:

Aiming to save ol' Chuck, we implemented a circular log table to track each user action in Sta.sh; we can store as many as one hundred and sixty thousand user actions. Anything above that wraps the table and automatically deletes old entries.

Each cursor corresponds to a specific log entry, so we can quickly compile a list of modified items since that cursor. Note that we do not record the state of the item in the log - we just record the fact that it changed in some way. This lets us fetch data for a small list on request, making the delta call cheap and fast.

Challenge 3: What happens when the circular table rolls over and old entries are lost? What if my application's last cursor becomes invalid? What if a submission is added and deleted between delta calls?

Challenge.. Accepted: The first two questions are examples of edge conditions where an incremental approach breaks down. To compensate, we have a special flag called reset that tells the application it needs to throw away its stored data and start over.

Because we only log the change in items, the list of those changes is likely to contain useless transactions like an submission that was added and deleted between delta calls. Applications don't care about this kind of entry - it won't appear in the app anywhere, doesn't exist anymore, and is simply confusing to see in the delta list. We applied an extra layer of filtering on the changelog to ensure that the application receives relevant, useful updates.

---

That's cool and all.. but where is the API for [awesome feature X]?

As kouiskas pointed out in last year's journal, adding new APIs is a time-consuming process; once we release an API, we want to support it forever. This limits our choices to APIs that are stable, scalable, and compatible with our long-term development goals. 

There have been numerous requests for great APIs over the last year; unfortunately, we can't build them all. Many would rely on infrastructure that will be refactored (e.g., Message Center) or that is in the process of being refactored (e.g., Search). 

Does this mean you should stop requesting those APIs? Not at all. Keep the requests coming! If you're an app developer and really, really want to see a specific new API call, here are some things that will help us review your request:
  1. Give us some stats. 
    If you have a successful app that is being used daily by deviants worldwide, growing in popularity and proving a benefit to the community, throw some numbers our way. Although popularity doesn't affect what is and isn't possible, it can be an important factor in deciding where to concentrate our efforts for the most benefit to most users.

  2. Show us what your app will do with the call. 
    Most of you already have very successful dA apps in the desktop and mobile world. We realize that there is a lot of DiFi hacking and page scraping necessary to make your app do what you want it to do. Show us how the new API will reduce your reliance on these methods. Include screen captures if you can, we like to look at shiny XHR logs and fail messages.

  3. Be specific and technical.
    We're developers, just like you. If you see a need for an official API, take the time to help us understand exactly what you want from the API (request parameters and result objects). This doesn't mean that we will honor your spec request exactly (we do what we can with the infrastructure we have), but it will help us understand your need more clearly.

DT 2012 Baguette meet at Breadlands

Tue Jul 17, 2012, 1:19 AM
Breadlands Logo By Randomduck-d52tawr by randomduck

Place

Our dt development team is 100%  remote. We documented our first foray in cautiously  meeting each other last summer in the blog DT 2011 Army Barracks Raccoon-Con, the event that took place in Marin Headlands near San Francisco. Or team is also fairly evenly distributed across a couple continents and several timezones, with many of the developers hailing from UK, France and other European countries. This year, those Europeans managed to coerce North America based developers to get their papers in order and tricked them into embarking on the long flight. Destination - beautiful village of Les Sauvages (which literally means “the savages”), France, countryside near Lyon and Beaujolais region. This event got named "Baguette meet" and the camp we would be staying in for the next two weeks as "Breadlands" in homage to our previous year meet at Marin Headlands.

Les Sauvages by micahgoulart

Planet Breadlands by ArtBIT

And this is our own germ city on the tennis ball!


Untitled by randomduck   Untitled by randomduck


The morning ride by samshull


Work

We did presentations, ignite talks, many pair-coding sessions encompassing work on small ideas and tech-debt. We completed two project hackathons with 19 mini project sprints. And what did not we have this year? We did not have a story of thousands of trunk commits, hundreds of deployments, rogue servers, long read-only and broken site disasters! And why? Because as first order of business during the meet was to switch our repository and deployment system from svn to git. Several of months planning went into this switch and its a subject of its own future blog post. All the work was done in private or shared feature branches, as well as code reviewed using our now standard phabricator process. Some successful bits have been merged already and the rest are to follow in coming months.


DT Discussions by micahgoulart

Untitled by randomduck  Sergey by randomduck

Untitled by randomduck  Untitled by randomduck

The morning when we found out about our remote workers' attendance practices


People

Last year, a little over 20 people converged on San Francisco for our meet. This year we had 39 people in attendance. DT has grown and in addition to usual crew, we had some new faces to meet and welcome. And also, we invited our friends and occasional nemesis from Product and Product-Marketing to join us. And on the second day we had our first infirmary patient, striken down by what later became diagnosed as a "Tahitian Plague". This a cautionary tale of the dangers of gathering so many remote developers together and exposing their untrained immune system to germs, presumably spread by office workers in Product Marketing! Also, a word of caution to others who might plan on having a meetup of large group of net addicts in the tranquil countryside. Do not count on the internet to function! Even with having top-of-the line bandwidth shaping system and after throttling every bandwidth hungry service and chasing down every rogue downloader, the internet was .. spotty at best. Complaints about reddit withdrawal suffering rolled in regularly, but what else would you expect from a group where you wont blink an eye when you see two people sitting shoulder to shoulder in an empty room, having a very lively conversation on ... Skype chat!


456342 10100887311732097 310159810 O by randomduck

pachunka insists that he is "on the mountain"


Breadlands 2012 by micahgoulart

Untitled by randomduck      Untitled by randomduck

Common scene: Product team is pacing and thinking, while DT developers are eating gravel!



Untitled by randomduck   Untitled by randomduck

Untitled by randomduck  Coding Hat By Artbit-d545fje by randomduck


Untitled by randomduck  Untitled by randomduck

Performance of a Traditional Canadian folk music band



Untitled by randomduck

Introducing Mr. June, from "Accordian Babes" pinup calendar


Da Breadlands 2012 By Knightar Web By Micahgoulart by randomduck

This is the group shot where we were told to act normal ..


Food

When we called our camp "Breadlands", we had no idea how much bread we are gonna see and eat over the next two weeks. And cheese, and sausage, and more cheese and bread. And wine, of course. But despite three healthy, regularly scheduled meals each day, the fine fast food establishments of nearby town were raided regularly with an aid of a gypsy-cab.

Untitled by randomduck    Untitled by randomduck

Here is about a three hours supply of beverages. And there is jekor wondering  how to say "omelet" in french


Untitled by randomduck     Untitled by randomduck

"I would kill for the cheeseburger"


Untitled by randomduck   Untitled by randomduck


Fire 

The tradition of very large bonfires continued. And it appears that several people have spent the offseason practicing their ax handling. We also, unlike last year, did not have any contractual obligations not to consume alcohol. Which interestingly lead to fairly mellow pace of said consumption.
 
Untitled by randomduck Untitled by randomduck


Animals

Attentive reader might remember that one of the hits of last year meetup were encounters with ravaging troops of raccoons that possessed no fear of humans. In contrast, Breadlands had an abundance of rather mellow farm animals that had no interest in stealing candy and ibuprofen.


2012-06-13 21.30.26 Hdr 2 by randomduck

Where DT sees a real cow for the first time


Untitled by randomduck  Untitled by randomduck Untitled by randomduck









We've just finished moving every message in every user's message center — all 6 billion of them — to a brand new set of database servers, and we managed to do it without anyone noticing. We didn't find much published material about the best way to do this, so we wanted to share a few details with you. It does get a little bit technical though so I won't be offended if you skip over some parts!

The data that powers message center was stored on one of our main sharded MySQL database clusters. That same cluster powers a lot of different features of the site and we recognised a lot of traffic was caused by this relatively simple but large set of data.

To free up capacity on that cluster, and help keep the site quick as we continue to grow, we decided to move the message center data to its own servers.

Some fun figures:
  • ~6 billion rows of data in 13 tables across 16 shards
  • Average 5k, daily peak of 10k queries per second for this data across the cluster
  • Average 100k, daily peak of over 150k rows read per second
  • Average 500, daily peak of over 3k rows written per second

So we had a pretty large set of data and the high traffic to it meant that we'd have to be very careful how we copied it over without losing any data, taking the message center down for a few days or otherwise affecting users.

To add a bit of fun to the project, the old tables were still using MyISAM storage engine as they've been around from before InnoDB's more recent rise to dominance. We decided it would be a good time to switch to InnoDB. If this doesn't mean much to you, then it means we decided to switch to a different type of database which has become a much better choice than it was a few years ago when we first designed message center.

How Much Hardware?


We started out not totally sure of how much hardware we would need because it was quite hard to get meaningful measurements of our current database server load when it was only a part of the traffic on the old cluster.

So to enable us to experiment and prove the new setup, we developed an extension for our PHP database API to allow us to duplicate query traffic from the current tables to the new cluster. We made it possible to control the percentage of read, write and delete queries mirrored individually though our distributed config system, and ensured that failures in the new cluster would not impact the real traffic.

After some experimentation we found a setup which worked well and gave us much more capacity to grow.

The new cluster has 8 servers as masters, replicating to 8 identical machines as a hot-backup.

Each machine has these specs:
  • 12 Cores
  • 96GB RAM
  • 200GB Intel SSDs

Since our existing cluster is sharded over 16 masters, we kept the data split into 16 logical shards and put 2 on each master machine in separate databases.

Syncing the data


Satisfied with the hardware setup, we needed to find a way to synchronise all the data. The initial import step was made more difficult by the move to InnoDB - it took several days to convert the most recent backup to InnoDB before the new cluster could start receiving new write traffic.

This left us with a gap in the data between the old and new clusters and we needed to find a way to get them back in sync without impacting the performance of the live cluster.

The nature of the data meant that even if we could efficiently copy only rows inserted within the missing period, that would still leave the data inconsistent where deletes happened during that time. People would start to see notifications they had already dismissed popping back up again.

So to be thorough, we needed to crawl all the data and ensure its consistency. Rather than copying everything again, we looked into ways to compare consistency between the live data and the new cluster without impacting users.

Checksums


We couldn't find any documented technologies or techniques used by others that would work for us without taking the production cluster off-line so we came up with a solution that although not perfect, seems to have proven useful in this case.

It's all based on checksums.

A checksum is a number which is obtained by doing some calculations on the numbers that make up any piece of data on a computer. In general, if the checksum for two pieces of data is the same, they are very very likely to be identical. If the checksums are different, the two pieces of data are guaranteed to be different. The important thing is that using a checksum we can compare many thousands of rows with just one query to each cluster and very little data transferred.

We came up with a query that could calculate the CRC32 checksum of all the primary keys in a batch of in-boxes surprisingly efficiently. In this dataset, rows are never updated - just inserted or deleted - so primary keys are the only data points needed to ensure consistency.

The query looks something like this:

SELECT CRC32(SUM(row_cs)) as cs
  FROM (
      SELECT CRC32(CONCAT(folderid, type, itemid)) as row_cs
        FROM table
       WHERE folderid IN (1,2,3,4,...)
    ORDER BY folderid, type, itemid
  ) as tmp

[edit] rehael pointed out correctly in the comments below that the ORDER BY is unnecessary here. SUM is commutative so it doesn't actually matter if natural order is inconsistent in the inner query. This must have slipped in from an earlier attempt to accumulate rows without the sub query. In this case sort is on primary key so I guess it made little difference, but a good spot. I've left it in for posterity.[/edit]

[edit2] Another interesting potential flaw here is that CONCAT could produce false positives on some data sets. A more correct solution would be to use CONCAT_WS() with a delimiter char such that a row like (123, 456, 78) would not falsely checksum equal to (12, 345, 678). I'll credit pt-table-sync for this realisation (see bottom of article).[/edit2]

Using MySQL's EXPLAIN showed that this query didn't create a temporary table, and could select and sort efficiently using the primary key. This meant it is more likely to run quickly and not cause problems for real users of the site. In testing, we found it ran in under 200ms even for the biggest batches of folders we could find.

Despite the checksum being apparently quick, we still couldn't actually copy the out-of-sync data from big in-boxes from our master servers without locking extremely high-traffic tables, so we ran the checksums and copied the data from our hot backups.

To balance the amount of data copied with number of checksums needed, we first checksummed an entire batch of 50 in-boxes at a time. If there was a mis-match, we checksummed each in-box in the batch in turn and any that didn't match we deleted all rows from the new cluster and copied all rows from the old. There are more sophisticated options we could have used at this level to optimise for big in-boxes where not a lot changed, but none that were without downsides.

In reality the checksum query turned out not to be super-fast all the time, but we found more than 99.5% of the few million we ran in the end executed in under 200ms and over 99.8% in less than 1 second. In general checksum query load was minute compared to the load of the data-copying queries for big in-boxes.

You may not want to try this at home


There are a few flaws in this technique.

Firstly, the fact that we had to sync from replication slaves meant that any replication lag between master and slave caused inconsistency in the copied data. Once there were a few slow copy queries running, the replication lag would get worse. We found generally it kept up very well but we couldn't guarantee 100% consistency in one pass.

Secondly, even ignoring the replication lag, checksums could not be not atomic between the clusters - a user may have deleted or received a message in between the time you get the checksum from one cluster and the other.

So we knew a single pass through the data would not be sufficient. However, thanks to the checksum trick, we could continue running the sync job through the dataset and each time a few orders of magnitude less data would be found out of sync and need to be copied. Each pass also got substantially quicker as less data needed to be transferred.

After just 3 passes (several days for the first, a few hours for subsequent) we measured 99.9995% consistency. We called the job good and switched users over to read and write primarily from the new cluster. We're still writing to the old one for a week or so until we are totally sure of the new cluster.

Obviously this sort of slightly fuzzy consistency would not suitable for other data that may require absolute integrity (like customer's orders). For us, it's acceptable because:
  • a significant proportion of the < 0.0005% of in-boxes still measured as out of sync will actually be fine - just reported due to the measurement error noted above
  • statistically it is likely that any in-boxes genuinely out of sync are very large ones (100k+ notifications) and that their owners never look through and curate or remove them
  • in the highly unlikely case that a user did notice a discrepancy, it is more likely to be a notice they deleted re-appearing (they can fix that in one click)
  • these are not personal messages or unique content, just notifications of activity elsewhere on site


The useful take-aways


Despite not being a perfect solution to all live migration problems, I hope this has been somewhat interesting and may even help others with similar tasks. The exact plan we used is pretty specific to our situation but I think there are a couple of more general points to take away from this:
  • Migrating large amounts of data from one MySQL server cluster to another without affecting operation or losing writes is hard and apparently no one has come up with a really good solution yet
  • Under different circumstances, it may be possible to be more robust than this although probably at the expense of taking longer, requiring more resources, or making the service unavailable for some time
  • Checksum queries, although not a panacea, are a potentially useful tool that are not widely discussed for on-line MySQL consistency checking

Big thanks to chris and randomduck for their hard work on this upgrade.

[edit] A number of people pointed us at pt-table-sync from Percona's Toolkit. That script actually ends up doing something almost exactly like what we did here (with different checksumming techniques). I didn't know of it and somehow didn't find it when looking for similar solutions. It looks like a much more robust solution in general although would not (easily) have been able to take advantage of our distributed job cluster which enabled our migration to run sync queries on many folder chunks in parallel, so it probably would have taken much longer.[/edit]

Recent Journal Entries

We're Hiring Developers

We're looking for talented web developers to join our team! :la: Interested? Check out deviantart.theresumator.com/ap…

Journal Writers