Our native auction ad server needs to be fast. At the 99th percentile we consistently serve ads back to our client in under 15 milliseconds. One of the crucial pieces of tech that makes this happen is Redis. But last week we found that the way we were using Redis took us down for two hours.
All of this allows us to keep response times very low - on average 6ms and in the 99th percentile between 10-20ms:
Last Thursday however, our fast response times quickly deteriorated:
And consequently we were only able to serve 1/6th of the requests from our clients:
While this alone was bad enough, less than an hour later we saw the same slow down in response times, but this time with more serious impact on our application: it was no longer able to communicate with Redis properly and was throwing exceptions in the process. This effectively took our ad server offline:
Once we saw that we were down, we were able to restore access within 15 minutes by spinning up new EC2 instances of our ad servers and Redis instance. The data stored in Redis is quickly restored from our primary MySQL database. In this way we keep loss of state to a minimum: just the changes on Redis that haven’t been pulled back into MySQL.
So what was it that caused our response times to slow down, and what caused the ad servers to lose their connection with Redis?
In the Redis logs, during the initial slowdown we saw this:
 31 May 02:16:14.084 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting
We have Redis configured using AOF (Append-Only Files) because we were concerned about the durability of the data stored there. The Redis docs contain a great overview of Redis persistence models but in summary by using AOF you can keep data loss to a minimum (by default you would lose up to 1 seconds worth of data). When the fsync of the AOF file takes too long, Redis will compensate by slowing down to ensure data is not lost.
The message in the log above indicates one of two likely causes:
- We are writing a lot of data in a short amount of time.
- The underlying hardware is slow to write to disk.
We can easily rule out the first possibility. Our Redis instance only handles about 2000 transactions per second on an m1.large EC2 instance. We saw no difference in the amount of transactions at any time before the slow down occured.
That leaves the second option. On EC2 we run the risk of having noisy neighbors (an issue where other ‘nearby’ instances are consuming significant resources). Unfortunately we don’t have the insight today to be able to determine if this is the root cause but we’ve seen reports from others with similar issues.
What happened during the second slowdown which led to complete application meltdown?
From our application logs we saw this:
com.twitter.finagle.redis.ServerError: ERR operation not permitted
This indicates that our adserver is somehow not able to authenticate with the Redis server. Now, the Redis security docs don’t strongly recommend using authentication as a sole means of protection but rather recommend using network level security. Unfortunately, timing is everything! We’re in the final stages of moving off of Heroku, and until that’s complete we’re temporarily relying on Redis AUTH.
So we have two problems now:
- How do we address the performance issue caused by writing to AOF?
- How do we fix our AUTH issues?
Our solution to these problems are simple: if something is causing us pain stop doing it!
In the past few months we improved our application so that we don’t require the same amount of durability, but nevertheless we had kept this option enabled. So the quickest solution turns out to be just turn AOF off. We can recreate nearly all the data in Redis from our MySQL datastore, the only exception being daily spend information, but we have the Redis RDB snapshots every minute for that.
Furthermore, we can remove the AUTH issues by not using AUTH. This means moving applications that require access to Redis inside EC2 and setting up network level security using EC2 security groups. We’re in the process of doing so now.
The incident also reinforced another thing we’ve found invaluable. Having visibility into your application metrics helps detect and pinpoint failures. We had a lot of good data on our application behavior that helped such as our ad server and Redis performance metrics, and we still found ourselves needing more data that we weren’t yet capturing such as disk IO performance.
Stepping back, the biggest thing we learned is to routinely question earlier assumptions made when building our application. Over time, your application and infrastructure builds up cruft that was necessary at one point and may no longer be needed. Eliminating variability in your application configuration reduces the search space for diagnosing application problems, and coupled with more data helps isolate root causes faster.
If building high performance services with Scala, Finagle, Redis, etc. and working with a team who respects the time and effort it takes to build solid infrastructure, Sharethrough is hiring :)