We've made a few changes to our push API today and it almost went well. In the old version we relied on a MongoDB connection to check the authentication of the gems pushing data to AppSignal. It basically asked MongoDB if there was an account with the API key and if it was active.
In addition to a MongoDB connection we also have a connection open to Redis for our queue. We wanted to remove MongoDB as a dependency for the push endpoints so we implemented a way to ask for valid API keys from Redis instead of MongoDB.
Unfortunately the deploy of this new version went less than perfect. During a brief time (less than a minute) the push endpoint was returning 401 status codes instead of accepting the API key.
This originated from our switch and the Redis store wasn't populated fast enough so the push endpoint couldn't fetch the API keys.
The graph above shows the result of this mishap. Our gem is configured to disengage and shut down if it gets a 401 status code, so it doesn't hammer our API.
When a new Passenger thread is started the gem tries again and that's why you can see the requests slowly coming back to the original level.
If you use a server solution that is not Passenger and has long lived processes it is possible that data is not sent to AppSignal until the gem is re-started again.
What we'll do to prevent this in the future.
While only a very small number of our customers are affected, it's still too much and we'll do the following to prevent this in the future.
- Add a feature to the gem to not disengage immediately, but try again for a few minutes.
- Review our deploy process to prevent the API from returning 401's