TheCatAPI // Incident report 15-17.11.2018

It’s always good practice to write up an incident report if there’s been service interruption, even for a side project. It should be an honest, unemotive & blame-free account. had some downtime overnight (GMT+11) on the 15th Nov, and then very high latency until the 17th.

It’s always good practice to write up an incident report if there’s been service interruption, even for a side project. Here’s the template i normally use

Forensically walking through the root causes & resolution steps leads to improving the service as a whole as well as providing a reference for your future self.  It should be an honest, unemotive & blame-free account.

It’s a natural human instinct to avoid embarrassment by not talking about it again and trying to quickly move on. Processing the experience is cathartic, covering it up is not.

— Incident Report – – 15-17.11.2018


  • At 21:45 (GMT+11) on 15th Nov.2018, there was a significant increase in requests to a new API route that was not properly cached. This led to an overload of the database, which led to all users seeing 503 responses for any public request.
  • This was wrongly identified as a broken code deployment due to conflicting errors, and a rollback to a previous release had minimal effect.
  • Extra database resources were provisioned, along with increasing the size of the Servers. This stabilised the service but left the response times in the seconds.
  • A patch put live at 12:30 on the 17th brought the response times back down to ~15ms, after which all services operated normally.

Timeline: (GMT +11)

  • 2018-11-14 14:10 Release went live which would produce confusing error reports due to new logging
  • 2018-11-15 21:20: increase in traffic to images/{image_id}/analysis route which was not cached
  • 2018-11-15 21:30: application overwhelms the database, Most responses are 503
  • 2018-11-16 08:30: I saw the errors and messages on the forum and began investigation
  • 2018-11-16 08:35: Error logs indicated that the codebase was producing errors, so a rollback was initiated. This caused the service to be resumed, albeit with a high response time.
  • 2018-11-16 09:45: Response time was still high so Database & Server capacity was increased, bringing down the response time.
  • 2018-11-16 10:30: Response time climbs again, however no 503’s were returned and service responds successfully. Log investigation reveals no caching on  ‘images/{image_id}/analysis route’
  • 2018-11-16 11:10: Code fix if created, and tested locally. However no time was available to deploy until the next day.
  • 2018-11-17 13:37: Fix deployed successfully, bringing the response time back down to ~15ms.

Root Cause:

  • Having an API route that made a request to the Database on every request without caching or rate-limiting. This should have been caught in testing, or profiled in load-testing.


  • Used the Memoization pattern to cache the response from the DB within the API route, with a fallback to Redis if available, the ultimately the read DB(s)
  • Revised error logging to prevent misdiagnosis in the future.

Future prevention:

  • Reduce the time between incident and investigation, by having my phone create an Alarm on receiving error alters, or use a service like Pagerduty.
  • Research adding a load testing service like LoadImpact into the CI/CD testing process via CircleCI.
  • Integrate Siege into local automated tests.
  • Consider bringing down rate-limit for new routes
  • Use the Strangler application pattern to split routes like signup & upload over to serverless microservices to prevent them getting interrupted in the future.

TheCatAPI // How it’s built

In 2018 I completely rebuilt the tech stack from the ground up.

Here’s an overview of the new stack:

The whole stack running

1. API Clients

There’s a never ending list of weird, wonderful – but always interesting uses that thousands of developers have used the Cat API for. From being baked into the Kubernetes tests, Tinder for cats, Discord bots, weekend hacks with Raspberry Pi’s, Games, IoT kiosks, office dashboards, and frequently teaching classroom full of students, young and old to code.

The API needs to accept requests from them all reliably under any load, with a predictable response, and yet be nimble enough to innovate upon.
Their requests hit the API’s endpoint – and from there onto the ….

2. Load Balancer

This allows the API to scale based on the load – or how many requests it receives in a time. It spreads the requests across all the API Servers, and if the load gets too high will trigger the automatic creation of another Server – thus scaling horizontally. If any Servers become unavailable it will reroute traffic to the others.

3. API Servers

These run the main application that’s written in NodeJS. They receive API requests via the Load Balancer, processes them according to the business logic, communicate with backend services & data storage, then send a response back to the client.

The business logic might be:

  • ‘search for a random image of cats wearing hats’ – find a random image from the Data Store with category_id=4
  • ‘save an image as a favourite with a custom sub_id’ – validate the image exists, then save the data to the Write DB.

Any tasks that will take a long, or unknown amount of time are turned into Jobs and added to the Queue so they don’t hold up other requests.

I’ve opted for AWS Elasticbeanstalk instead of GKE (Kubernetes) – GKE would be cheaper and more powerful, however Elasticbeanstalk is quicker to get going and maintain for a tightly scoped project such as this.

4. Data Storage

Object Storage – Stores files like uploaded Images, and Logs from the servers.

Job Queue – Temporary queue of jobs to be picked up by backend workers e.g. ‘analyse an uploaded image’, ‘roll up analytics from log files’, ‘webhook some data’, reate & email a report’. Some tasks like ‘send a welcome email’ skip to the front of the queue.

Data Cache – A Redis In-memory (RAM) data storage, provides faster read & write access than a Database (HDD). Data is only kept here for a short time, or until a change is made to it e.g. if a favourite is deleted then it would be ‘invalidated’ (removed) from the Cache. Saves the response sent to the user, rather than the raw data from the DB, so the same business logic doesn’t need doing again.

Read DB(s) – These are replicas of the Write BD. If the data is not found in the Cache then it is read from one of the Read Database. The replicas provide redundancy if the Write (Master) database becomes unavailable, and a way to read data without holding up the saving of new data.

Write DB – This master database saves any new data (Images, Votes, Favourites etc), and quickly sync’s the new data across to all the Read (replica) databases. Data is typically saved in batches to prevent it becoming a bottleneck during heavy write traffic.

The main application uses MySQL as the database – it’s safe and ‘boring’ which is perfectly fine for storing structured relational data – data that is related to each other e.g. Votes/Favourites to Images.

Some of the job workers use NoSQL databases as they communicate with external services which might return data of differing sizes, formats & types.

5. Serverless Job Workers

Serverless (Lambda) functions are perfect for scheduled, or ad-hoc short tasks because they can scale up instantly to meet demand, automatically retry failed jobs (e.g. communicating with an external email service), and don’t hang around racking up bills like a Server would.

I use them to:

  • Send images to image analysis to check Uploaded images for inappropriate content, categorisation, and if they actually contain a kitty.
  • Deleting any images that fail analysis
  • Resizing images into different sizes available a query parameters (thumb, small, med, full e.g. size=full)
  • Creating ‘rollups’ from the Log files with Athena for analytics
  • Send welcome emails to new signups
  • Analysing the user Votes for unpopular images to take out of rotation.
  • Sending me emails of potential issues before the Cloudwatch alarms kick in.

6. Content Distribution Network – (CDN)

This bucket acts as the source location for all images, but as it is in one location users further away would take longer to load it that users nearby. To get over this I use Cloudfront as the Content Distribution Network (CDN) – this is a network of storage servers around the world that hold a copy of the Image, and will serve it to any users nearby, instead of from the source location.

7. External Services

There are some things that I don’t use the AWS or GCP stacks for:

  • SendGrid: Sending emails to users when they signup containing their API Key – it’s secure and reliable.
  • Slack: Sending me real time telemetric updates to me like ‘XXX number of signups today’, ‘Image XXX is popular today’, etc

8. Image Analysis

the data from Rekognition for an image on

I had to build a basic image analysis engine years ago when the Cat API first launched in 2012 as there wasn’t any on the market – It was far from perfect. The AWS & GCP versions of today are a vast improvement, although neither do .gif files.

AWS Rekognition has different services (each chargeable):

  • Labels – a list of category style objects it has found in the image, along with a 0-100 confidence score e.g. Mammal – Confidence 80.123, Cat – Confidence 80.67. As you can see from the image above, these are of mixed usefulness.
  • Moderation Labels – a list of anything that would mark the image as ‘Unsafe Content’ like nudity or suggestive content. As a use-case for the API is in classrooms anything here would cause the image to be rejected.
  • Text – any machine readable text in the image. This generally causes an image to be rejected to be on the safe side.
  • Faces – any human faces in the image. This generally cause the image to be rejected too, the API’s about Cat images after all, not humans.

GCP Vision is different, and in some ways superior due to Google’s vast index of search results, and i’ll go into more detail about it, how i use the results to validate images, and how .gifs are moderated in a full article.

All this data is available via the API when requesting Image via ‘/images/{image_id}/analysis’

9. API Event Logging & Analysis

As CTO at both my own, and other companies i’ve spent millions of dollars over the years on data pipelines, warehouses, and 3rd party vendors, and specialist contractors. The price has thankfully come down by orders of magnitude over the years to do the same thing with mature stacks.

The closer to real-time the data needs to be seen the more expensive it is, and in the vast majority of cases hourly data is fine. Being pragmatic about it in this use-case, there is no need to use BigQuery or Redshift – the respective GCP & AWS platforms to for storing per request logs. If the output is well defined and simple then ‘timeboxed’ totals – or ‘rollups’ can be used instead, and the raw logs backed up. This has the added benefit of not storing more data than needed like raw user-agents, it’s all too tempting in most companies to simply store everything “just in-case it’s needed later” – it’s simply not acceptable in today’s world.

By accepting a delay in serving analytics, I can also do away with a data firehose via Kinisis & real-time ETL, and instead rotate the logs from the servers themselves into S3 and use AWS Athena to run Queries against them. Creating an AWS Kinesis/Glue/Redshift stack to process the same amount of data would eventually cost thousands vs Athena’s pennies.

10 Business Intelligence tools

My aim is to provide as much value to as many people as possible. To know if i’m successful in that, and where to do better visualisation and reporting tools are essential.

For this API i picked Metabase. It’s a free open source tool that was dead easy to spin up via their Docker image. I pointed it at the Read DB(s), added some queries and was able to create a dashboard, automated email & Slack reporting in minutes.

For the cost of running the small EC2 server & Postgres DB (~$20 per month) it’s well worth it. If you don’t need the automation then Google Datastudio is a great free option.

Metabase in action

11. Alerts & Issue Management 

Things break. The key to to know about it quickly (ideally before hand), and track progress made to diagnose & fix it.

Cloudwatch, along with some Lambda functions lets me know if there’s any spikes in traffic that aren’t being handled well, if any Job workers are having recurring errors. These are added to the internal Trello board, along with any related reports or log files.
If any bugs crop up that i should make the public aware of, the Trello ticket is mirrored across to the Public Roadmap board