Midtrans notification system is a fundamental building block of our Payments Infrastructure. This is the subsystem that sends merchants and the customers the email, http and SMS notifications after transactions.
The internal architecture of the notification system is driven out of a Kafka based transaction event stream, and uses Sidekiq for the Job processing and retry mechanisms that merchants depend on. These jobs are written in Ruby, and Sidekiq processes runs these jobs in multiple threads to achieve high concurrency.
Sidekiq has scaled well at processing the jobs over the last few years. But as the number of jobs started to increase, we started to hit some performance and resource related problems with Sidekiq.
Our workload is very bursty. Feel free to join the party. The job processing system would be nearly idle during the night, handle the normal load during the working hours. But at settlement times for few minutes we process about 20-30x of the normal workload.
[The spike of notifications sent in settlement time]
To avoid running out of memory during these periods, we are generous with the RAM and the CPU on the machines. Also high load affects the latencies of the jobs during this period.
Sending notifications is inherently a IO heavy process where we read from the Database and write to the HTTP endpoints, the response times of endpoints range from 300 ms to 30seconds(our timeout)
MRI Ruby is notoriously inefficient at threading, especially in high IO systems, We needed a platform that provided a high IO efficiency. Elixir turns out to be a good choice, as it was a simple switch from Ruby language, and provided the OTP Platform for delivering fantastic IO performance. Also the reliability patterns build in to OTP provided a good starting point for the building a solid system.
Rewriting large parts of any big system is prone to failure. So we restricted the scope to the worker part of the system. All the systems that push jobs to the queues should remain untouched. Exq which uses the same storage and internal structure as sidekiq, but written in Elixir looked like an perfect fit.
After initial tests for viability, we migrated some of the job types to the new system and deployed it in production in parallel with the old system. This provided us the cushion to make mistakes, and if there were failures, the older system would pick up the jobs and process it normally.
We gradually started to divert more load to the new system and in the meantime started to migrate the remaining job types to the new system.
The new system has now been running exclusively in production for more than a month without any major problems. We can confirm that the new system has provided a significant boost in performance and reliability of the system. It now takes about 20x less memory than the previous system, and provides us about 10x reduced latencies with the notifications.
[Performance changes in the new system(red) and the old system(old). Lower the better]
The peak RAM usage we have observed is ~300 MB as against ~5.3GB during the normal load in the old system. This has allowed us to significantly downsize our notifications infrastructure.
Measuring job latency is a little tricky because it also depends on the latency of external systems. For a sample group of job which uses the same external system at peak loading settlement time, the latency has decreased from 2.9s to 198ms.
Overall the new system seems to perform much better than we had expected and has helped us to scale without having to throw more hardware at the problem.
-- The Notifications team.