Scaling Notifications using Kafka Queueing

At Midtrans, we send out hundreds of thousands of notifications to merchants a day. These are generally email, http or sms notifications telling customers that the transaction has succeeded, postbacks to the merchants' endpoints confirming transaction amounts for reconciliation.

There are a few constraints in how we deliver the notifications, which makes sending something as simple as a http postback pretty complicated

  • It needs to be only once delivery
  • It needs to be sent in sequential order of transactions
  • It needs to be sent immediately (in the order of seconds) after the transaction event has occurred.

For example, if notification is not sent in time, merchant may assume that transaction has failed and cancel the transaction, while the customer may have already transferred money. Or vice versa, cancellation notification is not sent before settlement notification, then merchant may possibly update status to "settled" when transaction is really "canceled".

Even if you do not work with distributed system, it's not hard to recognize that this is a hard problem to solve.

We initially used a fairly simple implementation of Notification service. It accepted the notification payload over http and queued it for future delivery. This worked very well for a while but as number of transactions increased, we started seeing some failed requests and time out because the notification service loads. This is when the flaws in the standard synchronous REST based HTTP protocol becomes apparent.

The Payment API threads were blocked waiting on response from Notification service especially under high load. When we waited too long and timed out, we did not know whether to resend it again (we may send multiple notifications) or just ignore it (we may not send the notification at all).
Depending on the differentials in load on nodes of notification service, notifications could be queued out of order.

Kafka as Queuing system

We chose Kafka as our queuing system, using Payment API as message producer and Notification Service as message consumer. Payment API would queue notification payload (message) to Kafka broker and the Notification Service consumer would process the payload and send the notifications.
Schematic Diagram

Why do we choose Kafka?

  • Kafka handles scale very easily as a distributed platform, and its high availability setup is pretty convenient.
  • Kafka provides us the sequential consistency semantics for strict ordering of transaction messages, so we can always expect notifications to be in sequence.
  • Kafka provides at-least-once delivery semantics so we never lost notifications.
  • Kafka Storage provides a convenient mechanism to resend notifications in case of failures, and to double check in case of mismatches.

Implementation

On the Producer side, the Payment API threads use asynchronous send API and is load balancing by multiple topic partitions. We hash the transaction IDs to the same partitions, so all notifications for a transaction go to the same partition and guarantee sequential consistency.

On the Consumer side, Notification Service uses specific consumer group to subscribe to topics and create multiple consumers to receive notification payload. Each consumer consumes the message from different partition of topics and commit the offset.
Therefore, it's easy to do scaling and manage the performance by controlling the number of consumers. Also because we transactionally commit the offset of the Kafka message on success, we can provide only once delivery guarantee.

We also have developed a bunch of tools to allow us to find, retry or recreate and visualize notifications in case of failures that may happen at our end.

Result

We have been using this system for about 3 months. Most of the payments have migrated to this infrastructure. As of now, not a single notification was lost from Payment API to Notification Service with this queuing system.

Although we just have a few consumers, we have enough performance (fetch rate 150/s at peak time). It's easy to scale up in the future if we need. Furthermore, it deals with spikes in transactional notifications extremely well, especially during settlement period.

example - message receive per minute

Artikel - Artikel Terkait