Using Akka Streaming for “saving alerts” – Part 2

Using Akka Streaming for “saving alerts” – Part 2

Using Akka Streaming for “saving alerts” – Part 2

Diogo Piresaurelio

This blog post is the second and final part of the post Using akka streaming for “saving alerts”. In this part we enter the details on how the application was designed.

Application Actor System

The following illustration gives you a schematic view of all the actors used in our application, and (hopefully) some of the mechanics of their interaction:

As previously mentioned, one could divide the application lifecycle logically into three main stages. We will get into more detail about each one of them next, but for now let us walk through them narrowly and try to map to our application

Main Actors

The main actors in our application are: KafkaConsumerActorCouponRouterActor and PushNotificationRouterActor. They perform the core business tasks of this application:

  • Consume events from Kafka and validate them – this is done by KafkaConsumerActor. This is also the actor who controls the whole Akka Streaming Pipeline/flow. The flow is controled so that we can be assured to not overflow the main downstram Akka actors – CouponRouterActor and PushNotificationRouterActor – with too many events such that they cannot handle.
  • Query Coupon API for results – for available coupons for a given merchant and for a given user, we query coupon API for results. Those results are sent back to Akka Streaming Pipeline.
  • Apply Business Rules & fire or not a Push notification – the last key stage involves sending returned results to PushNotificationRouterActor for it to apply a given set of business rules. In case those rules consider the event valid, a push notification may be fired, in case none has been sent in the last X amount of hours.

Not mentioned yet is MetaInfoRouterActor. It is used with sole purpose of monitoring statistics throughout the whole Akka Streaming pipeline flow. As written on the illustration, given that it is not a core feature of the application itself, and thus we send all messages to our monitoring service in a “fire and forget” manner – that is, we do not wait for acknowledgement. This of course implies that there is the possibility of messages not being delivered, and ultimately not landing in our monitoring system. However, this was considered as a minor and neglectable risk.

Secondary Actors

In the sidelines, and as a secondary service that serves the main actors we have three actors: AppMasterActorMetaInfoActor and RulesActor.

AppMasterActor actor has two main functions: control the discovery protocol that we implemented, and host healthcheck endpoint used for outside application monitoring.

The so called discovery protocol basically makes sure that all actors know where – on which servers – other actors are, so that theoretically speaking we could separate each actor into different servers in a scale-out fashion. As a side note, we would like to highlight that this discovery protocol could have been implemented using Distributed PubSub modules from Akka – which would be definetely more advisable in case our application would grow in the number of actors. Full disclosure: at the time, due to the simplicity of our current App, it seemed simpler to implement it ourselves to keep the project simpler and smaller, which might be a questionable technical architecture decision.

Technically speaking, MetaInfoActor and RulesActor are almost identical actors in their implementation: they basically have a scheduled timer to remind them to check in a S3 bucket for a given key, stream-load it into memory, and broadcast it to their respective client actors.

As explained in the previous section, routers host many workers (so called “routees”) behind them, serving as a … well yes, router in front that directs traffic to them. All the actors that are Routers have it explicitely referenced in their name. Thus, when we say the MetaInfoActor or the RulesActor broadcast a message, in fact we are just sending one single message to the respective Router wrapped in a Broadcast() case class; the router then knows that it should broadcast the intended message to all it’s routees.

Scalability & HA

All the actors depict in the illustration live in the same server. As a matter a fact, for the time being we are scaling out the application kind of in a “schizofrenic manner” – we deploy the application in different servers, and each server runs a completely isolated application unaware of the existance of other twin applications. In other words, actors living inside server 1 do not cumunicate with any actor living in server 2. Thus we like to call our current deployment “Pod mode”. We are able to achieve this because all the application “Pods” are consuming events from Kafka using the same consumer group. Kafka intelligently assigns partitions Ids to the several consumers. In other words, Kafka controls the distribution of partitions to each POD, thus allowing us to scale out in a very simple manner:

 

To increase performance, we can scale out the number of KafkaConsumerActors up to the same number of Kafka partitions. So, for example, if we had a topic with three (3) partitions, we could improve consumption performance by scaling up to three (3) KafkaConsumerActors.

To address High Availability (HA), we could, theoretically speaking, add N+1 KafkaConsumerActors, where N is the number of paritions for HA purposes, if our application was mission critical and very sensitive to performance. This would, however, only potentially improve HA of the application, as this additional KafkaConsumerActor would sit iddle (e.g. not consuming anything from Kafka) until another KafkaConsumerActor node failed.

Moreover, in case you are wondering, not having N+1 KafkaConsumerActor does not severely harm the HA of our application, as Kafka will reassign partition Ids among remaining Consumers in the event of Consumer failure. However, this obviously means that consumers will be unbalanced, as one them will be simultaneously consuming from two partitions.

Now, you may ask what happens in the case of failure of a given node that was processing a given event? Well at the end of the Akka Streaming Pipeline each KafkaConsumerActor commits back the message offset to Kafka – thus ackowledging consumption. So in this case, after the default TTL of message consumption that is configured in Kafka passes, we know that a message was not successfully processed (no acknowledgement), and so another KafkaConsumerActor will actually read again from Kafka that same message, and thereby reprocessing it.

As mentioned previously, when an event processing was processed by the system KafkaConsumerActor will commit back the to Kafka that event’s offset, thereby acknowledging to Kafka that a given message has been successfully consumed for it’s Kafka Consumer Group. We can’t stress this enough (and thus repeating ourselves): this is how we are able to guarantee at at-least once semantics when processing a given message. Note that in our case, since we are storing in Kafka the offsets, in our implementation we cannot guarantee exactly once semantics. Nonetheless, this does not constitute a problem, as we are later using Redis cache to assure event. For more information about Akka Kafka consumer, please check here.

Let us address scalabilty in the rest of the application, by taking the CouponRouterActor architecture into consideration.

 

As shown in the previous illustration, performance is scaled by using Akka “routees” behind CouponRouterActor (as well as behind PushNotificationRouterActor). On of the beauties of Akka is that it allows us to code the CouponRouterActor 99% the same as if it was not operating as Akka Router. Simply on Actor class instantiation we mention its Router nature, and the rest is handled by Akka.

Final remarks

We will dive into more detail into each stage next. However, we would like to highlight the importance of Akka Streaming Pipeline. It is able to control how many messages should be read from Kafka, because it sends messages to CouponRouterActor and PushNotificationRouterActor using the Ask Pattern – which waits for a response (up to a given time-to-live (TTL)).

Also note that no matter how far an event may go down the flow (an event may be, for example, filtered right in the beginning in case it is considered invalid), we always log to Datadog that a given message was read from Kafka, and was successfully processed. Note that “successfully processed” can have different meanings – either considered Invalid event right in the beginning of the streaming pipeline, or no available coupons returned from Coupon API, or even business rules considered that the system should not send push notification to Kepler API, as business rules define it is unfit.

Moreover, please note that when an event processing is finished – again, no matter how far it goes down the stream pipeline – KafkaConsumerActor has the final task of committing back the to Kafka for that event’s offset. In other words, we acknowledge back to Kafka that a given message has been processed. This is an important detail –  since in case of failure of processing a given event (let’s say one of the application servers crashes), after the default TTL of message consumption tha tis configured in Kafka passes, another KafkaConsumerActor will actually read again from Kafka that same message, thus reprocessing it.

Docker environment

Currently we are only using Docker for local development, although this application would fit quite well in, say, Kubernettes cluster, for example.

We have setup a complete emulation of the production env in local via docker:

This is (extremely) useful not only to get a better grip of how the system works in day to day development, but also to do harder to emulate behavioral tests, such as High Availability (HA) tests.

Final Notes

Like any application, there are a number of things that could have be done better, but due to practical constraints (mainly time), were not.  Let us start with some of the things we do not regret:

  • Using Akka: there are many ways we could have implemented this application. Overall akka is a mature full-fledge framework – contains modules for almost anything you may require while building distributed highly available asyncronous applications – and with very satisfactory performance.
  • Using Akka streaming: there are many blogs out there with horror stories on constant performance issues with pure Akka implementations. Akka Streaming module, not only increases stability via back-pressure, it also provides a very intuitive and fun to work with API
  • Using Docker in local: this allowed us to test very easily and especially rapidly in our local machines, more rare scenarious, such as simulating failures on all points in the application: Kafka nodes, Redis, S3, and of course, the Akka application itself.

Some open topics for further reflection:

  • Using our own discovery protocol ended was a questionable technical decision. One possible alternative could have been using akka module “DistributedPubSub”
  • Ideally, this application would be a very nice initial use case to start using Container orchestration tools, such as Kubernetes

And … that’s all folks. We hope that this post was useful to you.