Common pitfalls of cloud event based architectures

Introduction

Event based architectures are becoming increasingly popular within the movement to micro-service orientated systems. This article looks to highlight some of the common pitfalls that one can encounter when building an event-based system. Additionally, it will describe some mitigations that one can implement to mitigate these pitfalls and increase the likelihood of a successful event base architecture design.

What is an event based architecture?

An event-based architecture is a type of software architecture that is built around the idea of event triggered messages and communication. In this architecture, events are the primary means of communication between different components of the system. Each component is responsible for listening to, processing, and responding to events as they occur. Events can be generated by various sources such as user interactions, system events, or external events. Event based architecture allows for systems to be highly decoupled and responsive to changing conditions. It also allows for the easy addition and removal of components as well as the ability to handle scaling of individual components separate for each other.

(1) Event traceability

It can be challenging to trace the event as it gets processed through the system. Lack of visibility of an event’s processing state makes debugging and troubleshooting a time-consuming process. This becomes even more challenging when dealing with micro-service architectures built with different frameworks and tech stack, and even at times external service integrations. Thankfully, in recent years work has been done to create a tracing standard via W3C Trace Context. This standard relates to a described pattern for sending tracing data in an HTTP request allowing for easy correlation of events as they process through the system. Additionally, two popular telemetry clients have merged to form an industry accepted set of SDKs in OpenTelemetry as the primary integration standard for many major vendors and frameworks. System that makes use of these standard can more easily trace their events throughout their system. Knowing where and why an event failed is critical to making it easy to debug and troubleshoot a failure and implement any corrective actions.

(2) Complexity in scalability

Over time the number of events the system needs to process is likely going to increase. Sometime this scaling requirement can arrive suddenly. This can result in bottlenecks arising from parts of the system that can’t scale to the same degree as other. Protecting the system against bottlenecks can become very important. Depending on the resource that experience the bottleneck, it is possible that that due to the constrain in resources throughput process may drop even lower across the system. Scalability also has a cost factor element; certain technology could become cost-inefficient at a needed level of scaling. Managing the scalability of the system is an ongoing endeavour to find the acceptable level of throughput against the level of acceptable cost. In the case of when events arrive at variable levels of throughput and event processing is not time critical. Rather than designing the system to always run for the maximum potential throughput rate for processing, which waste resources cost when the level of throughput processing power is not needed. A system could implement a queue-based pattern to regulate the flow of events being processed. This enables the system to process at a consistent processing throughput rate that the system can reliability handle.

(3) Complexity of processing

There are many complexities that arise when processing events depending on the business rules required by the system. The cheapest time to address these risks are in the design phase of a project, over rework required when building the system. Some of these complexities one could encounter are listed below.

Event sequence

Many event architecture patterns cater for processing event asynchronously. This can create problems with processing events in a certain order. Related events may need to be processed in a certain order relative to other events. Additionally, the need to fan out -to trigger multiple parallel processes against the same event – or fan in -to trigger the consolidation of multiple events- will add levels of complexity to the processing.

Event replay

While sometimes event can get lost due to factors like networking. It is common to build in redundant processes to trigger replaying of event when these failure cases are detected. While replaying of event does mitigate potential risk of the data state becoming out of sync it does also create some new challenges. You need to make sure that you are only replay the correct event, i.e. not replaying event that cannot be processed by the system. You also need to protect again duplicate events getting generated in the process. Duplicate events risks changing the data state to an inconsistent state. One way to mitigate against duplication is to make sure the event is idempotent. This process involves uniquely identifying the event and tracking the status of its processing. This allows for checking before processing if the event has already been processed.

Event schema changes

Over time new requirements and functionality will be added to the system. This will inevitably require changes to be made to various aspects of the system. These changes may require a backwards support model to handle integration between various parts of the system. It may not always be possible to upgrade each component within the system at the same release cadence. Versioning of the event schema becomes a critical component in how to manage these layers of change. The processing services then needs to be aware of receiving these different versioned events to cater for executing on the various business rules. Event schema changes and versioning is also an important component to monitor for event observability. Knowing which version of event was being processed is important to aid in debugging and troubleshooting.

Event security

As events flow through the system, they can contain sensitive information. It is therefore important to make sure that the events are sufficiently protected. Encrypting the data at rest and while in-transit during an events lifecycle will help to reduce the risk of a bad actor getting access to the data. Moreover, event-based systems can often relay on many internal networks that communicate via events between each service. It is important that these internal networks are locked down to prevent unauthorised access to event data. System that relies on external parties generating event also need to make sure that there is adequate validation in place to protect the system from inadvertently processing an invalid or insecure event.

Event routing

In an event base architecture system that support may differ event schemas for processing by different services it becomes critical to have a controlled and observable routing mechanism in place to correctly route the event to the correct service. This routing mechanism also needs to be able to scale to the systems require level of throughput.

(4) High latency

There is often a high degree of latency present in an event-based architectures. This makes event-based system well suited when there is no need to response immediately. The cause for this latency is often the pairing of event-based systems with micro-service architectures. Micro-service architectures separate out different components of a system into individual isolated components. The components often talk to each other via event-based patterns making use of TCP or gPRC which carries network latency for the communication.

(5) Event state management

As event flow through your system they could be used to alter the state of data stored within a component. It is important to take into account what should happen to the state if there is a failure in processing a downstream event who’s failure status could effect the data consistency of the component. Where you need strong data consistency between components one option available is to make use of a sage pattern. Using the sage pattern allows for data state to be linked together. With this pattern in place, should there be any downstream failure the up-stream components have an opportunity to alter their state to match required correct state.

Conclusion

Event based architecture enables a large degree of flexibility into a system. Allowing for components to be altered independently. However, this flexibility does come with a cost that one needs to be mindful of when designing and building an event-based system. Putting in place a system to be able to properly observe the events as they flow through the system will enable easier debugging and troubleshooting. Being mindful of the operational cost of each component and how those costs change with scale will prevent building a system that is too costly for the performance throughput level required. Additionally, one also needs to be mindful of the how you plan to scale out your system as demand changes. The complexities of processing an event can be daunting and numerous, it is almost always cheaper to solve for these issues in the design stage over the development stage. Therefore, make sure to thoroughly interrogate the anticipated flow your system will encounter to design the processing component accordingly.