How to Handle Bad Data in Event Streams
A recent Gartner survey found that poor data quality costs organizations an average of $12.9 million annually and can increase the complexity of data ecosystems. “Bad data” is defined as corrupted or malformed data that doesn’t conform to developers’ expectations. It can create outages and other corrupting effects for data scientists, analysts, machine learning, AI and other data practitioners.
Apache Kafka topics are immutable. Once an event is written into an event stream, it cannot be edited or deleted. This design tradeoff ensures that every consumer of the data will end up with exactly the same copy, and that no data will be edited or changed after it has been read. However because bad data cannot be edited once written to the stream, it is essential to prevent bad data from getting into the stream in the first place. However, if bad data does get into the stream, there are a few things you can do even though you can’t edit it in place.
Here are four tips to help you effectively prevent and fix bad data in event streams.
1. Use Schemas to Prevent Bad Data From Entering
Schemas explicitly define what data should and should not be in the event, including field names, types, defaults, ranges of acceptable values and human-readable documentation. Popular schema technologies for event streams include Avro, Protobuf and JSON Schema.
Schemas significantly reduce data errors by preventing the producer from writing bad data. If data does not adhere to the schema, the application will throw an exception and let the schema know. Schemas allow consumers to focus on using the data instead of making best-effort attempts to parse the producer’s actual meaning.
Strongly defined explicit schemas are important for ensuring clear meaning. It is common in an event-driven system to have different independent consumers read the same topic.
In the figure above, there are eight possible chances that a consumer will misinterpret the data from an event stream. The more consumers and topics you have, the greater the chance they misinterpret data compared to their peers, unless you use clearly defined explicit schemas.
The risk is that your consumers misinterpret the data just slightly differently from one another, leading to computations and results that deviate from one another. This can lead to significant efforts to reconcile which system is misinterpreting the data. Instead, eliminate this possibility by using schemas.
2. Test Your Schemas With Your Applications
Testing is essential for preventing bad data from entering your streams. While a runtime exception from the producing service may prevent the bad data from getting into the stream, it’ll likely degrade the experience for other applications and users that depend on that service.
Schemas provide everything you need to mock out test data for testing your code. Your producer service tests can exercise all your code paths to ensure that they create only properly formatted events. Meanwhile, your consumer applications can write all of their business logic and tests against the same schema so they don’t throw any exceptions or miscompute results when they receive and process the events.
Testing integrates into your CI/CD pipeline so you can verify that your code and schemas operate correctly together before you deploy your applications and services. You can also integrate your CI/CD pipeline to validate your schemas with the latest schemas in the schema registry to ensure that your application is compatible with all of its dependent schemas in case you missed an evolution or update.
3. Prioritize Event Design
Despite efforts to prevent bad data from entering a stream, sometimes a typo is all it takes to corrupt an input. Event design plays another pivotal role in preventing bad data in your event streams. A well-thought-out event design can allow for corrections, like overwriting previous bad data by publishing new records with the correct data. Prioritizing careful, deliberate event design during the application development phase can significantly ease issues related to bad data remediation.
State events (also known as event-carried state transfers) provide a complete picture of the entity at a given point in time. Delta events provide only the change from the previous delta event. The following image shows delta events as analogous to the moves in a game of chess, while the state event shows the full current state of the board.
State events can simplify the process of correcting previously published bad data. You simply publish a new state event with the updated correct state. Then, you can use compaction to (asynchronously) delete the old, incorrect data. Each consumer will receive a copy of the correct state and can process and infer their changes by comparing them to any previous state that they may have stored in their domain boundary.
While deltas provide a smaller event size, you cannot compact them away. The best you can do is issue a delta that undoes a previous delta, but the problem is that all of your consumers must be able to handle the reversal events. The challenge is that there are many ways to produce bad deltas (e.g., illegal moves, one player moving several turns in a row), and each undo event must be a precise fix. The reality is that this is really hard to do at any meaningful scale, and you still end up with all the previous bad data in your event stream; you simply can’t clean it up if you choose to use deltas.
Event design allows for rectifying errors without having to delete everything and start from square one. However, only state events provide the means to issue a correction (a new event with the total fixed state) and delete the old bad data (compaction).
4. When All Else Fails, Rewind, Rebuild and Retry
In the world of data streaming, prevention is always better than a fix. As a last resort, be prepared to dig into the event stream. While the process can be applied to any topic with bad data — whether it’s state, delta or a hybrid — it is labor-intensive and easy to mess up. Proceed with caution.
Rebuilding data from an external source requires searching for the bad data and producing a new stream with the fixed data. You have to rewind to the start of the process and pause consumers and producers. After that, you can fix and rewrite the data into another stream where you will eventually migrate all parties.
Although this expensive, complex solution should be a last resort, it’s an essential strategy to have in your arsenal.
Mitigate the Impact of Bad Data
Handling bad data in event streams doesn’t have to be a daunting task. By understanding the nature of bad data, preventing it from entering your event stream, utilizing event design to overwrite bad data, and being prepared to rewind, rebuild and retry when necessary, you can effectively mitigate the impact of bad data. Good data practices not only save time and effort but also enable you to get work done.