To Stream or Not to Stream: The Data Quality Edition

Image courtesy of Svetlana Orusova on Shutterstock.com, purchased for use with Standard License.

The old way: data quality for stream processing

A new approach to data quality for stream processing

Image courtesy of Barr Moses.
  • Freshness: Freshness seeks to understand how up-to-date your tables are, as well as the cadence at which they’re updated. This is essential for stream processing, as the data needs to be ready and reliable in real-time. If not, poor decisions can be made, and stale data becomes a waste of time and money. Data observability helps your data team understand data freshness, giving you insight into how up-to-date your data is that you are streaming.
  • Distribution: Distribution measures whether the data is within an accepted range. In terms of dealing with data from streaming, this is crucial–one anomaly is too many in most business use cases, and can lead to distrust in data from stakeholders. With data observability, data teams gain visibility into anomalies and can stop them from working their way downstream.
  • Volume: Volume refers to the completeness of your data tables and offers insight into the health of your data sources. If you are stream processing data and suddenly 10,000 rows turns into 100,000, you know something is wrong. Through data observability, data teams can stay informed about any unexpected volume changes.
  • Schema: Schema refers to changes in the organization of your data and is often an indicator of broken data. With stream processing, since you’re ingesting data from third-party sources that live outside your ecosystem, data observability alerts you to any unexpected schema changes — giving your data team a greater understanding of the health of your incoming data.
  • Lineage: When data breaks, the first question is always “where?” Data lineage provides the answer by telling you which upstream sources and downstream ingestors were impacted, which teams are generating the data, and who is accessing it. By applying data observability, data teams can understand how their stream processing data flows throughout their ecosystem, leading to faster root cause and impact analysis when responding to data incidents.

The future of data quality, in real time

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Lior Gavish

Lior Gavish

Lior Gavish is Co-founder and CTO of Monte Carlo. Interests: cybersecurity, data engineering, & F1 racing. #datadowntime