Architecting Systems for Bad Data

Table of contents Introduction 2 Quick Summary  4 When a Data Processing Pipeline Rejects Data 4 Correcting the Errors 5 Giving Business Users Access To Their Bad Data 6 There’s Gold in Data Error Management 7 Effective Handling of Errors 7 Dealing with Different Data Formats 8 Reconciliation 9 Auditing 10 Reporting 10 Conclusion 11 3 Data Integration [email protected] www.cloveretl.com Quick Summary In this paper we discuss the benefits of automated data processing pipelines designed for error management. We outline tools and practices that enable business users to effectively identify, correct and put bad data back into the processing pipeline. This data correction loop coupled with performance analysis, audit tracks and smart detection of anomalies eliminates unexpected downtimes, prevents data loss and avoids delays in business operations. When a Data Processing Pipeline Rejects Data IT systems are great at detecting issues, bringing invalid data to a user’s attention, but often lack the capability to fix the problems. Whenever data enters an organization or a certain business process, it’s pushed through an (ideally automated) data pipeline, which moves the data through complex processing stages and eventually ends up with a transformed data set or Processes produce sets that serve further business functions. That’s true rejected data sets, when everything is fine. What if there’s a problem with listing problematic the data? At worst, the whole process fails, leaving mess records coupled with and inconsistencies in the systems involved. At best, this additional information results in rejected data sets, listing records that couldn’t describing the nature be processed, augmented with additional information describing the nature of the problem. of problem. Data can be rejected at any point of its journey through the pipeline?—?typos or missing references during input validation; duplicates or records violating business logic along the way; up to issues with pushing data to its targets, like mismatch or change in data structures or congestions on network lines. AUTOMATED DATA PROCESSING PIPELINE Data Source Input Validation Logic Step Logic Step Output Data Target Rejected Data Data Integration 4 [email protected] www.cloveretl.com
Please complete the form to gain access to this content