Data cleansing and validation of 3 billion records

EDW offloading, Technology consolidation
Java developers
Key themes
Data mapping and transformation, code-free transformations
Industry segment
Large enterprise
OSS CDAP on-premise


The customer, a Fortune 500 company in the Financial sector, had custom-built a data pipeline to perform data validation and correction transforms. The pipeline was constructed using multiple complex technologies. Examples performed on the 3 billion records included:

  • Standardization, verification, and cleansing of USPS codes
  • Domain set validation, Null Checks, Length Checks
  • Regular expression validation (email, SSN, dates, etc.)

The legacy pipeline ran overnight, required multiple teams to keep it operating, and costly experts to maintain it. For these basic mapping, transformations and validation tasks, they would like a visual tool to avoid ad-hoc coding and solutions.

CDAP value proposition(s)

In-house Java programmers developed, tested, and ran the replacement pipeline using the drag-and-drop visual interface.

The new pipeline only required limited coding in order to integrate custom regular expressions.